CN114547323A - Fine-grained knowledge graph fusion method for two-dimensional overlapped large sample data source - Google Patents

Fine-grained knowledge graph fusion method for two-dimensional overlapped large sample data source Download PDF

Info

Publication number
CN114547323A
CN114547323A CN202111646665.6A CN202111646665A CN114547323A CN 114547323 A CN114547323 A CN 114547323A CN 202111646665 A CN202111646665 A CN 202111646665A CN 114547323 A CN114547323 A CN 114547323A
Authority
CN
China
Prior art keywords
similarity
entity
attributes
attribute
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111646665.6A
Other languages
Chinese (zh)
Inventor
季白杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Biwan Information Technology Co ltd
Original Assignee
Hangzhou Biwan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Biwan Information Technology Co ltd filed Critical Hangzhou Biwan Information Technology Co ltd
Priority to CN202111646665.6A priority Critical patent/CN114547323A/en
Publication of CN114547323A publication Critical patent/CN114547323A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a fine-grained knowledge graph fusion method of a two-dimensional overlapped large sample data source, which comprises the following steps: s1, performing iterative entity alignment on attribute triples corresponding to a knowledge graph to obtain an entity pair set of the attribute triples, and performing multi-level classification on the entity pairs to obtain high-confidence entity pairs; taking the obtained high-confidence entity pair as a training data set of an embedded model, performing structured embedding of the embedded model by using a relation triple to obtain high-dimensional space vector representation of the entities and the relations, and setting weights for the attributes and the relations to obtain final similarity of the attributes and the relations; s2, screening entity attributes according to the obtained similarity to obtain the final similarity of the attributes; s3, automatically completing the knowledge characteristic fusion of high-confidence entity pairs and attributes based on a classifier model and an atomic expression algorithm obtained by machine learning training; and S4, completing the bidirectional supervision interactive data fusion based on knowledge characteristic fusion.

Description

Fine-grained knowledge graph fusion method for two-dimensional overlapped large sample data source
Technical Field
The invention relates to the technical field of big data processing, in particular to a fine-grained knowledge graph fusion method for a two-dimensional overlapped big sample data source.
Background
Knowledge Graph (KG) contains a large number of cases, and in practical application, a triple is usually used<H,R,T>To express, H and T represent the head and tail entities, and R represents the intrinsic relationship implied by the two entities. In two respectsThe cognitive maps KG1 and KG2 are defined as follows: presentity pair set Entity ═ a1,A2,A3...AnIn which Ai(i ═ 1,2, 3.. n) is defined as a quadruple structure<ID,E1,E2,S>Where ID is set as a unique identifier, E1 ∈ KG1, E2 ∈ KG2, and S represents the similarity value between two entities, where S is located at [0,1]]In the meantime. In different knowledge graphs, equivalent entities are distinguished by identification names due to different constructors or language differences, and semantic similarity exists between the equivalent entities.
In the prior art, the following methods are mostly adopted for knowledge fusion: 1) by utilizing the text information of the entities and the relations, the method is mostly based on simple character string matching and is simple to operate. 2) And performing data matching by using the node similarity based on the data structure. 3) And carrying out indirect matching by utilizing a third-party data set. 4) And (4) performing data characteristic processing by using a machine learning algorithm, wherein the data characteristic processing comprises learning a fusion expression, training a classification model and the like. Or multi-algorithm fusion is carried out by utilizing an aggregation function to achieve better expected effect, but the method has the following defects:
1) the feature that the pre-fused knowledge graph has data overlapping in both the entity dimension and the attribute dimension is ignored.
2) In the existing scheme, a large number of entity pairs are fused and matched instead of the final purpose of improving the quality of the knowledge graph, namely, the key factor of attribute fusion of the entities is ignored, so that the finally formed knowledge graph is large and not accurate. The cross fusion of the ontology and the attributes is two indispensable dimensions from the viewpoint of improving the quality of the knowledge graph, and the mutual promotion of the ontology and the attributes is the difficulty and the key point of the knowledge fusion degree.
3) The large-scale data processing ideally is suitable for computer processing, but is limited by the technology and other reasons, such as manual operation for data block segmentation or data labeling, or manual light-weight algorithm processing. Causing a great deal of waste of manpower and financial resources.
Disclosure of Invention
Aiming at the problems that the quality of a knowledge graph obtained by the existing knowledge fusion method is low, a large amount of resources are consumed in the fusion process and the like, the invention combines linguistic information, spatial information and a machine learning algorithm, aims to solve one or more difficulties in the existing knowledge graph fusion to a considerable extent, and provides a fine-grained knowledge graph fusion method of a two-dimensional overlapped large-sample data source.
In order to achieve the purpose, the invention adopts the following technical scheme:
a fine-grained knowledge graph fusion method of a two-dimensional overlapped large sample data source comprises the following steps:
s1, performing iterative entity alignment on attribute triples corresponding to a knowledge graph to obtain an entity pair set of the attribute triples, and performing multi-level classification on the entity pairs to obtain high-confidence entity pairs; taking the obtained high-confidence entity pair as a training data set of an embedded model, performing structured embedding of the embedded model by using a relation triple to obtain high-dimensional space vector representation of the entities and the relations, and setting weights for the attributes and the relations to obtain final similarity of the attributes and the relations;
s2, screening entity attributes according to the obtained similarity to obtain the final similarity of the attributes;
s3, automatically completing the knowledge characteristic fusion of high-confidence entity pairs and attributes based on a classifier model and an atomic expression algorithm obtained by machine learning training;
and S4, completing the bidirectional supervision interactive data fusion based on knowledge characteristic fusion.
Further, the step S1 specifically includes:
s11, entity alignment is carried out on the attribute triples on the basis of an iterative model, entity matching operation is carried out on the basis of attribute values corresponding to the attributes and the attributes to obtain an entity pair set, attribute similarity matching operation is carried out on the entity pair set to obtain an attribute pair set, and entity pairs with high confidence degrees are obtained;
s12, performing structured embedding on the obtained high-confidence entity pair serving as a training data set of an embedding model by using a relation triple, describing and modeling a global structure of a knowledge graph to be fused, and finally obtaining high-dimensional space vector representation of the entity and the relation;
and S13, fusing the attribute alignment and the relationship alignment based on different weights to obtain alignment results of two dimensions of the relationship and the attribute, and obtaining the total similarity of the attribute and the relationship by adopting a linear combination mode.
Further, the step S2 specifically includes:
s21, calculating the similarity between the attributes;
s22, calculating the similarity between adjacent entities;
s23, calculating the similarity of the attribute label set;
and S24, screening upper-layer concept paths of entity attributes in the knowledge graph to form path vectors, and calculating the final similarity of the attributes.
Further, the step S3 specifically includes:
s31, obtaining a classifier model by utilizing machine learning training, and processing entity fusion by utilizing a two-classification method;
s32, screening attributes by using an atomic expression;
and S33, combining and using the atomic expressions to complete the knowledge characteristic fusion of the high-confidence entity pairs and the attributes.
Further, the step S4 specifically includes:
s41, embedding the vector of the triple based on a TransE algorithm and a PtransE algorithm to finish the training of a single knowledge graph;
and S42, remapping the high-dimensional space vectors of the processed entities and the processed relations in a low latitude space, and forming constraints on the entities and the relation vectors in the mapping process respectively to complete bidirectional supervision interactive data fusion.
Further, the step S11 specifically includes:
s111, setting uniform weight for the public attributes during attribute alignment, and calculating the similarity between entities, wherein the similarity is expressed as:
Figure BDA0003445393870000031
wherein, SimA(e1,e2) Representing an entity e1With entity e2The similarity between them;
Figure BDA0003445393870000032
representing an entity e1A k-th attribute common to both entities;
Figure BDA0003445393870000033
representing an entity e2A k-th attribute common to both entities; n represents the total number of attributes common to the two entities; simvRepresenting two attribute values
Figure BDA0003445393870000034
And
Figure BDA0003445393870000035
the similarity between them is expressed as:
Figure BDA0003445393870000036
wherein levenshteinSim represents the similarity calculated based on the Levenshtein distance; lcsSim represents that similarity calculation is carried out on the longest substring which is common to the character strings;
s112, searching potential aligned attribute pairs according to the aligned entity pairs, wherein the potential aligned attribute pairs are expressed as follows:
Figure BDA0003445393870000041
wherein the content of the first and second substances,
Figure BDA0003445393870000042
representing attribute pairs
Figure BDA0003445393870000043
The similarity of (2);
Figure BDA0003445393870000044
representing the number of elements in the finite set of the entity;
Figure BDA0003445393870000045
representing the similarity between attribute values.
Further, the total similarity of the attributes and the relationships obtained in step S13 is represented as:
Sim(Ei,Ej)=λ×simR(ei,ej)+(1-λ)×simA(ei,ej)
wherein simR represents similarity obtained based on the relation triples; the simA represents the similarity obtained by using the attribute triples; λ represents a weight; sim (E)i,Ej) Representing the overall similarity.
Further, the similarity between the attributes is calculated in step S21, and is expressed as:
Simproperty=COS(Nameproperty1Nameproperty2)
wherein SimpropertyRepresenting the similarity of two properties property1 and property2 at the property name level; nameproperty1With Nameproperty2Respectively, representing a high-dimensional space vector representation.
Further, in step S22, the similarity between adjacent entities is calculated as:
Simentity=|entityList1∩entityList2|/|entityList1∪entityList2|
wherein SimentityRepresenting the similarity of two adjacent entities; the entityList1 and entityList2 represent a limited set of entities adjacent to the property1 and property 2.
Further, in step S23, the similarity of the attribute label set is calculated as:
Simlabel=COS(labelproperty1,labelproperty2)
wherein, SimlabelFinite tag similarity representing property;
in step S24, the final similarity of the attributes is calculated and expressed as:
Simcon=COS(conceptproperty1,conceptproperty2)
wherein, SimconRepresenting the upper-level conceptual similarity, concept, of property1 and property2property1,conceptproperty2Respectively representing the upper level concept path labels of the attributes.
Compared with the prior art, the invention has the beneficial effects that:
1) the method aims to improve the quality of the knowledge graph, meets the requirement of fine-grained fusion by adopting a means of fusion of the knowledge graph in the financial field at an attribute level, focuses on a key factor of attribute fusion in the knowledge graph fusion based on the fact that a large number of entity pairs are matched in the entity fusion, promotes the fusion effect by utilizing the close relation between the entities and the attributes, and finally improves the quality and the precision of the fused knowledge graph.
2) Manual intervention in the fusion process is reduced, a machine learning algorithm only needing a positive sample fusion expression is realized, a good fusion effect is achieved, and resource consumption is reduced.
Drawings
Fig. 1 is a flowchart of a fine-grained knowledgegraph fusion method for a two-dimensional overlapped large sample data source according to an embodiment.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
The invention aims to provide a fine-grained knowledge graph fusion method of a two-dimensional overlapped large-sample data source, aiming at overcoming the defects of the prior art, reducing the workload and emphasizing on the utilization of the attributes of the entities in the knowledge graph, and ensuring the precision and quality of the fused knowledge graph by using the promotion relationship between the attributes and the entities.
Example one
The embodiment provides a fine-grained knowledge graph fusion method for a two-dimensional overlapped large sample data source, as shown in fig. 1, including:
s1, performing iterative entity alignment on attribute triples corresponding to a knowledge graph to obtain an entity pair set of the attribute triples, and performing multi-level classification on the entity pairs to obtain high-confidence entity pairs; taking the obtained high-confidence entity pair as a training data set of an embedded model, performing structured embedding of the embedded model by using a relation triple to obtain high-dimensional space vector representation of the entities and the relations, and setting weights for the attributes and the relations to obtain final similarity of the attributes and the relations;
s2, screening entity attributes according to the obtained similarity to obtain the final similarity of the attributes;
s3, automatically completing the knowledge characteristic fusion of high-confidence entity pairs and attributes based on a classifier model and an atomic expression algorithm obtained by machine learning training;
and S4, completing the bidirectional supervision interactive data fusion based on knowledge characteristic fusion.
Industry knowledge maps are oriented to a specific vertical domain, with stricter pre-data patterns and more accurate accuracy requirements for the data, and emphasis is placed on "depth". Financial field data is typically big data with "4V" characteristics (voluminous Volume, multi-structure multi-dimensional Value, huge Value, timeliness requirement Velocity). Further, the financial field is the industry that best represents data. The financial industry is very wide in category industry, and the major categories mainly include: bank class, investment class, insurance class, etc. The smaller particle size can be divided into: money, bond, fund, trust, etc. resource management programs, element markets, credit and loan, etc. The application of the knowledge graph in the financial field mainly comprises the following steps: wind control, credit investigation, auditing, anti-fraud, data analysis, automated reporting, etc. The knowledge graph is repeatedly constructed by different organizations, organizations and individuals aiming at different fields and different requirements. The data classes in these maps fall into three categories: 1) structuring data: taking e-government form data as a representative, usually taking the ID of a person or an organization as an anchor point to aggregate different information, such as name, occupation, income, etc.; a series of organization forms such as a basic bank, a subject bank, a special subject bank and the like can be evolved in the future. 2) Unstructured data: video, image, voice and text are taken as representatives, and most of the follow-up data need to be analyzed and processed into structured data to be used. 3) Spatio-temporal data: represented by geographic information, IoT, trajectory data.
The embodiment is a realistic problem of how to organize, rationalize and automatically fuse the heterogeneous knowledge maps aiming at the current situation in the financial field so as to improve the coverage and quality of knowledge and solve the problems of low data quality or data loss and the like of a single knowledge map so as to achieve better application effect. And organically combining an entity alignment algorithm and an attribute alignment algorithm in the field of knowledge graph fusion. And finally, designing and realizing a fine-grained knowledge graph fusion method for the heterogeneous two-dimensional overlapped large sample data source in the financial field based on a positive and negative sample fusion expression learning algorithm.
In the step S1, performing iterative entity alignment on attribute triples corresponding to the knowledge graph to obtain an entity pair set of the attribute triples, and performing multi-level classification on the entity pairs to obtain high-confidence entity pairs; and taking the obtained high-confidence entity pair as a training data set of the embedded model, performing structured embedding of the embedded model by using the relation triple to obtain high-dimensional space vector representation of the entity and the relation, and setting weights for the attribute and the relation to obtain the final similarity of the attribute and the relation.
And performing iterative entity alignment on the attribute triples based on the probability model. The method comprises the steps of carrying out multi-level grading on similarity on entities to form a hierarchical tree structure, setting different thresholds based on a tree hierarchical diagram to obtain entity pairs with high confidence, enabling the entity pairs to be relatively high in quality based on the tree hierarchical diagram, using the entity pairs as embedded model training to further obtain high-dimensional vector representation of low-latitude data, combining training of a logistic regression model with high-dimensional vectors to form uniform similarity mapping of a high-dimensional low-latitude space, and further obtaining final similarity based on weight setting. The method specifically comprises the following steps:
s11, based on attribute triple alignment, carrying out entity alignment on attribute triples based on an iterative model, carrying out entity matching operation based on attributes and attribute values corresponding to the attributes to obtain an entity pair set, carrying out attribute similarity matching operation on the entity pair set to obtain an attribute pair set, and repeatedly executing two steps until a new entity and attribute pair set cannot be generated to obtain an entity pair with high confidence level;
s111, setting uniform weight for the public attributes during attribute alignment, and calculating the similarity between entities;
because the attribute coverage is low, and the attribute names and the attribute values are expressed in diversity, the attributes of the same entity are different, and based on the viewpoint that any public attribute is particularly important when the attributes are aligned, the public attributes of the two entities are set as uniform weights, and the similarity of the two entities is calculated according to the following formula, which is expressed as:
Figure BDA0003445393870000071
wherein, SimA(e1,e2) Representing an entity e1With entity e2The similarity between them;
Figure BDA0003445393870000072
representing an entity e1A k-th attribute common to both entities;
Figure BDA0003445393870000073
representing an entity e2A k-th attribute common to both entities; n represents the total number of attributes common to the two entities; simvRepresenting two attribute values
Figure BDA0003445393870000074
And
Figure BDA0003445393870000075
the similarity between them is expressed as:
Figure BDA0003445393870000076
wherein levenshteinSim represents the similarity calculated based on the Levenshtein distance; lcsSim represents that similarity calculation is carried out on the longest substring which is common to the character strings;
and S112, searching potential aligned attribute pairs according to the aligned entity pairs.
S111 may find a potentially aligned attribute pair according to the aligned entity pair, specifically, obtain an aligned entity pair set, then find a subset including the potentially aligned attribute pair in the aligned entity set according to the potentially aligned attribute pair, measure attribute name similarity according to the entity, and represent a calculation formula for the attribute pair in the entity pair as follows:
Figure BDA0003445393870000077
wherein the content of the first and second substances,
Figure BDA0003445393870000078
representing attribute pairs
Figure BDA0003445393870000079
The similarity of (2);
Figure BDA00034453938700000710
representing the number of elements in the finite set of entities;
Figure BDA00034453938700000711
representing the similarity between attribute values.
Based on the mathematical model, the interactive alignment of the attributes and the entities is carried out according to the following algorithm iteration, wherein the algorithm is as follows:
Figure BDA0003445393870000081
s12, based on a relational embedding alignment method, performing structured embedding on the obtained high-confidence entity pair serving as a training data set of an embedded model by using a relational triple, describing and modeling a global structure of a knowledge graph to be fused, and finally obtaining high-dimensional space vector representation of the entity and the relation;
the structure embedding model optimizes the maximum boundary loss function to enable the positive samples to score the scores of the regional negative samples, and the formula is as follows:
OSE=∑∑(f(tr)-α(tr'))
wherein f (tr) | | | h + r-t | | | represents a score function; tr and Tr' represent a finite set of positive and negative sample triples; alpha is located between [0,1] and represents a hyperparameter for weighting positive and negative samples. And obtaining entity high-dimensional vectors in the two knowledge maps based on the embedding process so as to obtain the similarity through cosine distance.
S13, fusing the attribute alignment and the relationship alignment based on different weights to obtain the alignment result of two dimensions of the relationship and the attribute, and obtaining the total similarity of the attribute and the relationship by adopting a linear combination mode, wherein the similarity is expressed as follows:
Sim(Ei,Ej)=λ×simR(ei,ej)+(1-λ)×simA(ei,ej)
wherein, sim (E)i,Ej) Represents the total similarity; simR represents similarity obtained based on the relation triplets; the simA represents the similarity obtained by using the attribute triples; λ represents a weight, and this weight is learned by a regression model. More specifically, the importance of attributes and relationships in different data sets are different, the relationships and attribute qualities of different knowledge graphs are different, if the relationship quality in the knowledge graph is high, the relationship alignment obviously has higher confidence, and in the sparse knowledge graph, the result of alignment based on the attributes obviously has higher confidence.
In step S2, the entity attributes are filtered according to the obtained similarity, and the final similarity of the attributes is obtained.
The method comprises the steps of screening attributes with common meanings in a knowledge base according to the similarity of an entity, determining a threshold value standard, designing an attribute function to automatically screen the entity attributes, and utilizing part of information of the entity attributes to comprise upper and lower concepts, labels, attribute values and the like of the entity. And combining the similarity obtained by using the information with the attribute name similarity to obtain the final similarity of the attributes, and performing pruning operation to reduce the redundancy of the entity attributes in the fused knowledge graph. Finally, the two maps are interactively executed and mutually promoted, so that the two maps are input, the output is the fused map, and only the aligned entity pair is output. The method specifically comprises the following steps:
s21, calculating the similarity between the attributes;
the similarity of the attribute names and the semantic information of the attributes are particularly important, the similarity calculation expected by the embodiment can be deeply carried out on the specific information contained in the semantic level of different attributes, but not on the simple matching of the character level, and the similarity calculation is carried out by utilizing an open source laboratory AILab Chinese word vector library based on the following formula, which is expressed as follows:
Simproperty=COS(Nameproperty1Nameproperty2)
wherein SimpropertyRepresenting the similarity of two properties property1 and property2 at the property name level; nameproperty1With Nameproperty2Respectively representing high-dimensional space vector representation, and finally obtaining results which are cosine values of two attribute name vectors.
S22, calculating the similarity between adjacent entities;
the associated entity similarity, in addition to the entity similarity itself, the present embodiment notes that the neighboring relationship of the entities can also improve the knowledge fusion quality, and it is assumed here that if the similarity of the neighboring entities of the two attributes reaches a certain threshold, the attribute pair can be considered to be similar. The formula is calculated for the above-mentioned property1 and property2 adjacent entities as follows:
Simentity=|entityList1∩entityList2|/|entityList1∪entityList2|
wherein SimentityIndicating that two adjacent entities are similarDegree; the entityList1 and entityList2 represent a limited set of entities adjacent to the property1 and property 2.
S23, calculating the similarity of the attribute label set;
the similarity of related entity labels can be found in a plurality of search engines such as Wikipedia, Baidu and Saigao, and the generalization of the entry labels to entity features such as 'any positive negation' search can be found, and the labels such as president and CEO can be found. Such labels tend to be quite representative.
Based on the situation, from the perspective of improving the quality of the knowledge graph, the Label vector Label of the property1 is constructedproperty1=(X1,X2,...Xn) And the label vector label of property2property2= (y1,y2,...yn) Therefore, the similarity of the finite set of attribute labels is calculated, and the formula is as follows:
Simlabel=COS(labelproperty1,labelproperty2)
wherein, SimlabelFinite tag similarity representing property;
and S24, screening upper-layer concept paths of entity attributes in the knowledge graph to form path vectors, and calculating the final similarity of the attributes.
The similarity of the upper concepts of the associated entities, the hierarchical concept tree exists in the knowledge graph, and the root node is 'human' as the most common person, so as to be further differentiated into 'political field', 'economic field', 'amusement circle', and the like, and the economic field can be divided into 'real estate', 'automobile industry', and the like, and finally the concept hierarchical tree is formed, based on the condition, the upper-layer concept paths of the entity attributes in the two knowledge graphs are extracted, and path vectors are formed, so that the similarity calculation is carried out, and the formula is as follows:
Simcon=COS(conceptproperty1,conceptproperty2)
wherein, SimconRepresenting the upper-level conceptual similarity, concept, of property1 and property2property1,conceptproperty2Respectively representing the upper level concept path labels of the attributes.
In step S3, knowledge feature fusion of high-confidence entity pairs and attributes is automatically completed based on the classifier model and the atomic expression algorithm obtained by machine learning training.
Aiming at the practical situation that negative samples are not usually recorded in the knowledge fusion process, the data features are automatically extracted based on a machine learning algorithm, and under the condition that manual intervention can be reduced, knowledge can achieve a good fusion effect, and the method specifically comprises the following steps:
s31, obtaining a classifier model by utilizing machine learning training, and processing entity fusion by utilizing a two-classification method;
the classification function is formulated as follows:
Figure BDA0003445393870000101
and according to the high-quality entity pairs and the attribute set obtained in the step S1 and the step S2, determining attributes which meet the standard and can be subjected to similarity calculation, utilizing an atomic expression to enable the F-measure of each pair of attributes to reach the maximum, and completing the creation of an expression tree based on AND operation.
S32, screening attributes by using an atomic expression;
the atomic expression is based on the premise that a proper metric function is determined so that a proper threshold can be configured for screening the functions participating in similarity calculation, it is obvious that attribute pairs still need to be screened next after the operations of step S1 and step S2, in order to further improve the precision and quality of the fused knowledge graph, the first principle of attribute screening is the general representativeness of attributes, and the screening formula is as follows:
Figure BDA0003445393870000113
wherein, cover (p) represents the property universality, the numerator is the number containing the property p, and the denominator is the number of all subject entities; KG1 and KG2 were screened to obtain a finite set of properties, P1 and P2, respectively. Then, optionally, according to a custom function M, a cartesian product operation is performed on P1 and P2, to find a function Mp1, P2 and a corresponding threshold index θ that maximizes the F-measure value of the attribute pair (P1, P2), and then obtain an atomic expression set, as follows:
Figure BDA0003445393870000111
and S33, combining and using the atomic expressions to complete the knowledge characteristic fusion of the high-confidence entity pairs and the attributes.
Considering the defect that atomic expressions only utilize local information of attributes, atomic expressions are combined for use, and the formula is expressed as:
Figure BDA0003445393870000112
wherein φ (E) represents an operator symbol; u, "" indicates OR, AND AND DIFF operators, respectively.
In step S4, bi-directional supervised interactive data fusion is completed based on knowledge feature fusion.
The method realizes a two-way supervision interactive data fusion algorithm, carries out interactive supervision training on knowledge maps to be fused based on the assumption that the knowledge maps to be fused have a considerable degree of fitting, enhances the quality and quantity of the knowledge maps in the process of circulating fusion, emphasizes structural information among entities, weakens the similarity weight of the linguistic similarity in the process of fusion, converts the similarity of low latitude characters into the similarity of high-dimensional space vectors, realizes cross-domain structural fusion, and specifically comprises the following steps:
s41, embedding the vector of the triple based on a TransE algorithm and a PtransE algorithm to finish the training of a single knowledge graph;
vector embedding of RDF triples is realized based on TransE and PTransE models, and a loss function of the TransE is defined as follows:
L(h,r,t)=[γ+E[h,r,t]-E(h',r',t')]+
wherein L (h, r, t) represents a loss function; γ represents an interval value; e (h, r, t) represents vector embedding and (h ', r ', t ') is an error triplet.
The TransE algorithm flow is as follows:
Figure BDA0003445393870000121
the method is different from other methods in which the model is complicated and difficult to cut due to excessive setting parameters of training triples, is suitable for processing a knowledge graph with a large data set scale and simple data content, considers that the processing process necessarily involves a large scale and the data set content is complicated, considers that the PtransE model is adopted for processing, and has the following specific algorithm flow:
Figure 2
and S42, remapping the high-dimensional space vectors of the processed entities and the processed relations in a low latitude space, and forming constraints on the entities and the relation vectors in the mapping process respectively to complete bidirectional supervision interactive data fusion.
In step S41, the training of a single knowledge graph is mainly completed, and the essence of fusion is to remap the high-dimensional space vector of the processed entity relationship in the low latitude space, so this section completes the two-way supervised training of two knowledge graphs based on the pre-fused entity pair information, and forms the constraint on the respective vector in the process, and the specific algorithm pseudo code is as follows:
Figure BDA0003445393870000132
s43, knowledge representation learning and supervised learning are a reciprocating iterative process, for the entity pairs E1 and E2 in the two networks, a threshold value theta is determined, if E (E1, E2) < theta, the entity pairs are considered to be similar entities, namely (E1, E2) are called as a standard entity pair, and the standard entity pair can find more entity pairs through bidirectional supervision in the iterative process, wherein the specific flow of the iteration is as follows:
Figure BDA0003445393870000141
compared with the prior art, the beneficial effect of this embodiment:
1) the method aims to improve the quality of the knowledge graph, meets the requirement of fine-grained fusion by adopting a means of fusion of the knowledge graph in the financial field at an attribute level, focuses on a key factor of attribute fusion in the knowledge graph fusion based on the fact that a large number of entity pairs are matched in the entity fusion, promotes the fusion effect by utilizing the close relation between the entities and the attributes, and finally improves the quality and the precision of the fused knowledge graph.
2) Manual intervention in the fusion process is reduced, a machine learning algorithm only needing a positive sample fusion expression is realized, a good fusion effect is achieved, and resource consumption is reduced.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims (10)

1. A fine-grained knowledge graph fusion method of a two-dimensional overlapped large sample data source is characterized by comprising the following steps:
s1, performing iterative entity alignment on attribute triples corresponding to a knowledge graph to obtain an entity pair set of the attribute triples, and performing multi-level classification on the entity pairs to obtain high-confidence entity pairs; taking the obtained high-confidence entity pair as a training data set of an embedded model, performing structured embedding of the embedded model by using a relation triple to obtain high-dimensional space vector representation of the entities and the relations, and setting weights for the attributes and the relations to obtain final similarity of the attributes and the relations;
s2, screening entity attributes according to the obtained similarity to obtain the final similarity of the attributes;
s3, automatically completing the knowledge characteristic fusion of high-confidence entity pairs and attributes based on a classifier model and an atomic expression algorithm obtained by machine learning training;
and S4, completing the bidirectional supervision interactive data fusion based on knowledge characteristic fusion.
2. The method for fusing the fine-grained knowledge graph of a two-dimensional overlapped large sample data source according to claim 1, wherein the step S1 specifically comprises:
s11, entity alignment is carried out on the attribute triples on the basis of an iterative model, entity matching operation is carried out on the basis of attribute values corresponding to the attributes and the attributes to obtain an entity pair set, attribute similarity matching operation is carried out on the entity pair set to obtain an attribute pair set, and entity pairs with high confidence degrees are obtained;
s12, performing structured embedding on the obtained high-confidence entity pair serving as a training data set of an embedding model by using a relation triple, describing and modeling a global structure of a knowledge graph to be fused, and finally obtaining high-dimensional space vector representation of the entity and the relation;
and S13, fusing the attribute alignment and the relationship alignment based on different weights to obtain alignment results of two dimensions of the relationship and the attribute, and obtaining the total similarity of the attribute and the relationship by adopting a linear combination mode.
3. The method according to claim 2, wherein the step S2 specifically includes:
s21, calculating the similarity between the attributes;
s22, calculating the similarity between adjacent entities;
s23, calculating the similarity of the attribute label set;
and S24, screening upper-layer concept paths of entity attributes in the knowledge graph to form path vectors, and calculating the final similarity of the attributes.
4. The method according to claim 3, wherein the step S3 specifically includes:
s31, obtaining a classifier model by utilizing machine learning training, and processing entity fusion by utilizing a two-classification method;
s32, screening attributes by using an atomic expression;
and S33, combining and using the atomic expressions to complete the knowledge characteristic fusion of the high-confidence entity pairs and the attributes.
5. The method according to claim 4, wherein the step S4 specifically includes:
s41, embedding the vector of the triple based on a TransE algorithm and a PtransE algorithm to finish the training of a single knowledge graph;
and S42, remapping the high-dimensional space vectors of the processed entities and the processed relations in a low latitude space, and forming constraints on the entities and the relation vectors in the mapping process respectively to complete bidirectional supervision interactive data fusion.
6. The method according to claim 2, wherein the step S11 specifically includes:
s111, setting uniform weight for the public attributes during attribute alignment, and calculating the similarity between entities, wherein the similarity is expressed as:
Figure FDA0003445393860000021
wherein, SimA(e1,e2) Representing an entity e1With entity e2The similarity between them;
Figure FDA0003445393860000022
representing an entity e1In the kth genus common to both entitiesSex;
Figure FDA0003445393860000023
representing an entity e2A k-th attribute common to both entities; n represents the total number of attributes common to the two entities; simvRepresenting two attribute values
Figure FDA0003445393860000024
And
Figure FDA0003445393860000025
the similarity between them is expressed as:
Figure FDA0003445393860000026
wherein levenshteinSim represents the similarity calculated based on the Levenshtein distance; lcsSim represents that similarity calculation is carried out on the longest substring which is common to the character strings;
s112, searching potential aligned attribute pairs according to the aligned entity pairs, wherein the potential aligned attribute pairs are expressed as follows:
Figure FDA0003445393860000027
wherein the content of the first and second substances,
Figure FDA0003445393860000028
representing attribute pairs
Figure FDA0003445393860000029
The similarity of (2);
Figure FDA00034453938600000210
representing the number of elements in the finite set of entities;
Figure FDA0003445393860000031
representing the similarity between attribute values.
7. The fine-grained knowledgegraph fusion method of a two-dimensional overlapped large sample data source according to claim 6, wherein the total similarity of attributes and relations obtained in step S13 is expressed as:
Sim(Ei,Ej)=λ×simR(ei,ej)+(1-λ)×simA(ei,ej)
wherein simR represents similarity obtained based on the relation triples; simA represents the similarity obtained using the attribute triples; λ represents a weight; sim (E)i,Ej) Representing the overall similarity.
8. The fine-grained knowledgegraph fusion method of a two-dimensional overlapping large sample data source of claim 3, wherein the similarity between the attributes is calculated in step S21 as:
Simproperty=COS(Nameproperty1Nameproperty2)
wherein SimpropertyRepresenting the similarity of two properties property1 and property2 at the property name level; nameproperty1With Nameproperty2Respectively, representing a high-dimensional space vector representation.
9. The fine-grained knowledgegraph fusion method of a two-dimensional overlapping large sample data source of claim 8, wherein the similarity between adjacent entities is calculated in step S22 as:
Simentity=|entityList1∩entityList2|/|entityList1∪entityList2|
wherein SimentityRepresenting the similarity of two adjacent entities; the entityList1 and entityList2 represent a limited set of entities adjacent to the property1 and property 2.
10. The fine-grained knowledgegraph fusion method of a two-dimensional overlapped large sample data source of claim 9, wherein the similarity of the attribute tag sets is calculated in step S23 as:
Simlabel=COS(labelproperty1,labelproperty2)
wherein, SimlabelFinite tag similarity representing property;
in step S24, the final similarity of the attributes is calculated, and is expressed as:
Simcon=COS(conceptproperty1,conceptproperty2)
wherein, SimconRepresenting the upper-level conceptual similarity, concept, of property1 and property2property1,conceptproperty2Respectively representing the upper level concept path labels of the attributes.
CN202111646665.6A 2021-12-30 2021-12-30 Fine-grained knowledge graph fusion method for two-dimensional overlapped large sample data source Pending CN114547323A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111646665.6A CN114547323A (en) 2021-12-30 2021-12-30 Fine-grained knowledge graph fusion method for two-dimensional overlapped large sample data source

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111646665.6A CN114547323A (en) 2021-12-30 2021-12-30 Fine-grained knowledge graph fusion method for two-dimensional overlapped large sample data source

Publications (1)

Publication Number Publication Date
CN114547323A true CN114547323A (en) 2022-05-27

Family

ID=81670115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111646665.6A Pending CN114547323A (en) 2021-12-30 2021-12-30 Fine-grained knowledge graph fusion method for two-dimensional overlapped large sample data source

Country Status (1)

Country Link
CN (1) CN114547323A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116090360A (en) * 2023-04-12 2023-05-09 安徽思高智能科技有限公司 RPA flow recommendation method based on multi-modal entity alignment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116090360A (en) * 2023-04-12 2023-05-09 安徽思高智能科技有限公司 RPA flow recommendation method based on multi-modal entity alignment

Similar Documents

Publication Publication Date Title
Guo et al. Supplier selection based on hierarchical potential support vector machine
CN108388559A (en) Name entity recognition method and system, computer program of the geographical space under
Pedapati et al. Learning global transparent models consistent with local contrastive explanations
CN111931505A (en) Cross-language entity alignment method based on subgraph embedding
CN112417150A (en) Industry classification model training and using method, device, equipment and medium
Bonaccorso Hands-On Unsupervised Learning with Python: Implement machine learning and deep learning models using Scikit-Learn, TensorFlow, and more
CN111737477A (en) Intellectual property big data-based intelligence investigation method, system and storage medium
Wood et al. Automated industry classification with deep learning
CN115408525A (en) Petition text classification method, device, equipment and medium based on multi-level label
dos Reis et al. One-class quantification
Wang et al. Mushroom toxicity recognition based on multigrained cascade forest
Mittal et al. A COMPARATIVE STUDY OF ASSOCIATION RULE MINING TECHNIQUES AND PREDICTIVE MINING APPROACHES FOR ASSOCIATION CLASSIFICATION.
CN112579730A (en) High-expansibility multi-label text classification method and device
CN114547323A (en) Fine-grained knowledge graph fusion method for two-dimensional overlapped large sample data source
Garrido-Munoz et al. A holistic approach for image-to-graph: application to optical music recognition
Wang et al. R2-trans: Fine-grained visual categorization with redundancy reduction
Yuan et al. CSCIM_FS: Cosine similarity coefficient and information measurement criterion-based feature selection method for high-dimensional data
Alsammak et al. An enhanced performance of K-nearest neighbor (K-NN) classifier to meet new big data necessities
CN111339258A (en) University computer basic exercise recommendation method based on knowledge graph
Shankar et al. Analyzing attrition and performance of an employee using machine learning techniques
CN116304011A (en) Method, device and storage medium for generating regional industry chain
Zhou et al. Data mining method based on rough set and fuzzy neural network
AU2021107101A4 (en) A machine learning based system for classification using deviation parameters
CN114281994B (en) Text clustering integration method and system based on three-layer weighting model
Khalili et al. Sequential semi-supervised active learning model in extremely low training set (SSSAL)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination