WO2022222226A1

WO2022222226A1 - Structured-information-based relation alignment method and apparatus, and device and medium

Info

Publication number: WO2022222226A1
Application number: PCT/CN2021/096584
Authority: WO
Inventors: 程华东; 李剑锋; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-04-19
Filing date: 2021-05-28
Publication date: 2022-10-27
Also published as: CN113158668B; CN113158668A

Abstract

A structured-information-based relation alignment method, comprising: constructing a triplet corpus, wherein the triplet corpus comprises a plurality of triplets (S101); acquiring a relation vector representation corresponding to each triplet in the triplet corpus (S102); according to the relation vector representations, constructing relation sets corresponding to head entities in the triplets, and acquiring, from the relation sets, a relation mutually exclusive set having the maximum range (S103); performing clustering analysis on the triplets in the triplet corpus according to relation vector features, and performing clustering cluster merging according to the relation mutually exclusive set, so as to obtain several clusters (S104); and for each cluster, selecting a relation feature vector, which occurs the most frequently in the cluster, as a target relation feature vector, and modifying the relation feature vectors of all triplets in the cluster to be the target relation feature vector (S105). By means of the method, the problems in the prior art of the low accuracy of relation representation and the low precision of relation alignment during the construction of a knowledge graph are solved.

Description

Structured information-based relationship alignment method, device, device and medium

This application claims the priority of the Chinese patent application filed on April 19, 2021 with the application number 202110420316.6 and the title of the invention is "Relational Alignment Method, Apparatus, Equipment and Medium Based on Structured Information", the entire contents of which are Incorporated herein by reference.

technical field

The present application relates to the field of information technology, and in particular, to a relationship alignment method, apparatus, device and medium based on structured information.

Background technique

Building a knowledge graph requires a complete knowledge system. The knowledge system can be established manually, or it can be established by computer-based data analysis. Existing mutual knowledge network such as encyclopedia, there are a lot of triple knowledge. Existing technologies mainly use triple knowledge provided by the Internet, such as encyclopedic knowledge, to construct a knowledge system, and in this process, relationship alignment is required.

The inventor realized that in the process of relational alignment, if the structured information provided by the Internet is restored to unstructured information, and then the relational alignment is performed according to the entities of the unstructured information, since the restored unstructured information is extremely short. The context of the relationship is missing, and the structured information was originally edited by humans, with a certain degree of subjectivity and human error, resulting in low representation accuracy of relationships and reduced accuracy of relationship alignment. When the relationship vector representation is clustered to eliminate ambiguity, because the semantic information contained in the relationship vector representation obtained through structured information is limited, the effect of relationship clustering is not good, and the accuracy of relationship alignment is low.

Application content

The embodiments of the present application provide a relationship alignment method, device, device, and medium based on structured information, so as to solve the problems of low relationship representation accuracy and low relationship alignment accuracy in the prior art when building a knowledge graph.

A relational alignment method based on structured information, including:

constructing a triple corpus, the triple corpus includes several triples;

obtaining the relation vector representation corresponding to each triple in the triple corpus;

Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;

Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;

For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.

Optionally, the obtaining the relation vector representation corresponding to each triple in the triple corpus includes:

For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the relationship corresponding to the triplet;

obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing a second vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation;

The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.

Optionally, obtaining the head entity vector representation and tail entity vector representation corresponding to the triplet, and constructing a second vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation. include:

Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;

Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;

Obtain the entity type and category information of the tail entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the tail entity;

Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the tail entity, as the tail entity vector representation;

The difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.

For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the attributes corresponding to the triples;

obtaining the head entity vector representation and the attribute value vector representation corresponding to the triplet, and constructing a second vector representation of the attribute corresponding to the triplet according to the head entity vector representation and the attribute value vector representation;

Optionally, obtaining the head entity vector representation and attribute value vector representation corresponding to the triplet, and constructing a second vector representation of the attribute corresponding to the triplet according to the head entity vector representation and the attribute value vector representation. include:

Perform word segmentation processing on the attribute value, query a preset word vector list according to the word segmentation result, and obtain the word segmentation vector representation corresponding to each word segmentation;

Obtain the average value between the word segmentation vector representations as the attribute value vector representation;

The difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.

Optionally, the constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining the relationship mutual exclusion set with the largest range from the relationship set includes:

Acquire triples with the same head entity and their corresponding relationship vector representations, perform deduplication processing on the relationship vector representations, and obtain a relationship set corresponding to the head entity;

Inclusion filtering is performed on the relation sets corresponding to all head entities to obtain the relation-exclusive set with the largest range.

Optionally, performing cluster analysis on the triples in the triple corpus according to the relationship vector feature, and performing cluster cluster merging according to the relationship mutually exclusive set, to obtain several clusters including:

Using a preset algorithm to perform cluster analysis on the representation of the relation vector corresponding to the triples in the triplet corpus;

In the cluster analysis process, for the two clusters to be merged, determine whether there is at least one relationship vector representation in the two clusters that simultaneously exists in the same mutually exclusive set of relationships;

If so, do not merge the two clusters, otherwise merge the two clusters.

A relationship alignment device based on structured information, comprising:

a corpus building module for constructing a triple corpus, where the triple corpus includes several triples;

a relationship acquisition module, used for acquiring the relationship vector representation corresponding to each triple in the triple corpus;

A mutual exclusion set acquisition module, configured to construct a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtain a relationship mutual exclusion set with the largest scope from the relationship set;

a clustering and merging module, configured to perform cluster analysis on the triples in the triplet corpus according to the relationship vector feature, and perform cluster-cluster merging according to the relationship mutually exclusive set to obtain several clusters;

A correction module, for each cluster, selecting the relationship feature vector with the highest frequency in the cluster as the target relationship feature vector, and modifying the relationship feature vectors of all triples in the cluster to the target relationship feature vector .

A computer device, comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer-readable instructions:

constructing a triple corpus, the triple corpus includes several triples;

One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:

constructing a triple corpus, the triple corpus includes several triples;

In the embodiment of the present application, a triple corpus is constructed, and the triple corpus includes several triples; the relationship vector representation corresponding to each triple in the triple corpus is obtained; according to the relationship vector representation Constructing a relationship set corresponding to the head entity in the triplet, and obtaining the relationship mutual exclusion set with the largest range from the relationship set; performing cluster analysis on the triplet in the triplet corpus according to the relationship vector feature, And according to the mutually exclusive set of relationships, cluster clusters are merged to obtain several clusters; for each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and all three clusters in the cluster are selected. The relationship feature vector of the tuple is modified to the target relationship feature vector, thereby improving the accuracy and practicality of relationship alignment.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below, and other features and advantages of the application will become apparent from the description, drawings, and claims.

Description of drawings

In order to illustrate the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. , for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.

1 is a flowchart of a relationship alignment method based on structured information in an embodiment of the present application;

2 is a flowchart of step S102 in the structured information-based relationship alignment method in an embodiment of the present application;

3 is a flowchart of step S202 in the structured information-based relationship alignment method according to an embodiment of the present application;

4 is a flowchart of step S102 in the structured information-based relationship alignment method in another embodiment of the present application;

5 is a flowchart of step S402 in the structured information-based relationship alignment method in another embodiment of the present application;

6 is a flowchart of step S103 in the structured information-based relationship alignment method according to an embodiment of the present application;

7 is a flowchart of step S104 in the structured information-based relationship alignment method in an embodiment of the present application;

8 is a schematic block diagram of an apparatus for relationship alignment based on structured information in an embodiment of the present application;

FIG. 9 is a schematic diagram of a computer device in an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.

This embodiment provides a relationship alignment method based on structured information. The structured information-based relationship alignment method provided in this embodiment will be described in detail below. As shown in FIG. 1 , the structured information-based relationship alignment method includes:

In step S101, a triple corpus is constructed, and the triple corpus includes several triples.

Here, in the embodiment of the present application, the triplet knowledge is obtained by parsing the content of the infobox from the Internet webpage, or the triplet knowledge is obtained from the knowledge graph of the open domain. Wherein, the Internet webpage includes but is not limited to Baidu Encyclopedia and Wikipedia.

According to the Resource Description Framework (RDF), any complex semantics can be expressed by the combination of several triples. In this embodiment of the present application, the triples are in the form of "entity-relation-entity" and "entity-relationship-entity" -attribute-attribute-value" of these two types. For example, triples (Tan, nationality, China) belong to the type of "entity-relationship-entity", and triples (Tanluan, date of birth, 476) belong to the type of "entity-attribute-attribute value".

In step S102, a relation vector representation corresponding to each triple in the triple corpus is obtained.

Here, the relationship vector representation refers to the relationship or attribute in the triplet expressed in the form of a vector. Different from the prior art, the embodiment of the present application combines the representation of the relationship or attribute corresponding to the triplet in the massive data and the relationship representation transformed from the context of the triplet itself to obtain the relationship vector representation corresponding to each triplet, Thus, the accuracy of relation representation is greatly improved.

Optionally, as a preferred example of the present application, when the triplet is of the type "entity-relationship-entity", as shown in FIG. The relation vector representation corresponding to a triple includes:

In step S201, for the triples in the triplet corpus, a preset word vector list is queried to obtain a first vector representation of the relationship corresponding to the triples.

The embodiment of the present application first traverses each triplet in the triplet corpus, and according to the relative words in the triplet, queries a preset word vector list to obtain the vector representation of the relative word, which is used as the corresponding triplet of the triplet The first vector representation of the relationship of , which is the basic representation of the relationship corresponding to the triple described in this embodiment. The preset word vector list may be a list of 8 million word vectors open sourced by Tencent. Compared with the prior art to construct the representation of the relationship from unstructured data, the embodiment of the present application searches for the vector representation of the relationship word in the triple based on the existing word vector list, which is beneficial to improve the relationship of the triple. Characterize the rate of acquisition.

Optionally, the word vectors contained in the preset word vector list are limited, and if the vector representation of the relative words in the triplet cannot be obtained from the preset word vector list, the The related words are segmented, the preset word vector list is queried, the vector corresponding to each segment is obtained, the vectors corresponding to all segments are accumulated and the average value is obtained, which is used as the basic representation of the relationship corresponding to the triplet .

In step S202, a head entity vector representation and a tail entity vector representation corresponding to the triplet are obtained, and a second vector representation of the relationship corresponding to the triplet is constructed according to the head entity vector representation and the tail entity vector representation.

Since the preset word vector list is also edited manually, there are certain subjectivity and errors, and the first vector representation of the relationship corresponding to the triplet obtained by querying the preset word vector list will also be biased. The application embodiment further modifies the first vector representation based on the head entity vector representation and the tail entity vector representation of the triplet. Optionally, as a preferred example of the present application, as shown in FIG. 3 , step S202 further includes:

In step S301, the entity type and category information of the head entity corresponding to the triplet are obtained, the word vector list is queried, and the entity type vector representation and the category information vector representation corresponding to the head entity are obtained.

Here, according to the application scenario, entities can be divided into first-level divisions according to characters, locations, items, etc., to obtain several types; the entity type refers to the first-level division type to which the head entity belongs in the application scenario. The entity type can also be divided into two levels to obtain several types of information. Exemplarily, for ease of understanding, Table 1 is the entity type obtained by the first-level division of the existing Buddhist scene and the category information obtained by the second-level division of the entity type provided in this embodiment of the present application.

Table 1

In this embodiment of the present application, a corresponding entity type and category information table is obtained according to the application scenario, and the entity type and category information of the head entity corresponding to the triplet are determined according to the table; then the word vector list is queried to obtain the headers respectively. The entity type vector representation and the category information vector representation corresponding to the entity. Optionally, the word vector list may be the aforementioned 8 million word vector list open sourced by Tencent.

In step S302, an average value between the entity type vector representation corresponding to the head entity and the category information vector representation is obtained as the head entity vector representation.

After obtaining the entity type vector representation and the category information vector representation corresponding to the header entity, the embodiment of the present application calculates the average value between the entity type vector representation and the category information vector representation corresponding to the header entity, so as to obtain the header entity A vector representation of an entity.

In order to facilitate understanding, the following is the calculation process of the vector representation of the head entity "Virtuous Dharma Venerable". Based on the division in Table 1, it can be obtained that the entity type corresponding to the "Virtuous Dharma Venerable Dharma" is "person" and the category information is " Buddha", and then query the word vector list according to the entity type "person" and the category information "Buddha", and obtain the vector representation vec_1 of "person" as the entity type vector representation of the "Virtuous Dharma Venerable", the obtained " The vector representation vec_2 of the "Buddha" is used as the category information vector representation of the "Veteran of Virtue and Dharma". The vector representation v_head of the "Venor of Virtue and Wonderful Dharma" as the head entity is the average value between the vector representation vec_1 of "people" and the vector representation vec_2 of "Buddha", that is, v_head=(vec_1+vec_2)/2.

In step S303, the entity type and category information of the tail entity corresponding to the triplet are obtained, the word vector list is queried, and the entity type vector representation and the category information vector representation corresponding to the tail entity are obtained.

In step S304, an average value between the entity type vector representation corresponding to the tail entity and the category information vector representation is obtained as the tail entity vector representation.

Here, the process of obtaining and calculating the tail entity vector representation v_tail is the same as that of the head entity vector v_head. For details, please refer to the descriptions of the above steps S301 to S302, which will not be repeated here.

In step S305, the difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.

According to the classic model TransE of triple representation and its variants, the vector representation of relation v_relation=v_head-v_tail. In this embodiment of the present application, the difference between the head entity vector representation and the tail entity vector representation is calculated as the second vector representation of the relationship corresponding to the triplet.

Here, the second vector representation of the relationship corresponding to the triplet is obtained based on the contextual relationship of the triplet itself, which can reflect the transformation of the head entity and the tail entity in the triplet to a certain extent , which reflects the relationship in the triplet.

In step S203, the first vector representation and the second vector representation are spliced to obtain a relationship vector representation corresponding to the triplet.

Here, in this embodiment of the present application, the first vector representation is used as a basic vector of the type "entity-relationship-entity", and the second vector representation is used as a correction vector of the type of "entity-relationship-entity". , by splicing the first vector representation and the second vector representation together, the obtained combination is used as the relationship vector representation corresponding to the type triplet of "entity-relation-entity". Wherein, if the length of the first vector representation is len1 and the length of the second vector representation is len2, then the length of the relationship vector representation obtained after splicing is len1+len2.

Optionally, as another preferred example of the present application, when the triplet is of the type “entity-attribute-attribute value”, as shown in FIG. 4 , the step S102 is to obtain the triplet corpus The representation of the relation vector corresponding to each triple in includes:

In step S401, for triples in the triplet corpus, a preset word vector list is queried to obtain a first vector representation of attributes corresponding to the triples.

The embodiment of the present application first traverses each triple in the triple corpus, and according to the attribute word in the triple, queries a preset word vector list to obtain the vector representation of the attribute word, as the triple corresponding The first vector representation of the attribute of , which is the basic representation of the attribute corresponding to the triple described in this embodiment. The preset word vector list may be a list of 8 million word vectors open sourced by Tencent. Compared with constructing attribute representations from unstructured data in the prior art, the embodiments of the present application query the vector representations of attribute words in triples based on an existing word vector list, which is conducive to improving the relationship between triples Characterize the rate of acquisition.

Optionally, the word vectors contained in the preset word vector list are limited. If the vector representation of the attribute words in the triplet cannot be obtained from the preset word vector list, the The related words are segmented, the preset word vector list is queried, the vector corresponding to each segment is obtained, the vectors corresponding to all segments are accumulated and the average value is obtained, which is used as the basic representation of the attribute corresponding to the triplet .

In step S402, the head entity vector representation and the attribute value vector representation corresponding to the triplet are obtained, and a second vector representation of the attribute corresponding to the triplet is constructed according to the head entity vector representation and the attribute value vector representation.

Since the preset word vector list is also edited manually, there are certain subjectivity and errors, and the first vector representation of the attributes corresponding to the triples obtained by querying the preset word vector list will also be biased. The application embodiment further modifies the first vector representation based on the head entity vector representation and the attribute value vector representation of the triplet. Optionally, as a preferred example of the present application, as shown in FIG. 5 , step S402 further includes:

In step S501, the entity type and category information of the head entity corresponding to the triplet are obtained, the word vector list is queried, and the entity type vector representation and the category information vector representation corresponding to the head entity are obtained.

In step S502, an average value between the entity type vector representation corresponding to the head entity and the category information vector representation is obtained as the head entity vector representation.

Here, steps S501 to S502 are the same as the above-mentioned steps S301 to S302. For details, please refer to the descriptions of the above-mentioned steps S301 to S302, which will not be repeated here.

In step S503, a word segmentation process is performed on the attribute value, a preset word vector list is queried according to the word segmentation result, and a word segmentation vector representation corresponding to each word segmentation is obtained.

For attribute values, the embodiment of the present application performs word segmentation on the attribute words in the triplet to obtain several word segmentations, and then queries a preset word vector list to obtain the word segmentation vector representation corresponding to each word segmentation. Optionally, the preset word vector list may be the 8 million word vector list open-sourced by Tencent as described above. Word segmentation can be performed by calling the jieba word segmentation tool.

In step S504, an average value between the word segmentation vector representations is obtained as the attribute value vector representation.

After word segmentation, in this embodiment of the present application, an average value is obtained for all word segmentation vector representations corresponding to the attribute values, as the attribute value vector representation corresponding to the triplet.

In step S505, the difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.

Similar to the principle of step S305, in this embodiment of the present application, the difference between the head entity vector representation and the attribute value vector representation is calculated as the second vector representation of the attribute corresponding to the triplet.

Here, the second vector representation of the attribute corresponding to the triplet is obtained based on the contextual relationship of the triplet itself, which can reflect the transformation of the head entity and the attribute value in the triplet to a certain extent , which embodies the attributes in the triplet.

In step S403, the first vector representation and the second vector representation are spliced to obtain a relationship vector representation corresponding to the triplet.

Here, in this embodiment of the present application, the first vector representation is used as a basic vector of the type "entity-attribute-attribute value", and the second vector representation is used as a type of "entity-attribute-attribute value". To modify the vector, by splicing the first vector representation and the second vector representation together, the obtained combination is used as the relationship vector representation corresponding to the type triplet of "entity-attribute-attribute value". Wherein, if the length of the first vector representation is len1 and the length of the second vector representation is len2, then the length of the relationship vector representation obtained after splicing is len1+len2.

The relationship vector representation obtained by splicing not only considers the lexical representation of the word itself as a relationship or attribute, but also adds the representation help provided by the context of the triplet where the relationship or attribute is located, which can effectively correct the first vector Human subjectivity in representation improves the accuracy of relationship representation.

In step S103, a relationship set corresponding to the head entity in the triplet is constructed according to the relationship vector representation, and a relationship mutual exclusion set with the largest range is obtained from the relationship set.

Here, the relationship set refers to a set obtained by extracting the relationship or attribute in the specified triplet and performing deduplication processing. This embodiment of the present application constructs a corresponding relationship set according to the head entity in the triplet. The mutually exclusive set of relationships refers to a set of relationships that do not contain or be included in each other. In this embodiment of the present application, a relationship set with the largest scope is obtained by performing inclusion screening on the relationship set. Optionally, as a preferred example of the present application, as shown in FIG. 6 , step S103 further includes:

In step S601, a triplet with the same head entity and its corresponding relationship vector representation are obtained, and the relationship vector representation is deduplicated to obtain a relationship set corresponding to the head entity.

This embodiment of the present application classifies the triples according to the head entities, obtains triples with the same head entity, combines the relationship vector representations corresponding to the triples with the same head entity, and removes the same relationship vector representations in the combination. After reprocessing, only one relationship vector representation is retained, and the finally obtained set is used as the relationship set corresponding to the head entity. It should be understood that, after deduplication processing, the relationships or attributes included in the relationship set are mutually exclusive and cannot be clustered. When a triple corpus includes n different head entities, n sets of relations can be obtained correspondingly.

In step S602, inclusion filtering is performed on the relation sets corresponding to all the head entities to obtain the relation mutual exclusion set with the largest range.

The relationship sets corresponding to different head entities may have an inclusion relationship. In this regard, in this embodiment of the present application, by comparing the relationship sets corresponding to all the head entities, the relationship sets corresponding to all the head entities are subjected to inclusion screening, and the relationship sets with the inclusion relationship are merged. After several merges, the largest relational mutual exclusion set will be obtained.

Exemplarily, when a triple corpus includes 4 different head entities, correspondingly, 4 sets of relation sets can be obtained, which are ({real name, alias, era, ethnic group}, {Chinese name, foreign language name) , place of birth, representative works}, {real name, age}, {Chinese name, foreign name}), where the relation set {real name, age}∈relation set{real name, alias, age, ethnic group} , relation set {Chinese name, foreign name} ∈ relation set {Chinese name, foreign name, place of birth, representative work}. Merge the relationship set {real name, age} with the relationship set {original name, alias, age, ethnic group, and combine the relationship set {Chinese name, foreign language name} with the relationship set {Chinese name, foreign language name, birthplace, representative Works} are merged, and finally two sets of mutually exclusive relationships with the largest range are obtained ({real name, alias, era, ethnic group}, {Chinese name, foreign language name, place of birth, representative work}).

In step S104, cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and cluster-cluster merging is performed according to the relationship mutually exclusive set to obtain several clusters.

This embodiment of the present application uses a preset clustering algorithm to perform cluster analysis on the relationship vector features, so as to cluster all triples with the same or similar relationship in the triple corpus into one cluster, so as to complete the clustering analysis. Relational alignment processing. Optionally, as a preferred example of the present application, as shown in FIG. 7 , step S104 further includes:

In step S701, a preset algorithm is used to perform cluster analysis on the relationship vector representations corresponding to the triples in the triplet corpus.

Optionally, in the embodiment of the present application, a semi-supervised hierarchical clustering algorithm is used to perform cluster analysis on the triples in the triplet corpus, and the obtained clusters are merged in pairs from bottom to top.

In step S702, in the cluster analysis process, for the two clusters to be merged, it is determined whether there is at least one relationship vector representation in the two clusters that simultaneously exists in the same mutually exclusive set of relationships.

In this embodiment of the present application, the mutually exclusive set of relationships is integrated into a clustering model. Before the clustering model merges the pairs of clusters through the hierarchical clustering algorithm, it is determined whether the two clusters to be merged can be merged based on the mutually exclusive set of relationships. As mentioned above, after deduplication processing, the relationships or attributes included in the relationship set are mutually exclusive, and the relationships or attributes included in the relationship mutually exclusive set are also mutually exclusive, so clustering cannot be performed. In this embodiment of the present application, it is determined whether there is at least one relationship vector representation in the two clusters that simultaneously exists in the same maximum relationship set.

If so, it indicates that the two clusters have mutually exclusive elements, the two clusters are not the same or similar, and the relationship cannot be aligned, and step S703 is executed; otherwise, it indicates that the two clusters are not identical There are mutually exclusive elements, the two clusters are the same or similar, and the relationship can be aligned, and step S704 is executed.

In step S703, the two clusters are not merged.

In step S704, the two clusters are merged.

In this embodiment of the present application, the mutually exclusive set of relationships is incorporated into the process of cluster analysis, and the added prior knowledge can effectively improve the accuracy of cluster merging in cluster analysis, and improve the accuracy and practicability of relationship alignment.

In step S105, for each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified to the target relationship feature vector . The clusters obtained by the final clustering include several identical or displayed relational feature vectors. This embodiment of the present application further performs frequency statistics and comparisons on the relational feature vectors in each cluster, and obtains the relational feature vector with the highest frequency of occurrence, as the The target relation feature vector of the cluster. For the triplet whose relationship feature vector in the cluster is not the target relationship feature vector, modify the relationship feature vector to the target relationship feature vector to correct the wrong relationship and deviation relationship in the cluster, which can effectively correct Errors or biases caused by human subjectivity greatly improve the accuracy and utility of relational alignment.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

In one embodiment, a relationship alignment device based on structured information is provided, and the relationship alignment device based on structured information is in one-to-one correspondence with the relationship alignment method based on structured information in the foregoing embodiment. As shown in FIG. 8 , the structure information-based relationship alignment device includes a corpus construction module 81 , a relationship acquisition module 82 , a mutually exclusive set acquisition module 83 , a clustering and merging module 84 , and a correction module 85 . The detailed description of each functional module is as follows:

The corpus construction module 81 is used to construct a triple corpus, and the triple corpus includes several triples;

A relationship acquisition module 82, configured to acquire the relationship vector representation corresponding to each triple in the triple corpus;

Mutual exclusion set acquisition module 83, configured to construct a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtain a relationship mutual exclusion set with the largest scope from the relationship set;

The clustering and merging module 84 is configured to perform cluster analysis on the triples in the triplet corpus according to the relationship vector feature, and perform cluster-cluster merging according to the mutually exclusive set of relationships to obtain several clusters;

The modification module 85 is configured to, for each cluster, select the relationship feature vector with the highest frequency of occurrence in the cluster as the target relationship feature vector, and modify the relationship feature vectors of all triples in the cluster to the target relationship feature vector.

Optionally, the relationship acquisition module 82 includes:

a first vector representation acquisition unit, configured to query a preset word vector list for a triple in the triple corpus, and obtain a first vector representation of the relationship corresponding to the triple;

The second vector representation acquisition unit is configured to acquire the head entity vector representation and the tail entity vector representation corresponding to the triplet, and construct the first entity vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation two-vector representation;

The splicing unit is used for splicing the first vector representation and the second vector representation to obtain the relation vector representation corresponding to the triplet.

Optionally, the second vector representation acquisition unit includes:

a first query subunit, configured to obtain entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;

a first calculation subunit, used to obtain the average value between the entity type vector representation corresponding to the head entity and the category information vector representation, as the head entity vector representation;

The second query subunit is used to obtain entity type and category information of the tail entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the tail entity;

The second calculation subunit is used to obtain the average value between the entity type vector representation and the category information vector representation corresponding to the tail entity, as the tail entity vector representation;

The third calculation subunit is configured to obtain the difference between the head entity vector representation and the tail entity vector representation as the second vector representation of the relationship corresponding to the triplet.

Optionally, the relationship acquisition module 82 includes:

a first vector representation acquisition unit, configured to query a preset word vector list for triples in the triplet corpus, and obtain a first vector representation of attributes corresponding to the triples;

The second vector representation acquisition unit is configured to acquire the header entity vector representation and the attribute value vector representation corresponding to the triplet, and construct the first entity vector representation of the attribute corresponding to the triplet according to the header entity vector representation and the attribute value vector representation two-vector representation;

Optionally, the second vector representation acquisition unit includes:

The second query subunit is configured to perform word segmentation processing on the attribute value, query a preset word vector list according to the word segmentation result, and obtain a word segmentation vector representation corresponding to each word segmentation;

The second calculation subunit is used to obtain the average value between the representations of the word segmentation vectors as the representation of the attribute value vector;

The third calculation subunit is configured to obtain the difference between the head entity vector representation and the attribute value vector representation as the second vector representation of the attribute corresponding to the triplet.

Optionally, the mutually exclusive set acquisition module 83 includes:

a relationship set obtaining unit, configured to obtain triples with the same head entity and their corresponding relationship vector representations, and perform deduplication processing on the relationship vector representations to obtain a relationship set corresponding to the head entity;

The mutual exclusion set acquisition unit is used to filter the relation sets corresponding to all head entities to obtain the relation mutual exclusion set with the largest range.

Optionally, the cluster merging module 84 includes:

a clustering unit, configured to use a preset algorithm to perform cluster analysis on the relationship vector representations corresponding to the triples in the triplet corpus;

Judging unit, for in the cluster analysis process, for the two clusters to be merged, to judge whether there is at least one relationship vector representation in the two clusters to simultaneously exist in the same maximum relationship set;

The merging processing unit is used for not merging the two clusters when the judgment result of the judging unit is yes, otherwise merging the two clusters.

For the specific definition of the structure-information-based relationship alignment apparatus, reference may be made to the foregoing definition of the structured information-based relationship alignment method, which will not be repeated here. Each module in the above-mentioned structure-information-based relationship alignment apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In one embodiment, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 9 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a computer storage medium and an internal memory. The computer storage medium stores an operating system, computer readable instructions and a database. The internal memory provides an environment for the execution of the operating system and computer readable instructions in the computer storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions, when executed by a processor, implement a structured information-based relational alignment method. The readable storage medium provided by this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.

In one embodiment, there is provided a computer apparatus comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor executes the computer The following steps are implemented when readable instructions:

constructing a triple corpus, the triple corpus includes several triples;

In one embodiment, there is provided one or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to Perform the following steps:

constructing a triple corpus, the triple corpus includes several triples;

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile computer-readable storage medium, the computer-readable instructions, when executed, may include the processes of the foregoing method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it is still possible to implement the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims

A relational alignment method based on structured information, including:

constructing a triple corpus, the triple corpus includes several triples;

obtaining the relation vector representation corresponding to each triple in the triple corpus;

Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;

Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;

For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
The method for relationship alignment based on structured information according to claim 1, wherein the obtaining the relationship vector representation corresponding to each triple in the triple corpus comprises:

For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the relationship corresponding to the triplet;

obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing a second vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation;

The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
The relationship alignment method based on structured information according to claim 2, wherein the obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet is based on the head entity vector representation and the tail entity vector representation. Constructing the second vector representation of the relationship corresponding to the triplet includes:

Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;

Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;

Obtain the entity type and category information of the tail entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the tail entity;

Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the tail entity, as the tail entity vector representation;

The difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.
The relationship alignment method based on structured information according to any one of claims 1 to 3, wherein the acquiring the relationship vector representation corresponding to each triple in the triple corpus comprises:

For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the attributes corresponding to the triples;

obtaining the head entity vector representation and the attribute value vector representation corresponding to the triplet, and constructing a second vector representation of the attribute corresponding to the triplet according to the head entity vector representation and the attribute value vector representation;

The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
The method for relationship alignment based on structured information according to claim 4, wherein the obtaining the head entity vector representation and the attribute value vector representation corresponding to the triplet is based on the head entity vector representation and the attribute value vector representation. Constructing the second vector representation of the attributes corresponding to the triples includes:

Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;

Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;

Perform word segmentation processing on the attribute value, query a preset word vector list according to the word segmentation result, and obtain the word segmentation vector representation corresponding to each word segmentation;

Obtain the average value between the word segmentation vector representations as the attribute value vector representation;

The difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.
The relationship alignment method based on structured information according to any one of claims 1 and 5, wherein the relationship set corresponding to the head entity in the triplet is constructed according to the relationship vector representation, and obtained from the relationship set The most extensive set of relational mutexes include:

Acquire triples with the same head entity and their corresponding relationship vector representations, perform deduplication processing on the relationship vector representations, and obtain a relationship set corresponding to the head entity;

Inclusion filtering is performed on the relation sets corresponding to all head entities to obtain the relation-exclusive set with the largest range.
The relationship alignment method based on structured information according to claim 6, wherein the triplet in the triplet corpus is clustered according to the relationship vector feature, and the triplet is mutually exclusive according to the relationship. The set is clustered and merged, and several clusters are obtained, including:

Using a preset algorithm to perform cluster analysis on the representation of the relation vector corresponding to the triples in the triplet corpus;

In the cluster analysis process, for the two clusters to be merged, determine whether there is at least one relationship vector representation in the two clusters that simultaneously exists in the same mutually exclusive set of relationships;

If so, do not merge the two clusters, otherwise merge the two clusters.
A relationship alignment device based on structured information, wherein the device comprises:

a corpus building module for constructing a triple corpus, where the triple corpus includes several triples;

a relationship acquisition module, used for acquiring the relationship vector representation corresponding to each triple in the triple corpus;

A mutual exclusion set acquisition module, configured to construct a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtain a relationship mutual exclusion set with the largest scope from the relationship set;

a clustering and merging module, configured to perform cluster analysis on the triples in the triplet corpus according to the relationship vector feature, and perform cluster-cluster merging according to the relationship mutually exclusive set to obtain several clusters;

A correction module, for each cluster, selecting the relationship feature vector with the highest frequency in the cluster as the target relationship feature vector, and modifying the relationship feature vectors of all triples in the cluster to the target relationship feature vector .
A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer-readable instructions:

constructing a triple corpus, the triple corpus includes several triples;

obtaining the relation vector representation corresponding to each triple in the triple corpus;

Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;

Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;

For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
The computer device according to claim 9, wherein the obtaining the relation vector representation corresponding to each triple in the triple corpus comprises:

For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the relationship corresponding to the triplet;

obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing a second vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation;

The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
The computer device according to claim 10, wherein the obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing the triplet according to the head entity vector representation and the tail entity vector representation The second vector representation of the corresponding relationship includes:

Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;

Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;

Obtain the entity type and category information of the tail entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the tail entity;

Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the tail entity, as the tail entity vector representation;

The difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.
The computer device according to any one of claims 9 to 11, wherein the obtaining the relation vector representation corresponding to each triple in the triple corpus comprises:

For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the attributes corresponding to the triples;

obtaining the head entity vector representation and the attribute value vector representation corresponding to the triplet, and constructing a second vector representation of the attribute corresponding to the triplet according to the head entity vector representation and the attribute value vector representation;

The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
The computer device according to claim 12, wherein the acquiring the header entity vector representation and the attribute value vector representation corresponding to the triplet, constructs the triplet according to the header entity vector representation and the property value vector representation The second vector representation of the corresponding attribute includes:

Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;

Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;

Perform word segmentation processing on the attribute value, query a preset word vector list according to the word segmentation result, and obtain the word segmentation vector representation corresponding to each word segmentation;

Obtain the average value between the word segmentation vector representations as the attribute value vector representation;

The difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.
The computer device according to any one of claims 9 and 13, wherein the relationship set corresponding to the head entity in the triplet is constructed according to the relationship vector representation, and the relationship mutual exclusion with the largest range is obtained from the relationship set Sets include:

Acquire triples with the same head entity and their corresponding relationship vector representations, perform deduplication processing on the relationship vector representations, and obtain a relationship set corresponding to the head entity;

Inclusion filtering is performed on the relation sets corresponding to all head entities to obtain the relation-exclusive set with the largest range.
One or more readable storage media storing computer-readable instructions, wherein the computer-readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:

constructing a triple corpus, the triple corpus includes several triples;

obtaining the relation vector representation corresponding to each triple in the triple corpus;

Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;

Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;

For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
The readable storage medium according to claim 15, wherein the obtaining the relation vector representation corresponding to each triple in the triple corpus comprises:

For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the relationship corresponding to the triplet;

obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing a second vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation;

The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
The readable storage medium according to claim 16, wherein the obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing the triplet according to the head entity vector representation and the tail entity vector representation The second vector representation of the relationship corresponding to the tuple includes:

Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;

Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;

Obtain the entity type and category information of the tail entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the tail entity;

Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the tail entity, as the tail entity vector representation;

The difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.
The readable storage medium according to any one of claims 15 to 17, wherein the obtaining the relation vector representation corresponding to each triple in the triple corpus comprises:

For the triplet in the triplet corpus, query a preset word vector list to obtain the first vector representation of the attribute corresponding to the triplet;

obtaining the head entity vector representation and the attribute value vector representation corresponding to the triplet, and constructing a second vector representation of the attribute corresponding to the triplet according to the head entity vector representation and the attribute value vector representation;

The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
The readable storage medium according to claim 18, wherein the acquiring the header entity vector representation and the attribute value vector representation corresponding to the triplet, constructs the triplet according to the header entity vector representation and the property value vector representation The second vector representation of the attribute corresponding to the tuple includes:

Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;

Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;

Perform word segmentation processing on the attribute value, query a preset word vector list according to the word segmentation result, and obtain the word segmentation vector representation corresponding to each word segmentation;

Obtain the average value between the word segmentation vector representations as the attribute value vector representation;

The difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.
The readable storage medium according to claims 15 and 19, wherein the relationship set corresponding to the head entity in the triplet is constructed according to the relationship vector representation, and a relationship mutual exclusion set with the largest range is obtained from the relationship set include:

Acquire triples with the same head entity and their corresponding relationship vector representations, perform deduplication processing on the relationship vector representations, and obtain a relationship set corresponding to the head entity;

Inclusion filtering is performed on the relation sets corresponding to all head entities to obtain the relation-exclusive set with the largest range.