WO2022222226A1 - Structured-information-based relation alignment method and apparatus, and device and medium - Google Patents

Structured-information-based relation alignment method and apparatus, and device and medium Download PDF

Info

Publication number
WO2022222226A1
WO2022222226A1 PCT/CN2021/096584 CN2021096584W WO2022222226A1 WO 2022222226 A1 WO2022222226 A1 WO 2022222226A1 CN 2021096584 W CN2021096584 W CN 2021096584W WO 2022222226 A1 WO2022222226 A1 WO 2022222226A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector representation
relationship
vector
entity
triplet
Prior art date
Application number
PCT/CN2021/096584
Other languages
French (fr)
Chinese (zh)
Inventor
程华东
李剑锋
陈又新
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022222226A1 publication Critical patent/WO2022222226A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Definitions

  • the present application relates to the field of information technology, and in particular, to a relationship alignment method, apparatus, device and medium based on structured information.
  • the knowledge system can be established manually, or it can be established by computer-based data analysis.
  • Existing mutual knowledge network such as encyclopedia
  • triple knowledge there are a lot of triple knowledge.
  • Existing technologies mainly use triple knowledge provided by the Internet, such as encyclopedic knowledge, to construct a knowledge system, and in this process, relationship alignment is required.
  • the context of the relationship is missing, and the structured information was originally edited by humans, with a certain degree of subjectivity and human error, resulting in low representation accuracy of relationships and reduced accuracy of relationship alignment.
  • the relationship vector representation is clustered to eliminate ambiguity, because the semantic information contained in the relationship vector representation obtained through structured information is limited, the effect of relationship clustering is not good, and the accuracy of relationship alignment is low.
  • the embodiments of the present application provide a relationship alignment method, device, device, and medium based on structured information, so as to solve the problems of low relationship representation accuracy and low relationship alignment accuracy in the prior art when building a knowledge graph.
  • a relational alignment method based on structured information including:
  • the triple corpus includes several triples
  • Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
  • the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
  • the obtaining the relation vector representation corresponding to each triple in the triple corpus includes:
  • the first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
  • obtaining the head entity vector representation and tail entity vector representation corresponding to the triplet, and constructing a second vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation include:
  • the difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.
  • the obtaining the relation vector representation corresponding to each triple in the triple corpus includes:
  • the first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
  • obtaining the head entity vector representation and attribute value vector representation corresponding to the triplet, and constructing a second vector representation of the attribute corresponding to the triplet according to the head entity vector representation and the attribute value vector representation include:
  • the difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.
  • the constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining the relationship mutual exclusion set with the largest range from the relationship set includes:
  • Inclusion filtering is performed on the relation sets corresponding to all head entities to obtain the relation-exclusive set with the largest range.
  • performing cluster analysis on the triples in the triple corpus according to the relationship vector feature, and performing cluster cluster merging according to the relationship mutually exclusive set, to obtain several clusters including:
  • a relationship alignment device based on structured information comprising:
  • a corpus building module for constructing a triple corpus, where the triple corpus includes several triples
  • a relationship acquisition module used for acquiring the relationship vector representation corresponding to each triple in the triple corpus
  • a mutual exclusion set acquisition module configured to construct a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtain a relationship mutual exclusion set with the largest scope from the relationship set;
  • a clustering and merging module configured to perform cluster analysis on the triples in the triplet corpus according to the relationship vector feature, and perform cluster-cluster merging according to the relationship mutually exclusive set to obtain several clusters;
  • a correction module for each cluster, selecting the relationship feature vector with the highest frequency in the cluster as the target relationship feature vector, and modifying the relationship feature vectors of all triples in the cluster to the target relationship feature vector .
  • a computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer-readable instructions:
  • the triple corpus includes several triples
  • Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
  • the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
  • One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
  • the triple corpus includes several triples
  • Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
  • the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
  • a triple corpus is constructed, and the triple corpus includes several triples; the relationship vector representation corresponding to each triple in the triple corpus is obtained; according to the relationship vector representation Constructing a relationship set corresponding to the head entity in the triplet, and obtaining the relationship mutual exclusion set with the largest range from the relationship set; performing cluster analysis on the triplet in the triplet corpus according to the relationship vector feature, And according to the mutually exclusive set of relationships, cluster clusters are merged to obtain several clusters; for each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and all three clusters in the cluster are selected.
  • the relationship feature vector of the tuple is modified to the target relationship feature vector, thereby improving the accuracy and practicality of relationship alignment.
  • FIG. 1 is a flowchart of a relationship alignment method based on structured information in an embodiment of the present application
  • step S102 is a flowchart of step S102 in the structured information-based relationship alignment method in an embodiment of the present application
  • step S202 is a flowchart of step S202 in the structured information-based relationship alignment method according to an embodiment of the present application
  • step S102 is a flowchart of step S102 in the structured information-based relationship alignment method in another embodiment of the present application.
  • step S402 is a flowchart of step S402 in the structured information-based relationship alignment method in another embodiment of the present application.
  • step S103 is a flowchart of step S103 in the structured information-based relationship alignment method according to an embodiment of the present application.
  • step S104 is a flowchart of step S104 in the structured information-based relationship alignment method in an embodiment of the present application.
  • FIG. 8 is a schematic block diagram of an apparatus for relationship alignment based on structured information in an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a computer device in an embodiment of the present application.
  • This embodiment provides a relationship alignment method based on structured information.
  • the structured information-based relationship alignment method provided in this embodiment will be described in detail below.
  • the structured information-based relationship alignment method includes:
  • step S101 a triple corpus is constructed, and the triple corpus includes several triples.
  • the triplet knowledge is obtained by parsing the content of the infobox from the Internet webpage, or the triplet knowledge is obtained from the knowledge graph of the open domain.
  • the Internet webpage includes but is not limited to Baidu Encyclopedia and Wikipedia.
  • any complex semantics can be expressed by the combination of several triples.
  • the triples are in the form of "entity-relation-entity” and “entity-relationship-entity” -attribute-attribute-value" of these two types.
  • triples (Tan, nationality, China) belong to the type of "entity-relationship-entity”
  • triples (Tanluan, date of birth, 476) belong to the type of "entity-attribute-attribute value”.
  • step S102 a relation vector representation corresponding to each triple in the triple corpus is obtained.
  • the relationship vector representation refers to the relationship or attribute in the triplet expressed in the form of a vector.
  • the embodiment of the present application combines the representation of the relationship or attribute corresponding to the triplet in the massive data and the relationship representation transformed from the context of the triplet itself to obtain the relationship vector representation corresponding to each triplet, Thus, the accuracy of relation representation is greatly improved.
  • the relation vector representation corresponding to a triple includes:
  • step S201 for the triples in the triplet corpus, a preset word vector list is queried to obtain a first vector representation of the relationship corresponding to the triples.
  • the embodiment of the present application first traverses each triplet in the triplet corpus, and according to the relative words in the triplet, queries a preset word vector list to obtain the vector representation of the relative word, which is used as the corresponding triplet of the triplet
  • the first vector representation of the relationship of which is the basic representation of the relationship corresponding to the triple described in this embodiment.
  • the preset word vector list may be a list of 8 million word vectors open sourced by Tencent.
  • the embodiment of the present application searches for the vector representation of the relationship word in the triple based on the existing word vector list, which is beneficial to improve the relationship of the triple. Characterize the rate of acquisition.
  • the word vectors contained in the preset word vector list are limited, and if the vector representation of the relative words in the triplet cannot be obtained from the preset word vector list, the The related words are segmented, the preset word vector list is queried, the vector corresponding to each segment is obtained, the vectors corresponding to all segments are accumulated and the average value is obtained, which is used as the basic representation of the relationship corresponding to the triplet .
  • step S202 a head entity vector representation and a tail entity vector representation corresponding to the triplet are obtained, and a second vector representation of the relationship corresponding to the triplet is constructed according to the head entity vector representation and the tail entity vector representation.
  • step S202 further includes:
  • step S301 the entity type and category information of the head entity corresponding to the triplet are obtained, the word vector list is queried, and the entity type vector representation and the category information vector representation corresponding to the head entity are obtained.
  • entities can be divided into first-level divisions according to characters, locations, items, etc., to obtain several types; the entity type refers to the first-level division type to which the head entity belongs in the application scenario.
  • the entity type can also be divided into two levels to obtain several types of information.
  • Table 1 is the entity type obtained by the first-level division of the existing Buddhist scene and the category information obtained by the second-level division of the entity type provided in this embodiment of the present application.
  • a corresponding entity type and category information table is obtained according to the application scenario, and the entity type and category information of the head entity corresponding to the triplet are determined according to the table; then the word vector list is queried to obtain the headers respectively.
  • the word vector list may be the aforementioned 8 million word vector list open sourced by Tencent.
  • step S302 an average value between the entity type vector representation corresponding to the head entity and the category information vector representation is obtained as the head entity vector representation.
  • the embodiment of the present application calculates the average value between the entity type vector representation and the category information vector representation corresponding to the header entity, so as to obtain the header entity A vector representation of an entity.
  • the following is the calculation process of the vector representation of the head entity "Virtuous Dharma Venerable”. Based on the division in Table 1, it can be obtained that the entity type corresponding to the "Virtuous Dharma Venerable Dharma” is "person” and the category information is " Buddha”, and then query the word vector list according to the entity type "person” and the category information "Buddha”, and obtain the vector representation vec_1 of "person” as the entity type vector representation of the "Virtuous Dharma Venerable", the obtained " The vector representation vec_2 of the "Buddha” is used as the category information vector representation of the "Veteran of Virtue and Dharma".
  • step S303 the entity type and category information of the tail entity corresponding to the triplet are obtained, the word vector list is queried, and the entity type vector representation and the category information vector representation corresponding to the tail entity are obtained.
  • step S304 an average value between the entity type vector representation corresponding to the tail entity and the category information vector representation is obtained as the tail entity vector representation.
  • step S305 the difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.
  • the vector representation of relation v_relation v_head-v_tail.
  • the difference between the head entity vector representation and the tail entity vector representation is calculated as the second vector representation of the relationship corresponding to the triplet.
  • the second vector representation of the relationship corresponding to the triplet is obtained based on the contextual relationship of the triplet itself, which can reflect the transformation of the head entity and the tail entity in the triplet to a certain extent , which reflects the relationship in the triplet.
  • step S203 the first vector representation and the second vector representation are spliced to obtain a relationship vector representation corresponding to the triplet.
  • the first vector representation is used as a basic vector of the type "entity-relationship-entity”
  • the second vector representation is used as a correction vector of the type of "entity-relationship-entity”.
  • the step S102 is to obtain the triplet corpus
  • the representation of the relation vector corresponding to each triple in includes:
  • step S401 for triples in the triplet corpus, a preset word vector list is queried to obtain a first vector representation of attributes corresponding to the triples.
  • the embodiment of the present application first traverses each triple in the triple corpus, and according to the attribute word in the triple, queries a preset word vector list to obtain the vector representation of the attribute word, as the triple corresponding
  • the first vector representation of the attribute of which is the basic representation of the attribute corresponding to the triple described in this embodiment.
  • the preset word vector list may be a list of 8 million word vectors open sourced by Tencent.
  • the embodiments of the present application query the vector representations of attribute words in triples based on an existing word vector list, which is conducive to improving the relationship between triples Characterize the rate of acquisition.
  • the word vectors contained in the preset word vector list are limited. If the vector representation of the attribute words in the triplet cannot be obtained from the preset word vector list, the The related words are segmented, the preset word vector list is queried, the vector corresponding to each segment is obtained, the vectors corresponding to all segments are accumulated and the average value is obtained, which is used as the basic representation of the attribute corresponding to the triplet .
  • step S402 the head entity vector representation and the attribute value vector representation corresponding to the triplet are obtained, and a second vector representation of the attribute corresponding to the triplet is constructed according to the head entity vector representation and the attribute value vector representation.
  • step S402 further includes:
  • step S501 the entity type and category information of the head entity corresponding to the triplet are obtained, the word vector list is queried, and the entity type vector representation and the category information vector representation corresponding to the head entity are obtained.
  • step S502 an average value between the entity type vector representation corresponding to the head entity and the category information vector representation is obtained as the head entity vector representation.
  • steps S501 to S502 are the same as the above-mentioned steps S301 to S302.
  • steps S301 to S302 please refer to the descriptions of the above-mentioned steps S301 to S302, which will not be repeated here.
  • step S503 a word segmentation process is performed on the attribute value, a preset word vector list is queried according to the word segmentation result, and a word segmentation vector representation corresponding to each word segmentation is obtained.
  • the embodiment of the present application performs word segmentation on the attribute words in the triplet to obtain several word segmentations, and then queries a preset word vector list to obtain the word segmentation vector representation corresponding to each word segmentation.
  • the preset word vector list may be the 8 million word vector list open-sourced by Tencent as described above. Word segmentation can be performed by calling the jieba word segmentation tool.
  • step S504 an average value between the word segmentation vector representations is obtained as the attribute value vector representation.
  • an average value is obtained for all word segmentation vector representations corresponding to the attribute values, as the attribute value vector representation corresponding to the triplet.
  • step S505 the difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.
  • the difference between the head entity vector representation and the attribute value vector representation is calculated as the second vector representation of the attribute corresponding to the triplet.
  • the second vector representation of the attribute corresponding to the triplet is obtained based on the contextual relationship of the triplet itself, which can reflect the transformation of the head entity and the attribute value in the triplet to a certain extent , which embodies the attributes in the triplet.
  • step S403 the first vector representation and the second vector representation are spliced to obtain a relationship vector representation corresponding to the triplet.
  • the first vector representation is used as a basic vector of the type "entity-attribute-attribute value”
  • the second vector representation is used as a type of "entity-attribute-attribute value”.
  • the obtained combination is used as the relationship vector representation corresponding to the type triplet of "entity-attribute-attribute value”.
  • the relationship vector representation obtained by splicing not only considers the lexical representation of the word itself as a relationship or attribute, but also adds the representation help provided by the context of the triplet where the relationship or attribute is located, which can effectively correct the first vector Human subjectivity in representation improves the accuracy of relationship representation.
  • step S103 a relationship set corresponding to the head entity in the triplet is constructed according to the relationship vector representation, and a relationship mutual exclusion set with the largest range is obtained from the relationship set.
  • the relationship set refers to a set obtained by extracting the relationship or attribute in the specified triplet and performing deduplication processing.
  • This embodiment of the present application constructs a corresponding relationship set according to the head entity in the triplet.
  • the mutually exclusive set of relationships refers to a set of relationships that do not contain or be included in each other.
  • a relationship set with the largest scope is obtained by performing inclusion screening on the relationship set.
  • step S103 further includes:
  • step S601 a triplet with the same head entity and its corresponding relationship vector representation are obtained, and the relationship vector representation is deduplicated to obtain a relationship set corresponding to the head entity.
  • This embodiment of the present application classifies the triples according to the head entities, obtains triples with the same head entity, combines the relationship vector representations corresponding to the triples with the same head entity, and removes the same relationship vector representations in the combination. After reprocessing, only one relationship vector representation is retained, and the finally obtained set is used as the relationship set corresponding to the head entity. It should be understood that, after deduplication processing, the relationships or attributes included in the relationship set are mutually exclusive and cannot be clustered. When a triple corpus includes n different head entities, n sets of relations can be obtained correspondingly.
  • step S602 inclusion filtering is performed on the relation sets corresponding to all the head entities to obtain the relation mutual exclusion set with the largest range.
  • the relationship sets corresponding to different head entities may have an inclusion relationship.
  • the relationship sets corresponding to all the head entities are subjected to inclusion screening, and the relationship sets with the inclusion relationship are merged. After several merges, the largest relational mutual exclusion set will be obtained.
  • relation sets can be obtained, which are ( ⁇ real name, alias, era, ethnic group ⁇ , ⁇ Chinese name, foreign language name) , place of birth, representative works ⁇ , ⁇ real name, age ⁇ , ⁇ Chinese name, foreign name ⁇ ), where the relation set ⁇ real name, age ⁇ relation set ⁇ real name, alias, age, ethnic group ⁇ , relation set ⁇ Chinese name, foreign name ⁇ ⁇ relation set ⁇ Chinese name, foreign name, place of birth, representative work ⁇ .
  • step S104 cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and cluster-cluster merging is performed according to the relationship mutually exclusive set to obtain several clusters.
  • step S104 further includes:
  • step S701 a preset algorithm is used to perform cluster analysis on the relationship vector representations corresponding to the triples in the triplet corpus.
  • a semi-supervised hierarchical clustering algorithm is used to perform cluster analysis on the triples in the triplet corpus, and the obtained clusters are merged in pairs from bottom to top.
  • step S702 in the cluster analysis process, for the two clusters to be merged, it is determined whether there is at least one relationship vector representation in the two clusters that simultaneously exists in the same mutually exclusive set of relationships.
  • the mutually exclusive set of relationships is integrated into a clustering model. Before the clustering model merges the pairs of clusters through the hierarchical clustering algorithm, it is determined whether the two clusters to be merged can be merged based on the mutually exclusive set of relationships. As mentioned above, after deduplication processing, the relationships or attributes included in the relationship set are mutually exclusive, and the relationships or attributes included in the relationship mutually exclusive set are also mutually exclusive, so clustering cannot be performed. In this embodiment of the present application, it is determined whether there is at least one relationship vector representation in the two clusters that simultaneously exists in the same maximum relationship set.
  • step S703 If so, it indicates that the two clusters have mutually exclusive elements, the two clusters are not the same or similar, and the relationship cannot be aligned, and step S703 is executed; otherwise, it indicates that the two clusters are not identical There are mutually exclusive elements, the two clusters are the same or similar, and the relationship can be aligned, and step S704 is executed.
  • step S703 the two clusters are not merged.
  • step S704 the two clusters are merged.
  • the mutually exclusive set of relationships is incorporated into the process of cluster analysis, and the added prior knowledge can effectively improve the accuracy of cluster merging in cluster analysis, and improve the accuracy and practicability of relationship alignment.
  • step S105 for each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified to the target relationship feature vector .
  • the clusters obtained by the final clustering include several identical or displayed relational feature vectors. This embodiment of the present application further performs frequency statistics and comparisons on the relational feature vectors in each cluster, and obtains the relational feature vector with the highest frequency of occurrence, as the The target relation feature vector of the cluster.
  • a relationship alignment device based on structured information is provided, and the relationship alignment device based on structured information is in one-to-one correspondence with the relationship alignment method based on structured information in the foregoing embodiment.
  • the structure information-based relationship alignment device includes a corpus construction module 81 , a relationship acquisition module 82 , a mutually exclusive set acquisition module 83 , a clustering and merging module 84 , and a correction module 85 .
  • the detailed description of each functional module is as follows:
  • the corpus construction module 81 is used to construct a triple corpus, and the triple corpus includes several triples;
  • a relationship acquisition module 82 configured to acquire the relationship vector representation corresponding to each triple in the triple corpus
  • Mutual exclusion set acquisition module 83 configured to construct a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtain a relationship mutual exclusion set with the largest scope from the relationship set;
  • the clustering and merging module 84 is configured to perform cluster analysis on the triples in the triplet corpus according to the relationship vector feature, and perform cluster-cluster merging according to the mutually exclusive set of relationships to obtain several clusters;
  • the modification module 85 is configured to, for each cluster, select the relationship feature vector with the highest frequency of occurrence in the cluster as the target relationship feature vector, and modify the relationship feature vectors of all triples in the cluster to the target relationship feature vector.
  • the relationship acquisition module 82 includes:
  • a first vector representation acquisition unit configured to query a preset word vector list for a triple in the triple corpus, and obtain a first vector representation of the relationship corresponding to the triple;
  • the second vector representation acquisition unit is configured to acquire the head entity vector representation and the tail entity vector representation corresponding to the triplet, and construct the first entity vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation two-vector representation;
  • the splicing unit is used for splicing the first vector representation and the second vector representation to obtain the relation vector representation corresponding to the triplet.
  • the second vector representation acquisition unit includes:
  • a first query subunit configured to obtain entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
  • a first calculation subunit used to obtain the average value between the entity type vector representation corresponding to the head entity and the category information vector representation, as the head entity vector representation;
  • the second query subunit is used to obtain entity type and category information of the tail entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the tail entity;
  • the second calculation subunit is used to obtain the average value between the entity type vector representation and the category information vector representation corresponding to the tail entity, as the tail entity vector representation;
  • the third calculation subunit is configured to obtain the difference between the head entity vector representation and the tail entity vector representation as the second vector representation of the relationship corresponding to the triplet.
  • the relationship acquisition module 82 includes:
  • a first vector representation acquisition unit configured to query a preset word vector list for triples in the triplet corpus, and obtain a first vector representation of attributes corresponding to the triples;
  • the second vector representation acquisition unit is configured to acquire the header entity vector representation and the attribute value vector representation corresponding to the triplet, and construct the first entity vector representation of the attribute corresponding to the triplet according to the header entity vector representation and the attribute value vector representation two-vector representation;
  • the splicing unit is used for splicing the first vector representation and the second vector representation to obtain the relation vector representation corresponding to the triplet.
  • the second vector representation acquisition unit includes:
  • a first query subunit configured to obtain entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
  • a first calculation subunit used to obtain the average value between the entity type vector representation corresponding to the head entity and the category information vector representation, as the head entity vector representation;
  • the second query subunit is configured to perform word segmentation processing on the attribute value, query a preset word vector list according to the word segmentation result, and obtain a word segmentation vector representation corresponding to each word segmentation;
  • the second calculation subunit is used to obtain the average value between the representations of the word segmentation vectors as the representation of the attribute value vector;
  • the third calculation subunit is configured to obtain the difference between the head entity vector representation and the attribute value vector representation as the second vector representation of the attribute corresponding to the triplet.
  • the mutually exclusive set acquisition module 83 includes:
  • a relationship set obtaining unit configured to obtain triples with the same head entity and their corresponding relationship vector representations, and perform deduplication processing on the relationship vector representations to obtain a relationship set corresponding to the head entity;
  • the mutual exclusion set acquisition unit is used to filter the relation sets corresponding to all head entities to obtain the relation mutual exclusion set with the largest range.
  • the cluster merging module 84 includes:
  • a clustering unit configured to use a preset algorithm to perform cluster analysis on the relationship vector representations corresponding to the triples in the triplet corpus
  • Judging unit for in the cluster analysis process, for the two clusters to be merged, to judge whether there is at least one relationship vector representation in the two clusters to simultaneously exist in the same maximum relationship set;
  • the merging processing unit is used for not merging the two clusters when the judgment result of the judging unit is yes, otherwise merging the two clusters.
  • Each module in the above-mentioned structure-information-based relationship alignment apparatus may be implemented in whole or in part by software, hardware, and combinations thereof.
  • the above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 9 .
  • the computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes a computer storage medium and an internal memory.
  • the computer storage medium stores an operating system, computer readable instructions and a database.
  • the internal memory provides an environment for the execution of the operating system and computer readable instructions in the computer storage medium.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions when executed by a processor, implement a structured information-based relational alignment method.
  • the readable storage medium provided by this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer apparatus comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor executes the computer
  • the following steps are implemented when readable instructions:
  • the triple corpus includes several triples
  • Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
  • the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
  • one or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to Perform the following steps:
  • the triple corpus includes several triples
  • Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
  • the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A structured-information-based relation alignment method, comprising: constructing a triplet corpus, wherein the triplet corpus comprises a plurality of triplets (S101); acquiring a relation vector representation corresponding to each triplet in the triplet corpus (S102); according to the relation vector representations, constructing relation sets corresponding to head entities in the triplets, and acquiring, from the relation sets, a relation mutually exclusive set having the maximum range (S103); performing clustering analysis on the triplets in the triplet corpus according to relation vector features, and performing clustering cluster merging according to the relation mutually exclusive set, so as to obtain several clusters (S104); and for each cluster, selecting a relation feature vector, which occurs the most frequently in the cluster, as a target relation feature vector, and modifying the relation feature vectors of all triplets in the cluster to be the target relation feature vector (S105). By means of the method, the problems in the prior art of the low accuracy of relation representation and the low precision of relation alignment during the construction of a knowledge graph are solved.

Description

基于结构化信息的关系对齐方法、装置、设备及介质Structured information-based relationship alignment method, device, device and medium
本申请要求于2021年4月19日提交中国专利局、申请号为202110420316.6,发明名称为“基于结构化信息的关系对齐方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on April 19, 2021 with the application number 202110420316.6 and the title of the invention is "Relational Alignment Method, Apparatus, Equipment and Medium Based on Structured Information", the entire contents of which are Incorporated herein by reference.
技术领域technical field
本申请涉及信息技术领域,尤其涉及一种基于结构化信息的关系对齐方法、装置、设备及介质。The present application relates to the field of information technology, and in particular, to a relationship alignment method, apparatus, device and medium based on structured information.
背景技术Background technique
构建知识图谱需要有一个完整的知识体系。知识体系可以通过人工建立,也可以通过计算机基于数据分析来建立。现有互比如百科知识联网,存在大量的三元组知识。现有技术主要采用百科知识等互联网提供的三元组知识构建知识体系,在这过程中需要进行关系对齐。Building a knowledge graph requires a complete knowledge system. The knowledge system can be established manually, or it can be established by computer-based data analysis. Existing mutual knowledge network such as encyclopedia, there are a lot of triple knowledge. Existing technologies mainly use triple knowledge provided by the Internet, such as encyclopedic knowledge, to construct a knowledge system, and in this process, relationship alignment is required.
发明人意识到,在关系对齐的过程中,如果把互联网提供的结构化信息还原成非结构化信息,然后按照非结构化信息的实体进行关系对齐,由于还原后的非结构化信息是极短的文本,关系的上下文环境缺失,且这些结构化信息最初也是人为编辑的,带有一定的主观性和人为错误,从而导致关系的表征精度较低,降低了关系对齐的准确率。在对关系向量表征进行聚类以消除歧义时,又由于通过结构化信息得到关系向量表征所蕴含的语义信息有限,关系聚类的效果欠佳,关系对齐的准确率低。The inventor realized that in the process of relational alignment, if the structured information provided by the Internet is restored to unstructured information, and then the relational alignment is performed according to the entities of the unstructured information, since the restored unstructured information is extremely short. The context of the relationship is missing, and the structured information was originally edited by humans, with a certain degree of subjectivity and human error, resulting in low representation accuracy of relationships and reduced accuracy of relationship alignment. When the relationship vector representation is clustered to eliminate ambiguity, because the semantic information contained in the relationship vector representation obtained through structured information is limited, the effect of relationship clustering is not good, and the accuracy of relationship alignment is low.
申请内容Application content
本申请实施例提供了一种基于结构化信息的关系对齐方法、装置、设备及介质,以解决现有技术在构建知识图谱时存在的关系表征准确率低、关系对齐精度低的问题。The embodiments of the present application provide a relationship alignment method, device, device, and medium based on structured information, so as to solve the problems of low relationship representation accuracy and low relationship alignment accuracy in the prior art when building a knowledge graph.
一种基于结构化信息的关系对齐方法,包括:A relational alignment method based on structured information, including:
构建三元组语料库,所述三元组语料库中包括若干个三元组;constructing a triple corpus, the triple corpus includes several triples;
获取所述三元组语料库中每一个三元组对应的关系向量表征;obtaining the relation vector representation corresponding to each triple in the triple corpus;
根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;
根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
可选地,所述获取所述三元组语料库中每一个三元组对应的关系向量表征包括:Optionally, the obtaining the relation vector representation corresponding to each triple in the triple corpus includes:
对于三元组语料库中的三元组,查询预设的词向量列表,得到所述三元组对应的关系的第一向量表征;For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the relationship corresponding to the triplet;
获取所述三元组对应的头实体向量表征和尾实体向量表征,根据所述头实体向量表征和尾实体向量表征构建所述三元组对应的关系的第二向量表征;obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing a second vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation;
拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
可选地,所述获取所述三元组对应的头实体向量表征和尾实体向量表征,根据所述头实体向量表征和尾实体向量表征构建所述三元组对应的关系的第二向量表征包括:Optionally, obtaining the head entity vector representation and tail entity vector representation corresponding to the triplet, and constructing a second vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation. include:
获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;
获取所述三元组对应的尾实体的实体类型及类别信息,查询所述词向量列表,得到所述尾实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the tail entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the tail entity;
求取所述尾实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述尾实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the tail entity, as the tail entity vector representation;
求取所述头实体向量表征和尾实体向量表征之间的差值,作为所述三元组对应的关系的第二向量表征。The difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.
可选地,所述获取所述三元组语料库中每一个三元组对应的关系向量表征包括:Optionally, the obtaining the relation vector representation corresponding to each triple in the triple corpus includes:
对于三元组语料库中的三元组,查询预设的词向量列表,得到所述三元组对应的属性的第一向量表征;For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the attributes corresponding to the triples;
获取所述三元组对应的头实体向量表征和属性值向量表征,根据所述头实体向量表征和属性值向量表征构建所述三元组对应的属性的第二向量表征;obtaining the head entity vector representation and the attribute value vector representation corresponding to the triplet, and constructing a second vector representation of the attribute corresponding to the triplet according to the head entity vector representation and the attribute value vector representation;
拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
可选地,所述获取所述三元组对应的头实体向量表征和属性值向量表征,根据所述头实体向量表征和属性值向量表征构建所述三元组对应的属性的第二向量表征包括:Optionally, obtaining the head entity vector representation and attribute value vector representation corresponding to the triplet, and constructing a second vector representation of the attribute corresponding to the triplet according to the head entity vector representation and the attribute value vector representation. include:
获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;
对所述属性值进行分词处理,按照分词结果查询预设的词向量列表,得到每一分词对应的分词向量表征;Perform word segmentation processing on the attribute value, query a preset word vector list according to the word segmentation result, and obtain the word segmentation vector representation corresponding to each word segmentation;
求取所述分词向量表征之间的平均值,作为所述属性值向量表征;Obtain the average value between the word segmentation vector representations as the attribute value vector representation;
求取所述头实体向量表征和属性值向量表征之间的差值,作为所述三元组对应的属性的第二向量表征。The difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.
可选地,,所述根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集包括:Optionally, the constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining the relationship mutual exclusion set with the largest range from the relationship set includes:
获取具有相同头实体的三元组及其对应的关系向量表征,对所述关系向量表征进行去重处理,得到所述头实体对应的关系集;Acquire triples with the same head entity and their corresponding relationship vector representations, perform deduplication processing on the relationship vector representations, and obtain a relationship set corresponding to the head entity;
对所有头实体对应的关系集进行包含筛选,得到范围最大的关系互斥集。Inclusion filtering is performed on the relation sets corresponding to all head entities to obtain the relation-exclusive set with the largest range.
可选地,所述根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇包括:Optionally, performing cluster analysis on the triples in the triple corpus according to the relationship vector feature, and performing cluster cluster merging according to the relationship mutually exclusive set, to obtain several clusters including:
采用预设算法对所述三元组语料库中的三元组对应的关系向量表征进行聚类分析;Using a preset algorithm to perform cluster analysis on the representation of the relation vector corresponding to the triples in the triplet corpus;
在聚类分析过程中,对于待合并的两个聚类簇,判断所述两个聚类簇中是否有至少一个关系向量表征同时存在于同一关系互斥集中;In the cluster analysis process, for the two clusters to be merged, determine whether there is at least one relationship vector representation in the two clusters that simultaneously exists in the same mutually exclusive set of relationships;
若是,不合并所述两个聚类簇,否则合并所述两个聚类簇。If so, do not merge the two clusters, otherwise merge the two clusters.
一种基于结构化信息的关系对齐装置,包括:A relationship alignment device based on structured information, comprising:
语料库构建模块,用于构建三元组语料库,所述三元组语料库中包括若干个三元组;a corpus building module for constructing a triple corpus, where the triple corpus includes several triples;
关系获取模块,用于获取所述三元组语料库中每一个三元组对应的关系向量表征;a relationship acquisition module, used for acquiring the relationship vector representation corresponding to each triple in the triple corpus;
互斥集获取模块,用于根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;A mutual exclusion set acquisition module, configured to construct a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtain a relationship mutual exclusion set with the largest scope from the relationship set;
聚类合并模块,用于根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;a clustering and merging module, configured to perform cluster analysis on the triples in the triplet corpus according to the relationship vector feature, and perform cluster-cluster merging according to the relationship mutually exclusive set to obtain several clusters;
修正模块,用于对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。A correction module, for each cluster, selecting the relationship feature vector with the highest frequency in the cluster as the target relationship feature vector, and modifying the relationship feature vectors of all triples in the cluster to the target relationship feature vector .
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device, comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer-readable instructions:
构建三元组语料库,所述三元组语料库中包括若干个三元组;constructing a triple corpus, the triple corpus includes several triples;
获取所述三元组语料库中每一个三元组对应的关系向量表征;obtaining the relation vector representation corresponding to each triple in the triple corpus;
根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;
根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:
构建三元组语料库,所述三元组语料库中包括若干个三元组;constructing a triple corpus, the triple corpus includes several triples;
获取所述三元组语料库中每一个三元组对应的关系向量表征;obtaining the relation vector representation corresponding to each triple in the triple corpus;
根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;
根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
本申请实施例通过构建三元组语料库,所述三元组语料库中包括若干个三元组;获取所述三元组语料库中每一个三元组对应的关系向量表征;根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量,从而提高了关系对齐的精度和实用性。In the embodiment of the present application, a triple corpus is constructed, and the triple corpus includes several triples; the relationship vector representation corresponding to each triple in the triple corpus is obtained; according to the relationship vector representation Constructing a relationship set corresponding to the head entity in the triplet, and obtaining the relationship mutual exclusion set with the largest range from the relationship set; performing cluster analysis on the triplet in the triplet corpus according to the relationship vector feature, And according to the mutually exclusive set of relationships, cluster clusters are merged to obtain several clusters; for each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and all three clusters in the cluster are selected. The relationship feature vector of the tuple is modified to the target relationship feature vector, thereby improving the accuracy and practicality of relationship alignment.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below, and other features and advantages of the application will become apparent from the description, drawings, and claims.
附图说明Description of drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the drawings that are used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. , for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.
图1是本申请一实施例中基于结构化信息的关系对齐方法的流程图;1 is a flowchart of a relationship alignment method based on structured information in an embodiment of the present application;
图2是本申请一实施例中基于结构化信息的关系对齐方法中步骤S102的流程图;2 is a flowchart of step S102 in the structured information-based relationship alignment method in an embodiment of the present application;
图3是本申请一实施例中基于结构化信息的关系对齐方法中步骤S202的流程图;3 is a flowchart of step S202 in the structured information-based relationship alignment method according to an embodiment of the present application;
图4是本申请另一实施例中基于结构化信息的关系对齐方法中步骤S102的流程图;4 is a flowchart of step S102 in the structured information-based relationship alignment method in another embodiment of the present application;
图5是本申请另一实施例中基于结构化信息的关系对齐方法中步骤S402的流程图;5 is a flowchart of step S402 in the structured information-based relationship alignment method in another embodiment of the present application;
图6是本申请一实施例中基于结构化信息的关系对齐方法中步骤S103的流程图;6 is a flowchart of step S103 in the structured information-based relationship alignment method according to an embodiment of the present application;
图7是本申请一实施例中基于结构化信息的关系对齐方法中步骤S104的流程图;7 is a flowchart of step S104 in the structured information-based relationship alignment method in an embodiment of the present application;
图8是本申请一实施例中基于结构化信息的关系对齐装置的一原理框图;8 is a schematic block diagram of an apparatus for relationship alignment based on structured information in an embodiment of the present application;
图9是本申请一实施例中计算机设备的一示意图。FIG. 9 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.
本实施例提供了一种基于结构化信息的关系对齐方法。以下将对本实施例提供的基于结构化信息的关系对齐方法进行详细的描述,如图1所示,所述基于结构化信息的关系对齐方法包括:This embodiment provides a relationship alignment method based on structured information. The structured information-based relationship alignment method provided in this embodiment will be described in detail below. As shown in FIG. 1 , the structured information-based relationship alignment method includes:
在步骤S101中,构建三元组语料库,所述三元组语料库中包括若干个三元组。In step S101, a triple corpus is constructed, and the triple corpus includes several triples.
在这里,本申请实施例通过从互联网网页中解析infobox的内容,获取三元组知识,或者从开放域的知识图谱中获取三元组知识。其中,所述互联网网页包括但不限于百度百科、维基百科。Here, in the embodiment of the present application, the triplet knowledge is obtained by parsing the content of the infobox from the Internet webpage, or the triplet knowledge is obtained from the knowledge graph of the open domain. Wherein, the Internet webpage includes but is not limited to Baidu Encyclopedia and Wikipedia.
根据资源描述框架(RDF),任何复杂的语义都可以通过若干三元组的组合来进行表达,在本申请实施例中,所述三元组的形式包括“实体-关系-实体”和“实体-属性-属性值”这两种类型。比如三元组(昙度,国籍,中国)属于“实体-关系-实体”这一类型,三元组(昙鸾,出生日期,476年)属于“实体-属性-属性值”这一类型。According to the Resource Description Framework (RDF), any complex semantics can be expressed by the combination of several triples. In this embodiment of the present application, the triples are in the form of "entity-relation-entity" and "entity-relationship-entity" -attribute-attribute-value" of these two types. For example, triples (Tan, nationality, China) belong to the type of "entity-relationship-entity", and triples (Tanluan, date of birth, 476) belong to the type of "entity-attribute-attribute value".
在步骤S102中,获取所述三元组语料库中每一个三元组对应的关系向量表征。In step S102, a relation vector representation corresponding to each triple in the triple corpus is obtained.
在这里,所述关系向量表征是指以向量形式表现的、所述三元组中的关系或者属性。与现有技术不同,本申请实施例结合三元组对应的关系或属性在海量数据中的表征以及三元组自身前后文转化出来的关系表征,得到每一个三元组对应的关系向量表征,从而大大地提高了关系表征的准确率。Here, the relationship vector representation refers to the relationship or attribute in the triplet expressed in the form of a vector. Different from the prior art, the embodiment of the present application combines the representation of the relationship or attribute corresponding to the triplet in the massive data and the relationship representation transformed from the context of the triplet itself to obtain the relationship vector representation corresponding to each triplet, Thus, the accuracy of relation representation is greatly improved.
可选地,作为本申请的一个优选示例,当所述三元组为“实体-关系-实体”这一类型,如图2所示,步骤S102所述的获取所述三元组语料库中每一个三元组对应的关系向量表征包括:Optionally, as a preferred example of the present application, when the triplet is of the type "entity-relationship-entity", as shown in FIG. The relation vector representation corresponding to a triple includes:
在步骤S201中,对于三元组语料库中的三元组,查询预设的词向量列表,得到所述三元组对应的关系的第一向量表征。In step S201, for the triples in the triplet corpus, a preset word vector list is queried to obtain a first vector representation of the relationship corresponding to the triples.
本申请实施例首先遍历三元组语料库中的每一个三元组,按照所述三元组中的关系词,查询预设的词向量列表获取关系词的向量表征,作为所述三元组对应的关系的第一向量表征,这是本实施例中所述三元组对应的关系的基础表征。其中,所述预设的词向量列表可以为腾讯开源的800万词向量列表。相比于现有技术从非结构化数据中构建关系的表征,本申请实施例通过基于已有的词向量列表来查询三元组中的关系词的向量表征,有利于提高三元组的关系表征获取的速率。The embodiment of the present application first traverses each triplet in the triplet corpus, and according to the relative words in the triplet, queries a preset word vector list to obtain the vector representation of the relative word, which is used as the corresponding triplet of the triplet The first vector representation of the relationship of , which is the basic representation of the relationship corresponding to the triple described in this embodiment. The preset word vector list may be a list of 8 million word vectors open sourced by Tencent. Compared with the prior art to construct the representation of the relationship from unstructured data, the embodiment of the present application searches for the vector representation of the relationship word in the triple based on the existing word vector list, which is beneficial to improve the relationship of the triple. Characterize the rate of acquisition.
可选地,预设的词向量列表中包含的词向量是有限的,若无法从预设的词向量列表中获取三元组中关系词的向量表征时,可通过对所述三元组中的关系词进行分词,查询所述预设的词向量列表,得到每一分词对应的向量,将所有分词对应的向量进行累加后求取平均值,作为所述三元组对应的关系的基础表征。Optionally, the word vectors contained in the preset word vector list are limited, and if the vector representation of the relative words in the triplet cannot be obtained from the preset word vector list, the The related words are segmented, the preset word vector list is queried, the vector corresponding to each segment is obtained, the vectors corresponding to all segments are accumulated and the average value is obtained, which is used as the basic representation of the relationship corresponding to the triplet .
在步骤S202中,获取所述三元组对应的头实体向量表征和尾实体向量表征,根据所述头实体向量表征和尾实体向量表征构建所述三元组对应的关系的第二向量表征。In step S202, a head entity vector representation and a tail entity vector representation corresponding to the triplet are obtained, and a second vector representation of the relationship corresponding to the triplet is constructed according to the head entity vector representation and the tail entity vector representation.
由于预设的词向量列表也是人工编辑的,存在一定的主观性和错误,基于预设的词向量列表查询得到的三元组对应的关系的第一向量表征也会有偏差,对此,本申请实施例进一步基于三元组的头实体向量表征和尾实体向量表征来对所述第一向量表征进行修正。可选地,作为本申请的一个优选示例,如图3所示,步骤S202还包括:Since the preset word vector list is also edited manually, there are certain subjectivity and errors, and the first vector representation of the relationship corresponding to the triplet obtained by querying the preset word vector list will also be biased. The application embodiment further modifies the first vector representation based on the head entity vector representation and the tail entity vector representation of the triplet. Optionally, as a preferred example of the present application, as shown in FIG. 3 , step S202 further includes:
在步骤S301中,获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征。In step S301, the entity type and category information of the head entity corresponding to the triplet are obtained, the word vector list is queried, and the entity type vector representation and the category information vector representation corresponding to the head entity are obtained.
在这里,按照应用场景,实体可以按照人物、地点、物品等进行一级划分,得到若干 个类型;所述实体类型是指所述头实体在应用场景中所属的一级划分类型。所述实体类型也可以进行二级划分,得到若干个类别信息。示例性地,为了便于理解,表1为本申请实施例提供的现有的佛学场景经一级划分得到的实体类型和实体类型经二级划分后得到的类别信息。Here, according to the application scenario, entities can be divided into first-level divisions according to characters, locations, items, etc., to obtain several types; the entity type refers to the first-level division type to which the head entity belongs in the application scenario. The entity type can also be divided into two levels to obtain several types of information. Exemplarily, for ease of understanding, Table 1 is the entity type obtained by the first-level division of the existing Buddhist scene and the category information obtained by the second-level division of the entity type provided in this embodiment of the present application.
Figure PCTCN2021096584-appb-000001
Figure PCTCN2021096584-appb-000001
表1Table 1
本申请实施例根据应用场景得到对应的实体类型和类别信息表,根据该表确定所述三元组对应的头实体的实体类型及类别信息;然后查询所述词向量列表,分别得到所述头实体对应的实体类型向量表征和类别信息向量表征。可选地,所述词向量列表可以是前文所述的腾讯开源的800万词向量列表。In this embodiment of the present application, a corresponding entity type and category information table is obtained according to the application scenario, and the entity type and category information of the head entity corresponding to the triplet are determined according to the table; then the word vector list is queried to obtain the headers respectively. The entity type vector representation and the category information vector representation corresponding to the entity. Optionally, the word vector list may be the aforementioned 8 million word vector list open sourced by Tencent.
在步骤S302中,求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征。In step S302, an average value between the entity type vector representation corresponding to the head entity and the category information vector representation is obtained as the head entity vector representation.
在得到所述头实体对应的实体类型向量表征和类别信息向量表征之后,本申请实施例计算所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,从而得到所述头实体的向量表征。After obtaining the entity type vector representation and the category information vector representation corresponding to the header entity, the embodiment of the present application calculates the average value between the entity type vector representation and the category information vector representation corresponding to the header entity, so as to obtain the header entity A vector representation of an entity.
为了便于理解,以下给出头实体“德妙法尊者”的向量表征的计算过程,基于表1的划分,可以得到所述“德妙法尊者”对应的实体类型为“人”、类别信息为“佛陀”,然后按照实体类型“人”、类别信息“佛陀”分别查询所述词向量列表,得到“人”的向量表征vec_1作为所述“德妙法尊者”的实体类型向量表征,得到的“佛陀”的向量表征vec_2作为所述“德妙法尊者”的类别信息向量表征。所述“德妙法尊者”作为头实体的向量表征v_head,为“人”的向量表征vec_1和“佛陀”的向量表征vec_2之间的平均值,即v_head=(vec_1+vec_2)/2。In order to facilitate understanding, the following is the calculation process of the vector representation of the head entity "Virtuous Dharma Venerable". Based on the division in Table 1, it can be obtained that the entity type corresponding to the "Virtuous Dharma Venerable Dharma" is "person" and the category information is " Buddha", and then query the word vector list according to the entity type "person" and the category information "Buddha", and obtain the vector representation vec_1 of "person" as the entity type vector representation of the "Virtuous Dharma Venerable", the obtained " The vector representation vec_2 of the "Buddha" is used as the category information vector representation of the "Veteran of Virtue and Dharma". The vector representation v_head of the "Venor of Virtue and Wonderful Dharma" as the head entity is the average value between the vector representation vec_1 of "people" and the vector representation vec_2 of "Buddha", that is, v_head=(vec_1+vec_2)/2.
在步骤S303中,获取所述三元组对应的尾实体的实体类型及类别信息,查询所述词向量列表,得到所述尾实体对应的实体类型向量表征和类别信息向量表征。In step S303, the entity type and category information of the tail entity corresponding to the triplet are obtained, the word vector list is queried, and the entity type vector representation and the category information vector representation corresponding to the tail entity are obtained.
在步骤S304中,求取所述尾实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述尾实体向量表征。In step S304, an average value between the entity type vector representation corresponding to the tail entity and the category information vector representation is obtained as the tail entity vector representation.
在这里,尾实体向量表征v_tail的获取及计算流程与头实体向量v_head相同,具体请参见上述步骤S301至步骤S302的记载,此处不再赘述。Here, the process of obtaining and calculating the tail entity vector representation v_tail is the same as that of the head entity vector v_head. For details, please refer to the descriptions of the above steps S301 to S302, which will not be repeated here.
在步骤S305中,求取所述头实体向量表征和尾实体向量表征之间的差值,作为所述三元组对应的关系的第二向量表征。In step S305, the difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.
根据三元组表征的经典模型TransE及其变种可知,关系的向量表征v_relation=v_head-v_tail。本申请实施例通过计算所述头实体向量表征和尾实体向量表征之间的差值,作为所述三元组对应的关系的第二向量表征。According to the classic model TransE of triple representation and its variants, the vector representation of relation v_relation=v_head-v_tail. In this embodiment of the present application, the difference between the head entity vector representation and the tail entity vector representation is calculated as the second vector representation of the relationship corresponding to the triplet.
在这里,所述三元组对应的关系的第二向量表征是基于三元组自身的前后文关系所得到的,能在一定程度上反映了所述三元组中头实体和尾实体的转化,即体现了所述三元组中的关系。Here, the second vector representation of the relationship corresponding to the triplet is obtained based on the contextual relationship of the triplet itself, which can reflect the transformation of the head entity and the tail entity in the triplet to a certain extent , which reflects the relationship in the triplet.
在步骤S203中,拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。In step S203, the first vector representation and the second vector representation are spliced to obtain a relationship vector representation corresponding to the triplet.
在这里,本申请实施例以所述第一向量表征作为“实体-关系-实体”这一类型的基础向量,以所述第二向量表征作为“实体-关系-实体”这一类型的修正向量,通过将所述第一向量表征和第二向量表征拼接在一起,所得到的组合作为所述“实体-关系-实体”这一类型三元组对应的关系向量表征。其中,若第一向量表征的长度为len1,第二向量表征的长度为len2,那么拼接后所得到的关系向量表征的长度为len1+len2。Here, in this embodiment of the present application, the first vector representation is used as a basic vector of the type "entity-relationship-entity", and the second vector representation is used as a correction vector of the type of "entity-relationship-entity". , by splicing the first vector representation and the second vector representation together, the obtained combination is used as the relationship vector representation corresponding to the type triplet of "entity-relation-entity". Wherein, if the length of the first vector representation is len1 and the length of the second vector representation is len2, then the length of the relationship vector representation obtained after splicing is len1+len2.
可选地,作为本申请的另一个优选示例,当所述三元组为“实体-属性-属性值”这一类型,如图4所示,步骤S102所述的获取所述三元组语料库中每一个三元组对应的关系向量表征包括:Optionally, as another preferred example of the present application, when the triplet is of the type “entity-attribute-attribute value”, as shown in FIG. 4 , the step S102 is to obtain the triplet corpus The representation of the relation vector corresponding to each triple in includes:
在步骤S401中,对于三元组语料库中的三元组,查询预设的词向量列表,得到所述三元组对应的属性的第一向量表征。In step S401, for triples in the triplet corpus, a preset word vector list is queried to obtain a first vector representation of attributes corresponding to the triples.
本申请实施例首先遍历三元组语料库中的每一个三元组,按照所述三元组中的属性词,查询预设的词向量列表获取属性词的向量表征,作为所述三元组对应的属性的第一向量表征,这是本实施例中所述三元组对应的属性的基础表征。其中,所述预设的词向量列表可以为腾讯开源的800万词向量列表。相比于现有技术从非结构化数据中构建属性的表征,本申请实施例通过基于已有的词向量列表来查询三元组中的属性词的向量表征,有利于提高三元组的关系表征获取的速率。The embodiment of the present application first traverses each triple in the triple corpus, and according to the attribute word in the triple, queries a preset word vector list to obtain the vector representation of the attribute word, as the triple corresponding The first vector representation of the attribute of , which is the basic representation of the attribute corresponding to the triple described in this embodiment. The preset word vector list may be a list of 8 million word vectors open sourced by Tencent. Compared with constructing attribute representations from unstructured data in the prior art, the embodiments of the present application query the vector representations of attribute words in triples based on an existing word vector list, which is conducive to improving the relationship between triples Characterize the rate of acquisition.
可选地,预设的词向量列表中包含的词向量是有限的,若无法从预设的词向量列表中获取三元组中属性词的向量表征时,可通过对所述三元组中的关系词进行分词,查询所述预设的词向量列表,得到每一分词对应的向量,将所有分词对应的向量进行累加后求取平均值,作为所述三元组对应的属性的基础表征。Optionally, the word vectors contained in the preset word vector list are limited. If the vector representation of the attribute words in the triplet cannot be obtained from the preset word vector list, the The related words are segmented, the preset word vector list is queried, the vector corresponding to each segment is obtained, the vectors corresponding to all segments are accumulated and the average value is obtained, which is used as the basic representation of the attribute corresponding to the triplet .
在步骤S402中,获取所述三元组对应的头实体向量表征和属性值向量表征,根据所述头实体向量表征和属性值向量表征构建所述三元组对应的属性的第二向量表征。In step S402, the head entity vector representation and the attribute value vector representation corresponding to the triplet are obtained, and a second vector representation of the attribute corresponding to the triplet is constructed according to the head entity vector representation and the attribute value vector representation.
由于预设的词向量列表也是人工编辑的,存在一定的主观性和错误,基于预设的词向量列表查询得到的三元组对应的属性的第一向量表征也会有偏差,对此,本申请实施例进一步基于三元组的头实体向量表征和属性值向量表征来对所述第一向量表征进行修正。可选地,作为本申请的一个优选示例,如图5所示,步骤S402还包括:Since the preset word vector list is also edited manually, there are certain subjectivity and errors, and the first vector representation of the attributes corresponding to the triples obtained by querying the preset word vector list will also be biased. The application embodiment further modifies the first vector representation based on the head entity vector representation and the attribute value vector representation of the triplet. Optionally, as a preferred example of the present application, as shown in FIG. 5 , step S402 further includes:
在步骤S501中,获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征。In step S501, the entity type and category information of the head entity corresponding to the triplet are obtained, the word vector list is queried, and the entity type vector representation and the category information vector representation corresponding to the head entity are obtained.
在步骤S502中,求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征。In step S502, an average value between the entity type vector representation corresponding to the head entity and the category information vector representation is obtained as the head entity vector representation.
在这里,步骤S501至步骤S502与上述步骤S301至步骤S302相同,具体请参见上述步骤S301至步骤S302的记载,此处不再赘述。Here, steps S501 to S502 are the same as the above-mentioned steps S301 to S302. For details, please refer to the descriptions of the above-mentioned steps S301 to S302, which will not be repeated here.
在步骤S503中,对所述属性值进行分词处理,按照分词结果查询预设的词向量列表,得到每一分词对应的分词向量表征。In step S503, a word segmentation process is performed on the attribute value, a preset word vector list is queried according to the word segmentation result, and a word segmentation vector representation corresponding to each word segmentation is obtained.
对于属性值,本申请实施例对所述三元组中的属性词进行分词处理,得到若干个分词,然后查询预设的词向量列表得到每一个分词对应的分词向量表征。可选地,所述预设的词向量列表可以为前文所述的腾讯开源的800万词向量列表。分词处理可以通过调用jieba分词工具进行。For attribute values, the embodiment of the present application performs word segmentation on the attribute words in the triplet to obtain several word segmentations, and then queries a preset word vector list to obtain the word segmentation vector representation corresponding to each word segmentation. Optionally, the preset word vector list may be the 8 million word vector list open-sourced by Tencent as described above. Word segmentation can be performed by calling the jieba word segmentation tool.
在步骤S504中,求取所述分词向量表征之间的平均值,作为所述属性值向量表征。In step S504, an average value between the word segmentation vector representations is obtained as the attribute value vector representation.
在分词后,本申请实施例对所述属性值对应的所有分词向量表征求取平均值,作为所述三元组对应的属性值向量表征。After word segmentation, in this embodiment of the present application, an average value is obtained for all word segmentation vector representations corresponding to the attribute values, as the attribute value vector representation corresponding to the triplet.
在步骤S505中,求取所述头实体向量表征和属性值向量表征之间的差值,作为所述三元组对应的属性的第二向量表征。In step S505, the difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.
与步骤S305原理相似,本申请实施例通过计算所述头实体向量表征和属性值向量表征之间的差值,作为所述三元组对应的属性的第二向量表征。Similar to the principle of step S305, in this embodiment of the present application, the difference between the head entity vector representation and the attribute value vector representation is calculated as the second vector representation of the attribute corresponding to the triplet.
在这里,所述三元组对应的属性的第二向量表征是基于三元组自身的前后文关系所得到的,能在一定程度上反映了所述三元组中头实体和属性值的转化,即体现了所述三元组中的属性。Here, the second vector representation of the attribute corresponding to the triplet is obtained based on the contextual relationship of the triplet itself, which can reflect the transformation of the head entity and the attribute value in the triplet to a certain extent , which embodies the attributes in the triplet.
在步骤S403中,拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。In step S403, the first vector representation and the second vector representation are spliced to obtain a relationship vector representation corresponding to the triplet.
在这里,本申请实施例以所述第一向量表征作为“实体-属性-属性值”这一类型的基础向量,以所述第二向量表征作为“实体-属性-属性值”这一类型的修正向量,通过将所述第一向量表征和第二向量表征拼接在一起,所得到的组合作为“实体-属性-属性值”这一类型三元组对应的关系向量表征。其中,若第一向量表征的长度为len1,第二向量表征的长度为len2,那么拼接后所得到的关系向量表征的长度为len1+len2。Here, in this embodiment of the present application, the first vector representation is used as a basic vector of the type "entity-attribute-attribute value", and the second vector representation is used as a type of "entity-attribute-attribute value". To modify the vector, by splicing the first vector representation and the second vector representation together, the obtained combination is used as the relationship vector representation corresponding to the type triplet of "entity-attribute-attribute value". Wherein, if the length of the first vector representation is len1 and the length of the second vector representation is len2, then the length of the relationship vector representation obtained after splicing is len1+len2.
通过拼接得到的关系向量表征,不仅考虑了作为关系或属性的词语本身的词义表征,同时加入了关系或属性所处的三元组的上下文环境所提供的表征帮助,能够有效地修正第一向量表征中的人为主观性,提高关系表征的准确率。The relationship vector representation obtained by splicing not only considers the lexical representation of the word itself as a relationship or attribute, but also adds the representation help provided by the context of the triplet where the relationship or attribute is located, which can effectively correct the first vector Human subjectivity in representation improves the accuracy of relationship representation.
在步骤S103中,根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集。In step S103, a relationship set corresponding to the head entity in the triplet is constructed according to the relationship vector representation, and a relationship mutual exclusion set with the largest range is obtained from the relationship set.
在这里,所述关系集是指将指定三元组中的关系或者属性取出来,并进行去重处理后得到的集合。本申请实施例根据三元组中的头实体构建对应的关系集。所述关系互斥集是指相互之间不存在包含或者被包含关系的关系集。本申请实施例通过对关系集进行包含筛选,得到范围最大的关系互斥集。可选地,作为本申请的一个优选示例,如图6所示,步骤S103还包括:Here, the relationship set refers to a set obtained by extracting the relationship or attribute in the specified triplet and performing deduplication processing. This embodiment of the present application constructs a corresponding relationship set according to the head entity in the triplet. The mutually exclusive set of relationships refers to a set of relationships that do not contain or be included in each other. In this embodiment of the present application, a relationship set with the largest scope is obtained by performing inclusion screening on the relationship set. Optionally, as a preferred example of the present application, as shown in FIG. 6 , step S103 further includes:
在步骤S601中,获取具有相同头实体的三元组及其对应的关系向量表征,对所述关系向量表征进行去重处理,得到所述头实体对应的关系集。In step S601, a triplet with the same head entity and its corresponding relationship vector representation are obtained, and the relationship vector representation is deduplicated to obtain a relationship set corresponding to the head entity.
本申请实施例根据三元组的头实体进行分类,得到具有相同头实体的三元组,组合具有相同头实体的三元组对应的关系向量表征,并对组合中相同的关系向量表征进行去重处理,仅保留一个关系向量表征,最后所得到的集合作为所述头实体对应的关系集。应当理解,经过去重处理,所述关系集中所包含的关系或者属性是互斥的,不能进行聚类。当一个三元组语料库中包括n个不同的头实体时,对应可以得到n组关系集。This embodiment of the present application classifies the triples according to the head entities, obtains triples with the same head entity, combines the relationship vector representations corresponding to the triples with the same head entity, and removes the same relationship vector representations in the combination. After reprocessing, only one relationship vector representation is retained, and the finally obtained set is used as the relationship set corresponding to the head entity. It should be understood that, after deduplication processing, the relationships or attributes included in the relationship set are mutually exclusive and cannot be clustered. When a triple corpus includes n different head entities, n sets of relations can be obtained correspondingly.
在步骤S602中,对所有头实体对应的关系集进行包含筛选,得到范围最大的关系互斥集。In step S602, inclusion filtering is performed on the relation sets corresponding to all the head entities to obtain the relation mutual exclusion set with the largest range.
不同头实体对应的关系集可能存在包含关系,对此,本申请实施例通过比较所有头实体对应的关系集,对所有头实体对应的关系集进行包含筛选,合并具有包含关系的关系集,经过若干次合并,将得到范围最大的关系互斥集。The relationship sets corresponding to different head entities may have an inclusion relationship. In this regard, in this embodiment of the present application, by comparing the relationship sets corresponding to all the head entities, the relationship sets corresponding to all the head entities are subjected to inclusion screening, and the relationship sets with the inclusion relationship are merged. After several merges, the largest relational mutual exclusion set will be obtained.
示例性地,当一个三元组语料库中包括4个不同的头实体时,对应可以得到4组关系集,分别为({本名,别称,所处时代,民族族群},{中文名,外文名,出生地,代表作品},{本名,所处时代},{中文名,外文名}),其中关系集{本名,所处时代}∈关系集{本名,别称,所处时代,民族族群},关系集{中文名,外文名}∈关系集{中文名,外文名,出生地,代表作品}。将关系集{本名,所处时代}与关系集{本名,别称,所处时代,民族族群合并,将关系集{中文名,外文名}与关系集{中文名,外文名,出生地,代表作品}合并,最终得到范围最大的两个关系互斥集({本名,别称,所处时代,民族族群},{中文名,外文名,出生地,代表作品})。Exemplarily, when a triple corpus includes 4 different head entities, correspondingly, 4 sets of relation sets can be obtained, which are ({real name, alias, era, ethnic group}, {Chinese name, foreign language name) , place of birth, representative works}, {real name, age}, {Chinese name, foreign name}), where the relation set {real name, age}∈relation set{real name, alias, age, ethnic group} , relation set {Chinese name, foreign name} ∈ relation set {Chinese name, foreign name, place of birth, representative work}. Merge the relationship set {real name, age} with the relationship set {original name, alias, age, ethnic group, and combine the relationship set {Chinese name, foreign language name} with the relationship set {Chinese name, foreign language name, birthplace, representative Works} are merged, and finally two sets of mutually exclusive relationships with the largest range are obtained ({real name, alias, era, ethnic group}, {Chinese name, foreign language name, place of birth, representative work}).
在步骤S104中,根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇。In step S104, cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and cluster-cluster merging is performed according to the relationship mutually exclusive set to obtain several clusters.
本申请实施例采用预设的聚类算法对所述关系向量特征进行聚类分析,以将所述三元 组语料库中具有相同或相似关系的所有三元组聚类到一个簇中,以完成关系对齐处理。可选地,作为本申请的一个优选示例,如图7所示,步骤S104还包括:This embodiment of the present application uses a preset clustering algorithm to perform cluster analysis on the relationship vector features, so as to cluster all triples with the same or similar relationship in the triple corpus into one cluster, so as to complete the clustering analysis. Relational alignment processing. Optionally, as a preferred example of the present application, as shown in FIG. 7 , step S104 further includes:
在步骤S701中,采用预设算法对所述三元组语料库中的三元组对应的关系向量表征进行聚类分析。In step S701, a preset algorithm is used to perform cluster analysis on the relationship vector representations corresponding to the triples in the triplet corpus.
可选地,本申请实施例采用半监督的层次聚类算法对所述三元组语料库中的三元组进行聚类分析,由底往上对所得到的聚类簇进行两两合并。Optionally, in the embodiment of the present application, a semi-supervised hierarchical clustering algorithm is used to perform cluster analysis on the triples in the triplet corpus, and the obtained clusters are merged in pairs from bottom to top.
在步骤S702中,在聚类分析过程中,对于待合并的两个聚类簇,判断所述两个聚类簇中是否有至少一个关系向量表征同时存在于同一关系互斥集中。In step S702, in the cluster analysis process, for the two clusters to be merged, it is determined whether there is at least one relationship vector representation in the two clusters that simultaneously exists in the same mutually exclusive set of relationships.
本申请实施例将所述关系互斥集融合到聚类模型中。在聚类模型通过层次聚类算法对两两聚类簇进行合并之前,基于所述关系互斥集确定待合并的两个聚类簇是否可以合并。如前所述,经过去重处理,所述关系集中所包含的关系或者属性是互斥的,关系互斥集中所包含的关系或者属性也是互斥的,不能进行聚类。本申请实施例通过判断所述两个聚类簇中是否有至少一个关系向量表征同时存在于同一最大关系集中。In this embodiment of the present application, the mutually exclusive set of relationships is integrated into a clustering model. Before the clustering model merges the pairs of clusters through the hierarchical clustering algorithm, it is determined whether the two clusters to be merged can be merged based on the mutually exclusive set of relationships. As mentioned above, after deduplication processing, the relationships or attributes included in the relationship set are mutually exclusive, and the relationships or attributes included in the relationship mutually exclusive set are also mutually exclusive, so clustering cannot be performed. In this embodiment of the present application, it is determined whether there is at least one relationship vector representation in the two clusters that simultaneously exists in the same maximum relationship set.
若是时,表明所述两个聚类簇存在互斥的元素,所述两个聚类簇不是相同或相似的,不能进行关系对齐,执行步骤S703;否则,表明所述两个聚类簇不存在互斥的元素,所述两个聚类簇是相同或相似的,可进行关系对齐,执行步骤S704。If so, it indicates that the two clusters have mutually exclusive elements, the two clusters are not the same or similar, and the relationship cannot be aligned, and step S703 is executed; otherwise, it indicates that the two clusters are not identical There are mutually exclusive elements, the two clusters are the same or similar, and the relationship can be aligned, and step S704 is executed.
在步骤S703中,不合并所述两个聚类簇。In step S703, the two clusters are not merged.
在步骤S704中,合并所述两个聚类簇。In step S704, the two clusters are merged.
本申请实施例将关系互斥集融入到聚类分析的过程中,加入的先验知识能够有效地提高聚类分析中聚类簇合并的准确度,提高了关系对齐的精度和实用性。In this embodiment of the present application, the mutually exclusive set of relationships is incorporated into the process of cluster analysis, and the added prior knowledge can effectively improve the accuracy of cluster merging in cluster analysis, and improve the accuracy and practicability of relationship alignment.
在步骤S105中,对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。最后聚类得到的簇中包括若干个相同或者显示的关系特征向量,本申请实施例进一步对每一个簇中的关系特征向量进行频次统计及比较,获取出现频率最高的关系特征向量,作为所述簇的目标关系特征向量。对于所述簇中的关系特征向量不是所述目标关系特征向量的三元组,将其关系特征向量修改为所述目标关系特征向量,以纠正簇中的错误关系、偏差关系,能够有效地纠正人为主观性造成的错误或偏差,大大地提高了关系对齐的精度和实用性。In step S105, for each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified to the target relationship feature vector . The clusters obtained by the final clustering include several identical or displayed relational feature vectors. This embodiment of the present application further performs frequency statistics and comparisons on the relational feature vectors in each cluster, and obtains the relational feature vector with the highest frequency of occurrence, as the The target relation feature vector of the cluster. For the triplet whose relationship feature vector in the cluster is not the target relationship feature vector, modify the relationship feature vector to the target relationship feature vector to correct the wrong relationship and deviation relationship in the cluster, which can effectively correct Errors or biases caused by human subjectivity greatly improve the accuracy and utility of relational alignment.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
在一实施例中,提供一种基于结构化信息的关系对齐装置,该基于结构化信息的关系对齐装置与上述实施例中基于结构化信息的关系对齐方法一一对应。如图8所示,该基于结构化信息的关系对齐装置包括语料库构建模块81、关系获取模块82、互斥集获取模块83、聚类合并模块84、修正模块85。各功能模块详细说明如下:In one embodiment, a relationship alignment device based on structured information is provided, and the relationship alignment device based on structured information is in one-to-one correspondence with the relationship alignment method based on structured information in the foregoing embodiment. As shown in FIG. 8 , the structure information-based relationship alignment device includes a corpus construction module 81 , a relationship acquisition module 82 , a mutually exclusive set acquisition module 83 , a clustering and merging module 84 , and a correction module 85 . The detailed description of each functional module is as follows:
语料库构建模块81,用于构建三元组语料库,所述三元组语料库中包括若干个三元组;The corpus construction module 81 is used to construct a triple corpus, and the triple corpus includes several triples;
关系获取模块82,用于获取所述三元组语料库中每一个三元组对应的关系向量表征;A relationship acquisition module 82, configured to acquire the relationship vector representation corresponding to each triple in the triple corpus;
互斥集获取模块83,用于根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;Mutual exclusion set acquisition module 83, configured to construct a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtain a relationship mutual exclusion set with the largest scope from the relationship set;
聚类合并模块84,用于根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;The clustering and merging module 84 is configured to perform cluster analysis on the triples in the triplet corpus according to the relationship vector feature, and perform cluster-cluster merging according to the mutually exclusive set of relationships to obtain several clusters;
修正模块85,用于对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。The modification module 85 is configured to, for each cluster, select the relationship feature vector with the highest frequency of occurrence in the cluster as the target relationship feature vector, and modify the relationship feature vectors of all triples in the cluster to the target relationship feature vector.
可选地,所述关系获取模块82包括:Optionally, the relationship acquisition module 82 includes:
第一向量表征获取单元,用于对于三元组语料库中的三元组,查询预设的词向量列表, 得到所述三元组对应的关系的第一向量表征;a first vector representation acquisition unit, configured to query a preset word vector list for a triple in the triple corpus, and obtain a first vector representation of the relationship corresponding to the triple;
第二向量表征获取单元,用于获取所述三元组对应的头实体向量表征和尾实体向量表征,根据所述头实体向量表征和尾实体向量表征构建所述三元组对应的关系的第二向量表征;The second vector representation acquisition unit is configured to acquire the head entity vector representation and the tail entity vector representation corresponding to the triplet, and construct the first entity vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation two-vector representation;
拼接单元,用于拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。The splicing unit is used for splicing the first vector representation and the second vector representation to obtain the relation vector representation corresponding to the triplet.
可选地,所述第二向量表征获取单元包括:Optionally, the second vector representation acquisition unit includes:
第一查询子单元,用于获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征;a first query subunit, configured to obtain entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
第一计算子单元,用于求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征;a first calculation subunit, used to obtain the average value between the entity type vector representation corresponding to the head entity and the category information vector representation, as the head entity vector representation;
第二查询子单元,用于获取所述三元组对应的尾实体的实体类型及类别信息,查询所述词向量列表,得到所述尾实体对应的实体类型向量表征和类别信息向量表征;The second query subunit is used to obtain entity type and category information of the tail entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the tail entity;
第二计算子单元,用于求取所述尾实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述尾实体向量表征;The second calculation subunit is used to obtain the average value between the entity type vector representation and the category information vector representation corresponding to the tail entity, as the tail entity vector representation;
第三计算子单元,用于求取所述头实体向量表征和尾实体向量表征之间的差值,作为所述三元组对应的关系的第二向量表征。The third calculation subunit is configured to obtain the difference between the head entity vector representation and the tail entity vector representation as the second vector representation of the relationship corresponding to the triplet.
可选地,所述关系获取模块82包括:Optionally, the relationship acquisition module 82 includes:
第一向量表征获取单元,用于对于三元组语料库中的三元组,查询预设的词向量列表,得到所述三元组对应的属性的第一向量表征;a first vector representation acquisition unit, configured to query a preset word vector list for triples in the triplet corpus, and obtain a first vector representation of attributes corresponding to the triples;
第二向量表征获取单元,用于获取所述三元组对应的头实体向量表征和属性值向量表征,根据所述头实体向量表征和属性值向量表征构建所述三元组对应的属性的第二向量表征;The second vector representation acquisition unit is configured to acquire the header entity vector representation and the attribute value vector representation corresponding to the triplet, and construct the first entity vector representation of the attribute corresponding to the triplet according to the header entity vector representation and the attribute value vector representation two-vector representation;
拼接单元,用于拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。The splicing unit is used for splicing the first vector representation and the second vector representation to obtain the relation vector representation corresponding to the triplet.
可选地,所述第二向量表征获取单元包括:Optionally, the second vector representation acquisition unit includes:
第一查询子单元,用于获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征;a first query subunit, configured to obtain entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
第一计算子单元,用于求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征;a first calculation subunit, used to obtain the average value between the entity type vector representation corresponding to the head entity and the category information vector representation, as the head entity vector representation;
第二查询子单元,用于对所述属性值进行分词处理,按照分词结果查询预设的词向量列表,得到每一分词对应的分词向量表征;The second query subunit is configured to perform word segmentation processing on the attribute value, query a preset word vector list according to the word segmentation result, and obtain a word segmentation vector representation corresponding to each word segmentation;
第二计算子单元,用于求取所述分词向量表征之间的平均值,作为所述属性值向量表征;The second calculation subunit is used to obtain the average value between the representations of the word segmentation vectors as the representation of the attribute value vector;
第三计算子单元,用于求取所述头实体向量表征和属性值向量表征之间的差值,作为所述三元组对应的属性的第二向量表征。The third calculation subunit is configured to obtain the difference between the head entity vector representation and the attribute value vector representation as the second vector representation of the attribute corresponding to the triplet.
可选地,所述互斥集获取模块83包括:Optionally, the mutually exclusive set acquisition module 83 includes:
关系集获取单元,用于获取具有相同头实体的三元组及其对应的关系向量表征,对所述关系向量表征进行去重处理,得到所述头实体对应的关系集;a relationship set obtaining unit, configured to obtain triples with the same head entity and their corresponding relationship vector representations, and perform deduplication processing on the relationship vector representations to obtain a relationship set corresponding to the head entity;
互斥集获取单元,用于对所有头实体对应的关系集进行包含筛选,得到范围最大的关系互斥集。The mutual exclusion set acquisition unit is used to filter the relation sets corresponding to all head entities to obtain the relation mutual exclusion set with the largest range.
可选地,所述聚类合并模块84包括:Optionally, the cluster merging module 84 includes:
聚类单元,用于采用预设算法对所述三元组语料库中的三元组对应的关系向量表征进行聚类分析;a clustering unit, configured to use a preset algorithm to perform cluster analysis on the relationship vector representations corresponding to the triples in the triplet corpus;
判断单元,用于在聚类分析过程中,对于待合并的两个聚类簇,判断所述两个聚类簇 中是否有至少一个关系向量表征同时存在于同一最大关系集中;Judging unit, for in the cluster analysis process, for the two clusters to be merged, to judge whether there is at least one relationship vector representation in the two clusters to simultaneously exist in the same maximum relationship set;
合并处理单元,用于当判断单元的判断结果为是时,不合并所述两个聚类簇,否则合并所述两个聚类簇。The merging processing unit is used for not merging the two clusters when the judgment result of the judging unit is yes, otherwise merging the two clusters.
关于基于结构化信息的关系对齐装置的具体限定可以参见上文中对于基于结构化信息的关系对齐方法的限定,在此不再赘述。上述基于结构化信息的关系对齐装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the structure-information-based relationship alignment apparatus, reference may be made to the foregoing definition of the structured information-based relationship alignment method, which will not be repeated here. Each module in the above-mentioned structure-information-based relationship alignment apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括计算机存储介质、内存储器。该计算机存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为计算机存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种基于结构化信息的关系对齐方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。In one embodiment, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 9 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a computer storage medium and an internal memory. The computer storage medium stores an operating system, computer readable instructions and a database. The internal memory provides an environment for the execution of the operating system and computer readable instructions in the computer storage medium. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions, when executed by a processor, implement a structured information-based relational alignment method. The readable storage medium provided by this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:In one embodiment, there is provided a computer apparatus comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor executes the computer The following steps are implemented when readable instructions:
构建三元组语料库,所述三元组语料库中包括若干个三元组;constructing a triple corpus, the triple corpus includes several triples;
获取所述三元组语料库中每一个三元组对应的关系向量表征;obtaining the relation vector representation corresponding to each triple in the triple corpus;
根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;
根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
在一实施例中,提供一种一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:In one embodiment, there is provided one or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to Perform the following steps:
构建三元组语料库,所述三元组语料库中包括若干个三元组;constructing a triple corpus, the triple corpus includes several triples;
获取所述三元组语料库中每一个三元组对应的关系向量表征;obtaining the relation vector representation corresponding to each triple in the triple corpus;
根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;
根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或者易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM (SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile computer-readable storage medium, the computer-readable instructions, when executed, may include the processes of the foregoing method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it is still possible to implement the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims (20)

  1. 一种基于结构化信息的关系对齐方法,其中,包括:A relational alignment method based on structured information, including:
    构建三元组语料库,所述三元组语料库中包括若干个三元组;constructing a triple corpus, the triple corpus includes several triples;
    获取所述三元组语料库中每一个三元组对应的关系向量表征;obtaining the relation vector representation corresponding to each triple in the triple corpus;
    根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;
    根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
    对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
  2. 如权利要求1所述的基于结构化信息的关系对齐方法,其中,所述获取所述三元组语料库中每一个三元组对应的关系向量表征包括:The method for relationship alignment based on structured information according to claim 1, wherein the obtaining the relationship vector representation corresponding to each triple in the triple corpus comprises:
    对于三元组语料库中的三元组,查询预设的词向量列表,得到所述三元组对应的关系的第一向量表征;For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the relationship corresponding to the triplet;
    获取所述三元组对应的头实体向量表征和尾实体向量表征,根据所述头实体向量表征和尾实体向量表征构建所述三元组对应的关系的第二向量表征;obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing a second vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation;
    拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
  3. 如权利要求2所述的基于结构化信息的关系对齐方法,其中,所述获取所述三元组对应的头实体向量表征和尾实体向量表征,根据所述头实体向量表征和尾实体向量表征构建所述三元组对应的关系的第二向量表征包括:The relationship alignment method based on structured information according to claim 2, wherein the obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet is based on the head entity vector representation and the tail entity vector representation. Constructing the second vector representation of the relationship corresponding to the triplet includes:
    获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
    求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;
    获取所述三元组对应的尾实体的实体类型及类别信息,查询所述词向量列表,得到所述尾实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the tail entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the tail entity;
    求取所述尾实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述尾实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the tail entity, as the tail entity vector representation;
    求取所述头实体向量表征和尾实体向量表征之间的差值,作为所述三元组对应的关系的第二向量表征。The difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.
  4. 如权利要求1至3任一项所述的基于结构化信息的关系对齐方法,其中,所述获取所述三元组语料库中每一个三元组对应的关系向量表征包括:The relationship alignment method based on structured information according to any one of claims 1 to 3, wherein the acquiring the relationship vector representation corresponding to each triple in the triple corpus comprises:
    对于三元组语料库中的三元组,查询预设的词向量列表,得到所述三元组对应的属性的第一向量表征;For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the attributes corresponding to the triples;
    获取所述三元组对应的头实体向量表征和属性值向量表征,根据所述头实体向量表征和属性值向量表征构建所述三元组对应的属性的第二向量表征;obtaining the head entity vector representation and the attribute value vector representation corresponding to the triplet, and constructing a second vector representation of the attribute corresponding to the triplet according to the head entity vector representation and the attribute value vector representation;
    拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
  5. 如权利要求4所述的基于结构化信息的关系对齐方法,其中,所述获取所述三元组对应的头实体向量表征和属性值向量表征,根据所述头实体向量表征和属性值向量表征构建所述三元组对应的属性的第二向量表征包括:The method for relationship alignment based on structured information according to claim 4, wherein the obtaining the head entity vector representation and the attribute value vector representation corresponding to the triplet is based on the head entity vector representation and the attribute value vector representation. Constructing the second vector representation of the attributes corresponding to the triples includes:
    获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
    求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;
    对所述属性值进行分词处理,按照分词结果查询预设的词向量列表,得到每一分词对应的分词向量表征;Perform word segmentation processing on the attribute value, query a preset word vector list according to the word segmentation result, and obtain the word segmentation vector representation corresponding to each word segmentation;
    求取所述分词向量表征之间的平均值,作为所述属性值向量表征;Obtain the average value between the word segmentation vector representations as the attribute value vector representation;
    求取所述头实体向量表征和属性值向量表征之间的差值,作为所述三元组对应的属性的第二向量表征。The difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.
  6. 如权利要求1、5任一项所述的基于结构化信息的关系对齐方法,其中,所述根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集包括:The relationship alignment method based on structured information according to any one of claims 1 and 5, wherein the relationship set corresponding to the head entity in the triplet is constructed according to the relationship vector representation, and obtained from the relationship set The most extensive set of relational mutexes include:
    获取具有相同头实体的三元组及其对应的关系向量表征,对所述关系向量表征进行去重处理,得到所述头实体对应的关系集;Acquire triples with the same head entity and their corresponding relationship vector representations, perform deduplication processing on the relationship vector representations, and obtain a relationship set corresponding to the head entity;
    对所有头实体对应的关系集进行包含筛选,得到范围最大的关系互斥集。Inclusion filtering is performed on the relation sets corresponding to all head entities to obtain the relation-exclusive set with the largest range.
  7. 如权利要求6所述的基于结构化信息的关系对齐方法,其中,所述根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇包括:The relationship alignment method based on structured information according to claim 6, wherein the triplet in the triplet corpus is clustered according to the relationship vector feature, and the triplet is mutually exclusive according to the relationship. The set is clustered and merged, and several clusters are obtained, including:
    采用预设算法对所述三元组语料库中的三元组对应的关系向量表征进行聚类分析;Using a preset algorithm to perform cluster analysis on the representation of the relation vector corresponding to the triples in the triplet corpus;
    在聚类分析过程中,对于待合并的两个聚类簇,判断所述两个聚类簇中是否有至少一个关系向量表征同时存在于同一关系互斥集中;In the cluster analysis process, for the two clusters to be merged, determine whether there is at least one relationship vector representation in the two clusters that simultaneously exists in the same mutually exclusive set of relationships;
    若是,不合并所述两个聚类簇,否则合并所述两个聚类簇。If so, do not merge the two clusters, otherwise merge the two clusters.
  8. 一种基于结构化信息的关系对齐装置,其中,所述装置包括:A relationship alignment device based on structured information, wherein the device comprises:
    语料库构建模块,用于构建三元组语料库,所述三元组语料库中包括若干个三元组;a corpus building module for constructing a triple corpus, where the triple corpus includes several triples;
    关系获取模块,用于获取所述三元组语料库中每一个三元组对应的关系向量表征;a relationship acquisition module, used for acquiring the relationship vector representation corresponding to each triple in the triple corpus;
    互斥集获取模块,用于根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;A mutual exclusion set acquisition module, configured to construct a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtain a relationship mutual exclusion set with the largest scope from the relationship set;
    聚类合并模块,用于根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;a clustering and merging module, configured to perform cluster analysis on the triples in the triplet corpus according to the relationship vector feature, and perform cluster-cluster merging according to the relationship mutually exclusive set to obtain several clusters;
    修正模块,用于对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。A correction module, for each cluster, selecting the relationship feature vector with the highest frequency in the cluster as the target relationship feature vector, and modifying the relationship feature vectors of all triples in the cluster to the target relationship feature vector .
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device comprising a memory, a processor, and computer-readable instructions stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer-readable instructions:
    构建三元组语料库,所述三元组语料库中包括若干个三元组;constructing a triple corpus, the triple corpus includes several triples;
    获取所述三元组语料库中每一个三元组对应的关系向量表征;obtaining the relation vector representation corresponding to each triple in the triple corpus;
    根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;
    根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
    对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量,将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
  10. 如权利要求9所述的计算机设备,其中,所述获取所述三元组语料库中每一个三元组对应的关系向量表征包括:The computer device according to claim 9, wherein the obtaining the relation vector representation corresponding to each triple in the triple corpus comprises:
    对于三元组语料库中的三元组,查询预设的词向量列表,得到所述三元组对应的关系的第一向量表征;For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the relationship corresponding to the triplet;
    获取所述三元组对应的头实体向量表征和尾实体向量表征,根据所述头实体向量表征和尾实体向量表征构建所述三元组对应的关系的第二向量表征;obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing a second vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation;
    拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
  11. 如权利要求10所述的计算机设备,其中,所述获取所述三元组对应的头实体向量表 征和尾实体向量表征,根据所述头实体向量表征和尾实体向量表征构建所述三元组对应的关系的第二向量表征包括:The computer device according to claim 10, wherein the obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing the triplet according to the head entity vector representation and the tail entity vector representation The second vector representation of the corresponding relationship includes:
    获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
    求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;
    获取所述三元组对应的尾实体的实体类型及类别信息,查询所述词向量列表,得到所述尾实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the tail entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the tail entity;
    求取所述尾实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述尾实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the tail entity, as the tail entity vector representation;
    求取所述头实体向量表征和尾实体向量表征之间的差值,作为所述三元组对应的关系的第二向量表征。The difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.
  12. 如权利要求9至11任一项所述的计算机设备,其中,所述获取所述三元组语料库中每一个三元组对应的关系向量表征包括:The computer device according to any one of claims 9 to 11, wherein the obtaining the relation vector representation corresponding to each triple in the triple corpus comprises:
    对于三元组语料库中的三元组,查询预设的词向量列表,得到所述三元组对应的属性的第一向量表征;For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the attributes corresponding to the triples;
    获取所述三元组对应的头实体向量表征和属性值向量表征,根据所述头实体向量表征和属性值向量表征构建所述三元组对应的属性的第二向量表征;obtaining the head entity vector representation and the attribute value vector representation corresponding to the triplet, and constructing a second vector representation of the attribute corresponding to the triplet according to the head entity vector representation and the attribute value vector representation;
    拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
  13. 如权利要求12所述的计算机设备,其中,所述获取所述三元组对应的头实体向量表征和属性值向量表征,根据所述头实体向量表征和属性值向量表征构建所述三元组对应的属性的第二向量表征包括:The computer device according to claim 12, wherein the acquiring the header entity vector representation and the attribute value vector representation corresponding to the triplet, constructs the triplet according to the header entity vector representation and the property value vector representation The second vector representation of the corresponding attribute includes:
    获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
    求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;
    对所述属性值进行分词处理,按照分词结果查询预设的词向量列表,得到每一分词对应的分词向量表征;Perform word segmentation processing on the attribute value, query a preset word vector list according to the word segmentation result, and obtain the word segmentation vector representation corresponding to each word segmentation;
    求取所述分词向量表征之间的平均值,作为所述属性值向量表征;Obtain the average value between the word segmentation vector representations as the attribute value vector representation;
    求取所述头实体向量表征和属性值向量表征之间的差值,作为所述三元组对应的属性的第二向量表征。The difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.
  14. 如权利要求9、13任一项所述的计算机设备,其中,所述根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集包括:The computer device according to any one of claims 9 and 13, wherein the relationship set corresponding to the head entity in the triplet is constructed according to the relationship vector representation, and the relationship mutual exclusion with the largest range is obtained from the relationship set Sets include:
    获取具有相同头实体的三元组及其对应的关系向量表征,对所述关系向量表征进行去重处理,得到所述头实体对应的关系集;Acquire triples with the same head entity and their corresponding relationship vector representations, perform deduplication processing on the relationship vector representations, and obtain a relationship set corresponding to the head entity;
    对所有头实体对应的关系集进行包含筛选,得到范围最大的关系互斥集。Inclusion filtering is performed on the relation sets corresponding to all head entities to obtain the relation-exclusive set with the largest range.
  15. 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer-readable instructions, wherein the computer-readable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps:
    构建三元组语料库,所述三元组语料库中包括若干个三元组;constructing a triple corpus, the triple corpus includes several triples;
    获取所述三元组语料库中每一个三元组对应的关系向量表征;obtaining the relation vector representation corresponding to each triple in the triple corpus;
    根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集;Constructing a relationship set corresponding to the head entity in the triplet according to the relationship vector representation, and obtaining a relationship mutual exclusion set with the largest range from the relationship set;
    根据所述关系向量特征对所述三元组语料库中的三元组进行聚类分析,并根据所述关系互斥集进行聚类簇合并,得到若干个簇;Cluster analysis is performed on the triples in the triple corpus according to the relationship vector feature, and clusters are merged according to the relationship mutual exclusion set to obtain several clusters;
    对于每一个簇,选择所述簇中出现频率最高的关系特征向量作为目标关系特征向量, 将所述簇中的所有三元组的关系特征向量修改为所述目标关系特征向量。For each cluster, the relationship feature vector with the highest frequency in the cluster is selected as the target relationship feature vector, and the relationship feature vectors of all triples in the cluster are modified as the target relationship feature vector.
  16. 如权利要求15所述的可读存储介质,其中,所述获取所述三元组语料库中每一个三元组对应的关系向量表征包括:The readable storage medium according to claim 15, wherein the obtaining the relation vector representation corresponding to each triple in the triple corpus comprises:
    对于三元组语料库中的三元组,查询预设的词向量列表,得到所述三元组对应的关系的第一向量表征;For the triples in the triplet corpus, query the preset word vector list to obtain the first vector representation of the relationship corresponding to the triplet;
    获取所述三元组对应的头实体向量表征和尾实体向量表征,根据所述头实体向量表征和尾实体向量表征构建所述三元组对应的关系的第二向量表征;obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing a second vector representation of the relationship corresponding to the triplet according to the head entity vector representation and the tail entity vector representation;
    拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
  17. 如权利要求16所述的可读存储介质,其中,所述获取所述三元组对应的头实体向量表征和尾实体向量表征,根据所述头实体向量表征和尾实体向量表征构建所述三元组对应的关系的第二向量表征包括:The readable storage medium according to claim 16, wherein the obtaining the head entity vector representation and the tail entity vector representation corresponding to the triplet, and constructing the triplet according to the head entity vector representation and the tail entity vector representation The second vector representation of the relationship corresponding to the tuple includes:
    获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
    求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;
    获取所述三元组对应的尾实体的实体类型及类别信息,查询所述词向量列表,得到所述尾实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the tail entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the tail entity;
    求取所述尾实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述尾实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the tail entity, as the tail entity vector representation;
    求取所述头实体向量表征和尾实体向量表征之间的差值,作为所述三元组对应的关系的第二向量表征。The difference between the head entity vector representation and the tail entity vector representation is obtained as the second vector representation of the relationship corresponding to the triplet.
  18. 如权利要求15至17任一项所述的可读存储介质,其中,所述获取所述三元组语料库中每一个三元组对应的关系向量表征包括:The readable storage medium according to any one of claims 15 to 17, wherein the obtaining the relation vector representation corresponding to each triple in the triple corpus comprises:
    对于三元组语料库中的三元组,查询预设的词向量列表,得到所述三元组对应的属性的第一向量表征;For the triplet in the triplet corpus, query a preset word vector list to obtain the first vector representation of the attribute corresponding to the triplet;
    获取所述三元组对应的头实体向量表征和属性值向量表征,根据所述头实体向量表征和属性值向量表征构建所述三元组对应的属性的第二向量表征;obtaining the head entity vector representation and the attribute value vector representation corresponding to the triplet, and constructing a second vector representation of the attribute corresponding to the triplet according to the head entity vector representation and the attribute value vector representation;
    拼接所述第一向量表征和第二向量表征,得到所述三元组对应的关系向量表征。The first vector representation and the second vector representation are spliced to obtain the relation vector representation corresponding to the triplet.
  19. 如权利要求18所述的可读存储介质,其中,所述获取所述三元组对应的头实体向量表征和属性值向量表征,根据所述头实体向量表征和属性值向量表征构建所述三元组对应的属性的第二向量表征包括:The readable storage medium according to claim 18, wherein the acquiring the header entity vector representation and the attribute value vector representation corresponding to the triplet, constructs the triplet according to the header entity vector representation and the property value vector representation The second vector representation of the attribute corresponding to the tuple includes:
    获取所述三元组对应的头实体的实体类型及类别信息,查询所述词向量列表,得到所述头实体对应的实体类型向量表征和类别信息向量表征;Obtain the entity type and category information of the head entity corresponding to the triplet, query the word vector list, and obtain the entity type vector representation and the category information vector representation corresponding to the head entity;
    求取所述头实体对应的实体类型向量表征和类别信息向量表征之间的平均值,作为所述头实体向量表征;Obtain the average value between the entity type vector representation and the category information vector representation corresponding to the head entity, as the head entity vector representation;
    对所述属性值进行分词处理,按照分词结果查询预设的词向量列表,得到每一分词对应的分词向量表征;Perform word segmentation processing on the attribute value, query a preset word vector list according to the word segmentation result, and obtain the word segmentation vector representation corresponding to each word segmentation;
    求取所述分词向量表征之间的平均值,作为所述属性值向量表征;Obtain the average value between the word segmentation vector representations as the attribute value vector representation;
    求取所述头实体向量表征和属性值向量表征之间的差值,作为所述三元组对应的属性的第二向量表征。The difference between the head entity vector representation and the attribute value vector representation is obtained as the second vector representation of the attribute corresponding to the triplet.
  20. 如权利要求15、19所述的可读存储介质,其中,所述根据所述关系向量表征构建三元组中的头实体对应的关系集,从所述关系集中获取范围最大的关系互斥集包括:The readable storage medium according to claims 15 and 19, wherein the relationship set corresponding to the head entity in the triplet is constructed according to the relationship vector representation, and a relationship mutual exclusion set with the largest range is obtained from the relationship set include:
    获取具有相同头实体的三元组及其对应的关系向量表征,对所述关系向量表征进行去重处理,得到所述头实体对应的关系集;Acquire triples with the same head entity and their corresponding relationship vector representations, perform deduplication processing on the relationship vector representations, and obtain a relationship set corresponding to the head entity;
    对所有头实体对应的关系集进行包含筛选,得到范围最大的关系互斥集。Inclusion filtering is performed on the relation sets corresponding to all head entities to obtain the relation-exclusive set with the largest range.
PCT/CN2021/096584 2021-04-19 2021-05-28 Structured-information-based relation alignment method and apparatus, and device and medium WO2022222226A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110420316.6 2021-04-19
CN202110420316.6A CN113158668B (en) 2021-04-19 2021-04-19 Relationship alignment method, device, equipment and medium based on structured information

Publications (1)

Publication Number Publication Date
WO2022222226A1 true WO2022222226A1 (en) 2022-10-27

Family

ID=76868936

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/096584 WO2022222226A1 (en) 2021-04-19 2021-05-28 Structured-information-based relation alignment method and apparatus, and device and medium

Country Status (2)

Country Link
CN (1) CN113158668B (en)
WO (1) WO2022222226A1 (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160092549A1 (en) * 2014-09-26 2016-03-31 International Business Machines Corporation Information Handling System and Computer Program Product for Deducing Entity Relationships Across Corpora Using Cluster Based Dictionary Vocabulary Lexicon
US20190087724A1 (en) * 2017-09-21 2019-03-21 Foundation Of Soongsil University Industry Cooperation Method of operating knowledgebase and server using the same
CN109992673A (en) * 2019-04-10 2019-07-09 广东工业大学 A kind of knowledge mapping generation method, device, equipment and readable storage medium storing program for executing
CN110516078A (en) * 2019-08-27 2019-11-29 合肥工业大学 Alignment schemes and device
CN111026865A (en) * 2019-10-18 2020-04-17 平安科技(深圳)有限公司 Relation alignment method, device and equipment of knowledge graph and storage medium
CN111061841A (en) * 2019-12-19 2020-04-24 京东方科技集团股份有限公司 Knowledge graph construction method and device
CN111198950A (en) * 2019-12-24 2020-05-26 浙江工业大学 Knowledge graph representation learning method based on semantic vector
CN112149400A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150127323A1 (en) * 2013-11-04 2015-05-07 Xerox Corporation Refining inference rules with temporal event clustering
CN104933164B (en) * 2015-06-26 2018-10-09 华南理工大学 In internet mass data name entity between relationship extracting method and its system
CN110851609A (en) * 2018-07-24 2020-02-28 华为技术有限公司 Representation learning method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160092549A1 (en) * 2014-09-26 2016-03-31 International Business Machines Corporation Information Handling System and Computer Program Product for Deducing Entity Relationships Across Corpora Using Cluster Based Dictionary Vocabulary Lexicon
US20190087724A1 (en) * 2017-09-21 2019-03-21 Foundation Of Soongsil University Industry Cooperation Method of operating knowledgebase and server using the same
CN109992673A (en) * 2019-04-10 2019-07-09 广东工业大学 A kind of knowledge mapping generation method, device, equipment and readable storage medium storing program for executing
CN110516078A (en) * 2019-08-27 2019-11-29 合肥工业大学 Alignment schemes and device
CN111026865A (en) * 2019-10-18 2020-04-17 平安科技(深圳)有限公司 Relation alignment method, device and equipment of knowledge graph and storage medium
CN111061841A (en) * 2019-12-19 2020-04-24 京东方科技集团股份有限公司 Knowledge graph construction method and device
CN111198950A (en) * 2019-12-24 2020-05-26 浙江工业大学 Knowledge graph representation learning method based on semantic vector
CN112149400A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113158668B (en) 2023-02-28
CN113158668A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
US11762876B2 (en) Data normalization using data edge platform
WO2021151325A1 (en) Method and apparatus for triage model training based on medical knowledge graphs, and device
WO2021000671A1 (en) Database query method and apparatus, server and medium
WO2019136993A1 (en) Text similarity calculation method and device, computer apparatus, and storage medium
WO2021114810A1 (en) Graph structure-based official document recommendation method, apparatus, computer device, and medium
WO2022142027A1 (en) Knowledge graph-based fuzzy matching method and apparatus, computer device, and storage medium
WO2020119053A1 (en) Picture clustering method and apparatus, storage medium and terminal device
CN112800287B (en) Full-text indexing method and system based on graph database
US20240126817A1 (en) Graph data query
CN106951526B (en) Entity set extension method and device
AU2019422006B2 (en) Disambiguation of massive graph databases
CN113254593B (en) Text abstract generation method and device, computer equipment and storage medium
WO2020056968A1 (en) Data denoising method and apparatus, computer device, and storage medium
WO2021047373A1 (en) Big data-based column data processing method, apparatus, and medium
CN111651641A (en) Graph query method, device and storage medium
Bouhamoum et al. Scaling up schema discovery for RDF datasets
Luo et al. Maximum biplex search over bipartite graphs
Wang et al. Approximate truth discovery via problem scale reduction
WO2020132933A1 (en) Short text filtering method and apparatus, medium and computer device
EP3168791A1 (en) Method and system for data validation in knowledge extraction apparatus
WO2022222226A1 (en) Structured-information-based relation alignment method and apparatus, and device and medium
WO2021027162A1 (en) Non-full-cell table content extraction method and apparatus, and terminal device
CN106682107B (en) Method and device for determining incidence relation of database table
CN114637846A (en) Video data processing method, video data processing device, computer equipment and storage medium
WO2021128342A1 (en) Document processing method and apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21937451

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21937451

Country of ref document: EP

Kind code of ref document: A1