CN113190690B

CN113190690B - Unsupervised knowledge graph inference processing method, unsupervised knowledge graph inference processing device, unsupervised knowledge graph inference processing equipment and unsupervised knowledge graph inference processing medium

Info

Publication number: CN113190690B
Application number: CN202110586895.1A
Authority: CN
Inventors: 徐菁; 王吉星; 张文志; 孟竹喧; 陈庆印
Original assignee: Evaluation Argument Research Center Academy Of Military Sciences Pla China
Current assignee: Evaluation Argument Research Center Academy Of Military Sciences Pla China
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2022-10-04
Anticipated expiration: 2041-05-27
Also published as: CN113190690A

Abstract

The application relates to an unsupervised knowledge graph inference processing method, a device, equipment and a medium, wherein the method comprises the following steps: acquiring an input knowledge graph, a source entity to be inferred and a network corpus; obtaining a word2vec word vector model by utilizing network corpus training, and vectorizing all entities and relations in the knowledge map; respectively carrying out cosine similarity calculation with a source entity according to vectorization expression of entities in the knowledge graph to obtain a target entity candidate set; training a bidirectional long-short term memory network classifier model according to vectorization representation of entities in the knowledge graph to determine candidate target entities in a target entity candidate set, wherein the candidate target entities have an incidence relation with a source entity; determining target relation words and obtaining target relation triples according to the source entity, the candidate target entities and the network corpus; and after calculating the credibility scores of the target relationship triples, outputting all the target relationship triples related to the source entity and corresponding credibility scores. The treatment cost is low, and the generalization capability is strong.

Description

Unsupervised knowledge graph inference processing method, unsupervised knowledge graph inference processing device, unsupervised knowledge graph inference processing equipment and unsupervised knowledge graph inference processing medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to an unsupervised knowledge graph inference processing method, apparatus, device, and medium.

Background

The knowledge graph inference is a process of obtaining implicit knowledge by using the existing knowledge inference in the knowledge graph. The knowledge graph reasoning has important significance on knowledge verification and completion, association relation analysis, implicit knowledge mining and the like of the knowledge graph, not only provides high-efficiency association relation discovery capability for resources in a large-scale heterogeneous knowledge graph, but also automatically verifies and completes knowledge in the knowledge graph, has important application value in the aspects of reducing cost and improving efficiency, and can be widely applied to scenes such as medical treatment, finance, intelligent question-answering systems, recommendation systems and the like, for example, auxiliary disease diagnosis, fraud behavior identification, user search intention understanding and the like.

The existing knowledge graph reasoning technology mainly comprises the following steps: description logic inference based methods, rule inference based methods, distributed representation inference based methods, neural network inference based methods, and the like. These prior knowledge-graph reasoning techniques rely on manual formulation of logic rules and labeling of training corpora to perform the reasoning task. However, in the process of implementing the present invention, the inventor finds that the conventional knowledge-graph reasoning technology has a technical problem of high processing cost.

Disclosure of Invention

In view of the above, it is necessary to provide an unsupervised knowledge-graph inference processing method, an unsupervised knowledge-graph inference processing apparatus, a computer device, and a computer-readable storage medium, which have low processing cost.

In order to achieve the above purpose, the embodiment of the invention adopts the following technical scheme:

in one aspect, an embodiment of the present invention provides an unsupervised knowledge graph inference processing method, including:

acquiring an input knowledge graph, a source entity to be inferred and a network corpus; the knowledge graph comprises an entity set, a relation set and a category set of the entity;

obtaining a word2vec word vector model by utilizing network corpus training;

vectorizing all entities and relations in the knowledge graph by using a word2vec word vector model;

respectively carrying out cosine similarity calculation with a source entity according to vectorization expression of the entities in the knowledge graph to obtain a target entity candidate set;

training a bidirectional long-short term memory network classifier model according to vectorization representation of entities in the knowledge graph, and determining candidate target entities in a target entity candidate set, which have an incidence relation with a source entity, by using the bidirectional long-short term memory network classifier model;

determining target relation words and obtaining target relation triples according to the source entity, the candidate target entities and the network corpus; the target relation triple comprises a source entity, target relation words and candidate target entities;

and after calculating the credibility scores of the target relationship triples, outputting all the target relationship triples related to the source entity and corresponding credibility scores.

In one embodiment, the step of vectorizing all entities and relations in the knowledge graph by using the word2vec word vector model includes:

respectively obtaining the attribute value of each entity from the knowledge graph, and respectively searching each entity and the vector representation form of the attribute value of each entity in the word2vec word vector model;

and determining the average value of the vector representation form of each entity and the vector representation form of the attribute value of each entity as the vectorized representation of each entity.

In one embodiment, the step of obtaining the candidate set of target entities by performing cosine similarity calculation with the source entity according to the vectorized representation of the entities in the knowledge-graph includes:

acquiring an entity set which is directly connected with a source entity in a knowledge graph and vectorization representation of each entity in the entity set;

performing cosine similarity calculation on the vectorized representation of the source entity and the vectorized representation of each entity in the entity set, and taking the average value of calculation results as a cosine threshold;

acquiring vectorized representation of other entities in the knowledge graph, wherein the other entities do not belong to the entity set;

respectively carrying out cosine similarity calculation on the vectorized representation of each other entity and the vectorized representation of the source entity to obtain corresponding similarity calculation results of each other entity;

and putting the corresponding rest entities with similar calculation results exceeding the cosine threshold into the target entity candidate set.

In one embodiment, the step of training a two-way long-short term memory network classifier model according to vectorized representation of the entity in the knowledge-graph, and determining candidate target entities in the target entity candidate set having association relationship with the source entity by using the two-way long-short term memory network classifier model, includes:

forming a first quintuple by any two entities with direct edges connected, an edge relation and the classes of any two entities in the knowledge graph;

replacing any two entities in the first quintuple by other two unrelated entities without incidence relation in the knowledge graph and correspondingly modifying the category to form a second quintuple;

taking any two entities and corresponding classes in the first quintuple as positive examples, taking two unrelated entities and corresponding classes in the second quintuple as negative examples, and carrying out vectorization representation training on the positive examples and the negative examples to obtain a two-way long-short term memory network classifier model; wherein the vectorization representation of the category is obtained by adopting a one-hot method;

and judging the incidence relation condition of the entities in the target entity candidate set and the source entity by using the trained bidirectional long-short term memory network classifier model, and determining the candidate target entities in the target entity candidate set and the source entity with incidence relation.

In one embodiment, obtaining a vectorized representation of a category using a one-hot method comprises the steps of:

constructing the categories of all entities in the knowledge graph into a dictionary;

when vectorization expression is carried out, the position corresponding to a selected category is set to be 1, the positions corresponding to the other categories are set to be 0, and 0 compensation processing is carried out after vectorization expression with the vector dimension smaller than the set dimension value.

In one embodiment, the process of determining target relational terms according to a source entity, candidate target entities and network corpora includes:

acquiring a source entity and a candidate target entity, and a third quintuple with the same category as any two entities in the first quintuple, and putting the relation words of the third quintuple into a candidate relation set;

extracting a relation indicator from a network corpus with a set length in which a source entity and a candidate target entity appear simultaneously;

and performing cosine similarity calculation on the vectorized representation of the relation indicator words and the vectorized representation of the relation words in the candidate relation set, and determining the candidate relation words with the maximum similarity as target relation words.

In one embodiment, the process of calculating a confidence score for a target relationship triplet includes:

obtaining the reciprocal of the minimum number of the connecting edges of the two entities in the target relation triple in the knowledge graph spectrum, and the ratio of the number of times of the two entities appearing simultaneously in the network corpus with the set length to the number of times of the two entities appearing respectively;

and adding the reciprocal and the ratio to obtain the credibility score of the target relation triple.

In another aspect, an unsupervised knowledge graph inference processing apparatus is provided, including:

the input acquisition module is used for acquiring an input knowledge graph, a source entity to be inferred and a network corpus; the knowledge graph comprises an entity set, a relation set and a category set of the entity;

the model training module is used for obtaining a word2vec word vector model by utilizing network corpus training;

the vectorization module is used for vectorizing all entities and relations in the knowledge graph by using a word2vec word vector model;

the entity candidate module is used for respectively carrying out cosine similarity calculation with the source entity according to vectorization representation of the entities in the knowledge graph to obtain a target entity candidate set;

the correlation target module is used for training a bidirectional long-short term memory network classifier model according to vectorization representation of the entities in the knowledge map, and determining candidate target entities in a target entity candidate set and having correlation with the source entities by using the bidirectional long-short term memory network classifier model;

the target relation module is used for determining target relation words and obtaining target relation triples according to the source entities, the candidate target entities and the network linguistic data; the target relation triple comprises a source entity, target relation words and candidate target entities;

and the credibility output module is used for outputting all the target relation triples related to the source entity and corresponding credibility scores after calculating the credibility scores of the target relation triples.

In yet another aspect, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of any one of the above-mentioned unsupervised knowledge-graph inference processing methods when executing the computer program.

In yet another aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of any of the above unsupervised knowledge-graph inference processing methods.

One of the above technical solutions has the following advantages and beneficial effects:

by adopting the processing steps, the unsupervised knowledge graph reasoning processing method, the unsupervised knowledge graph reasoning processing device, the unsupervised knowledge graph reasoning processing equipment and the unsupervised knowledge graph reasoning processing medium do not need to manually label training corpora and make rules, and a deep learning model classifier is trained by using the existing knowledge and structure in the knowledge graph and combining large-scale internet corpora aiming at the input source entity to be deduced, so that an entity relation triple consisting of a target entity and relation words is obtained, and the credibility score of the entity relation triple is output. The structural information of the knowledge graph is fully utilized, the semantic information of knowledge is deeply mined, the characteristics of large-scale knowledge statistical information, strong learning of a neural network model and the like are integrated to improve the accuracy of the result, the generalization capability is strong, the implementation is easy, and the technical effects of greatly reducing the processing cost and improving the processing performance are achieved.

Drawings

FIG. 1 is a schematic flow diagram of an unsupervised knowledge-graph inference processing method in one embodiment;

FIG. 2 is a schematic diagram of an embodiment of a process for unsupervised knowledge-graph inference;

FIG. 3 is a flow diagram of entity vectorization representation in one embodiment;

FIG. 4 is a flowchart illustrating the target entity candidate set acquisition process in one embodiment;

FIG. 5 is a flowchart illustrating a process for determining a target entity, according to one embodiment;

FIG. 6 is a flowchart illustrating processing of determining target relational terms, in one embodiment;

fig. 7 is a block diagram of an unsupervised knowledge-graph inference processing apparatus in an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

In addition, the technical solutions in the embodiments of the present invention may be combined with each other, but it must be based on the realization of those skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination of technical solutions should be considered to be absent and not within the protection scope of the present invention.

The knowledge graph reasoning is a process of obtaining implicit knowledge by using the existing knowledge reasoning in the knowledge graph. In the construction process of the knowledge graph, due to the fact that the quality of a data source is difficult to guarantee and the extraction technology is incomplete, wrong knowledge is easy to extract, and the phenomenon of contradiction between noise knowledge and knowledge exists in the knowledge graph. In addition, the existing knowledge graph is difficult to construct a knowledge graph with complete information due to incompleteness of data sources and omission of knowledge extraction technology, and the information such as the attribute of the entity and the incidence relation between the entities is seriously deficient.

The existing knowledge graph reasoning technology mainly comprises the following steps: description logic inference based methods, rule inference based methods, distributed representation inference based methods, neural network inference based methods, and the like. These prior knowledge-graph reasoning techniques rely on manual formulation of logic rules and labeling of training corpora to perform the reasoning task. However, in practical application, the inventor finds that the existing knowledge graph reasoning technologies have the defects of high cost, low coverage rate, poor generalization capability, lack of deep semantic information, complex model and the like of manually marking linguistic data and formulating rules, so that the processing cost is high, the processing performance is poor, and the existing knowledge graph reasoning technologies are difficult to be widely applied to large-scale knowledge graphs with increasingly-increased and abundant contents.

In summary, the invention is directed to the technical problem of high processing cost of the traditional knowledge graph reasoning technology, and the invention designs an unsupervised knowledge graph reasoning processing method, which is directed to a source entity to be inferred, trains a classifier by using the existing knowledge and structure of a knowledge graph and combining large-scale network corpora, acquires a target entity and a target relation word, outputs a target entity relation triple and calculates the credibility score thereof. The design of the invention eliminates the cost of manually marking training corpora and making rules, has strong generalization capability and is easy to realize.

It is understood that the problem to be solved by the present invention can be formally defined as: input knowledge graph G = (E, R, C), where E = (E) ₁ ,…,e _i ,…,e _m ) Represents a set of entities, R = (R) ₁ ,…,r _k ,…,r _n ) Represents a set of relationships, C = (C) ₁ ,…,c _i ,…,c _m ) A set of categories representing each entity. Inputting a source entity e to be inferred _i And network corpus D = (D) ₁ ,d ₂ ,…,d _n ) Knowledge graph construction corpora, wikipedia, and Baidu encyclopedia, as well as news and blogs related to entities in the knowledge graph, can be overlaidAnd the like. Output and source entity e _i All relationship triplets (e) related _i ,r _k ,e _j ) Or (e) _j ,r _k ,e _i ) (i.e., the source entity is not limited to a head entity or a tail entity, e) _j Candidate target entities, r, representing associations with source entities _k Representing the target relational terms), and a confidence score s for each relational triplet.

Referring to fig. 1 and fig. 2, in one aspect, the present invention provides an unsupervised knowledge-graph inference processing method, including the following steps S12 to S24:

s12, acquiring an input knowledge graph, a source entity to be inferred and a network corpus; the knowledge graph includes a set of entities, a set of relationships, and a set of categories of entities.

It will be appreciated that the input knowledge graph can be denoted as G = (E, R, C), and the source entity to be inferred as E _i And the network corpus is marked as D. Set of entities, i.e., E = (E) ₁ ,…,e _i ,…,e _m ) The set of relationships is also R = (R) ₁ ,…,r _k ,…,r _n ) The class set of entities, i.e., C = (C) ₁ ,…,c _i ,…,c _m ). The computing device may obtain the data of the aforementioned input by, but not limited to, receiving an external input, downloading from a database server, or reading from a hard disk, etc.

And S14, obtaining a word2vec word vector model by utilizing network corpus training.

It can be understood that the Word2Vec Word vector model is a distributed representation model that adopts training models such as CBOW (Continuous Bag of Words) or Skip-Gram to vectorize Words in natural language texts. The word2vec word vector model obtained by training is in a file form, words extracted from training corpus and expression vectors of each word are contained in the model, the default value of the vector length is 200, and the default value can be adjusted according to experimental conditions.

S16, vectorizing all entities and relations in the knowledge graph by using a word2vec word vector model;

s18, respectively carrying out cosine similarity calculation with a source entity according to vectorization representation of the entities in the knowledge graph to obtain a target entity candidate set;

s20, training a bidirectional long-short term memory network classifier model according to vectorization representation of entities in the knowledge graph, and determining candidate target entities in a target entity candidate set and having an incidence relation with a source entity by using the bidirectional long-short term memory network classifier model;

s22, determining target relation words and obtaining target relation triples according to the source entities, the candidate target entities and the network linguistic data; the target relation triple comprises a source entity, target relation words and candidate target entities;

and S24, after calculating the credibility scores of the target relationship triples, outputting all the target relationship triples related to the source entity and corresponding credibility scores.

According to the unsupervised knowledge graph reasoning processing method, the processing steps are adopted, training corpora and formulating rules are not needed to be marked manually, a deep learning model classifier is trained by using the existing knowledge and structure in the knowledge graph aiming at the input source entity to be inferred and combining large-scale internet corpora, an entity relation triple consisting of the target entity and the relation words is obtained, and the credibility value of the entity relation triple is output. The accuracy of the result is improved by fully utilizing the structural information of the knowledge map, deeply excavating the semantic information of knowledge and integrating the characteristics of large-scale knowledge statistical information, strong learning of a neural network model and the like, the generalization capability is strong, the implementation is easy, and the technical effects of greatly reducing the processing cost and improving the processing performance are achieved.

Referring to fig. 3, in an embodiment, the step S16 may specifically include the following processing steps S162 and S164:

s162, respectively obtaining the attribute value of each entity from the knowledge graph, and respectively searching each entity and the vector representation form of the attribute value of each entity in the word2vec word vector model;

s164, the vector representation format of each entity and the average value of the vector representation format of the attribute value of each entity are determined as the vectorized representation of each entity.

It will be appreciated that all entities and relationships in the knowledge-graph G are represented in vector form using the word2vec word vector model trained in step S14. In order to solve the problem and be able to mine entity deep-level information, attribute values of each entity are respectively obtained from the knowledge graph G, and a vector representation form of the entity and the attribute values thereof in a word2vec word vector model file is searched, where a final vector representation form (i.e., a vectorized representation) of the entity is an average of a vectorized representation of the entity itself and a vectorized representation of the attribute values thereof, considering that the entity has a homonymy and synonymy condition (for example, "firearm" represents both a weapon and a registered brand).

For ease of understanding, the following are exemplified: suppose an entity e _j Is v '(i.e., vector form in word2vec word vector model file)' _ej The attribute value is (a) _j1 ,a _j2 ,…,a _jy ) The initial vector form of these attribute values is

Entity e _j Is expressed in the form of a final vector of

By searching the word2vec word vector model file, the vector representation forms of all the relational words in the knowledge graph G can be obtained.

Through the steps, high-accuracy vectorized representation of each entity is obtained.

Referring to fig. 4, in an embodiment, the step S18 may specifically include the following processing steps S181 to S185:

s181, acquiring an entity set which is directly connected with the source entity in the knowledge graph and vectorization representation of each entity in the entity set;

s182, performing cosine similarity calculation on the vectorized representation of the source entity and the vectorized representation of each entity in the entity set, and taking the average value of calculation results as a cosine threshold;

s183, acquiring vectorization representation of other entities in the knowledge graph, wherein the other entities do not belong to the entity set;

s184, respectively carrying out cosine similarity calculation on the vectorized representation of each other entity and the vectorized representation of the source entity to obtain corresponding similarity calculation results of each other entity;

s185, putting the corresponding other entities with the similar calculation results exceeding the cosine threshold into the target entity candidate set. It can be understood that the source entity e in the knowledge graph G is obtained _i Set of entities with direct edge connections (by)

Representation) and the final vector representation of these entities is obtained from step S16 and applied to the source entity e _i And these with the source entity e _i Cosine similarity calculation is carried out in a final vector form of entities with direct edges connected, and the calculation formula is as follows:

wherein x is _k And y _k And respectively representing the element values of the two vectors in different dimensions, and taking the mean value of each calculation result as a cosine threshold. For example, in the case of a liquid,

is a source entity e _i Is represented in the form of a final vector of,

is in the form of a final vector

The cosine threshold is then:

other entities (other than those in knowledge-graph G) are obtained from step S16

Except) and the source entity e, and separately connecting these other entities to the source entity e _i Final vector representation of

And performing cosine similarity calculation, and putting the entity of which the calculation result exceeds the cosine threshold value into a candidate target entity set.

Referring to fig. 5, in an embodiment, the step S20 may specifically include the following processing steps S202 to S208:

s202, any two entities with direct edges connected in the knowledge graph, edge relations and the classes of any two entities form a first quintuple;

s204, replacing any two entities in the first quintuple by using the other two unrelated entities without incidence relation in the knowledge graph and correspondingly modifying the category to form a second quintuple;

s206, taking any two entities and corresponding classes in the first quintuple as positive examples, taking two unrelated entities and corresponding classes in the second quintuple as negative examples, and training by using vectorization representation of the positive examples and the negative examples to obtain a bidirectional long-short term memory network classifier model; wherein the vectorization representation of the category is obtained by adopting a one-hot method;

and S208, judging the incidence relation condition of the entities in the target entity candidate set and the source entity by using the trained bidirectional long-short term memory network classifier model, and determining the candidate target entities in the target entity candidate set and the source entity with incidence relation.

It can be understood that any two entities with direct edge connection, edge relation and classes of any two entities in the knowledge-graph G are combined into a first quintuple, which is used as (e) _a ,c _a ,r _k ,e _b ,c _b ) Is shown in the specification, wherein e _a ∈E,c _a ∈C,r _k ∈R,e _b ∈E,c _b E.g. C). For example, entity e _a Indicating "a certain team of war" (the category of which is organization) and entity e _b The method indicates that direct edge connection exists in a certain aviation base (the category of the certain aviation base is a place), the relational word of the edge is 'stationing', and the first five-tuple formed by the five kinds of information is (a certain combat squad, a certain organization, a stationing, a certain aviation base and a place).

Replacing two entities in the first five-tuple (i.e., e) using other entities in the knowledge-graph G _i And e _j ) And modifying the category information of the entity to form a second five-tuple. Wherein, the limit conditions for the replacement entity are as follows: two alternative entities cannot appear simultaneously in a network corpus of a set length (default to a full statement, such as a statement divided by an ". |" or other punctuation mark) to ensure that there is no association between the two alternative entities.

Two entities and class information thereof are obtained from the first quintuple as a positive example, two entities and class information thereof are obtained from the second quintuple as a negative example, and each obtained entity and class thereof are expressed in a vector form and used for training a bidirectional long-short term memory network classifier model (BiLSTM for short, which can well solve the long-time order dependence problem). Wherein, the final vector representation form of these entities is obtained from step S14, and the vector representation of the categories of the entities adopts a one-hot method.

In one embodiment, specifically, the obtaining of the vectorized representation of the category by using the one-hot method includes the steps of:

and when vectorization representation is performed, the position corresponding to the selected one category is set to be 1, the positions corresponding to the other categories are set to be 0, and 0 compensation processing is performed after vectorization representation of which the vector dimension is smaller than the set dimension value.

Specifically, all entity categories in the knowledge graph G are constructed as a dictionary, and when a certain category is expressed in a vector form, the position corresponding to the category is set to 1, and the positions corresponding to other categories are set to 0. In order to ensure that the vector dimensions of the entity and the category of the entity are consistent, 0 is added behind the vector representation with smaller dimension. And setting the dimension value as a dimension limit value which is preset according to the vector dimension of the entity and is used for judging whether the vector dimension of the category is smaller than the vector dimension of the entity. The BilSTM is formed by combining a forward LSTM (Long Short-Term Memory) and a backward LSTM.

Using the trained BilSTM classifier model to determine the candidate target entity and source entity e in step S18 _i And if the association relationship exists, the candidate target entity is processed in the subsequent steps. Through the processing steps, the candidate target entities having the association relation with the source entity can be determined efficiently and accurately.

Referring to fig. 6, in an embodiment, regarding the process of determining the target relation term according to the source entity, the candidate target entity and the network corpus in the step S22, the process may specifically include the following processing steps S222 to S226:

s222, acquiring a source entity and a candidate target entity, and a third quintuple with the same category as any two entities in the first quintuple, and putting relational terms of the third quintuple into a candidate relation set;

s224, extracting a relation indicator from the network linguistic data with the set length, wherein the network linguistic data simultaneously appear in the source entity and the candidate target entity;

s226, cosine similarity calculation is carried out on the vectorized representation of the relation indicator and the vectorized representation of the relation words in the candidate relation set, and the candidate relation word with the maximum similarity is determined as the target relation word.

Specifically, for the source entity e with the association relationship output in step S208 _i And its candidate target entity (with e) _j Representation) to obtain an entity e _i And e _j Two entities (e) in the first quintuple of step S202 _a And e _b ) Five-tuple with same category (as third five-tuple), and the relational word r of the third five-tuple _k And putting the candidate relation set. Wherein, the same category means: entity e _i With entity e _a And e _b Of the same category of any one of the entities, or entity e _j With entity e _a And e _b The category of any one of the entities is the same.

From a source entity e _i And its candidate target entity e _j Extracting the relation indicator from the network corpus of the set length (default is a complete sentence, such as a sentence divided by punctuation marks such as ".!."), and obtaining the vector representation form of the relation indicator and the candidate relation words (namely the relation words in the candidate relation set) from the step S14 for cosine similarity calculation, and using the candidate relation words with the maximum similarity (r is used for calculating the cosine similarity of the candidate relation words with the maximum similarity) _k Representation) as a target relational term.

Through the processing steps, the target relational terms can be determined efficiently and accurately, so that the target relational terms can be used for being matched with the source entity e _i And its candidate target entity e _j Collectively forming target relationship triplets (e) _i ,r _k ,e _j )。

In an embodiment, regarding the process of calculating the confidence score of the target relationship triplet in step S24, the process may specifically include the following processing steps: obtaining the reciprocal of the minimum number of the connecting edges of the two entities in the target relation triple in the knowledge graph spectrum, and the ratio of the number of times of the two entities appearing simultaneously in the network corpus with the set length to the number of times of the two entities appearing respectively;

Specifically, the target relationship triplet (e) _i ,r _k ,e _j ) The reciprocal of the minimum number of connected edges of the two entities in the knowledge-graph G, and the length of the two entities in the set length (default to one complete sentence, such as "in"). | A Equal punctuation symbol segmented sentences) and the ratio of the times of simultaneous occurrence in the network corpus of the two entities to the times of respective occurrence of the two entities are added and calculated to obtain the credibility score of the target relation triple.

For ease of understanding, for example: source entity e _i And candidate target entity e _j In the knowledge graph GThe few continuous edges have 4 edges, the two entities appear in the network corpus with the set length for 5 times at the same time, and the source entity e _i Candidate target entity e appearing 15 times in the network corpus _j And when the network corpus appears 25 times, the target relation triple (e) _i ,r _k ,e _j ) The confidence score of (a) is:

similarly, through the processing steps, the credibility scores of the target relation triples can be accurately obtained.

It should be understood that although the steps in the flowcharts of fig. 1-6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps of fig. 1-6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

Referring to fig. 7, in an embodiment, there is further provided an unsupervised knowledge-graph inference processing apparatus 100, which includes an input obtaining module 11, a model training module 13, a vectorization module 15, an entity candidate module 17, an association target module 19, a target relation module 21, and a trusted output module 23. The input acquisition module 11 is used for acquiring an input knowledge graph, a source entity to be inferred and a network corpus; the knowledge graph includes a set of entities, a set of relationships, and a set of categories of entities. The model training module 13 is configured to obtain a word2vec word vector model by using the web corpus training. The vectorization module 15 is configured to perform vectorization representation on all entities and relationships in the knowledge graph by using the word2vec word vector model. The entity candidate module 17 is configured to perform cosine similarity calculation with the source entity according to vectorization representation of the entity in the knowledge graph, and obtain a target entity candidate set. The association target module 19 is configured to train a two-way long-short term memory network classifier model according to vectorized representation of the entities in the knowledge graph, and determine candidate target entities in the target entity candidate set, which have an association relationship with the source entity, by using the two-way long-short term memory network classifier model. The target relation module 21 is configured to determine a target relation word and obtain a target relation triple according to the source entity, the candidate target entity, and the network corpus; the target relation triple comprises a source entity, target relation words and candidate target entities. The credibility output module 23 is configured to output all the target relationship triples related to the source entity and corresponding credibility scores after calculating the credibility scores of the target relationship triples.

The unsupervised knowledge graph reasoning processing device 100 utilizes the existing knowledge and structure in the knowledge graph and combines large-scale internet corpus to train the deep learning model classifier aiming at the input source entity to be inferred without manually marking training corpus and formulating rules through the cooperation of all modules, so as to obtain entity relation triples formed by target entities and relation words and output credibility values of the entity relation triples. The structural information of the knowledge graph is fully utilized, the semantic information of knowledge is deeply mined, the characteristics of large-scale knowledge statistical information, strong learning of a neural network model and the like are integrated to improve the accuracy of the result, the generalization capability is strong, the implementation is easy, and the technical effects of greatly reducing the processing cost and improving the processing performance are achieved.

In one embodiment, the vectorization module 15 described above may include an attribute lookup sub-module and a representation determination sub-module. And the attribute searching submodule is used for respectively obtaining the attribute value of each entity from the knowledge graph spectrum and respectively searching each entity and the vector representation form of the attribute value of each entity in the word2vec word vector model. The representation determining submodule is used for determining the vector representation form of each entity and the mean value of the vector representation form of the attribute value of each entity as the vectorization representation of each entity.

In one embodiment, the entity candidate module 17 may include a connected representation sub-module, a threshold calculation sub-module, a remainder representation sub-module, a similar result sub-module, and a candidate discrimination sub-module. The connection representation submodule is used for obtaining an entity set which is connected with the source entity on the direct side in the knowledge graph and vectorization representation of each entity in the entity set. And the threshold value calculation sub-module is used for performing cosine similarity calculation on the vectorized representation of the source entity and the vectorized representation of each entity in the entity set, and taking the mean value of calculation results as a cosine threshold value. And the residual representation submodule is used for acquiring the vectorization representation of each residual entity which does not belong to the entity set in the knowledge graph. The similar result sub-module is used for respectively carrying out cosine similarity calculation on the vectorization representation of each other entity and the vectorization representation of the source entity to obtain a corresponding similar calculation result of each other entity. And the candidate discrimination submodule is used for putting the corresponding other entities of which the similarity calculation results exceed the cosine threshold into the target entity candidate set.

In one embodiment, the association objective module 19 may include a first quintuple sub-module, a second quintuple sub-module, a classifier training sub-module, and an association determination sub-module. The first quintuple submodule is used for forming any two entities with direct edges connected, an edge relation and the category of any two entities into a first quintuple. And the second quintuple sub-module is used for replacing any two entities in the first quintuple by using the other two unrelated entities without incidence relation in the knowledge graph and correspondingly modifying the category to form a second quintuple. The classifier training sub-module is used for taking any two entities and corresponding classes in the first quintuple as positive examples, taking two unrelated entities and corresponding classes in the second quintuple as negative examples, and training by using vectorization representation of the positive examples and the negative examples to obtain a two-way long-short term memory network classifier model; wherein the vectorization representation of the category is obtained by a one-hot method. And the association judgment sub-module is used for judging the association relation condition of the entities in the target entity candidate set and the source entity by utilizing the trained bidirectional long-short term memory network classifier model and determining the candidate target entities in the target entity candidate set, which have the association relation with the source entity.

In an embodiment, the vectorized representation of the category is obtained by using a one-hot method, which specifically includes the following implementation steps: constructing the categories of all entities in the knowledge graph into a dictionary; and when vectorization representation is performed, the position corresponding to the selected one category is set to be 1, the positions corresponding to the other categories are set to be 0, and 0 compensation processing is performed after vectorization representation of which the vector dimension is smaller than the set dimension value.

In an embodiment, the target relationship module 21 may be specifically configured to implement the following process in the process of determining the target relationship word according to the source entity, the candidate target entity and the network corpus: acquiring a source entity and a candidate target entity, and a third quintuple with the same category as any two entities in the first quintuple, and putting the relation words of the third quintuple into a candidate relation set; extracting a relation indicator from a network corpus with a set length in which a source entity and a candidate target entity appear simultaneously; and performing cosine similarity calculation on the vectorized representation of the relation indicator words and the vectorized representation of the relation words in the candidate relation set, and determining the candidate relation words with the maximum similarity as target relation words.

In an embodiment, in the process of calculating the credibility scores of the target relationship triples, the credibility output module 23 may be specifically configured to implement the following process: obtaining the reciprocal of the minimum number of the connecting edges of the two entities in the target relation triple in the knowledge graph spectrum, and the ratio of the number of times of the two entities appearing simultaneously in the network corpus with the set length to the number of times of the two entities appearing respectively; and adding the reciprocal and the ratio to obtain the credibility score of the target relation triple.

For specific limitations of the unsupervised knowledge graph inference processing apparatus 100, reference may be made to corresponding limitations of the unsupervised knowledge graph inference processing method in the foregoing, and details are not repeated here. The modules in the unsupervised knowledge-graph inference processing apparatus 100 may be implemented in whole or in part by software, hardware, and a combination thereof. The modules may be embedded in a hardware form or may be independent from a device with a specific data processing function, or may be stored in a memory of the device in a software form, so that a processor may invoke and execute operations corresponding to the modules, where the device may be, but is not limited to, various data calculation and analysis devices existing in the art.

In still another aspect, a computer device is provided, which includes a memory and a processor, the memory stores a computer program, and the processor executes the computer program to implement the following steps: acquiring an input knowledge graph, a source entity to be inferred and a network corpus; the knowledge graph comprises an entity set, a relation set and a category set of the entity; obtaining a word2vec word vector model by utilizing network corpus training; vectorizing all entities and relations in the knowledge graph by using a word2vec word vector model; respectively carrying out cosine similarity calculation with a source entity according to vectorization expression of entities in the knowledge graph to obtain a target entity candidate set; training a bidirectional long-short term memory network classifier model according to vectorization representation of entities in the knowledge graph, and determining candidate target entities in a target entity candidate set, which have an incidence relation with a source entity, by using the bidirectional long-short term memory network classifier model; determining target relation words and obtaining target relation triples according to the source entity, the candidate target entities and the network linguistic data; the target relation triple comprises a source entity, target relation words and candidate target entities; and after calculating the credibility scores of the target relationship triples, outputting all the target relationship triples related to the source entity and corresponding credibility scores.

In one embodiment, the processor, when executing the computer program, may further implement the additional steps or sub-steps of the above-described unsupervised knowledge-graph inference processing method in various embodiments.

In yet another aspect, there is also provided a computer readable storage medium having a computer program stored thereon, the computer program when executed by a processor implementing the steps of: acquiring an input knowledge graph, a source entity to be inferred and a network corpus; the knowledge graph comprises an entity set, a relation set and a category set of the entity; obtaining a word2vec word vector model by utilizing network corpus training; vectorizing all entities and relations in the knowledge graph by using a word2vec word vector model; respectively carrying out cosine similarity calculation with a source entity according to vectorization expression of entities in the knowledge graph to obtain a target entity candidate set; training a bidirectional long-short term memory network classifier model according to vectorization representation of entities in the knowledge map, and determining candidate target entities in a target entity candidate set and having an incidence relation with a source entity by using the bidirectional long-short term memory network classifier model; determining target relation words and obtaining target relation triples according to the source entity, the candidate target entities and the network corpus; the target relation triple comprises a source entity, target relation words and candidate target entities; and after calculating the credibility scores of the target relationship triples, outputting all the target relationship triples related to the source entity and corresponding credibility scores.

In one embodiment, the computer program, when executed by the processor, may further implement the additional steps or sub-steps of the embodiments of the unsupervised knowledge-graph inference processing method described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a non-volatile computer-readable storage medium, and the computer program may include the processes of the embodiments of the methods described above when executed. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus DRAM (RDRAM), and interface DRAM (DRDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the spirit of the present application, and all of them fall within the scope of the present application. Therefore, the protection scope of the present patent should be subject to the appended claims.

Claims

1. An unsupervised knowledge graph inference processing method is characterized by comprising the following steps:

acquiring an input knowledge graph, a source entity to be inferred and a network corpus; the knowledge graph comprises an entity set, a relation set and a category set of entities;

training by utilizing the network corpus to obtain a word2vec word vector model;

vectorizing all entities and relations in the knowledge graph by using the word2vec word vector model;

respectively carrying out cosine similarity calculation with the source entity according to vectorization representation of the entities in the knowledge graph to obtain a target entity candidate set;

training a bidirectional long-short term memory network classifier model according to vectorization representation of the entities in the knowledge graph, and determining candidate target entities in the target entity candidate set, which have an association relation with the source entity, by using the bidirectional long-short term memory network classifier model;

determining target relation words and obtaining target relation triples according to the source entities, the candidate target entities and the network linguistic data; the target relationship triplets include the source entity, the target relationship terms, and the candidate target entities;

2. The unsupervised knowledge graph inference processing method of claim 1, wherein the step of vectorizing all entities and relations in the knowledge graph by using the word2vec word vector model comprises:

respectively acquiring an attribute value of each entity from the knowledge graph, and respectively searching each entity and a vector representation form of the attribute value of each entity in the word2vec word vector model;

3. The unsupervised knowledge-graph inference processing method of claim 1 or 2, wherein the step of obtaining a candidate set of target entities by performing a cosine similarity calculation with the source entities according to the vectorized representation of the entities in the knowledge-graph comprises:

acquiring an entity set which is directly connected with the source entity in the knowledge graph and vectorized representation of each entity in the entity set;

performing cosine similarity calculation on the vectorized representation of the source entity and the vectorized representation of each entity in the entity set, and taking the mean value of calculation results as a cosine threshold;

obtaining vectorized representations of the remaining entities in the knowledge-graph that do not belong to the set of entities;

respectively carrying out cosine similarity calculation on the vectorized representation of each of the other entities and the vectorized representation of the source entity to obtain a similarity calculation result corresponding to each of the other entities;

and putting the other entities with the similar calculation results exceeding the cosine threshold into the target entity candidate set.

4. The unsupervised knowledgegraph inference processing method according to claim 3, wherein the step of training a two-way long-short term memory network classifier model based on vectorized representations of entities in the knowledgegraph, and using the two-way long-short term memory network classifier model to determine candidate target entities in the target entity candidate set that have an association relationship with the source entity comprises:

forming a first quintuple by any two entities with direct edges connected in the knowledge graph, edge relations and the classes of the any two entities;

taking any two entities and corresponding classes in the first quintuple as positive examples, taking two unrelated entities and corresponding classes in the second quintuple as negative examples, and obtaining the two-way long-short term memory network classifier model by vectorization representation training of the positive examples and the negative examples; wherein the vectorization representation of the category is obtained by adopting a one-hot method;

5. The unsupervised knowledgebase graph inference processing method of claim 4, wherein said vectorized representation of a category is obtained using a one-hot method comprising the steps of:

constructing the categories of all the entities in the knowledge graph into a dictionary;

6. The unsupervised knowledge-graph inference processing method of claim 4, wherein a process of determining target relational terms according to said source entities, said candidate target entities and said network corpus comprises:

acquiring a third quintuple which is the same as any two entities in the first quintuple in category and the source entity and the candidate target entity, and putting a relation word of the third quintuple into a candidate relation set;

extracting relation indicator words from the network corpus with set length in which the source entity and the candidate target entity appear at the same time;

and performing cosine similarity calculation on the vectorized representation of the relation indicator words and the vectorized representation of the relation words in the candidate relation set, and determining the candidate relation words with the maximum similarity as the target relation words.

7. The unsupervised knowledge-graph inference processing method of any one of claims 4 to 6, wherein the process of calculating the confidence scores of the target relationship triples comprises:

obtaining the reciprocal of the minimum number of the connecting edges of the two entities in the target relation triple in the knowledge graph and the ratio of the number of times of the two entities appearing simultaneously in the network corpus with the set length to the number of times of the two entities appearing respectively;

8. An unsupervised knowledge-graph inference processing apparatus, comprising:

the input acquisition module is used for acquiring an input knowledge graph, a source entity to be inferred and a network corpus; the knowledge graph comprises an entity set, a relation set and a category set of entities;

the model training module is used for training by utilizing the network corpus to obtain a word2vec word vector model;

the vectorization module is used for vectorizing all entities and relations in the knowledge graph by using the word2vec word vector model;

the association target module is used for training a bidirectional long-short term memory network classifier model according to vectorization representation of the entities in the knowledge graph, and determining candidate target entities in the target entity candidate set, which have association with the source entities, by using the bidirectional long-short term memory network classifier model;

the target relation module is used for determining target relation words and obtaining target relation triples according to the source entities, the candidate target entities and the network linguistic data; the target relationship triplets include the source entity, the target relationship terms, and the candidate target entities;

and the credibility output module is used for outputting all the target relation triples related to the source entity and the corresponding credibility scores after calculating the credibility scores of the target relation triples.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the unsupervised knowledge-graph inference processing method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the unsupervised knowledge-graph inference processing method of any one of claims 1 to 7.