CN113298160B

CN113298160B - Triple verification method, apparatus, device and medium

Info

Publication number: CN113298160B
Application number: CN202110594046.0A
Authority: CN
Inventors: 曾钢欣
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2023-03-07
Anticipated expiration: 2041-05-28
Also published as: CN113298160A

Abstract

The invention discloses a triple verification method, which comprises the following steps: firstly, random sampling marking is carried out on a triple to be checked to obtain a marked first triple and a second triple which is not marked. And then obtaining the input vector and the labeling information of the first triple to train a first binary-binary model, and then checking a second triple by using the first binary-binary model. And determining an labeled data set according to the first triple, the second triple and the verification result by adopting a self-training method, training a second classification model by using the labeled data set, and finally completing the verification of the triple to be verified by using the target classification model. According to the scheme, the labeled data sets of the number of samples required by training can be obtained on the premise that only a small number of triples are labeled, the fitting capacity of the trained target classification model is improved, and a good effect can be obtained only by a small number of labeled data. In addition, a triple verifying device, equipment and a storage medium are also provided.

Description

Triple verification method, apparatus, device and medium

Technical Field

The invention relates to the technical field of knowledge graphs, in particular to a triple verification method, a triple verification device, triple verification equipment and triple verification media.

Background

With the development of artificial intelligence, knowledge graphs are increasingly important at the application bottom layer of artificial intelligence. For traditional knowledge graph construction, the whole process needs a lot of manpower and material resources, so that an unsupervised knowledge graph construction scheme becomes the mainstream of development at the present stage. However, in an unsupervised knowledge graph construction scheme, due to lack of manual intervention, extracted triples are not very accurate and need to be manually corrected, and errors of the triples will affect construction of the knowledge graph, so that how to ensure accuracy of the triples on the premise of less manual intervention is very important.

Disclosure of Invention

Based on this, it is necessary to provide a method, an apparatus, a device, and a medium for verifying a triplet, which ensure the accuracy of the triplet with less human intervention.

A method for checking triples, the method comprising:

acquiring a triple to be verified, randomly sampling and marking the triple to be verified to obtain a marked first triple and a second triple which is not marked;

embedding the triple information of the first triple into an input vector through a pre-training model, wherein the first triple information comprises a head entity, a relation, a tail entity and a sentence where the first triple is located, obtaining the marking information of the first triple, the marking information comprises whether each first triple is credible or not, and training a first two-two classification model according to the input vector and the marking information;

performing first check on the second triple by using the trained first-second classification model to obtain a first check result corresponding to the second triple, wherein the first check result is whether the second triple is credible or not;

determining an annotated data set according to the first triple, the second triple and the first check result, training a second classification model according to the annotated data set, and performing second check on the triple to be checked by using the trained second classification model to obtain a second check result of the triple to be checked, wherein the second check result is whether the triple to be checked is credible or not.

In one embodiment, the embedding the triplet information of the first triplet into an input vector through a pre-trained model includes:

for each first triple, respectively encoding the head entity, the relation and the tail entity through the pre-training model to obtain a first vector corresponding to the head entity, a second vector corresponding to the relation and a third vector corresponding to the tail entity;

sequentially connecting the first vector, the second vector and the third vector according to the sequence of the first vector, the second vector and the third vector to obtain a first input vector;

and coding the sentence through the pre-training model, and taking the coded sentence as a second input vector, wherein the input vector comprises the first input vector and the second input vector.

In one embodiment, the first binary model comprises a feedforward neural network and an activation function, and the training the first binary model according to the input vector and the labeling information comprises:

the first secondary classification model maps the credibility of the triple to be checked between 0 and 1 according to the input vector, and calculates a mapping error according to the labeling information and the mapping result;

and adjusting the model parameters of the first secondary classification model according to the mapping error until the mapping result meets a preset check standard.

In one embodiment, the determining an annotated dataset from the first triple, the second triple, and the first verification result includes:

and taking the first triple and a second triple with a credible first verification result as the annotation data set.

In one embodiment, the obtaining a triplet to be verified includes:

acquiring text data, and extracting a triple to be verified from the text data, wherein the extraction is based on rule extraction or syntactic analysis extraction.

In one embodiment, the pre-training model is any one of bert, word2vec, XLinet, and Albert.

In one embodiment, after the target classification model is used to verify the triplet to be verified, and a verification result of the triplet to be verified is obtained, the method further includes:

acquiring trusted target triples from the annotation data set according to a second check result, wherein each target triplet comprises a target head entity, a target relation and a target tail entity;

constructing a plurality of first co-occurrence matrixes of the target header entity and the target relation, screening a first target matrix which is larger than a first segmentation threshold value from the plurality of first co-occurrence matrixes, and combining a header entity type and a relation type corresponding to the first target matrix to obtain a first combination;

constructing a plurality of second co-occurrence matrixes of the target tail entity and the target relation, screening a second target matrix which is larger than a second segmentation threshold value from the plurality of second co-occurrence matrixes, and combining tail entity types and relation types corresponding to the second target matrix to obtain a second combination;

and carrying out cross combination on the first combination and the second combination to obtain a knowledge graph.

A triple verification apparatus, the apparatus comprising:

the marking module is used for acquiring the triples to be checked, randomly sampling and marking the triples to be checked to obtain marked first triples and unmarked second triples;

the initial training module is used for embedding the triple information of the first triple into an input vector through a pre-training model, acquiring the marking information of the first triple, wherein the marking information comprises whether each first triple is credible or not, and training a first two-two classification model according to the input vector and the marking information, and the first triple information comprises a head entity, a relation, a tail entity and a sentence where the first triple is located;

the first checking module is used for performing first checking on the second triple by using the trained first and second classification model to obtain a first checking result corresponding to the second triple, wherein the first checking result is whether the second triple is credible or not;

and the training and checking module is used for determining a marked data set according to the first triple, the second triple and the first checking result, training a second classification model according to the marked data set, checking the triple to be checked by using the target classification model to obtain a second checking result of the triple to be checked, wherein the second checking result is whether the triple to be checked is credible or not.

A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

performing first verification on the second triple by using the trained first and second classification models to obtain a first verification result corresponding to the second triple, wherein the first verification result is whether the second triple is credible or not;

A triple verification device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

acquiring a triple to be checked, randomly sampling and marking the triple to be checked to obtain a marked first triple and a second triple which is not marked;

The invention provides a triple verification method, a triple verification device, equipment and a triple verification medium. And then, acquiring the input vector and the labeling information of the first triple to train a first two-class classification model, and then verifying the second triple by using the first two-class classification model, so that the scheme only needs a small amount of data labeling, and the labeling cost can be greatly reduced. Then, a self-training method is adopted in the scheme, a labeled data set is determined according to the first triple, the second triple and the verification result, a second classification model is trained by the labeled data set, and finally the target classification model is used for completing the verification of the triple to be verified. Therefore, the labeled data sets of the number of samples required by training can be obtained on the premise of labeling only a small number of triples, the fitting capacity of the trained target classification model is improved, and a good effect can be obtained only by a small number of labeled data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Wherein:

FIG. 1 is a schematic flow chart of a triple verification method according to an embodiment;

FIG. 2 is a schematic diagram of an exemplary triple verifier;

fig. 3 is a block diagram of a triple verifying apparatus in an embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, fig. 1 is a schematic flow chart of a triple verification method in an embodiment, where the triple verification method in this embodiment provides steps including:

and 102, acquiring a triple to be verified, randomly sampling and marking the triple to be verified, and acquiring a marked first triple and a second triple which is not marked.

The triple to be verified refers to a group of head entities, relation entities and tail entities, whether the triple to be verified is credible or not, information is correct information, and a knowledge graph for artificial intelligence can be constructed based on the credible triple. Illustratively, "zhou jen" is the head entity, "singer" is the relationship, and "nu stick" is the tail entity in (zhou jen, singer, nu stick).

Specifically, a section of text data is obtained first, the text data is a text paragraph composed of a plurality of sentences, and the length of the text data can be automatically input by a user according to requirements. And then extracting the triples to be checked from the text data based on rules or extracted triples based on syntactic analysis. Illustratively, the rule-based extraction is divided into: the first step is as follows: a set of relationships to be extracted is defined, such as (father, mother, son, daughter). The second step is that: and traversing each sentence of the text data, and removing the non-head entity in each sentence and the words in the non-relation set. The third step: traversing from the second word of each sentence, and selecting the head entity which is closest to the word when encountering the word in the relation set. Rule-based extraction requires no training and the rules defined are simpler and therefore more commonly used. The extraction based on the syntactic analysis needs to determine the syntactic structure of the sentence (for example, determining the structure of a guest-like intermediation in the sentence) or the dependency relationship between words in the sentence (for example, the relationship such as a right-appended relationship, a fixed relationship, a state-in-relation, etc.), the logic setting of the extraction based on the syntactic analysis is more complicated, but the extraction result is more accurate than the extraction based on the rule.

And after the triples to be verified are extracted, marking part of the triples to be verified in a random extraction mode. In order to ensure randomness and avoid the occurrence of too many similar triples to be verified, in a specific application scenario, the triples to be verified can be divided into different layers according to the types of the triples to be verified, for example, the triples to be verified are divided into layers according to the types of words in the triples (individual nouns: cars and rooms; abstract nouns: living and friendly), and then samples are randomly extracted from the different layers according to a specified proportion. Finally, a labeled first triplet is obtained, for example comprising a labeled 1 (Zhougelon, who is singer) and a labeled 0 (Zhougelon, who is writer), and a second unlabeled triplet.

And 104, embedding the triple information of the first triple into an input vector through a pre-training model, acquiring the labeling information of the first triple, and training the first binary model according to the input vector and the labeling information.

The pre-training model adopted in this embodiment is any one of bert, word2vec, XLnet, and Albert. The triple information of the first triple is related attribute information of the first triple, and includes a head entity, a relationship and a tail entity of the first triple, a sentence in which the first triple is located in the text data, and information such as a position of the triple in the sentence. The annotation information is information indicating whether each first triple is authentic, i.e., whether a sentence composed of the triples is correct, and is generated when step 102 is executed.

In one embodiment, the embedded input vector comprises a first input vector and a second input vector, and the process of embedding the triplet information into the input vector comprises: for each first triple, the head entity, the relation and the tail entity are respectively encoded through the pre-training model to obtain a first vector corresponding to the head entity, a second vector corresponding to the relation and a third vector corresponding to the tail entity. Wherein, the available coding modes comprise existing doc2vec. And then sequentially connecting (c operation) according to the sequence of the first vector, the second vector and the third vector to obtain a first input vector. If the first vector: a = [1,2,3], the second vector B = [4,5,6], and the third vector B = [4,5,6] are sequentially connected to obtain the first input vector AcBcC = [1,2,3,4,5,6,7,8,9]. For each sentence, the sentence is also encoded by doc2vec, and the encoded sentence is taken as a second input vector. And the coding of the position of the triple can be based on the index, the index is a numerical value, for a sentence with 128 words, a matrix with 128 x 10 is initialized, and the index of the nth row is found as the position coding based on the position n of the triple in the sentence.

The first two-class model in this embodiment includes a feed forward Neural Network (FFN) and an activation function Sigmoid, each neuron in the feed forward Neural Network is composed of a linear fit and a nonlinear activation function, each neuron is arranged in layers, different layers are all connected, and each neuron is only connected with a neuron in a previous layer, receives an output of the previous layer, and outputs the output to a next layer. The activation function Sigmoid introduces nonlinear factors to the neurons, so that the neural network can arbitrarily approximate any nonlinear function to realize the mapping of variables between 0 and 1.

During training, the first binary model takes the input vector as the input data of the model, and the credibility of the triplet to be verified is mapped between 0 and 1. Specifically, the feedforward neural network FFN uses the formula: ffn _ output = Relu (Wx input + b), where Wx and b are trainable parameters and Relu is an activation function. In the original space, the input features do not distinguish well between the trusted and the untrusted classes, and the activation function is a non-linear function by which the input can be transformed into another implicit space that best distinguishes between the trusted and the untrusted classes. Then the feedforward neural network FFN is connected with the full connection layer, and the formula is as follows: sigmoid (Wy ffn _ output + b 2). Where Wy and b2 are also trainable parameters. The Sigmoid is an activation function, and input features are mapped between 0 and 1 through the Sigmoid, namely, the credibility of the triples to be checked is mapped between 0 and 1. Further, the mapping result is more than or equal to 0.5 is considered as credible, and the mapping result is less than 0.5 is considered as incredible, so that the output can be separated into two categories. And then, calculating a mapping error according to the mapping result and the labeling information, namely, calculating the ratio of the wrong mapping result to all the mapping results. And then, performing loss reverse updating parameters according to the mapping error, such as values of weight values and bias parameters in the adjustment model, iterating for multiple times until the mapping error is smaller than a preset error threshold value, and obtaining more accurate parameters Wx and Wy.

In this way, the feed-forward neural network has the function of mapping inseparable samples in an original space into another feature space so as to better separate two types of samples, sigmoid is to map features between 0 and 1, and is taken as the probability of a certain type, and through maximum likelihood estimation (the probability of a certain type is the maximum, the result is considered to be the same type), and sigmoid is a continuous function, so that normal derivation can be achieved, and parameters are updated through gradient descent. The primary purpose of sigmoid is to map feature values to probabilities and enable updating of parameters.

And 106, performing first verification on the second triple by using the trained first and second classification models to obtain a first verification result corresponding to the second triple.

Embedding the triple information of the second triple into an input vector through a pre-training model, and inputting the embedded input vector into the first secondary classification model for verification. And after the input vector corresponding to the second triple is processed by the feedforward neural network and the activation function, a first verification result of whether each second triple is credible is obtained.

And 108, determining an annotated data set according to the first triple, the second triple and the first check result, training a second classification model according to the annotated data set, and performing second check on the triple to be checked by using the second classification model to obtain a second check result of the triple to be checked.

Specifically, the first triple and the second triple with the first verification result being credible are used as the labeled data set. Therefore, the labeled data set with the number of samples required by training can be obtained on the premise of labeling only a few triples, and the fitting capacity of the trained second classification model can be improved.

The model parameters of the second binary model are initialized and are only structurally identical to the first binary model. The process of training the second classification model with the labeled data set is substantially the same as that in step 104, and the description is omitted as long as the samples are different. And then performing second check on all triples to be checked of the trained second classification model, wherein the second check is basically consistent with the step 106, only the samples have differences, and the description is omitted.

According to the triple verification method, random sampling marking is firstly carried out on the triple to be verified, and a marked first triple and a second triple which is not marked are obtained. And then, acquiring the input vector and the labeling information of the first triple to train a first two-first classification model, and then verifying the second triple by using the first two-first classification model, so that the scheme only needs a small amount of data labeling, and the labeling cost can be greatly reduced. Then, a self-training method is adopted in the scheme, a labeled data set is determined according to the first triple, the second triple and the verification result, a second classification model is trained by the labeled data set, and finally the target classification model is used for completing the verification of the triple to be verified. Therefore, the labeled data sets of the number of samples required by training can be obtained on the premise of labeling only a small number of triples, the fitting capacity of the trained target classification model is improved, and a good effect can be obtained only by a small number of labeled data.

Further, after the check of the triple to be detected is completed, a downstream task of constructing the knowledge graph can be further performed based on a second check result. Specifically, first, a credible triple is acquired from the annotation dataset according to the second check result to serve as a target triple, and a head entity, a relation and a tail entity of each target triple are used as a target head entity, a target relation and a target tail entity.

Then, a plurality of first co-occurrence matrices of the target head entity and the target relationship are constructed, wherein the first co-occurrence matrices are symmetric matrices of which the target head entity and the target relationship are rows and columns. Constructing the first co-occurrence matrix includes first constructing a plurality of two-dimensional matrices of target head entities and target relationships, each two-dimensional matrix including a plurality of head entity columns and a plurality of relationship series. Wherein, a line or a column of data where any target head entity in the two-dimensional matrix is located is called a head entity column; one row or column of data where any one target relationship in the two-bit matrix is located is called a relationship series. And determining the type of any target head entity as a target head entity type, determining the type of any target relation as a target relation type, adding 1 to a head entity column consistent with the type of the target head entity and adding 1 to a relation series consistent with the type of the target relation in a two-dimensional matrix of all 0 to obtain a first co-occurrence matrix. And repeatedly selecting the target head entity type and/or the target relation type to obtain a plurality of first co-occurrence matrixes. And then, screening out a first target matrix which is larger than a first segmentation threshold value from the plurality of first co-occurrence matrices, for example, taking a matrix which is obtained by adding all numerical values in the first co-occurrence matrices and is larger than K as a first target matrix, and combining a target head entity and a target relation corresponding to the first target matrix to obtain a first combination. For example, the target head entity includes zhou jen, lin jun jen, ginger, etc., the target relationship includes singer, composer, brother, etc., a column of data in the two-dimensional matrix is used as a head entity column, and a row of data in the two-dimensional matrix is used as a relationship series. If the figure is determined to be the target head entity type, adding 1 to a column in which Zhongjien, linjunjie, ginger and the like are positioned; and if the occupational relationship is determined to be the target relationship type, adding 1 to a line where a singer, a composer and the like are located, and finally obtaining a first co-occurrence matrix. If the first co-occurrence matrix of the above example is determined to be the first target matrix based on K, the combination may result in a first combination including "zhou jeren-singer", "zhou jeren-composer", "linj-singer", "linj-composer", and the like.

And similarly, a plurality of second co-occurrence matrixes of the target tail entity and the target relationship are constructed, and the second co-occurrence matrixes are symmetrical matrixes taking the target tail entity and the target relationship as rows and columns. Constructing the second co-occurrence matrix includes constructing a plurality of two-dimensional matrices of target tail entities and target relationships, each two-dimensional matrix including a plurality of tail entity columns and a plurality of relationship series. Wherein, a row or a column of data where any tail entity in the two-dimensional matrix is located is called a tail entity column. And determining the type of any one target tail entity as a target tail entity type, determining the type of any one target relationship as a target relationship type, and adding 1 to a tail entity column consistent with the type of the target tail entity and adding 1 to a relationship series consistent with the type of the target relationship in a two-dimensional matrix with all 0 to obtain a first co-occurrence matrix. And repeatedly selecting the target head entity type and/or the target relation type to obtain a plurality of second co-occurrence matrixes. And screening a second target matrix which is larger than a second segmentation threshold value from the plurality of second co-occurrence matrices, for example, adding all numerical values in the second co-occurrence matrices to obtain a matrix which is larger than K as the second target matrix, and combining the target tail entity and the target relation corresponding to the second target matrix to obtain a second combination. For example, the target tail entity comprises rice aroma, south of the Yangtze river, jiangwu and the like, the target relationship comprises singers, composers, brothers and the like, one column of data in the two-dimensional matrix is used as a tail entity column, and one row of data in the two-dimensional matrix is used as a relationship series. If the work is determined to be the target tail entity type, adding 1 to a row in which rice fragrance, south of Yangtze river and the like are positioned; and if the occupational relationship is determined to be the target relationship type, adding 1 to a line where a singer, a composer and the like are located, and finally obtaining a second co-occurrence matrix. If the second co-occurrence matrix of the above example is determined to be the second target matrix based on K, the combination may result in a second combination including "singer-rice scent", "singer-south of the Yangtze river", "composer-rice scent", "composer-south of the Yangtze river", and the like. And finally, performing cross combination on the first combination and the second combination with the same target relation to obtain the knowledge graph.

In one embodiment, as shown in fig. 2, a triple verifying apparatus is provided, the apparatus including:

the marking module 202 is configured to obtain a triple to be checked, and randomly sample and mark the triple to be checked to obtain a marked first triple and an unmarked second triple;

the initial training module 204 is configured to embed triple information of a first triple into an input vector through a pre-training model, where the triple information includes a head entity, a relationship, a tail entity, and a sentence in which the first triple is located, obtain labeling information of the first triple, where the labeling information includes whether each first triple is reliable, and train a first two-two classification model according to the input vector and the labeling information;

the first checking module 206 is configured to perform first checking on the second triple by using the trained first-second classification model, and obtain a first checking result corresponding to the second triple;

the training and checking module 208 is configured to determine an annotated data set according to the first triple, the second triple and the first checking result, train a second classification model according to the annotated data set, and check the triple to be checked by using the target classification model to obtain a second checking result of the triple to be checked.

The triple verifying device firstly carries out random sampling labeling on a triple to be verified to obtain a labeled first triple and an unlabeled second triple. And then, acquiring the input vector and the labeling information of the first triple to train a first two-first classification model, and then verifying the second triple by using the first two-first classification model, so that the scheme only needs a small amount of data labeling, and the labeling cost can be greatly reduced. Then, the scheme adopts a self-training method, determines an labeled data set according to the first triple, the second triple and the verification result, trains a second binary model by using the labeled data set, and finally completes the verification of the triple to be verified by using the target classification model. Therefore, the labeled data sets of the number of samples required by training can be obtained on the premise of labeling only a small number of triples, the fitting capacity of the trained target classification model is improved, and a good effect can be obtained only by a small number of labeled data.

In one embodiment, the initial training module 204 is specifically configured to: respectively encoding a head entity, a relation and a tail entity through a pre-training model for each first triple to obtain a first vector corresponding to the head entity, a second vector corresponding to the relation and a third vector corresponding to the tail entity; sequentially connecting the first vector, the second vector and the third vector to obtain a first input vector; and coding the sentence through a pre-training model, and taking the coded sentence as a second input vector, wherein the input vector comprises a first input vector and a second input vector.

In one embodiment, the initial training module 204 is specifically configured to: the first binary classification model maps the credibility of the triple to be verified between 0 and 1 according to the input vector, and calculates a mapping error according to the labeling information and the mapping result; and adjusting the model parameters of the first and second classification models according to the mapping error until the mapping result meets the preset check standard.

In one embodiment, the training and verification module 208 is specifically configured to: and taking the first triple and the second triple with the credible first verification result as an annotation data set.

In one embodiment, the labeling module 202 is specifically configured to: and acquiring text data, extracting the triples to be checked from the text data, wherein the extraction is based on rules or syntactic analysis.

In one embodiment, the triple verifying apparatus further includes: the knowledge graph construction module is used for acquiring credible target triples from the labeling data set according to a second check result, and each target triplet comprises a target header entity, a target relation and a target tail entity; constructing a plurality of first co-occurrence matrixes of target head entities and target relations, screening out a first target matrix which is larger than a first segmentation threshold value from the plurality of first co-occurrence matrixes, and combining head entity types and relation types corresponding to the first target matrix to obtain a first combination; constructing a plurality of second co-occurrence matrixes of the target tail entity and the target relation, screening out a second target matrix which is larger than a second segmentation threshold value from the plurality of second co-occurrence matrixes, and combining tail entity types and relation types corresponding to the second target matrix to obtain a second combination; and carrying out cross combination on the first combination and the second combination to obtain the knowledge graph.

Fig. 3 shows an internal configuration diagram of a verification device of a triplet in one embodiment. As shown in fig. 3, the triple verification device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the triple verification device stores an operating system, and may further store a computer program, and when the computer program is executed by the processor, the computer program may enable the processor to implement the triple verification method. The internal memory may also have stored therein a computer program that, when executed by the processor, causes the processor to perform a method of verifying triples. It will be understood by those skilled in the art that the structure shown in fig. 3 is only a block diagram of a part of the structure related to the present application, and does not constitute a limitation to the checking device of the triplet to which the present application is applied, and a checking device of a specific triplet may include more or less components than those shown in the figure, or combine some components, or have a different arrangement of components.

A triple verification device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring a triple to be verified, randomly sampling and marking the triple to be verified to obtain a marked first triple and a second triple which is not marked; embedding the triple information of the first triple into an input vector through a pre-training model, wherein the triple information comprises a head entity, a relation, a tail entity and a sentence where the first triple is located, acquiring the marking information of the first triple, the marking information comprises whether each first triple is credible or not, and training a first two-two classification model according to the input vector and the marking information; performing first verification on the second triple by using the trained first-second classification model to obtain a first verification result corresponding to the second triple; and determining an annotation data set according to the first triple, the second triple and the first check result, training a second binary model according to the annotation data set, and performing second check on the triple to be checked by using the trained second binary model to obtain a second check result of the triple to be checked.

In one embodiment, embedding triplet information of a first triplet as an input vector by a pre-trained model includes: respectively encoding a head entity, a relation and a tail entity through a pre-training model for each first triple to obtain a first vector corresponding to the head entity, a second vector corresponding to the relation and a third vector corresponding to the tail entity; sequentially connecting the first vector, the second vector and the third vector to obtain a first input vector; and coding the sentence through a pre-training model, and taking the coded sentence as a second input vector, wherein the input vector comprises a first input vector and a second input vector.

In one embodiment, the first binary model includes a feedforward neural network and an activation function, and training the first binary model based on the input vector and the labeling information includes: the first binary classification model maps the credibility of the triple to be verified between 0 and 1 according to the input vector, and calculates a mapping error according to the labeling information and the mapping result; and adjusting the model parameters of the first and second classification models according to the mapping error until the mapping result meets the preset check standard.

In one embodiment, determining the annotated dataset from the first triple, the second triple, and the first verification result includes: and taking the first triple and the second triple with the credible first verification result as an annotation data set.

In one embodiment, obtaining a triplet to be checked includes: and acquiring text data, extracting the triples to be checked from the text data, wherein the extraction is based on rules or syntactic analysis.

In an embodiment, after performing a second check on the triple to be checked by using the trained second classification model to obtain a second check result of the triple to be checked, the method further includes: acquiring credible target triples from the labeling dataset according to the second check result, wherein each target triplet comprises a target header entity, a target relation and a target tail entity; constructing a plurality of first co-occurrence matrixes of target head entities and target relations, screening out a first target matrix which is larger than a first segmentation threshold value from the plurality of first co-occurrence matrixes, and combining head entity types and relation types corresponding to the first target matrix to obtain a first combination; constructing a plurality of second co-occurrence matrixes of the target tail entity and the target relation, screening out a second target matrix which is larger than a second segmentation threshold value from the plurality of second co-occurrence matrixes, and combining tail entity types and relation types corresponding to the second target matrix to obtain a second combination; and carrying out cross combination on the first combination and the second combination to obtain the knowledge graph.

A computer-readable storage medium, storing a computer program which, when executed by a processor, performs the steps of: acquiring a triple to be verified, randomly sampling and marking the triple to be verified to obtain a marked first triple and a second triple which is not marked; embedding the triple information of the first triple into an input vector through a pre-training model, wherein the triple information comprises a head entity, a relation, a tail entity and a sentence where the first triple is located, acquiring the marking information of the first triple, the marking information comprises whether each first triple is credible or not, and training a first two-two classification model according to the input vector and the marking information; performing first verification on the second triple by using the trained first and second classification models to obtain a first verification result corresponding to the second triple; and determining a labeled data set according to the first triple, the second triple and the first check result, training a second classification model according to the labeled data set, and performing second check on the triple to be checked by using the trained second classification model to obtain a second check result of the triple to be checked.

In one embodiment, embedding triplet information of a first triplet as an input vector by a pre-trained model includes: respectively encoding a head entity, a relation and a tail entity through a pre-training model for each first triple to obtain a first vector corresponding to the head entity, a second vector corresponding to the relation and a third vector corresponding to the tail entity; sequentially connecting the first vector, the second vector and the third vector according to the sequence of the first vector, the second vector and the third vector to obtain a first input vector; and coding the sentence through a pre-training model, and taking the coded sentence as a second input vector, wherein the input vector comprises a first input vector and a second input vector.

In one embodiment, the first binary model includes a feedforward neural network and an activation function, and training the first binary model based on the input vector and the labeling information includes: the first and second classification models map the credibility of the triple to be checked between 0 and 1 according to the input vector, and the mapping error is calculated according to the labeling information and the mapping result; and adjusting the model parameters of the first and second classification models according to the mapping error until the mapping result meets the preset check standard.

In one embodiment, acquiring a triplet to be verified includes: and acquiring text data, extracting the triple to be checked from the text data, wherein the extraction is based on rule or syntactic analysis.

It should be noted that the method, the apparatus, the device and the computer-readable storage medium for checking triples described above belong to a general inventive concept, and the contents in the embodiments of the method, the apparatus, the device and the computer-readable storage medium for checking triples may be applicable to each other.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for verifying a triplet, the method comprising:

embedding the triple information of the first triple into an input vector through a pre-training model, wherein the first triple information comprises a head entity, a relation, a tail entity and a sentence where the first triple is located, acquiring the marking information of the first triple, the marking information comprises whether the first triple is credible or not, and training a first two-two classification model according to the input vector and the marking information;

determining an annotated data set according to the first triple, the second triple and the first check result, training a second binary model according to the annotated data set, and performing second check on the triple to be checked by using the trained second binary model to obtain a second check result of the triple to be checked, wherein the second check result is whether the triple to be checked is credible or not;

the obtaining of the triplet to be verified includes: acquiring text data, and extracting triples to be checked from the text data, wherein the extraction is based on rule extraction or syntactic analysis; the text data is a text paragraph composed of a plurality of sentences;

after the trained second classification model is used to perform second check on the triplet to be checked to obtain a second check result of the triplet to be checked, the method further includes: acquiring credible target triples from the labeling data set according to a second check result, wherein each target triplet comprises a target header entity, a target relation and a target tail entity; constructing a plurality of first co-occurrence matrixes of the target header entity and the target relationship, wherein the first co-occurrence matrixes are symmetrical matrixes taking the target header entity and the target relationship as rows and columns, screening out first target matrixes which are greater than a first segmentation threshold value from the plurality of first co-occurrence matrixes, and combining head entity types and relationship types corresponding to the first target matrixes to obtain a first combination; constructing a plurality of second co-occurrence matrixes of the target tail entity and the target relation, wherein the second co-occurrence matrixes are symmetrical matrixes taking the target tail entity and the target relation as rows and columns, screening a second target matrix which is larger than a second segmentation threshold value from the plurality of second co-occurrence matrixes, and combining tail entity types and relation types corresponding to the second target matrix to obtain a second combination; and carrying out cross combination on the first combination and the second combination to obtain a knowledge graph.

2. The verification method of claim 1, wherein embedding the triplet information of the first triplet as an input vector through a pre-trained model comprises:

respectively encoding the head entity, the relation and the tail entity through the pre-training model to obtain a first vector corresponding to the head entity, a second vector corresponding to the relation and a third vector corresponding to the tail entity;

3. The verification method of claim 1, wherein the training a first binary model based on the input vector and the labeling information comprises:

the first binary classification model maps the credibility of the triple to be verified between 0 and 1 according to the input vector, and calculates a mapping error according to the labeling information and the mapping result;

4. The verification method of claim 1, wherein determining an annotated dataset from the first triple, the second triple, and the first verification result comprises:

5. The verification method according to claim 1, wherein the pre-training model is any one of bert, word2vec, XLnet, and Albert.

6. A triple verification apparatus, comprising:

the training and checking module is used for determining a marked data set according to the first triple, the second triple and the first checking result, training a second classification model according to the marked data set, checking the triple to be checked by using a target classification model to obtain a second checking result of the triple to be checked, wherein the second checking result is whether the triple to be checked is credible or not;

wherein, the marking module is specifically used for: acquiring text data, and extracting a triple to be verified from the text data, wherein the extraction is based on rule extraction or syntactic analysis extraction; the text data is a text paragraph composed of a plurality of sentences;

the triple verifying apparatus further includes: a knowledge graph construction module to: obtaining credible target triples from the labeling data set according to a second check result, wherein each target triplet comprises a target header entity, a target relation and a target tail entity; constructing a plurality of first co-occurrence matrixes of the target header entity and the target relation, wherein the first co-occurrence matrixes are symmetrical matrixes with rows and columns of the target header entity and the target relation, screening out first target matrixes which are larger than a first segmentation threshold value from the plurality of first co-occurrence matrixes, and combining the head entity types and the relation types corresponding to the first target matrixes to obtain a first combination; constructing a plurality of second co-occurrence matrixes of the target tail entity and the target relation, wherein the second co-occurrence matrixes are symmetrical matrixes taking the target tail entity and the target relation as rows and columns, screening a second target matrix which is larger than a second segmentation threshold value from the plurality of second co-occurrence matrixes, and combining tail entity types and relation types corresponding to the second target matrix to obtain a second combination; and carrying out cross combination on the first combination and the second combination to obtain a knowledge graph.

7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 5.

8. Triple verification device comprising a memory and a processor, characterized in that the memory stores a computer program which, when executed by the processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 5.