CN113515630A

CN113515630A - Triple generating and checking method and device, electronic equipment and storage medium

Info

Publication number: CN113515630A
Application number: CN202110650253.3A
Authority: CN
Inventors: 曾钢欣
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-10-19
Anticipated expiration: 2041-06-10
Also published as: CN113515630B

Abstract

The application discloses a triple generating and checking method and device, electronic equipment and a storage medium. The method comprises the following steps: performing part-of-speech analysis and syntactic analysis processing on the target text to obtain an analysis result, wherein the analysis result comprises the part-of-speech and syntactic labels of each word in the target text, and the syntactic label corresponding to each word comprises the dependency relationship between each word and the head entity word thereof; matching a triple corresponding to the target text according to a preset matching rule and an analysis result, wherein the triple comprises a head entity word, a tail entity word and a relation word for marking a syntactic relation between the head entity word and the tail entity word; acquiring input data, including vectors of original sentences corresponding to the triples, vectors of the triples, position vectors corresponding to the triples and part-of-speech vectors; and inputting the input data into a preset classification model, and processing the input data through the preset classification model to verify the triples.

Description

Triple generating and checking method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a triplet generating and checking method, apparatus, electronic device, and storage medium.

Background

With the falling of various intelligent products, the knowledge graph plays a key role as an energized intellectual brain behind the products. However, the cost of knowledge graph construction is very high. Usually, experts in the corresponding field are required to define data modes of the knowledge map in advance, massive labeled data are required to extract knowledge and fuse knowledge, and a storage database capable of fast response and massive storage is required. Therefore, only large companies can build large knowledge maps with such capability, and for many small and medium-sized companies, the building of knowledge maps is very difficult work, the required data volume is large, a large amount of manual participation is required, and the cost is high.

Entity extraction is one of the classical tasks of Natural Language Processing (NLP) with the aim of extracting entities from structured, semi-structured or unstructured data, as defined entity types: country, unstructured text is: if the Chinese has a history of five thousand years of culture, the extracted entities are as follows: "China". Some existing methods obtain triples by segmenting words of a collected corpus and using the segmented words as candidate sets of entities, but the accuracy of the triples generated according to the method is not high enough.

Disclosure of Invention

The application provides a triple generating and checking method and device, electronic equipment and a storage medium.

In a first aspect, a triple generation and verification method is provided, including:

acquiring a target text;

performing part-of-speech analysis and syntactic analysis processing on the target text to obtain an analysis result corresponding to the target text, wherein the analysis result comprises the part of speech of each word in the target text and a syntactic label corresponding to each word, and the syntactic label corresponding to each word comprises the dependency relationship between each word and a head entity word corresponding to each word;

matching a triple corresponding to the target text according to a preset matching rule and the analysis result, wherein the triple comprises a head entity word, a tail entity word and a relation word for marking a syntactic relation between the head entity word and the tail entity word;

acquiring input data, wherein the input data comprises vectors of original sentences corresponding to the triples, vectors of the triples, position vectors corresponding to the triples and part-of-speech vectors corresponding to the triples, the original sentences are sentences extracted from the target text, the vectors of the original sentences are used for indicating information of the original sentences on a feature space, the vectors of the triples are used for indicating information of the target triples on the feature space, the position vectors corresponding to the triples are used for indicating information of positions of the head entity words, the tail entity words and the relation words on the feature space, and the part-of-speech vectors corresponding to the target triples are used for indicating information of parts-of-speech of the head entity words, the tail entity words and the relation words on the feature space;

and inputting the input data into a preset classification model, and processing the input data through the preset classification model to verify the triples.

In a second aspect, there is provided a triplet generating and checking apparatus, including:

the acquisition module is used for acquiring a target text;

the analysis module is used for performing part-of-speech analysis and syntactic analysis processing on the target text to obtain an analysis result corresponding to the target text, wherein the analysis result comprises the part-of-speech of each word in the target text and a syntactic label corresponding to each word, and the syntactic label corresponding to each word comprises the dependency relationship between each word and a head entity word corresponding to each word;

the matching module is used for matching a triple corresponding to the target text according to a preset matching rule and the analysis result, wherein the triple comprises a head entity word, a tail entity word and a relation word for marking the syntactic relation between the head entity word and the tail entity word;

a checking module, configured to obtain input data, where the input data includes a vector of an original sentence corresponding to the triplet, a vector of the triplet, a position vector corresponding to the triplet, and a part-of-speech vector corresponding to the triplet, where the original sentence is a sentence extracted from the target text, the vector of the original sentence is used to indicate information of the original sentence on a feature space, the vector of the triplet is used to indicate information of the target triplet on the feature space, the position vector corresponding to the triplet is used to indicate information of positions of the head entity word, the tail entity word, and the relation word on the feature space, and the part-of-speech vector corresponding to the target triplet is used to indicate information of parts-of-speech of the head entity word, the tail entity word, and the relation word on the feature space;

the verification module is further configured to perform verification processing on the triple corresponding to the target text, and obtain a verification result of the triple corresponding to the target text.

In a third aspect, an electronic device is provided, comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps as in the first aspect and any one of its possible implementations.

In a fourth aspect, there is provided a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the steps of the first aspect and any possible implementation thereof.

The method includes the steps that part-of-speech analysis and syntactic analysis processing are conducted on a target text, so that an analysis result corresponding to the target text is obtained, the analysis result comprises the part-of-speech of each word in the target text and a syntactic label corresponding to each word, and the syntactic label corresponding to each word comprises the dependency relationship between each word and a head entity word corresponding to each word; matching a triple corresponding to the target text according to a preset matching rule and the analysis result, wherein the triple comprises a head entity word, a tail entity word and a relation word for marking a syntactic relation between the head entity word and the tail entity word; acquiring input data, wherein the input data comprises vectors of original sentences corresponding to the triples, vectors of the triples, position vectors corresponding to the triples and part-of-speech vectors corresponding to the triples, the original sentences are sentences extracted from the target text, the vectors of the original sentences are used for indicating information of the original sentences on a feature space, the vectors of the triples are used for indicating information of the target triples on the feature space, the position vectors corresponding to the triples are used for indicating information of positions of the head entity words, the tail entity words and the relation words on the feature space, and the part-of-speech vectors corresponding to the target triples are used for indicating information of parts-of-speech of the head entity words, the tail entity words and the relation words on the feature space; the input data is input into a preset classification model, the input data is processed through the preset classification model to check the triples, the triples can be extracted by combining syntactic analysis and part-of-speech analysis to obtain the syntactic and part-of-speech information of the triples, then checking processing is carried out according to the triples and the information of the triples, such as the syntactic and part-of-speech information, the characteristics of original sentences to which the triples belong, the characteristics of the triples, and the position characteristics and part-of-speech characteristics of head entity words, tail entity words and relation words in the triples are considered, the triples can be analyzed and checked more comprehensively, the accuracy of the triples is improved, and the triples can be used for establishing a more accurate knowledge map.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present application, the drawings required to be used in the embodiments or the background art of the present application will be described below.

Fig. 1 is a schematic flowchart of a triple generating and verifying method according to an embodiment of the present application;

fig. 2 is a schematic diagram of triple pattern matching according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a method for calculating a vector of a triplet according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a verification structure provided in the embodiment of the present application;

fig. 5 is a schematic structural diagram of a triple generating and verifying apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The knowledge graph related to the embodiment of the application is essentially a semantic network for revealing the relationship between entities, generally consists of triples, is represented in a form of head node-relationship-tail node, and can store attributes of both nodes and edges, and generally has two storage modes: firstly, an RDF resource description framework; secondly, a graph database. A triplet is generally extracted from three words in a sentence component, including a host entity (subject), a guest entity (object), and a relationship (relationship) between two entities, and can be expressed as (subject, relationship, object), for example: a triple is (department, employment, beauty basket), and a large number of such triples form a specific knowledge map. Correspondingly, in the embodiment of the present application, three words in the triple are referred to as a head entity word, a relation word relationship, and a tail entity word tail.

The embodiments of the present application will be described below with reference to the drawings.

Referring to fig. 1, fig. 1 is a schematic flowchart of a triple generating and verifying method according to an embodiment of the present disclosure. The method can comprise the following steps:

101. and acquiring a target text.

The execution subject of the embodiments of the present application may be a triplet generating and verifying apparatus, and may be an electronic device, which may be a terminal, also referred to as a terminal device in a specific implementation, including but not limited to other portable devices such as a laptop or tablet computer having a touch-sensitive surface (e.g., a touch screen display and/or a touch pad). It should also be understood that in some embodiments, the devices described above are not portable communication devices, but rather are desktop computers having touch-sensitive surfaces (e.g., touch screen displays and/or touch pads).

The target text may be an explanation text of a certain keyword, for example, a text which is queried on a network such as encyclopedia by using the keyword and takes the word as an entry, or may be other texts, which is not limited in this embodiment of the present application.

Optionally, the step 101 includes:

011. acquiring a text to be processed;

012. and removing the special characters in the text to be processed, and performing sentence splitting processing on the text to be processed to obtain a target text, wherein each sentence in the target text comprises a subject.

Specifically, the text to be processed may be obtained first, and the text to be processed may be preprocessed, and the text to be processed may be obtained in any manner. Optionally, a public general crawler tool may be used to crawl the corresponding website to obtain the text corresponding to the keyword, such as: crawling a text taking a cosmetologist as an entry, wherein the obtained text is as follows:

"cosmetologist is a professional name in the professional beauty field. The method is mainly used in beauty parlors and places where beauty services can be provided for customers. The job function is to provide beauty services to the customer, such as skin care tasks like washing the face, caring, massaging, aromatherapy, and losing weight. "

The text to be processed may be subjected to data preprocessing, and may first remove special characters, such as: xxxI カ 2020 visited the united states in 2020, (where xxx is a leader name), pretreated by: xxx2020 visits the united states. In addition, it is possible to utilize ". ","; ","! The punctuation marks are subjected to sentence segmentation, and the subject of each sentence in the text to be processed can be supplemented to obtain the target text. After this stage, the text becomes a sentence for subsequent syntactic analysis. Other preprocessing modes can be added according to the data set, and the embodiment of the application does not limit the preprocessing modes.

For example, the text with "cosmetologist" as an entry is preprocessed to obtain the following clauses by punctuation clause segmentation:

"cosmetologist is a professional title in the field of professional cosmetology";

"places mainly working in beauty parlors and providing beauty service for customers";

"job function is to provide beauty services such as washing, caring, massaging, aromatherapy, and weight loss to customers for skin care.

Where the second sentence has no subject, the subject "cosmetologist" in the first sentence can be supplemented, i.e. the second sentence is:

beauty professionals are mainly working in beauty salons and places where beauty services can be provided for customers.

102. And performing part-of-speech analysis and syntactic analysis processing on the target text to obtain an analysis result corresponding to the target text, wherein the analysis result comprises the part of speech of each word in the target text and a syntactic label corresponding to each word, and the syntactic label corresponding to each word comprises the dependency relationship between each word and the head entity word corresponding to each word.

The part of speech referred to in the embodiments of the present application refers to the characteristics of words as the basis for dividing the part of speech. The part of speech is a linguistic term, is a grammatical classification of words in a language, and is a result of dividing words by taking grammatical features (including syntactic function and morphological change) as main basis and considering lexical meaning, and the words of modern Chinese can be divided into 13 parts of speech, which can include:

preposition

pronoun

n. noun

verb and its use

conjoin word

s main language

sc table language

O object

oc guest supplement

verb to miss

vt. and verb

aux.v. verb assistant

adj

adverb

article of art

num

The syntactic analysis (Parsing) involved in the embodiments of the present application refers to analyzing grammatical functions of words in sentences, such as "i am late," where "i am the subject," i am the predicate, and "late" is the complement.

In one embodiment, performing part-of-speech analysis and syntactic analysis processing on a target text to obtain an analysis result corresponding to the target text, includes:

performing word segmentation processing on each sentence in the target text to obtain a plurality of words in the target text;

performing part-of-speech analysis on the plurality of words to determine the part-of-speech of each word;

and performing the syntactic analysis processing on the target text according to the part of speech of each word to obtain a syntactic label corresponding to each word.

Two parts of the result of the part of speech analysis and the result of the syntactic analysis, namely the part of speech of each word in the target text and the syntactic label corresponding to each word, can be obtained through the part of speech analysis and the syntactic analysis. The syntactic label comprises a dependency relationship between each word and a head entity word corresponding to the word.

The embodiment of the application relates to a dependency theory, wherein 'dependency' refers to the relationship of dominance and dominated among words, and the relationship has directionality. The dominant word is called the dominant word, i.e. the head entity word (head), and the dominant word is called the dependent.

Dependencies can be subdivided into different types, representing specific two-word dependencies, such as the sentence "i send her bouquet" (i < -send): major-minor relationship (subject-verb, SBV), (send- > flower): moving object relationship (VOB); also as in "red apple" (red < -apple): centering relationship (attribute), and the like.

The syntactic label obtained through syntactic analysis processing in the application can include a position index of the current word, a head entity word index of the current word, and a dependency relationship between the current word and the head entity word.

For example, as in the above sentence "xxx 2020 visits the united states," the analysis results after processing by the part-of-speech analysis and the syntactic analysis include the results of the part-of-speech analysis: [ 'nh', 'nt', 'v', 'ns' ], and the result of the syntactic analysis (syntactic label): [ (1,3, 'SBV'), (2,3, 'ADV'), (3,0, 'HED'), (4,3, 'VOB') ].

Wherein "nh" means "xxx" is the name of a person, "nt" means "2020" is the time noun, "v" means "visit" is the verb, "ns" means "the United states" is the geographic noun; in the above (1,3, "SBV"), 1 refers to the index of the position of the current word "xxx", indicating that the position of the word "xxx" is first in the text, 3 refers to the index of the head entity word "visit" of the current word, indicating that the position of the word "visit" is third in the text, SBV represents a cardinal meaning, i.e., "xxx" and "visit" are in cardinal meaning, and so on.

In this embodiment of the present application, when a word does not have a corresponding head entity word, for example, the word itself does not belong to an entity word, and the corresponding parsing result does not have the head entity word index (which may be denoted as 0).

In the embodiment of the application, the syntactic analysis, the word segmentation and the like can adopt a natural language processing tool LTP of a Haugh large open source, can also use other open source tools such as NLTK, fastNLP and the like, and can also train a specific model to perform syntactic analysis and part-of-speech analysis as required, which is not limited in the embodiment of the application.

103. And matching a triple corresponding to the target text according to a preset matching rule and the analysis result, wherein the triple comprises a head entity word, a tail entity word and a relation word for marking the syntactic relation between the head entity word and the tail entity word.

The preset matching rule specifies how different triples should be generated according to the dependency relationship between different entries, and after the merging and updating are completed, the triples corresponding to the target text can be matched according to the preset matching rule and the syntactic analysis result.

In an optional implementation manner, the preset matching rule includes a preset dependency relationship pattern and a triple expression corresponding to the preset dependency relationship pattern;

the matching of the triples corresponding to the target text according to the preset matching rules and the syntactic analysis results includes:

031. determining a group of words meeting the dependency relationship and the part of speech specified by the preset relationship mode in the words according to the part of speech of each word and the dependency relationship corresponding to each word in the analysis result;

032. and constructing the group of words into corresponding triples according to the triple expressions corresponding to the preset dependency relationship mode.

Specifically, a plurality of dependency relationship patterns may be preset as needed, and the triplet expression corresponding to the dependency relationship pattern may be predefined, so as to match one dependency relationship pattern corresponding to a group of words according to the part of speech of each word in the analysis result and the dependency relationship between different words, and then substitute the group of words into the expression according to the triplet expression corresponding to the pattern, so as to obtain the final triplet. Each pattern may include at least two groups of dependencies, where each dependency indicates a dependency between two words (an entity word and its corresponding head entity word), and is not described herein again. Corresponding relation triple expressions can be preset according to the dependency relationship among different entries and the part of speech of each word, and when the dependency relationship among the different entries and the part of speech of each word in a certain mode is met, the corresponding triple expressions can be adopted to substitute the entries into the triple expressions to obtain specific triple results.

Fig. 2 is a schematic diagram of triple pattern matching provided in the embodiment of the present application, and as shown in fig. 2, seven matching patterns are given: DSNF1-DSNF7, a logic expression and a graphic expression corresponding to each mode, and corresponding relation triples, wherein the part of speech and the part of speech of each word are marked in the frame of the graphic expression.

Where the arrowed lines indicate the dependency of two words and the "-" indicates the combination of two words. "{ 1,2} +" denotes a word that appears once or twice, "[ ]? And + represents a word that appears once or does not appear. The triples can be matched according to the seven patterns in fig. 2 through the dependency relationship in the embodiment of the present application.

For example, in the DSNF1 mode, the part of speech of E1 is n, n represents a noun, the part of speech of the core word of E1 is n, the relationship between E1 and the core word is ATT relationship, i.e., centering relationship, the part of speech of E2 is n, the relationship between the core word of E1 and E2 is ATT relationship, i.e., centering relationship; when the words in the sentence satisfy the relationship, and the central word can be denoted as attword, the triple (E1, attword, E2) can be extracted, where E1 is the head entity word, attword is the relation word, and E2 is the tail entity word.

As another example, the DSNF2 mode is a predicate mode, specifically: specifically, the method comprises the following steps: e1 and E2 are nouns, the core word Pred is a verb, E1 and Pred are a predicate relationship, and E2 and Pred are a verb-predicate relationship, which form a "predicate-predicate relationship", if the relationship is satisfied, the subject, predicate and object of the sentence can be correspondingly extracted as a triple, which is expressed as (E1, Pred, E2), where E1 is a head entity, Pred is a relation word, and E2 is a tail entity.

Specifically, after the analysis result is obtained, the dependency relationship in the syntactic analysis result may be matched with the dependency relationship specified in each preset pattern, whether the dependency relationship specified in the corresponding pattern exists is determined, if yes, the corresponding dependency relationship is referred to as a current dependency relationship, and then, according to the part of speech in the analysis result, whether the part of speech of each word in the current dependency relationship is consistent with the part of speech of each word specified in the dependency relationship pattern may be determined, if yes, the current dependency relationship is matched to the dependency relationship pattern, and if not, the current dependency relationship is not matched.

For example, the original sentence "i send her bunch of flowers", the results obtained by syntactic analysis include: (1,2, SBV), (4,2, VOB),.. the part of speech analysis results include: "I": 'n', 'send': 'v', "flower": n', which is only illustrated here, and thus some results are omitted, only to describe a matching generation process of a triplet. Wherein (1,2, SBV) indicates (i < -send): major-predicate relationship (subject-verb, SBV), (4,2, VOB) indicates (send- > flower): comparing the verb-object (VOB) with a preset dependency relationship mode, the satisfied dependency relationship is DSNF2, the three words E1, Pred and E2 in the mode are respectively a noun, a verb v and a noun, and the part-of-speech analysis results also accord with each other, so that the triples (E1, Pred and E2), namely (I, Send and Hua), can be extracted.

104. Acquiring input data, wherein the input data comprises a vector of an original sentence corresponding to the triplet, a vector of the triplet, a position vector corresponding to the triplet and a part-of-speech vector corresponding to the triplet, the original sentence is a sentence from which the triplet is extracted from the target text, the vector of the original sentence is used for indicating information of the original sentence on a feature space, the vector of the triplet is used for indicating information of the target triplet on the feature space, the position vector corresponding to the triplet is used for indicating information of positions of the head entity word, the tail entity word and the relation word on the feature space, and the part-of-speech vector corresponding to the target triplet is used for indicating information of parts-of-speech of the head entity word, the tail entity word and the relation word on the feature space.

Most of the existing BERT models are trained to determine the relationship of entities, and in the embodiment of the application, whether the triples are trusted can be determined based on the entities and the relationship of the input triples by using the trained BERT models. Specifically, the preset classification model may be obtained by training based on labeled sample data, where the labeled sample data includes a plurality of triple samples, the triple samples are labeled with confidence identifiers, and the confidence identifiers indicate that the triple samples are false triples or correct triples.

The input data of the classification model in the embodiment of the present application includes a vector of an original sentence corresponding to a target triple, a vector of the target triple, a position vector corresponding to the target triple, and a part-of-speech vector corresponding to the target triple. The following describes the obtaining of the four-part vector for a current triplet.

Specifically, the original sentence is a sentence from which the current triplet is extracted, and the vector may be initialized by using the trained weights in the bert model to obtain the vector of the original sentence corresponding to the current triplet. The vector of the original sentence represents the information of the sentence in the form of a vector, for example, if a word is a 128-dimensional vector, and a word has 10 words, the vector of the sentence is a 10 x 128-dimensional vector.

Optionally, for a target triple, a vector of a head entity word, a vector of a tail entity word, and a vector of a relation word may be obtained, and then the vector of the head entity word, the vector of the tail entity word, and the vector of the relation word are added to obtain the vector of the target triple.

Specifically, the vector of the target triplet is the sum of the vectors of the three words of the current triplet. And similarly, initializing the vector by using the weight in the trained bert model, obtaining the vector of each word in the current triplet, and adding the vectors of all the words to obtain the vector of the current triplet. The vector of the word referred to in the embodiments of the present application is a two-dimensional vector, and the summation referred to in the above steps means that data is superimposed on the second dimension of the vector.

For example, fig. 3 is a schematic flow chart of a vector calculation method for a triplet, where C refers to a coordinate function. For one-dimensional data a: [1,2,3] and b: [4,5,6], a-catenate b ═ 1,2,3,4,5,6 ]. For a two-dimensional vector, data superposition is performed in the second dimension, that is, the data processing in the above one-dimensional dimension is performed in the second dimension, for example, the vector dimension of the head solid word is 1 × 250, the vector dimension of the tail solid word is 1 × 250, and the vector dimension after concatenate is 1 × 500.

In one embodiment, the calculation of the position vector of the triplet may comprise the steps of:

41. acquiring the head entity words, the tail entity words and the position information of the relation words in the original sentences;

42. calculating a plurality of pieces of relative position information according to the position information, wherein the plurality of pieces of relative position information comprise relative position information between every two of head entity words, tail entity words and relation words and relative position information between respective head and tail characters of the head entity words, the tail entity words and the relation words;

43. coding the relative position information to obtain a plurality of position codes;

44. and carrying out linear transformation on the sum of the position codes to obtain the position vector corresponding to the target triple.

The position coding in the embodiment of the present application refers to the coding of the position of a certain word or word in a sentence on a feature space, similar to a word vector, but the word vector refers to the coding of a certain word on the feature space, and the position coding refers to the coding of the position on the feature space.

The three words in the triplet are respectively a head entity word, a relation word and a tail entity word, and the position information of each word may include a start index and a stop index of the word, where the start index of a word indicates the position of the first character of the word in the original sentence, and the stop index indicates the position information of the last character of the word in the original sentence, i.e., the position of a word in the sentence may be determined by the start index and the stop index.

In the embodiment of the application, head refers to a head entity word, rel refers to a relation word, tail refers to a tail entity word, head [ i ] refers to a start index of the head, head [ j ] refers to a stop index of the head, rel [ i ] refers to a start index of the relation word, rel [ j ] refers to a stop index of the relation word, tail [ i ] refers to a start index of the tail entity word, and tail [ j ] refers to a stop index of the tail entity word. Such as: xiaoming is the class leader, where the first entity word is "xiaoming", the start index is 0 (here the first character of the sentence is noted at 0), and the end index is 2, indicating the location of the word "xiaoming" in the sentence.

Thus, for a triplet, the location information may be obtained including: the start and end indices for the head entity word, the start and end indices for the tail entity word, and the start and end indices for the related words. The relative position information between the head entity word, the tail entity word and the relation word in the triple and the relative position information between the head and tail characters of the head entity word, the tail entity word and the relation word are calculated.

Specifically, the step 42 may include:

and calculating the difference between each initial index and each terminal index according to the initial index and the terminal index of the head entity word, the initial index and the terminal index of the tail entity word and the initial index and the terminal index of the relation word to obtain a plurality of pieces of relative position information.

In the embodiment of the present application, a plurality of pieces of relative position information, for example, may be obtained by calculating a difference between each start index and each end index. For three words in the above triplet, in the case that each word has a corresponding start index and an end index, eight pieces of relative position information may be calculated, which respectively include:

difference between start index and end index of head

Difference between head's start index and tail's end index

Difference between start index of tail and end index of head

Difference between start index and end index of tail

Difference between head's start index and rek's end index

Difference between start index and end index of rel

Difference between start index of rel and end index of tail

Difference between start index and rel end index of tail

Specifically, the method can be obtained by calculating according to the following formula:

further, the code corresponding to each d position may be calculated. The function calculation and the addition can be performed in odd dimensions and even dimensions, and specifically can be calculated by the following formula:

where 2i indicates the even dimension of the d position, and 2i +1 indicates the odd dimension of the d position, and the two results are combined to indicate the final position code of the d position, so that the position code PE corresponding to each piece of relative position information d can be obtained.

Further, the sum of the above position codes may be taken and linearly transformed as a final position vector. In particular, it may be calculated based on the activation function and the initialization matrix of the model. In an alternative embodiment, the activation function may adopt a RELU activation function, and the final position vector of the target triplet may be calculated by the following formula:

where Wr denotes a matrix initialized at random.

Optionally, the acquiring the input data further includes:

acquiring the part of speech of each word in the target triple;

and acquiring a part-of-speech vector corresponding to the part-of-speech of each word according to a preset mapping relation between the part-of-speech and the part-of-speech vector, and taking the part-of-speech vector corresponding to the part-of-speech of each word as the part-of-speech vector corresponding to the target triple.

The classification model in the embodiment of the application can initialize a multi-dimensional vector matrix according to a preset part-of-speech type, and the multi-dimensional vector matrix is used as a part-of-speech vector candidate set, namely the mapping relation between the preset part-of-speech and a part-of-speech vector is included. For example, there are 10 total part-of-speech categories, and 10 x 128-dimensional vectors can be initialized to represent the 10 parts-of-speech, each part-of-speech being a 128-dimensional vector. When a part of speech is input, a vector representing the corresponding part of speech can be found in the matrix according to the input part of speech as a part of speech vector.

Through the steps, the characteristics of each aspect of the triples are fully considered, the characteristics comprise the characteristics of the original sentences to which the triples belong, the characteristics of the triples, and the position characteristics and the part-of-speech characteristics of the head entity words, the tail entity words and the relation words in the triples, the triples can be analyzed more comprehensively to identify the correctness of the triples, and the accuracy of the triples can be further improved through verification.

105. And inputting the input data into a preset classification model, and processing the input data through the preset classification model to verify the triples.

In the embodiment of the application, the triple can be checked through a preset classification model. In particular, the classification model may use a BERT model, which may perform text classification. The trained classification model can judge whether each triplet is correct or incorrect.

Referring to fig. 4, fig. 4 is a schematic view of a verification structure provided in the embodiment of the present application. As shown in fig. 4, first, a vector of the triplet sample, a vector of the original sentence corresponding to the triplet sample, a position vector corresponding to the triplet sample, and a part-of-speech vector of the triplet sample may be input into the classification model during training. The vectors can be obtained by randomly initializing the vectors, and then updating the weights of the vectors through model training; alternatively, vector data of the already trained triples may be used, which is not limited herein. Further, the vectors are added (using a coordinate function, see the description above and fig. 3), and then input into the encoder of the model for encoding, and then the last layer is linearly transformed by a linear layer, and an activation function layer is processed.

The encoder may adopt an encoder (encoder) structure of a Transformer, which is a seq2seq model proposed by google brain. The processing flow of the encoder mainly comprises the following steps: an input is calculated through an Attention mechanism (Multi-Head Attention), a residual linking (residual connection) is carried out, then a fully-connected neural network LayerNormal transformation (the hidden layer in the neural network is normalized to be standard normal distribution and convergence is accelerated) is carried out, finally linear mapping is carried out through a Feed-Forward network (Feed Forward) and activation is carried out through an activation function, and an encoding task is completed through N times of circulation.

Specifically, as shown in fig. 4, the FFN + sigmoid means that a fully connected layer is layered with a sigmoid activation function to perform binary classification, and the probability can be converted into a label (label), which is a classification result of whether a triplet is reliable in the present application, and the trained classification model can predict the triplet. The activation function may map numbers between 0 and 1, so as to represent a probability distribution of a certain number x, and the probability obtained finally is the result after sigmoid.

Through the attention mechanism, each character in a sentence contains information of all other characters in the sentence, data correlation is increased, and the feature expression capability of the model is improved by utilizing part of speech analysis and position coding, so that the accuracy of the model can be improved.

In practical application, four vectors related to sentences and triples can be obtained through the steps, similar to the training process, the sum of the four vectors is input into a model, after the four vectors are encoded by a model encoder, a numerical value is converted into a probability through a sigmoid function, a credibility identifier corresponding to the triples is obtained, if the result is 1, the triples are trustable, and if the result is 0, the triples are untrustworthy, and the triples need to be filtered.

For example, if the vector after neural network processing is [1, 3], the value is converted into probability by sigmoid function, the value after sigmoid is [0.25,0.75], the probability of 1 (correct) is 0.25, the probability of 0 (wrong) is 0.75, and the final classification prediction result is 0, and the triplet is a wrong triplet. According to the credibility identifiers of the triples, whether the triples are wrong triples or correct triples can be judged, and the triples can be checked.

The classification model used in the embodiments of the present application is BERT. Alternatively, machine-learned classification models can be substituted such as: support Vector Machines (SVMs), logistic classification, etc., or to replace deep learning models such as Convolutional Neural Networks (CNNs), ALBERT, etc., which is not limited in this embodiment of the present application.

Optionally, after the step 105, the method further includes:

and after the error triple is deleted, filtering out the triple which is considered as an error by the classification model, and constructing a knowledge graph according to the triple after the error triple is deleted as a final triple result.

In the case that the extracted triples may have errors, this step may use the classification model to remove some obviously erroneous triples, so as to improve the accuracy. And then, taking the obtained entities in the triples as nodes and taking the space-time relationship among different entities as connecting lines, wherein one entity can emit one or more connecting lines to be connected with other entities, so that an entity relationship network presented in a graph structure, namely a knowledge graph, can be constructed.

Optionally, after the above steps, the knowledge graph may be stored in a database, for example, graph storage is performed by using a graph database arangoDB, and the stored graph may be displayed. In the embodiment of the application, the triplets can be verified through the classification model, so that the accuracy of triplets extraction can be improved, and a more accurate and reliable map can be established.

In the embodiment of the application, a part-of-speech analysis and a syntactic analysis processing are performed on a target text to obtain an analysis result corresponding to the target text, wherein the analysis result comprises a part-of-speech of each word in the target text and a syntactic label corresponding to each word, and the syntactic label corresponding to each word comprises a dependency relationship between each word and a head entity word corresponding to each word; matching a triple corresponding to the target text according to a preset matching rule and the analysis result, wherein the triple comprises a head entity word, a tail entity word and a relation word for marking a syntactic relation between the head entity word and the tail entity word; acquiring input data, wherein the input data comprises vectors of original sentences corresponding to the triples, vectors of the triples, position vectors corresponding to the triples and part-of-speech vectors corresponding to the triples, the original sentences are sentences extracted from the target text, the vectors of the original sentences are used for indicating information of the original sentences on a feature space, the vectors of the triples are used for indicating information of the target triples on the feature space, the position vectors corresponding to the triples are used for indicating information of positions of the head entity words, the tail entity words and the relation words on the feature space, and the part-of-speech vectors corresponding to the target triples are used for indicating information of parts-of-speech of the head entity words, the tail entity words and the relation words on the feature space; and inputting the input data into a preset classification model, and processing the input data through the preset classification model to verify the triples. The triplet is generated by an unsupervised method of syntactic analysis, part of speech analysis and pattern recognition, manual tagging is not needed, and tagging cost is reduced. And subsequently, verification processing is carried out according to the triples and the information such as the syntax and the part of speech of the triples, the characteristics of the original sentences to which the triples belong, the characteristics of the triples, and the position characteristics and the part of speech characteristics of the head entity words, the tail entity words and the relation words in the triples are considered, so that the triples can be analyzed and verified more comprehensively, the accuracy of the triples is improved, and the triples can be used for establishing a more accurate knowledge map.

Based on the description of the embodiment of the triple generating and checking method, the embodiment of the application also discloses a triple generating and checking device. Referring to fig. 5, the triple generating and verifying apparatus 500 includes:

an obtaining module 510, configured to obtain a target text;

an analysis module 520, configured to perform part-of-speech analysis and syntactic analysis processing on the target text to obtain an analysis result corresponding to the target text, where the analysis result includes a part-of-speech of each word in the target text and a syntactic label corresponding to each word, and the syntactic label corresponding to each word includes a dependency relationship between each word and a head entity word corresponding to each word;

a matching module 530, configured to match a triple corresponding to the target text according to a preset matching rule and the analysis result, where the triple includes a head entity word, a tail entity word, and a relation word that identifies a syntactic relationship between the head entity word and the tail entity word;

a checking module 540, configured to obtain input data, where the input data includes a vector of an original sentence corresponding to the triplet, a vector of the triplet, a position vector corresponding to the triplet, and a part-of-speech vector corresponding to the triplet, where the original sentence is a sentence extracted from the target text, the vector of the original sentence is used to indicate information of the original sentence on a feature space, the vector of the triplet is used to indicate information of the target triplet on the feature space, the position vector corresponding to the triplet is used to indicate information of positions of the head entity word, the tail entity word, and the relation word on the feature space, and the part-of-speech vector corresponding to the target triplet is used to indicate information of parts-of-speech of the head entity word, the tail entity word, and the relation word on the feature space;

the checking module 540 is further configured to check the triple corresponding to the target text, and obtain a checking result of the triple corresponding to the target text.

According to an embodiment of the present application, each step involved in the method shown in fig. 1 may be performed by each module in the triple generating and verifying apparatus 500 shown in fig. 5, and is not described herein again.

Based on the description of the method embodiment and the device embodiment, the embodiment of the application further provides an electronic device. Referring to fig. 6, the electronic device 600 at least includes a processor 601, a memory 602, and an input/output unit 603. The processor 601 may be a Central Processing Unit (CPU), which is a final execution unit for information processing and program operation as an operation and control core of the computer system.

A computer storage medium may be stored in the memory 602 of the electronic device 600, the computer storage medium being used to store a computer program comprising program instructions, and the processor 601 may execute the program instructions stored in the memory 602. The preset classification models and the like in the embodiment of the present application may also be stored in the memory 602.

In an embodiment, the electronic device 600 according to the embodiment of the present application may be configured to perform a series of processes, including the method according to any embodiment shown in fig. 1, and so on, which are not described herein again.

An embodiment of the present application further provides a computer storage medium (Memory), which is a Memory device in an electronic device and is used to store programs and data. It is understood that the computer storage medium herein may include both a built-in storage medium in the electronic device and, of course, an extended storage medium supported by the electronic device. Computer storage media provide storage space that stores an operating system for an electronic device. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), suitable for loading and execution by the processor. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer storage medium located remotely from the processor.

In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by a processor to perform the corresponding steps in the above embodiments; in particular implementations, one or more instructions in the computer storage medium may be loaded by the processor and perform any steps of the method in fig. 1, which are not described herein again.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the division of the module is only one logical division, and other divisions may be possible in actual implementation, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not performed. The shown or discussed mutual coupling, direct coupling or communication connection may be an indirect coupling or communication connection of devices or modules through some interfaces, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a read-only memory (ROM), or a Random Access Memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a Digital Versatile Disk (DVD), or a semiconductor medium, such as a Solid State Disk (SSD).

Claims

1. A triplet generation and verification method, comprising:

acquiring a target text;

2. The triplet generation and verification method of claim 1 wherein the obtaining input data comprises:

acquiring the vector of the head entity word, the vector of the tail entity word and the vector of the relation word;

adding the vector of the head entity word, the vector of the tail entity word and the vector of the relation word to obtain a vector of the target triple;

acquiring the head entity words, the tail entity words and the position information of the relation words in the original sentences;

calculating a plurality of pieces of relative position information according to the position information, wherein the plurality of pieces of relative position information comprise relative position information between every two of head entity words, tail entity words and relation words and relative position information between respective head and tail characters of the head entity words, the tail entity words and the relation words;

coding the relative position information to obtain a plurality of position codes;

and carrying out linear transformation on the sum of the position codes to obtain the position vector corresponding to the target triple.

3. The triplet generation and verification method of claim 2 wherein the obtaining input data further comprises:

acquiring the part of speech of each word in the target triple;

4. The triplet generating and checking method according to claim 2, characterised in that the location information comprises: the starting index and the ending index of the head entity word, the starting index and the ending index of the tail entity word and the starting index and the ending index of the relation word, the starting index of the target word is used for indicating the position of the first character of the target word in the original sentence, the ending index of the target word is used for indicating the position information of the last character of the target word in the original sentence, and the target word is the head entity word, the tail entity word or the relation word;

the calculating a plurality of relative position information according to the position information comprises:

and calculating the difference between each initial index and each terminal index according to the initial index and the terminal index of the head entity word, the initial index and the terminal index of the tail entity word and the initial index and the terminal index of the relation word to obtain the relative position information.

5. The triple generation and verification method according to claim 1, wherein the preset matching rule includes a preset relationship pattern and a triple expression corresponding to the preset relationship pattern, and the preset relationship pattern specifies a dependency relationship between every two words and a part of speech of each word;

determining a group of words meeting the dependency relationship and the part of speech specified by the preset relationship mode in the words according to the part of speech of each word and the dependency relationship corresponding to each word in the analysis result;

and constructing the group of words into corresponding triples according to the triple expressions corresponding to the preset relation modes.

6. The triplet generating and checking method according to any one of claims 1-5, wherein the obtaining the target text comprises:

acquiring a text to be processed;

and removing special characters in the text to be processed, and performing sentence division processing on the text to be processed to obtain a target text, wherein each sentence in the target text comprises a subject.

7. The triplet generation and verification method according to claim 1, wherein after inputting the input data into a preset classification model and processing the input data through the preset classification model to verify the triplets, the method further comprises:

after the error triples are deleted, a knowledge graph is constructed according to the triples after the error triples are deleted.

8. A triplet generating and verifying apparatus comprising:

the acquisition module is used for acquiring a target text;

9. An electronic device comprising a processor and a memory, the memory storing a computer program that, when executed by the processor, causes the microcontroller to perform the steps of the triplet generation and verification method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, causes the processor to carry out the steps of the triplet generation and verification method according to any one of claims 1 to 7.