CN113515630B

CN113515630B - Triplet generation and verification method and device, electronic equipment and storage medium

Info

Publication number: CN113515630B
Application number: CN202110650253.3A
Authority: CN
Inventors: 曾钢欣
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2024-04-09
Anticipated expiration: 2041-06-10
Also published as: CN113515630A

Abstract

The application discloses a triplet generation and verification method, a triplet generation and verification device, electronic equipment and a storage medium. The method comprises the following steps: performing part-of-speech analysis and syntactic analysis processing on the target text to obtain an analysis result, wherein the analysis result comprises part-of-speech and syntactic labels of each word in the target text, and the syntactic label corresponding to each word comprises the dependency relationship between each word and its head entity word; according to a preset matching rule and an analysis result, matching out a triplet corresponding to the target text, wherein the triplet comprises a head entity word, a tail entity word and Guan Jici for identifying the syntactic relation between the head entity word and the tail entity word; acquiring input data, wherein the input data comprises a vector of an original sentence corresponding to a triplet, a vector of the triplet, a position vector corresponding to the triplet and a part-of-speech vector; and inputting the input data into a preset classification model, and processing the input data through the preset classification model to verify the triples.

Description

Triplet generation and verification method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, an electronic device, and a storage medium for generating and verifying a triplet.

Background

Along with the landing of various intelligent products, the position of the knowledge graph is important as the knowledge brain energized behind the product. However, the cost of knowledge graph construction is very high. The expert in the corresponding field is usually required to define the data mode of the knowledge graph in advance, massive labeling data are required to extract knowledge and fuse knowledge, and a storage database which can be responded quickly and stored in a massive manner is required. Therefore, only large companies can have the capability of constructing large-scale knowledge maps, and the construction of the knowledge maps is a very difficult work for many small and medium-sized companies, and the large-scale knowledge maps require large data volume and large manual participation, and have high cost.

Entity extraction is one of the classical tasks of Natural Language Processing (NLP) with the aim of extracting entities from structured, semi-structured or unstructured data, as defined entity types are: the nations, unstructured text is: the Chinese has a history of five thousand years culture, and the extracted entities are: "China". At present, some methods are used for obtaining triples by segmenting words in an acquired corpus and using segmented words as candidate sets of entities, but the accuracy of the triples generated according to the method is not high enough.

Disclosure of Invention

The application provides a triplet generation and verification method, a triplet generation and verification device, electronic equipment and a storage medium.

In a first aspect, a method for generating and checking a triplet is provided, including:

acquiring a target text;

performing part-of-speech analysis and syntactic analysis on the target text to obtain an analysis result corresponding to the target text, wherein the analysis result comprises the part of speech of each word in the target text and a syntactic label corresponding to each word, and the syntactic label corresponding to each word comprises the dependency relationship between each word and a head entity word corresponding to each word;

according to a preset matching rule and the analysis result, matching a triplet corresponding to the target text, wherein the triplet comprises a head entity word, a tail entity word and Guan Jici for identifying the syntactic relation between the head entity word and the tail entity word;

acquiring input data, wherein the input data comprises a vector of an original sentence corresponding to a triplet, a vector of the triplet, a position vector corresponding to the triplet and a part-of-speech vector corresponding to the triplet, the original sentence is a sentence of the triplet extracted from the target text, the vector of the original sentence is used for indicating information of the original sentence on a feature space, the vector of the triplet is used for indicating information of the target triplet on the feature space, the position vector corresponding to the triplet is used for indicating information of positions of the head entity word, the tail entity word and the relation word on the feature space, and the part-of-speech vector corresponding to the target triplet is used for indicating information of the head entity word, the tail entity word and the part-of-speech of the relation word on the feature space;

And inputting the input data into a preset classification model, and processing the input data through the preset classification model to verify the triples.

In a second aspect, there is provided a triplet generating and checking arrangement comprising:

the acquisition module is used for acquiring the target text;

the analysis module is used for performing part-of-speech analysis and syntactic analysis on the target text to obtain an analysis result corresponding to the target text, wherein the analysis result comprises the part-of-speech of each word in the target text and a syntactic label corresponding to each word, and the syntactic label corresponding to each word comprises the dependency relationship between each word and a head entity word corresponding to each word;

the matching module is used for matching a triplet corresponding to the target text according to a preset matching rule and the analysis result, wherein the triplet comprises a head entity word, a tail entity word and Guan Jici for identifying the syntactic relation between the head entity word and the tail entity word;

the verification module is used for acquiring input data, wherein the input data comprises a vector of an original sentence corresponding to the triplet, a vector of the triplet, a position vector corresponding to the triplet and a part-of-speech vector corresponding to the triplet, the original sentence is a sentence of the triplet extracted from the target text, the vector of the original sentence is used for indicating information of the original sentence on a feature space, the vector of the triplet is used for indicating information of the target triplet on the feature space, the position vector corresponding to the triplet is used for indicating information of positions of the head entity word, the tail entity word and the relation word on the feature space, and the part-of-speech vector corresponding to the target triplet is used for indicating information of the head entity word, the tail entity word and the part-of-speech of the relation word on the feature space;

The verification module is further used for performing verification processing on the triples corresponding to the target text to obtain verification results of the triples corresponding to the target text.

In a third aspect, there is provided an electronic device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps as in the first aspect and any one of its possible implementations.

In a fourth aspect, there is provided a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the steps of the first aspect and any one of its possible implementations described above.

According to the method, the device and the system, part of speech analysis and syntactic analysis processing are carried out on the target text, so that an analysis result corresponding to the target text is obtained, the analysis result comprises the part of speech of each word in the target text and a syntactic label corresponding to each word, and the syntactic label corresponding to each word comprises the dependency relationship between each word and a head entity word corresponding to each word; according to a preset matching rule and the analysis result, matching a triplet corresponding to the target text, wherein the triplet comprises a head entity word, a tail entity word and Guan Jici for identifying the syntactic relation between the head entity word and the tail entity word; acquiring input data, wherein the input data comprises a vector of an original sentence corresponding to a triplet, a vector of the triplet, a position vector corresponding to the triplet and a part-of-speech vector corresponding to the triplet, the original sentence is a sentence of the triplet extracted from the target text, the vector of the original sentence is used for indicating information of the original sentence on a feature space, the vector of the triplet is used for indicating information of the target triplet on the feature space, the position vector corresponding to the triplet is used for indicating information of positions of the head entity word, the tail entity word and the relation word on the feature space, and the part-of-speech vector corresponding to the target triplet is used for indicating information of the head entity word, the tail entity word and the part-of-speech of the relation word on the feature space; inputting the input data into a preset classification model, processing the input data through the preset classification model to verify the triples, extracting the triples by utilizing syntactic analysis and part-of-speech analysis to obtain syntactic and part-of-speech information of the triples, and subsequently verifying according to the triples and the syntactic and part-of-speech information of the triples, wherein characteristics of original sentences to which the triples belong, characteristics of the triples, and position characteristics and part-of-speech characteristics of head entity words, tail entity words and relation words in the triples are considered, so that the triples can be more comprehensively analyzed and verified, the accuracy of the triples is improved, and the method can be used for establishing a more accurate knowledge graph.

Drawings

In order to more clearly describe the technical solutions in the embodiments or the background of the present application, the following description will describe the drawings that are required to be used in the embodiments or the background of the present application.

FIG. 1 is a schematic flow chart of a method for generating and checking triples according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a triplet pattern matching provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart of a method for calculating a vector of a triplet according to an embodiment of the present application;

fig. 4 is a schematic diagram of a verification structure according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a triplet generating and checking device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The knowledge graph related to the embodiment of the application is essentially a semantic network for revealing the relationship between entities, and is generally composed of triples, expressed in the form of head node-relationship-tail node, and the nodes and edges can store attributes, and generally have two storage modes: 1. RDF resource description framework; 2. and (5) a graph database. A triplet typically extracts three words in a sentence component, including a host entity (subject), a guest entity (object), and a relationship between the two entities (relationship), which may be expressed as (subject, relation), for example: one triplet is (kebi, job basket) and a large number of such triples constitute a specific knowledge graph. Correspondingly, in the embodiment of the present application, three words in the triplet are called a head entity word head, a relation word relation and a tail entity word tail.

Embodiments of the present application are described below with reference to the accompanying drawings in the embodiments of the present application.

Referring to fig. 1, fig. 1 is a flow chart of a method for generating and checking triples according to an embodiment of the present application. The method may include:

101. and acquiring the target text.

The execution body of the embodiment of the present application may be a triplet generating and checking device, may be an electronic device, and in a specific implementation, the electronic device may be a terminal, which may also be referred to as a terminal device, including but not limited to other portable devices such as a laptop computer or a tablet computer having a touch sensitive surface (e.g., a touch screen display and/or a touch pad). It should also be appreciated that in some embodiments, the above-described devices are not portable communication devices, but rather desktop computers having touch-sensitive surfaces (e.g., touch screen displays and/or touch pads).

The target text may be an interpretation text of a certain keyword, for example, a text which is queried on a network such as the network of the keyword, the network uses the keyword as an entry, or other text, which is not limited in the embodiment of the present application.

Optionally, the step 101 includes:

011. Acquiring a text to be processed;

012. and removing special characters in the text to be processed, and carrying out sentence segmentation on the text to be processed to obtain a target text, wherein each sentence in the target text contains a subject.

Specifically, the text to be processed may be obtained first, and the text to be processed may be preprocessed, and the text to be processed may be obtained in any manner. Alternatively, the disclosed general crawler tool may be adopted to crawl the corresponding website to obtain the text corresponding to the keyword, for example: crawling the text taking a "cosmetologist" as an entry, wherein the obtained text is:

"beauticians are a professional designation in the field of professional cosmetology. The system mainly works in beauty parlors and places capable of providing beauty services for customers. It is the job of providing cosmetic services to the customer, such as skin care tasks for washing the face, maintaining, massaging, aromatherapy, and losing weight. "

The text to be processed may be subjected to data preprocessing, and special characters may be removed first, for example: xxxI (where xxx is a name of a leader) accesses the united states in 2020, and the pretreatment is: xxx2020 accesses the united states. In addition, it is possible to use. ","; ", I! The punctuation mark performs sentence dividing processing, and can also supplement the subject of each sentence in the text to be processed to obtain the target text, and the specific operation can include that if no subject is detected in the sentence obtained after the sentence dividing, the subject in the previous sentence is taken as the subject of the current sentence. After this stage, the text becomes a sentence one by one for subsequent syntactic analysis. Other preprocessing modes can be added according to the data set, and the embodiment of the application does not limit the preprocessing modes.

For example, the text using "beautician" as an entry is subjected to data preprocessing, and the following clauses can be obtained through punctuation clause processing:

"beauticians are professional names in the field of professional cosmetology";

"mainly works in beauty parlor and can provide the beauty service for customers";

the "job duty" is to provide the customer with cosmetic services such as skin care tasks such as face washing, maintenance, massage, aromatherapy, and weight loss.

Wherein the second sentence has no subject, and the subject 'cosmetologist' in the first sentence can be used as supplement, namely the second sentence is:

the beauty shop is mainly operated by the beauty shop and the place for providing the customer with the beauty service.

102. And performing part-of-speech analysis and syntactic analysis processing on the target text to obtain an analysis result corresponding to the target text, wherein the analysis result comprises the part of speech of each word in the target text and a syntactic label corresponding to each word, and the syntactic label corresponding to each word comprises the dependency relationship between each word and the head entity word corresponding to each word.

The parts of speech referred to in this embodiment of the present application refers to the feature of a word as the basis for dividing the parts of speech. The word class is a linguistic term, is the grammar classification of words in a language, is based on grammar characteristics (including syntactic function and morphological change) as a main basis, gives consideration to the result of word division in terms of vocabulary meaning, and the words of modern Chinese can be divided into 13 word classes, which can include:

prep. preposition

pronouns of pronoun

n, noun

v. verb

conj. Connective word

s subject language

sc list

object o

OCBING tonic

vi. disfigurement verbs

vt. and verb

aux.v auxiliary verb

adj adjectives

adv adverbs

art. Article

num. Number words

The syntax analysis (Parsing) referred to in the embodiments of the present application refers to analyzing the word grammar function in the sentence, for example, "i come late", where "i" is a subject, "i" is a predicate, and "i late" is a complement.

In one embodiment, part-of-speech analysis and syntactic analysis are performed on a target text to obtain an analysis result corresponding to the target text, including:

word segmentation processing is carried out on each sentence in the target text, and a plurality of words in the target text are obtained;

performing part-of-speech analysis on the plurality of words to determine the part-of-speech of each word;

and carrying out the syntactic analysis processing on the target text according to the part of speech of each word to obtain the syntactic label corresponding to each word.

The part of speech analysis and the syntactic analysis can obtain a part of speech analysis result and a syntactic analysis result, namely the part of speech of each word in the target text and the syntactic label corresponding to each word. Wherein, the syntax label comprises the dependency relationship between each word and the head entity word corresponding to the word.

Embodiments of the present application relate to dependency theory, where "dependency" refers to a word-to-word relationship that is in an assignment and is subject to, such relationship being directional. The dominant word is called a dominant, i.e., the head entity word (head), and the dominant word may be called a subordinate (dependency).

The dependencies can be subdivided into different types, representing the dependency of specific two words, such as in the sentence "I am her a bundle of flowers" (I am < "): main subject-verb (SBV), (send- > flower): a move guest relationship (VOB); in "red apples," for example, (red < - > apples): centering relationships (attributes), and so on.

The syntax tag obtained through the syntactic analysis processing in the application can comprise a position index of the current word, a head entity word index of the current word, and a dependency relationship between the current word and the head entity word.

For example, as the sentence "xxx2020 accesses the united states", the analysis results after processing by part of speech analysis and syntactic analysis include the results of part of speech analysis: results of syntactic analysis (syntactic labels) [ 'nh', 'nt', 'v', 'ns' ]: [ (1, 3, 'SBV'), (2, 3, 'ADV'), (3, 0, 'HED'), (4, 3, 'VOB') ].

Wherein, "nh" refers to "xxx" as a name of a person, "nt" refers to "2020" as a time noun, "v" refers to "visit" as a verb, and "ns" refers to "U.S" as a geographic noun; in the above (1, 3, "SBV"), 1 refers to the position index of the current word "xxx", indicating that the position of the word "xxx" is the first of the texts, 3 refers to the index of the head entity word "access" of the current word, indicating that the position of the word "access" is the third of the texts, and "SBV" represents the dominant relationship, i.e., "xxx" and "access" are dominant relationships, and so on.

In this embodiment of the present application, when a word does not have a corresponding head entity word, for example, the word itself does not belong to an entity word, and the head entity word index (which may be recorded as 0) is not found in the corresponding parsing result.

In the embodiment of the application, the syntactic analysis, word segmentation and the like can adopt a natural language processing tool LTP with a Hadoku open source, other open source tools such as NLTK, fastNLP and the like can be used, and the syntactic analysis and the part of speech analysis can be performed by training a specific model according to the requirement, so that the embodiment of the application is not limited.

103. And matching the triples corresponding to the target text according to a preset matching rule and the analysis result, wherein the triples comprise head entity words, tail entity words and related words for marking the syntactic relation between the head entity words and the tail entity words.

The preset matching rule specifies how different triples should be generated according to different inter-term dependencies, and after the merging and updating are completed, the triples corresponding to the target text can be matched according to the preset matching rule and the syntactic analysis result.

In an alternative embodiment, the preset matching rule includes a preset dependency relationship mode, and a triplet expression corresponding to the preset dependency relationship mode;

and matching the triples corresponding to the target text according to a preset matching rule and the syntactic analysis result, wherein the matching comprises the following steps:

031. determining a group of words meeting the dependency relationship and the part of speech specified by the preset relation mode in the words according to the part of speech of each word and the dependency relationship corresponding to each word in the analysis result;

032. and constructing the group of words into corresponding triples according to the triples expression corresponding to the preset dependency relationship mode.

Specifically, multiple dependency relation modes can be preset according to requirements, and a triplet expression corresponding to the dependency relation modes is predefined, so that a dependency relation mode corresponding to a group of words is matched according to the part of speech of each word and the dependency relation among different words in the analysis result, and then the group of entries are substituted into the expression according to the triplet expression corresponding to the mode, so that a final triplet is obtained. Wherein each schema may include at least two sets of dependencies, each of which indicates a dependency between two words (one entity word and its corresponding head entity word), and which are not described in detail herein. Corresponding relation triplet expressions can be preset according to the dependency relations among different terms and the parts of speech of each term, and when the part of speech and the dependency relations among terms in a certain mode are met, the corresponding triplet expressions can be adopted, and the terms are substituted into the triplet expressions, so that a specific triplet result is obtained.

Fig. 2 is a schematic diagram of triplet pattern matching provided in the embodiment of the present application, and as shown in fig. 2, seven matching patterns are given: DSNF1-DSNF7, a logic expression and a graphic expression corresponding to each mode, and a corresponding relation triplet, wherein the part of speech and the part of speech of each word are marked in a graphic expression box.

Where the arrow lines are marked as dependencies of two words, "-" represents a combination of two words. "{1,2} +" means a word appearing one or two times, "[ ]? A + "indicates a word that appears once or not. Then in the embodiment of the present application, the triples can be matched according to the seven modes in fig. 2 through the dependency relationship.

For example, in the DSNF1 mode, the part of speech of E1 is n, n represents a noun, the part of speech of the center word of E1 is n, the relationship between E1 and the center word is an ATT relationship, i.e., a centering relationship, the part of speech of E2 is n, and the relationship between the center word of E1 and E2 is an ATT relationship, i.e., a centering relationship; when the words in the sentence satisfy the above relationship, the central word may be denoted as attword, and then a triplet (E1, attword, E2) may be extracted, where E1 is a head entity word, attword is Guan Jici, and E2 is a tail entity word.

As mentioned above, the DSNF2 mode is a master guest mode, specifically: specific: e1 and E2 are nouns, the central word Pred is a verb, E1 and Pred are a main-predicate relationship, E2 and Pred are a moving-guest relationship, namely a main-predicate relationship is formed, if the relationship is met, the main, predicate and object of a sentence can be correspondingly extracted to serve as triples, namely (E1, pred and E2), wherein E1 is a head entity word, pred is Guan Jici, and E2 is a tail entity word.

Specifically, after the analysis result is obtained, the dependency relationship in the syntactic analysis result can be matched with the dependency relationship specified in each preset mode, whether the dependency relationship specified in the conforming mode exists or not is determined, if yes, the conforming dependency relationship is called as the current dependency relationship, and if not, the part of speech of each word in the current dependency relationship is matched with the part of speech of each word specified in the dependency relationship mode according to the part of speech in the analysis result.

For example, the original sentence "I send her a bundle of flowers", and the results obtained by syntactic analysis include: (1, 2, sbv), (4, 2, vob), part-of-speech analysis results include: "me": 'n', 'send': 'v', "flower": 'n', only illustrative is provided herein, and therefore partial results are omitted, only to describe the matching generation process of one triplet. Wherein, (1, 2, SBV) indicates (I < ">): main subject-verb (SBV), (4, 2, VOB) indicates (send- > flower): the dynamic guest relation (VOB) is compared with a preset dependency relation mode, the satisfied dependency relation is DSNF2, three words E1, pred and E2 in the mode are noun, verb v and noun respectively, and the part-of-speech analysis results are also consistent, so that the triples (E1, pred and E2), namely (I, send and flower) can be extracted.

104. The method comprises the steps of obtaining input data, wherein the input data comprises vectors of original sentences corresponding to triples, vectors of triples, position vectors corresponding to triples and part-of-speech vectors corresponding to triples, the original sentences are sentences of the triples extracted from target texts, the vectors of the original sentences are used for indicating information of the original sentences in a feature space, the vectors of the triples are used for indicating information of the target triples in the feature space, the position vectors corresponding to triples are used for indicating information of positions of head entity words, tail entity words and relation words in the feature space, and the part-of-speech vectors corresponding to the target triples are used for indicating information of the head entity words, the tail entity words and the part-of-speech of the relation words in the feature space.

Most of the current BERT models are trained to determine relationships of entities, and in the embodiment of the present application, the trained BERT models can be used to determine whether a triplet is trustworthy based on the entities and relationships of the input triplet. Specifically, the preset classification model may be obtained through training based on labeling sample data, where the labeling sample data includes a plurality of triplet samples, where the triplet samples are labeled with a confidence identifier, and the confidence identifier indicates that the triplet samples are an error triplet or a correct triplet.

The input data of the classification model in the embodiment of the application comprises a vector of an original sentence corresponding to the target triplet, a vector of the target triplet, a position vector corresponding to the target triplet and a part-of-speech vector corresponding to the target triplet. The four-part vector acquisition is described below for a current triplet.

Specifically, the original sentence is a sentence from which the current triplet is extracted, and the weight in the trained bert model can be used to initialize the vector to obtain the vector of the original sentence corresponding to the current triplet. The vector of the original sentence represents the information of the sentence in the form of a vector, for example, a word is a 128-dimensional vector, a sentence has 10 words, and the vector of the sentence is a 10 x 128-dimensional vector.

Optionally, for a target triplet, a vector of a head entity word, a vector of a tail entity word, and a vector of a relational word may be obtained, and then the vector of the head entity word, the vector of the tail entity word, and the vector of Guan Jici are added to obtain the vector of the target triplet.

Specifically, the vector of the target triplet is the sum of the vectors of the three words of the current triplet. The weights in the trained bert model are used for initializing the vectors, so that the vector of each word in the current triplet can be obtained, and then the vectors of the words are added to obtain the vector of the current triplet. The vector of the word referred to in the embodiment of the present application is a two-dimensional vector, and the summation referred to in the above step refers to superposition of data in the second dimension of the vector.

For example, fig. 3 is a schematic flow chart of a method for calculating a vector of a triplet, where C refers to a concatate function. For one-dimensional data a: [1,2,3] and b [4,5,6], a concatate b= [1,2,3,4,5,6]. For the two-dimensional vector, the data superposition is performed in the second dimension, that is, the one-dimensional data processing is performed in the second dimension, for example, the vector dimension of the head entity word is 1×250, the vector dimension of the tail entity word is 1×250, and the vector dimension after conccate is 1×500.

In one embodiment, the calculation of the position vector of the triplet may comprise the steps of:

41. acquiring position information of the head entity word, the tail entity word and the relation word in the original sentence;

42. calculating a plurality of relative position information according to the position information, wherein the plurality of relative position information comprises head entity words, tail entity words and relative position information between every two relation words and relative position information between head entity words, tail entity words and head and tail characters of each relation word;

43. coding the plurality of relative position information to obtain a plurality of position codes;

44. And performing linear transformation on the sum of the plurality of position codes to obtain a position vector corresponding to the target triplet.

Position coding in the embodiments of the present application refers to coding of a position of a certain word or word in a sentence on a feature space, similar to a word vector, but a word vector refers to coding of a certain word on a feature space, and position coding refers to coding of a position on a feature space.

The three words in the triplet are a head entity word, a relationship word and a tail entity word, and the position information of each word may include a start index and a stop index of the word, where the start index of one word indicates the position of the first character of the word in the original sentence, and the stop index indicates the position information of the last character of the word in the original sentence, that is, the position of one word in the sentence may be determined through the start index and the stop index.

In the embodiment of the application, a head entity word is adopted, rel refers to Guan Jici, tail refers to a tail entity word, head [ i ] refers to a head start index, head [ j ] refers to a head end index, rel [ i ] refers to a relation word start index, rel [ j ] refers to a relation word end index, tail [ i ] refers to a tail entity word start index, and tail [ j ] refers to a tail entity word end index. Such as: the term "Xiaoming" is a length of a shift where the head entity word is "Xiaoming", the start index is 0 (here, the first character of the sentence is written with 0), and the end index is 2, indicating the position of the term "Xiaoming" in the sentence.

Thus for a triplet, the available location information includes: the start index and the end index of the head entity word, the start index and the end index of the tail entity word, and the start index and the end index of the related word. The relative position information between the head entity word, the tail entity word and the relation word in one triplet and the relative position information between the head character and the tail character of the head entity word, the tail entity word and the relation word are needed to be calculated.

Specifically, the step 42 may include:

and calculating the difference between each start index and each end index according to the start index and the end index of the head entity word, the start index and the end index of the tail entity word and the start index and the end index of the related word, and obtaining a plurality of relative position information.

In the embodiment of the application, a plurality of pieces of relative position information, for example, can be obtained by calculating the difference between each start index and each end index. For three words in the triplet, when each word has a corresponding start index and a corresponding end index, eight pieces of relative position information may be calculated, including:

the difference between the start index and the end index of the head The difference between the head start index and the tail end indexDifference between the start index of Tail and the end index of head +.>Difference between start index and end index of tail +.>Difference between the start index of head and the end index of rek +.>Difference between the start index and the end index of rel +.>Difference between the start index of rel and the end index of tail>Difference between the start index and the rel end index of Tail>Specifically, the method can be obtained by calculation according to the following formula:

further, the codes corresponding to the respective d positions may be calculated. The function calculation and the addition can be performed in an odd dimension and an even dimension, and the function calculation and the addition can be specifically performed by the following formula:

wherein 2i represents even dimension of d position, 2i+1 represents odd dimension of d position, and the two results are combined to represent final position code of d position, so that position code PE corresponding to each relative position information d can be obtained.

Further, the above-described position-coded sum may be taken and subjected to linear transformation as a final position vector. In particular, the calculation may be based on the activation function and the initialization matrix of the model. In an alternative embodiment, the activation function may be a RELU activation function, and the final location vector of the target triplet may be calculated by the following formula:

Where Wr denotes a randomly initialized matrix.

Optionally, the acquiring the input data further includes:

acquiring the part of speech of each word in the target triplet;

and acquiring the part-of-speech vector corresponding to the part-of-speech of each word according to a preset mapping relation between the part-of-speech and the part-of-speech vector, and taking the part-of-speech vector as the part-of-speech vector corresponding to the target triplet.

The classification model in the embodiment of the application can initialize a multidimensional vector matrix according to the preset part-of-speech class as a vector candidate set of the part-of-speech, namely, the classification model comprises the mapping relation between the preset part-of-speech and the part-of-speech vector. For example, a total of 10 parts of speech categories, a 10 x 128 dimensional vector may be initialized to represent the 10 parts of speech, each part of speech being a 128 dimensional vector. When inputting a part of speech, a vector representing the corresponding part of speech can be found in the above matrix as a part of speech vector according to the inputted part of speech.

Through the steps, the characteristics of each aspect of the triples, including the characteristics of the original sentences to which the triples belong, the characteristics of the triples, and the position characteristics and the part-of-speech characteristics of the head entity words, the tail entity words and the relation words in the triples, can be fully considered, so that the triples can be more comprehensively analyzed to identify the correctness of the triples, and the accuracy of the triples can be further improved through verification.

105. Inputting the input data into a preset classification model, and processing the input data through the preset classification model to verify the triples.

In the embodiment of the application, the triples can be checked through a preset classification model. In particular, the classification model may use a BERT model, which may perform text classification. The trained classification model can determine whether each triplet is correct or incorrect.

Referring to fig. 4, fig. 4 is a schematic diagram of a verification structure according to an embodiment of the present application. As shown in fig. 4, first, a vector of the triplet sample, a vector of an original sentence to which the triplet sample corresponds, a position vector to which the triplet sample corresponds, and a part-of-speech vector of the triplet sample may be input into the classification model at the time of training. The vectors can be obtained by initializing the vectors randomly, and then updating the weights of the vectors through model training; alternatively, vector data of triples that have been trained may be used, without limitation. Further, the vectors are added (using the concatate function, see the description above and related to fig. 3), and then input to the encoder of the model for encoding, and then the last layer is subjected to linear transformation by a linear layer, and an activation function layer is processed.

The encoder may adopt an encoder (encoder) structure of a Transformer, which is a seq2seq model proposed by google brain. The processing flow of the encoder mainly comprises: an input is calculated through a Attention mechanism (Multi-Head Attention), then a residual link (resdituial connection), then a fully connected neural network Layernormal (the function is to normalize hidden layers in the neural network into standard normal distribution and accelerate convergence), finally a feedforward network (Feed Forward) is used for carrying out linear mapping and activating with an activating function, and the coding task is completed through N times of circulation.

Specifically, as shown in fig. 4, ffn+sigmoid refers to that the full connection layer performs two classifications by following a sigmoid activation function, and the probability can be converted into a label (label), which is a classification result of whether the triplet is trusted or not in the present application, and the trained classification model can predict the triplet. The activation function may map numbers between 0 and 1, so as to represent probability distribution of a certain number x, and finally obtain probability, which is a result after sigmoid.

Through the attention mechanism, each word in a sentence contains information of all other words in the sentence, so that the data correlation is increased, and the feature expression capability of the model is improved by using part-of-speech analysis and position coding, so that the accuracy of the model can be improved.

In practical application, four vectors related to sentences and triples can be obtained through the steps, the sum of the four vectors is input into a model similar to the training process, after the model encoder codes, the numerical value is converted into probability through a sigmoid function, the credibility identifier corresponding to the triples is obtained, if the result is 1, the triples are trusted, and if the result is 0, the triples are not trusted, and the triples need to be filtered.

For example, if the vector processed by the neural network is [1,3], the value is converted into a probability by the sigmoid function, the value after the sigmoid is [0.25,0.75], the probability of 1 (correct) is considered to be 0.25, and the probability of 0 (wrong) is considered to be 0.75, then the final classification prediction result is 0, and the triplet is an erroneous triplet. According to the credibility identification of each triplet, whether each triplet is an error triplet or a correct triplet can be judged, and the verification of the triplet can be completed.

The classification model employed in the embodiments of the present application is BERT. Alternatively, machine-learned classification models such as: support vector machines (Support Vector Machine, SVM), logistic classification, etc., or alternatively deep learning models such as convolutional neural networks (Convolutional Neural Networks, CNN), ALBERT, etc., to which the embodiments of the present application are not limited.

Optionally, after the step 105, the method further includes:

after deleting the wrong triplet, filtering out the triplet which is considered as the wrong triplet by the classification model, and taking the triplet as a final triplet result, and constructing a knowledge graph according to the triplet after deleting the wrong triplet.

The extracted triples may have errors, and this step may use the classification model to delete some obviously erroneous triples to improve accuracy. And then taking the entities in the obtained triples as nodes, taking the space-time relationship among different entities as connecting lines, and enabling one entity to diverge one or more connecting lines to be connected with other entities, so that an entity relationship network, namely a knowledge graph, which is structurally presented by the graph can be constructed.

Alternatively, after the above steps, the knowledge patterns may be stored in a database, for example, the patterns may be stored using a chart database arango db, and the stored patterns may be displayed. In the embodiment of the application, the triad can be checked through the classification model, so that the accuracy of triad extraction can be improved, and a more accurate and reliable map can be established.

In the embodiment of the application, part of speech analysis and syntactic analysis processing are performed on a target text to obtain an analysis result corresponding to the target text, wherein the analysis result comprises the part of speech of each word in the target text and a syntactic label corresponding to each word, and the syntactic label corresponding to each word comprises the dependency relationship between each word and a head entity word corresponding to each word; according to a preset matching rule and the analysis result, matching a triplet corresponding to the target text, wherein the triplet comprises a head entity word, a tail entity word and Guan Jici for identifying the syntactic relation between the head entity word and the tail entity word; acquiring input data, wherein the input data comprises a vector of an original sentence corresponding to a triplet, a vector of the triplet, a position vector corresponding to the triplet and a part-of-speech vector corresponding to the triplet, the original sentence is a sentence of the triplet extracted from the target text, the vector of the original sentence is used for indicating information of the original sentence on a feature space, the vector of the triplet is used for indicating information of the target triplet on the feature space, the position vector corresponding to the triplet is used for indicating information of positions of the head entity word, the tail entity word and the relation word on the feature space, and the part-of-speech vector corresponding to the target triplet is used for indicating information of the head entity word, the tail entity word and the part-of-speech of the relation word on the feature space; and inputting the input data into a preset classification model, and processing the input data through the preset classification model to verify the triples. According to the embodiment of the application, the triples are generated through the non-supervision method of syntactic analysis, part-of-speech analysis and pattern recognition, manual labeling is not needed, and labeling cost is reduced. And then, according to the triple and the information such as the syntax, the part of speech and the like thereof, verification processing is carried out, the characteristics of the original sentence to which the triple belongs, the characteristics of the triple itself, the position characteristics and the part of speech characteristics of the head entity word, the tail entity word and the relation word in the triple are considered, the analysis and verification can be carried out on the triple more comprehensively, the accuracy of the triple is improved, and the method can be used for establishing a more accurate knowledge graph.

Based on the description of the triplet generation and verification method embodiment, the embodiment of the application also discloses a triplet generation and verification device. Referring to fig. 5, the triplet generating and verifying apparatus 500 includes:

an obtaining module 510, configured to obtain a target text;

the analysis module 520 is configured to perform part-of-speech analysis and syntax analysis on the target text to obtain an analysis result corresponding to the target text, where the analysis result includes a part-of-speech of each word in the target text and a syntax tag corresponding to each word, and the syntax tag corresponding to each word includes a dependency relationship between each word and a head entity word corresponding to each word;

the matching module 530 is configured to match, according to a preset matching rule and the analysis result, a triplet corresponding to the target text, where the triplet includes a head entity word, a tail entity word, and Guan Jici identifying a syntactic relationship between the head entity word and the tail entity word;

a verification module 540, configured to obtain input data, where the input data includes a vector of an original sentence corresponding to the triplet, a vector of the triplet, a position vector corresponding to the triplet, and a part-of-speech vector corresponding to the triplet, where the original sentence is a sentence in which the triplet is extracted from the target text, the vector of the original sentence is used to indicate information of the original sentence in a feature space, the vector of the triplet is used to indicate information of the target triplet in a feature space, the position vector corresponding to the triplet is used to indicate information of positions of the head entity word, the tail entity word, and the relation word in a feature space, and the part-of-speech vector corresponding to the target triplet is used to indicate information of the head entity word, the tail entity word, and the part-of-speech of the relation word in a feature space;

The verification module 540 is further configured to perform verification processing on the triplet corresponding to the target text, and obtain a verification result of the triplet corresponding to the target text.

According to an embodiment of the present application, each step involved in the method shown in fig. 1 may be performed by each module in the triplet generating and checking device 500 shown in fig. 5, which is not described herein.

Based on the description of the method embodiment and the device embodiment, the embodiment of the application also provides electronic equipment. Referring to fig. 6, the electronic device 600 at least includes a processor 601, a memory 602, and an input/output unit 603. The processor 601 may be a central processing unit (central processing unit, CPU), and serves as an arithmetic and control core of the computer system, and is a final execution unit for information processing and program execution.

A computer storage medium may be stored in the memory 602 of the electronic device 600, where the computer storage medium is used to store a computer program, where the computer program includes program instructions, and where the processor 601 may execute the program instructions stored in the memory 602. The preset classification model and the like in the embodiment of the present application may also be stored in the above-described memory 602.

In one embodiment, the electronic device 600 described in the embodiments of the present application may be used to perform a series of processes, including the method in any embodiment shown in fig. 1, and so on, which are not described herein.

The embodiment of the application also provides a computer storage medium (Memory), which is a Memory device in the electronic device and is used for storing programs and data. It is understood that the computer storage media herein may include both built-in storage media in the electronic device and extended storage media supported by the electronic device. The computer storage medium provides a storage space that stores an operating system of the electronic device. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory; optionally, at least one computer storage medium remote from the processor may be present.

In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by a processor to implement the corresponding steps in the above embodiments; in particular, one or more instructions in the computer storage medium may be loaded by the processor and perform any steps of the method of fig. 1, which are not described herein.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the division of the module is merely a logical function division, and there may be another division manner when actually implemented, for example, a plurality of modules or components may be combined or may be integrated into another system, or some features may be omitted or not performed. The coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or module indirect coupling or communication connection, which may be in electrical, mechanical, or other form.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a read-only memory (ROM), or a random-access memory (random access memory, RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a digital versatile disk (digital versatile disc, DVD), or a semiconductor medium, such as a Solid State Disk (SSD), or the like.

Claims

1. A method of triplet generation and verification, comprising:

acquiring a target text;

Inputting the input data into a preset classification model, and processing the input data through the preset classification model to verify the triples;

wherein the obtaining input data includes: obtaining the vector of the head entity word _、 The vector of the tail entity word and the vector of the relation word; vector of the head entity word _、 Vector of the tail entity word and vector of the relation word are added to obtain vector of the target triplet; acquiring position information of the head entity word, the tail entity word and the relation word in the original sentence; calculating a plurality of relative position information according to the position information, wherein the plurality of relative position information comprises head entity words, tail entity words and relative position information between every two relation words and relative position information between head entity words, tail entity words and head and tail characters of each relation word; encoding the plurality of relative position information to obtain a plurality of position codes; encoding the plurality of positionsPerforming linear transformation to obtain a position vector corresponding to the target triplet _；

The acquiring input data further includes: acquiring the part of speech of each word in the target triplet;

And acquiring the part-of-speech vector corresponding to the part-of-speech of each word according to the preset mapping relation between the part-of-speech and the part-of-speech vector, and taking the part-of-speech vector as the part-of-speech vector corresponding to the target triplet.

2. The triplet generation and verification method of claim 1, wherein the location information comprises: the starting index and the ending index of the head entity word, the starting index and the ending index of the tail entity word, and the starting index and the ending index of the relation word, the starting index of the target word is used for indicating the position of the first character of the target word in the original sentence, the ending index of the target word is used for indicating the position information of the last character of the target word in the original sentence, and the target word is the head entity word, the tail entity word or the Guan Jici;

the calculating a plurality of relative position information according to the position information includes:

and calculating the difference between each start index and each end index according to the start index and the end index of the head entity word, the start index and the end index of the tail entity word and the start index and the end index of the relation word, and obtaining the plurality of relative position information.

3. The triplet generation and verification method according to claim 1, wherein the preset matching rule includes a preset relationship pattern, which designates a dependency relationship between every two words and a part of speech of each word, and a triplet expression corresponding to the preset relationship pattern;

determining a group of words meeting the dependency relationship and the part of speech specified by the preset relation mode in the words according to the part of speech of each word and the dependency relationship corresponding to each word in the analysis result;

and constructing the group of words into corresponding triples according to the triples expression corresponding to the preset relation mode.

4. A triplet generation and verification method according to any one of claims 1-3, wherein said obtaining target text comprises:

acquiring a text to be processed;

and removing special characters in the text to be processed, and carrying out sentence segmentation on the text to be processed to obtain a target text, wherein each sentence in the target text contains a subject.

5. The triplet generation and verification method according to claim 1, wherein the input data is input into a preset classification model, the input data is processed through the preset classification model to verify the triplet, the method further comprising:

after the erroneous triplet is deleted, a knowledge graph is constructed from the triples after the erroneous triplet is deleted.

6. A triplet generation and verification apparatus, comprising:

the acquisition module is used for acquiring the target text;

the verification module is further used for performing verification processing on the triples corresponding to the target text to obtain verification results of the triples corresponding to the target text;

the acquiring input data includes: obtaining the vector of the head entity word _、 The vector of the tail entity word and the vector of the relation word; vector of the head entity word _、 Vector of the tail entity word and vector of the relation word are added to obtain vector of the target triplet; acquiring position information of the head entity word, the tail entity word and the relation word in the original sentence; calculating a plurality of relative position information according to the position information, wherein the plurality of relative position information comprises head entity words, tail entity words and relative position information between every two relation words and relative position information between head entity words, tail entity words and head and tail characters of each relation word; encoding the plurality of relative position information to obtain a plurality of position codes; performing linear transformation on the sum of the plurality of position codes to obtain a position vector corresponding to the target triplet _；

7. An electronic device comprising a processor and a memory, the memory storing a computer program that, when executed by the processor, causes the microcontroller to perform the steps of the triplet generation and verification method of any one of claims 1 to 5.

8. A computer readable storage medium, characterized in that a computer program is stored, which, when being executed by a processor, causes the processor to perform the steps of the triplet generation and verification method according to any one of claims 1 to 5.