CN113282762B

CN113282762B - Knowledge graph construction method, knowledge graph construction device, electronic equipment and storage medium

Info

Publication number: CN113282762B
Application number: CN202110585997.1A
Authority: CN
Inventors: 曾钢欣
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2023-06-02
Anticipated expiration: 2041-05-27
Also published as: CN113282762A

Abstract

The application discloses a knowledge graph construction method, a knowledge graph construction device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a target text; carrying out syntactic analysis processing on the target text to obtain syntactic analysis results corresponding to each entry in the target text, wherein the syntactic analysis results corresponding to each entry comprise dependency relations between the entry and head entity words corresponding to the entry; obtaining a target syntactic analysis result of which the dependency relationship is a centering relationship, and obtaining a target entry corresponding to the target syntactic analysis result and a head entity word corresponding to the entry; merging the target vocabulary entry with the head entity word corresponding to the target vocabulary entry to obtain a merged vocabulary entry, updating the target vocabulary entry corresponding to the target syntactic analysis result into the merged vocabulary entry, and then matching the triplet corresponding to the target text according to a preset matching rule and the syntactic analysis result; and constructing a knowledge graph according to the triples.

Description

Knowledge graph construction method, knowledge graph construction device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a knowledge graph construction method, a knowledge graph construction device, an electronic device, and a storage medium.

Background

Along with the landing of various intelligent products, the position of the knowledge graph is important as the knowledge brain energized behind the product. However, the cost of knowledge graph construction is very high. The expert in the corresponding field is usually required to define the data mode of the knowledge graph in advance, massive labeling data are required to extract knowledge and fuse knowledge, and a storage database which can be responded quickly and stored in a massive manner is required. Therefore, only large companies can have the capability of constructing large-scale knowledge maps, and the construction of the knowledge maps is a very difficult work for many small and medium-sized companies, and the large-scale knowledge maps require large data volume and large manual participation, and have high cost.

Entity extraction is one of the classical tasks of Natural Language Processing (NLP) with the aim of extracting entities from structured, semi-structured or unstructured data, as defined entity types are: the nations, unstructured text is: the Chinese has a history of five thousand years culture, and the extracted entities are: "China". Currently, some methods divide words in an acquired corpus, adopt the words after the word division as candidate sets of entities, and obtain triples to construct a knowledge graph, namely, the entities in the triples are only obtained through word division, and may deviate from the entities in actual sentences.

Disclosure of Invention

The application provides a knowledge graph construction method, a knowledge graph construction device, electronic equipment and a storage medium.

In a first aspect, a knowledge graph construction method is provided, including:

acquiring a target text;

carrying out syntactic analysis processing on the target text to obtain syntactic analysis results corresponding to each entry in the target text, wherein the syntactic analysis results corresponding to each entry comprise dependency relations between the entry and head entity words corresponding to the entry;

obtaining a target syntactic analysis result of which the dependency relationship is a centering relationship, and obtaining a target entry corresponding to the target syntactic analysis result and a head entity word corresponding to the target entry;

merging the target entry with the head entity word corresponding to the target entry to obtain a merged entry, updating the syntactic analysis result according to the merged entry, and then matching the triplet corresponding to the target text according to a preset matching rule and the syntactic analysis result;

and constructing a knowledge graph according to the triples.

In a second aspect, there is provided a knowledge graph construction apparatus, including:

the acquisition module is used for acquiring the target text;

The syntactic analysis module is used for carrying out syntactic analysis processing on the target text to obtain syntactic analysis results corresponding to each entry in the target text, wherein the syntactic analysis results corresponding to each entry comprise dependency relations between the entry and the head entity word corresponding to the entry;

the merging module is used for acquiring a target syntactic analysis result of which the dependency relationship is a centering relationship and acquiring a target entry corresponding to the target syntactic analysis result and a head entity word corresponding to the entry;

the merging module is further configured to merge the target entry with a head entity word corresponding to the target entry, obtain a merged entry, update the target entry corresponding to the target syntax analysis result to the merged entry, and then match a triplet corresponding to the target text according to a preset matching rule and the syntax analysis result;

and the construction module is used for constructing a knowledge graph according to the triples.

In a third aspect, there is provided an electronic device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps as in the first aspect and any one of its possible implementations.

In a fourth aspect, there is provided a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the steps of the first aspect and any one of its possible implementations described above.

According to the embodiment of the application, the target text is acquired; carrying out syntactic analysis processing on the target text to obtain syntactic analysis results corresponding to each entry in the target text, wherein the syntactic analysis results corresponding to each entry comprise dependency relations between the entry and head entity words corresponding to the entry; obtaining a target syntactic analysis result of which the dependency relationship is a centering relationship, and obtaining a target entry corresponding to the target syntactic analysis result and a head entity word corresponding to the entry; merging the target vocabulary entry with the head entity word corresponding to the target vocabulary entry to obtain a merged vocabulary entry, updating the target vocabulary entry corresponding to the target syntactic analysis result into the merged vocabulary entry, and then matching the triplet corresponding to the target text according to a preset matching rule and the syntactic analysis result; and constructing a knowledge graph according to the triples. The knowledge graph can be automatically constructed through a section of text, the artificial participation degree is reduced, the construction cost of the knowledge graph is reduced, words after word segmentation are adopted as candidate sets of entities in a general method, the final entity and the relationship are necessarily independent words, if the final entity is the combination of a plurality of words under real conditions, the final extraction result is incorrect no matter how correct the dependency analysis tree is, and the term of the centering relationship obtained after the text syntactic analysis and the corresponding head entity word are combined into one entity, and the combined entity word contains the idioms of the original entity and is used as the entity in the triplet, so that more accurate triples can be obtained.

Drawings

In order to more clearly describe the technical solutions in the embodiments or the background of the present application, the following description will describe the drawings that are required to be used in the embodiments or the background of the present application.

Fig. 1 is a schematic flow chart of a knowledge graph construction method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a triplet pattern matching provided in an embodiment of the present application;

fig. 3 is a flow chart of another knowledge graph construction method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a classification model according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a knowledge graph construction device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The knowledge graph related to the embodiment of the application is essentially a semantic network for revealing the relationship between entities, and is generally composed of triples, expressed in the form of head node-relationship-tail node, and the nodes and edges can store attributes, and generally have two storage modes: 1. RDF resource description framework; 2. and (5) a graph database. A triplet typically extracts three words in a sentence component, including a host entity (subject), a guest entity (object), and a relationship between the two entities (relationship), which may be expressed as (subject, relation), for example: one triplet is (kebi, job basket) and a large number of such triples constitute a specific knowledge graph.

Embodiments of the present application are described below with reference to the accompanying drawings in the embodiments of the present application.

Referring to fig. 1, fig. 1 is a flow chart of a knowledge graph construction method according to an embodiment of the present application. The method may include:

101. and acquiring the target text.

The implementation body of the embodiment of the present application may be a knowledge graph construction apparatus, may be an electronic device, and in a specific implementation, the electronic device may be a terminal, which may also be referred to as a terminal device, including but not limited to other portable devices such as a laptop computer or a tablet computer having a touch sensitive surface (e.g., a touch screen display and/or a touch pad). It should also be appreciated that in some embodiments, the above-described devices are not portable communication devices, but rather desktop computers having touch-sensitive surfaces (e.g., touch screen displays and/or touch pads).

The target text may be an explanatory text of a certain keyword, for example, a text which is searched on a network such as the network of the keyword encyclopedia and uses the term as an entry.

Optionally, the step 101 includes:

011. acquiring a text to be processed corresponding to the key term;

012. and carrying out sentence segmentation on the text to be processed to obtain target texts, wherein each sentence in the target texts contains a subject.

Specifically, the disclosed universal crawler tool can be adopted to crawl the corresponding website to obtain the text corresponding to the keyword, namely the text to be processed. Such as: crawling the text taking a "cosmetologist" as an entry, wherein the obtained text is:

"beauticians are a professional designation in the field of professional cosmetology. The system mainly works in beauty parlors and places capable of providing beauty services for customers. It is the job of providing cosmetic services to the customer, such as skin care tasks for washing the face, maintaining, massaging, aromatherapy, and losing weight. "

The text to be processed may be subjected to data preprocessing, such as by use of. ","; ", I! The punctuation mark clause processing, and the subject of each sentence in the text to be processed can be supplemented to obtain the target text, and the specific operation can include that if no subject exists in the sentence obtained after the clause is detected, the subject in the previous sentence is taken as the subject of the current sentence. After this stage, the text becomes a sentence one by one for subsequent syntactic analysis.

For example, the text using "beautician" as an entry is subjected to data preprocessing, and the following clauses can be obtained through punctuation clause processing:

"beauticians are professional names in the field of professional cosmetology";

"mainly works in beauty parlor and can provide the beauty service for customers";

the "job duty" is to provide the customer with cosmetic services such as skin care tasks such as face washing, maintenance, massage, aromatherapy, and weight loss.

Wherein the second sentence has no subject, and the subject 'cosmetologist' in the first sentence can be used as supplement, namely the second sentence is:

the beauty shop is mainly operated by the beauty shop and the place for providing the customer with the beauty service.

102. And carrying out syntactic analysis processing on the target text to obtain syntactic analysis results corresponding to each entry in the target text, wherein the syntactic analysis results corresponding to each entry comprise the dependency relationship between the entry and the head entity word corresponding to the entry.

The syntax analysis (Parsing) referred to in the embodiments of the present application refers to analyzing the word grammar function in the sentence, for example, "i come late", where "i" is a subject, "i" is a predicate, and "i late" is a complement.

Specifically, the target text may be subjected to a syntactic analysis process to obtain a syntactic analysis result corresponding to each term in the target text, where the terms in the embodiment of the present application are basic units forming sentences, may be words, terms, or may be composed of words, terms, etc., and each term corresponds to a word grammar, and may be determined by the syntactic analysis, for example, the "i", and "late" terms are three terms respectively. The syntactic analysis result of the entry includes the relationship between the entry and its corresponding head entity word.

Embodiments of the present application relate to dependency theory, where "dependency" refers to a word-to-word relationship that is in an assignment and is subject to, such relationship being directional. The dominant word is called a dominant, i.e., the head entity word (head), and the dominant word may be called a subordinate (dependency).

The dependencies can be subdivided into different types, representing the dependency of specific two words, such as in the sentence "I am her a bundle of flowers" (I am < "): main subject-verb (SBV), (send- > flower): a move guest relationship (VOB); in "red apples," for example, (red < - > apples): centering relationships (attributes), and so on. According to syntactic analysis, for two words for which there is a dependency, the word pointed to by the arrow is the head entity word head of the word from which the arrow originated.

It should be noted that, in general, the centering relationship exists between the head entity word and the centering phrase, and when the processing is performed by using the natural language tool LTP4.0 developed by hakuda, the head entity word head and the centering phrase att are respectively recorded, for example, "human" is att and "society" is head in "human society".

In one embodiment, the step 102 includes:

Word segmentation processing is carried out on each sentence in the target text, and a plurality of entries in the target text are obtained;

identifying the parts of speech of each entry, and determining the parts of speech of each entry;

and carrying out syntactic analysis processing on the target text according to the part of speech of each term to obtain a syntactic analysis result corresponding to each term in the target text.

The parts of speech referred to in this embodiment of the present application refers to the feature of a word as the basis for dividing the parts of speech. The word class is a linguistic term, is the grammar classification of words in a language, is based on grammar characteristics (including syntactic function and morphological change) as a main basis, gives consideration to the result of word division in terms of vocabulary meaning, and the words of modern Chinese can be divided into 13 word classes, which can include:

prep. preposition

pronouns of pronoun

n, noun

v. verb

conj. Connective word

s subject language

sc list

object o

OCBING tonic

vi. disfigurement verbs

vt. and verb

aux.v auxiliary verb

adj adjectives

adv adverbs

art. Article

num. Number words

For example, the following are: the sentence is that the free radical has unpaired electrons, atoms or ions, the chemical property is active, the oxidation performance is extremely high, the LTP tool with the Ha Gong is utilized to divide words firstly, and the result is that:

[ ' free radical ', ' bearing ', ' not ', ' forming ', ' pairing ', ' electron ', ' ', ' atom ', ' or ', ' ion ', ' and, ' it ', ' chemical ', ' property ', ' active ', ' ', ' have ', ' pole ', ' high ', ' oxidation ', ' performance ', ' the like '; part-of-speech recognition after word segmentation is as follows:

['n','v','d','v','p','n','wp','n','c','n','wp','r','n','n','a','wp','v','d','a','u','v','n']；

then, syntactic analysis is carried out to obtain the following results:

[(1,2,'SBV'),(2,0,'HED'),(3,4,'ADV'),(4,2,'CMP'),(5,4,'VOB'),(6,2,

'VOB'),(7,8,'WP'),(8,6,'COO'),(9,10,'LAD'),(10,6,'COO'),(11,2,'WP'),(12,14,'ATT'),(13,14,'ATT'),(14,15,'SBV'),(15,2,'COO'),(16,2,'WP'),(17,2,

'COO'),(18,19,'ADV'),(19,24,'ATT'),(20,19,'RAD'),(21,22,'ATT'),(22,24,

'ATT'),(23,24,'ATT'),(24,17,'VOB'),(25,17,'COO'),(26,27,'ADV'),(27,25,

'CMP'),(28,27,'CMP'),(29,28,'POB'),(30,31,'WP'),(31,29,'COO'),(32,33,

'LAD'),(33,29,'COO'),(34,2,'WP'),(35,37,'ATT'),(36,37,'ATT'),(37,38,'SBV'),(38,2,'COO'),(39,38,'WP'),(40,38,'COO'),(41,42,'ADV'),(42,45,'ATT'),(43,42,'RAD'),(44,45,'ATT'),(45,40,'VOB')]。

wherein the first element of each set of data is the position index of the word, the second element is the head entity word index of the word, and the third word represents the dependency relationship between the word and the head entity word. For example (1, 2, 'SBV'), the first element 1 represents the position of the word "radical", i.e., the first entry in the original sentence; the second element 2 represents the position of the head entity word corresponding to the free radical, namely the second term in the original sentence is provided with the SBV which represents the relationship between the head entity word provided with the free radical and the main term,

"with" predicates that are "radicals"; as another example, (5, 4, 'VOB') wherein the first element 5 represents the position of the "pair", i.e., the fifth entry in the original sentence; the second element 4 indicates the position of the head entity word corresponding to the "pair", that is, the fourth term "adult" in the original sentence, and the "VOB" indicates that the head entity word "adult" and "pair" are in a motor guest relationship.

It should be noted that the head entity word head defined in the dependency is conceptually different from the main entity in the triplet, the main entity in the triplet is the main body in the triplet and represents the sender of the action, while the logic in the dependency is the head entity word head defining the word pointed by the syntactic analysis arrow as the word initiated by the arrow, and is not necessarily the sender of the action.

In the embodiment of the application, the syntactic analysis, word segmentation and the like are implemented by using a natural language processing tool LTP with a large Hadamard open source, other open source tools such as NLTK, fastNLP and the like can be used, and the syntactic analysis can be implemented by training a specific model according to requirements, so that the embodiment of the application is not limited.

103. And obtaining a target syntactic analysis result of which the dependency relationship is a centering relationship, and obtaining a target entry corresponding to the target syntactic analysis result and a head entity word corresponding to the entry.

Specifically, after the syntactic analysis, a target syntactic analysis result of the centering relationship can be determined according to the dependency relationship in the syntactic analysis result, and then a target entry corresponding to the target syntactic analysis result is determined, namely, a target entry att belonging to the centering relationship and a head entity word head corresponding to the target entry att can be found.

In the results obtained by the foregoing syntactic analysis, "ATT" indicates a centering relationship. For example, the sentence "human society has undergone development stages of agricultural society, industrial society, post-industrial society, and the like in sequence. ", assuming that the obtained result includes (1, 2, 'ATT'), wherein the position indicated by 1 corresponds to the word" human ", the position indicated by 2 corresponds to the word" social ", the head entity word of" human ", and ATT indicates that" human "and" social "are in a centering relationship.

In an optional embodiment, the syntax analysis result corresponding to each term further includes: first position information indicating a position of the term in the target text, and second position information indicating a position of a head entity word corresponding to the term in the target text;

the obtaining the target entry corresponding to the target syntactic analysis result and the head entity word corresponding to the entry includes:

031. acquiring the target entry corresponding to the target syntactic analysis result according to the first position information in the target syntactic analysis result;

032. and acquiring the head entity word corresponding to the target entry according to the second position information in the target syntactic analysis result.

The first location information in the result of the syntactic analysis is the location index of the term, the second location information is the head index of the term, for example, (42, 45, 'ATT'), 42 indicates the location of the term a in the target text (42 th word), 45 indicates the location of the head entity word b corresponding to the term a in the target text (45 th word), and ATT indicates that the dependency relationship between the term a and the head entity word b is a centering relationship.

For example, the sentence "the human society has successively undergone the development stages of agricultural society, industrial society, post-industrial society, etc." is assumed to obtain the result including (1, 2, 'ATT'), wherein the position indicated by 1 corresponds to the word "human", the position indicated by 2 corresponds to the word "society", the head entity word of "human", ATT means that "human" and "society" are in a centering relationship.

104. And merging the target entry with the head entity word corresponding to the target entry to obtain a merged entry, updating the syntactic analysis result according to the merged entry, and then matching the triplet corresponding to the target text according to a preset matching rule and the syntactic analysis result.

Through the steps, the target vocabulary entry att and the head entity word head belonging to the centering relationship can be obtained, and then the target vocabulary entry att and the head entity word head belonging to the centering relationship are combined to obtain the combined vocabulary entry (att+head), in which case, the syntactic analysis result needs to be updated, and the original vocabulary entry head related to other syntactic analysis results needs to be replaced by the combined vocabulary entry (att+head).

For further example, for one syntactic analysis result (1, 2, 'ATT'), indicating that "human" and "society" are in a centering relationship, the two may be combined into "human society", the head corresponding to the term before the combination is "society", and at this time, the original term involved in all syntactic analysis results may be replaced with "human society".

In the embodiment of the application, when the term does not belong to the centering phrase, the term does not execute the merging process; and when the corresponding head entity word does not exist in the term, for example, the term itself does not belong to the entity word, the second position information, namely, the head index (0) is not included in the corresponding syntactic analysis result, and the merging process is not involved.

The preset matching rules specify how different triples should be generated according to different inter-term dependency relationships, and after the merging and updating are completed, the triples corresponding to the target text can be matched according to the preset matching rules and the syntactic analysis results.

In an alternative embodiment, the preset matching rule includes a preset inter-term dependency mode, and a triplet expression corresponding to the preset inter-term dependency mode;

And matching the triples corresponding to the target text according to a preset matching rule and the syntactic analysis result, wherein the matching comprises the following steps:

041. determining a group of entries meeting the preset inter-entry dependency relation mode in each entry according to the dependency relation of each entry in the syntactic analysis result;

042. and constructing the group of entries into corresponding triples according to the triples expression corresponding to the preset inter-entry dependency relation mode.

Specifically, multiple inter-term dependency relation modes can be preset according to requirements, and triple expressions corresponding to the inter-term dependency relation modes are predefined, so that one inter-term dependency relation mode corresponding to one group of terms is matched according to the dependency relation between different terms in a syntactic analysis result, and the group of terms is substituted into the expression according to the triple expressions corresponding to the mode, so that a final triple is obtained. Wherein each schema may include at least two sets of dependencies, each dependency indicating a dependency between two terms, not necessarily here a dependency between two entity words. Corresponding relation triplet expressions can be preset according to the dependency relation among different terms, and when the dependency relation among terms in a certain mode is met, the corresponding triplet expressions can be adopted, and terms are substituted into the triplet expressions, so that a specific triplet result is obtained.

Fig. 2 is a schematic diagram of triplet pattern matching provided in the embodiment of the present application, and as shown in fig. 2, six matching patterns are given: DSNF2-DSNF7, a logic expression and a graphic expression corresponding to each mode, and a corresponding relation triplet, wherein the graphic expression is marked with the type of each word in a box, for example, the DSNF2 mode entity word E1 is marked with n, which is a noun; the dependency relationship of the two words is marked on the arrowed line, "-" indicates a combination of the two words. "{1,2} +" means a word appearing one or two times, "[ ]? A + "indicates a word that appears once or not.

For example, the DSNF2 mode is a master guest mode, and specifically: e1 and E2 are nouns, pred is a verb, E1 and Pred are a main-predicate relationship, E2 and Pred are a guest-move relationship, namely a main-predicate relationship is formed, if the relationship is satisfied, subjects, predicates and objects of sentences can be correspondingly extracted as triples, namely (E1, pred and E2), wherein E1 is a main entity, pred is Guan Jici, and E2 is a guest entity.

Before entity words with centering relationship are combined, the difference is mainly that in a DSNF1 mode, the part of speech of E1 is n, n represents noun, the part of speech of the center word of E1 is n, the relationship between E1 and the center word is ATT relationship, namely centering relationship, the part of speech of E2 is n, and the relationship between the center word of E1 and E2 is ATT relationship, namely centering relationship; when the words in the sentence satisfy the above relationship, the central word may be denoted as attword, and then a triplet (E1, attword, E2) may be extracted, where E1 is a main entity, attword is Guan Jici, and E2 is a guest entity.

After the ATT and the head entity word head of the centering relationship are combined, the result of the centering relationship (ATT) does not exist in the syntactic analysis result, that is, the matching of the DSNF1 mode is not needed to be executed, the result of the triplet is reduced (E1, attword, E2), and the triplet can be matched according to the last 6 modes in FIG. 2 through the dependency relationship. For the latter 6 patterns, the matching before and after merging is consistent, but the results related to the merged vocabulary entry after merging will differ because att related to all syntactic analysis results of the original text is replaced by the merged vocabulary entry (att+head).

105. And constructing a knowledge graph according to the triples.

Specifically, with the entities in the obtained triples as nodes and the space-time relationship among different entities as connecting lines, one entity can diverge one or more connecting lines to connect with other entities, so as to construct an entity relationship network represented by graph structuring, namely a knowledge graph.

According to the embodiment of the application, the target text is acquired; carrying out syntactic analysis processing on the target text to obtain syntactic analysis results corresponding to each entry in the target text, wherein the syntactic analysis results corresponding to each entry comprise dependency relations between the entry and head entity words corresponding to the entry; obtaining a target syntactic analysis result of which the dependency relationship is a centering relationship, and obtaining a target entry corresponding to the target syntactic analysis result and a head entity word corresponding to the entry; merging the target vocabulary entry with the head entity word corresponding to the target vocabulary entry to obtain a merged vocabulary entry, updating the target vocabulary entry corresponding to the target syntactic analysis result into the merged vocabulary entry, and then matching the triplet corresponding to the target text according to a preset matching rule and the syntactic analysis result; and constructing a knowledge graph according to the triples. The knowledge graph can be automatically constructed through a section of text, the artificial participation degree is reduced, the construction cost of the knowledge graph is reduced, words after word segmentation are adopted as candidate sets of entities in a general method, the final entity and the relationship are necessarily independent words, if the final entity is the combination of a plurality of words under real conditions, the final extraction result is incorrect no matter how correct the dependency analysis tree is, and the term of the centering relationship obtained after the text syntactic analysis and the corresponding head entity word are combined into one entity, and the combined entity word contains the idioms of the original entity and is used as the entity in the triplet, so that more accurate triples can be obtained. For example, in the sentence of "the human society goes through development stages such as agricultural society and industrial society in turn," the triples obtained by directly performing pattern matching are (society, experience, society), (society, experience, stage), and att and the corresponding head are polymerized by the method in the present application, and the result will be (human society, experience, agricultural society development stage), (human society, experience, industrial society), that is, the correct result is obtained.

Referring to fig. 3, fig. 3 is a flow chart of another knowledge graph construction method according to an embodiment of the present application. The method may include:

301. and obtaining the text to be processed corresponding to the key term.

Specifically, the text corresponding to the term can be queried on the network such as the encyclopedia of the keyword as the text to be processed, and various public general crawler tools can be used for crawling the corresponding websites, so that the method is not limited.

302. And carrying out sentence segmentation on the text to be processed to obtain target texts, wherein each sentence in the target texts contains a subject.

Specifically, the text to be processed may be subjected to data preprocessing, such as deleting some useless characters and clauses. The clause is utilized. ","; ",")! "punctuation marks" are used for clauses. Other preprocessing modes can be added according to the data set, for example, if no subject is detected in the sentence, the subject in the previous sentence is acquired as the subject of the current sentence. The target text after the above processing is a sentence one by one, and step 303 may be performed.

303. And carrying out syntactic analysis processing on the target text to obtain syntactic analysis results corresponding to each entry in the target text, wherein the syntactic analysis results corresponding to each entry comprise the dependency relationship between the entry and the head entity word corresponding to the entry.

304. And obtaining a target syntactic analysis result of which the dependency relationship is a centering relationship, and obtaining a target entry corresponding to the target syntactic analysis result and a head entity word corresponding to the entry.

305. Merging the target entry with the head entity word corresponding to the target entry to obtain a merged entry, updating the target entry corresponding to the target syntactic analysis result into the merged entry, and then matching the triplet corresponding to the target text according to a preset matching rule and the syntactic analysis result.

The above steps 303 to 305 may be respectively described with reference to the steps 102 to 104 in the embodiment shown in fig. 1, which is not described herein.

306. And performing verification processing on the triples corresponding to the target text, screening out error triples in the triples corresponding to the target text, and deleting the error triples.

307. And constructing a knowledge graph according to the triples corresponding to the target text.

In the embodiment of the application, the triples can be checked through a preset classification model. The triples extracted in the previous step can have errors, and the step mainly uses a classification model to delete some obviously erroneous triples so as to improve the accuracy.

In particular, the classification model may use a BERT model, which may perform text classification. The trained classification model can determine whether each triplet is correct or incorrect. Filtering out triples which are considered to be errors by the classification model, and taking the triples as a final triples result.

In one embodiment, the step 306 includes:

acquiring input data, wherein the input data comprises a vector of an original sentence corresponding to a target triplet, a vector of each element in the target triplet and a vector of each element position in the target triplet, the original sentence is a sentence from which the target triplet is extracted, and the elements in the target triplet comprise entries in the target triplet and dependency relations among entries of the target triplet, and the target triplet is any triplet corresponding to a target text;

inputting the input data into a preset classification model, and processing the input data through the preset classification model to obtain a credibility identifier of the target triplet, wherein the credibility identifier indicates whether the target triplet is an error triplet;

and screening out the error triples in the triples corresponding to the target text based on the credibility identification of the triples corresponding to the target text.

Most of the current BERT models are trained to determine relationships of entities, and in the embodiment of the present application, the trained BERT models can be used to determine whether a triplet is trustworthy based on the entities and relationships of the input triplet. Specifically, the preset classification model may be obtained through training based on labeling sample data, where the labeling sample data includes a plurality of triplet samples, where the triplet samples are labeled with a confidence identifier, and the confidence identifier indicates whether the triplet samples are erroneous triples.

Fig. 4 is a schematic structural diagram of a classification model according to an embodiment of the present application. As shown in fig. 4, the model is a BERT model structure, where the numerical numbers refer to the encoding of the corresponding content in vector space. Firstly, three elements (including a main entity, a relation and a guest entity) of a triplet sample and positions of the three elements are encoded into vectors (3-8) during training, specifically, the vectors can be randomly initialized, and then weights of the vectors are updated through model training, so that the vectors corresponding to the information can be obtained; alternatively, vectors of triples that have been trained may be used, without limitation. Further, the vectors (3-8) are added, then added with the vector (2) of the input original sentence, the vector is used as the input of the BERT model, after BERT coding, the last layer carries out linear transformation through a linear layer, and an activation function layer is used for processing, the probability is converted into a label (label), in the application, the label is the trusted classification result of the triplet, and the trained classification model can predict the triplet. The activation function may be a Sigmoid function, which maps numbers between 0 and 1, so as to represent probability distribution of a certain number x, and finally obtain probability, which is a result after Sigmoid.

In model prediction, the vector of each entity of the triples, the vector of each entity position in the triples and the vector of sentences are input to predict whether one triplet is trusted or not, so that obviously wrong triples can be filtered. The position information of each element in the triplet can be represented by a vector of each element position, the triplet related vector is input into the trained model, the output is a binary class which is 1 and 0 respectively, the representation is a trusted triplet, and whether the triplet is a trusted triplet can be predicted by the model obtained after the training of the labeling sample data.

Specifically, in practical application, the weight in the trained BERT model may be used to initialize the vector, obtain the vector of the original sentence, the vector of each element (main entity, relation and tail entity) in the triplet and the vector of each element position in the triplet, and the vectors are input into the BERT model, and after BERT encoding, the numerical value is converted into the probability through the sigmoid function, so as to obtain the reliability identifier corresponding to the triplet, including the probability of being correct (1) and the probability of being wrong (0), so as to determine whether the triplet is wrong. For example, if the vector processed by the neural network is [1,3], the value is converted into a probability by the sigmoid function, the value after the sigmoid is [0.25,0.75], the probability of 1 (correct) is considered to be 0.25, and the probability of 0 (wrong) is considered to be 0.75, then the final classification prediction result is 0, and the triplet is an erroneous triplet. According to the credibility identification of each triplet, whether each triplet is an error triplet or not can be judged, and therefore error triples in all triples are screened out.

The classification model used in the embodiments of the present application is BERT, and may be replaced by a machine-learned classification model such as: support vector machines (Support Vector Machine, SVM), logistic classification, etc., or alternatively deep learning models such as convolutional neural networks (Convolutional Neural Networks, CNN), ALBERT, etc., to which the embodiments of the present application are not limited.

The above step 307 may refer to the specific description of step 105 in the embodiment shown in fig. 1, and will not be described herein. Alternatively, after the above step 107, the knowledge-graph may be stored in a database, for example, the graph database arango db may be used for graph storage, and the stored graph may be displayed.

In the embodiment of the application, the triad can be checked through the classification model, so that the accuracy of triad extraction can be improved, and a more accurate and reliable map can be established.

Based on the description of the embodiment of the knowledge graph construction method, the embodiment of the application also discloses a knowledge graph construction device. Referring to fig. 5, the knowledge graph construction apparatus 500 includes:

an obtaining module 510, configured to obtain a target text;

the syntax analysis module 520 is configured to perform syntax analysis processing on the target text, and obtain syntax analysis results corresponding to each term in the target text, where the syntax analysis results corresponding to each term include dependency relationships between the term and a head entity word corresponding to the term;

A merging module 530, configured to obtain a target syntax analysis result that the dependency relationship is a centering relationship, and obtain a target term corresponding to the target syntax analysis result and a head entity word corresponding to the term;

the merging module 530 is further configured to merge the target term with a head entity word corresponding to the target term, obtain a merged term, update the target term corresponding to the target syntax analysis result to the merged term, and then match a triplet corresponding to the target text according to a preset matching rule and the syntax analysis result;

and a construction module 540, configured to construct a knowledge graph according to the triples.

According to an embodiment of the present application, each step involved in the methods shown in fig. 1 and fig. 3 may be performed by each module in the knowledge graph construction apparatus 500 shown in fig. 5, which is not described herein.

The knowledge graph construction device 500 in the embodiment of the present application may obtain the target text; carrying out syntactic analysis processing on the target text to obtain syntactic analysis results corresponding to each entry in the target text, wherein the syntactic analysis results corresponding to each entry comprise dependency relations between the entry and head entity words corresponding to the entry; obtaining a target syntactic analysis result of which the dependency relationship is a centering relationship, and obtaining a target entry corresponding to the target syntactic analysis result and a head entity word corresponding to the entry; merging the target vocabulary entry with the head entity word corresponding to the target vocabulary entry to obtain a merged vocabulary entry, updating the target vocabulary entry corresponding to the target syntactic analysis result into the merged vocabulary entry, and then matching the triplet corresponding to the target text according to a preset matching rule and the syntactic analysis result; and constructing a knowledge graph according to the triples. The knowledge graph can be automatically constructed through a section of text, the artificial participation degree is reduced, the construction cost of the knowledge graph is reduced, the word after word segmentation is adopted as a candidate set of the entity in the general method, the final entity and the relationship are necessarily independent words, if the final entity is the combination of a plurality of words under the real condition, the final extraction result is incorrect no matter how correct the dependency analysis tree is, and the term of the centering relationship obtained after the text syntactic analysis and the corresponding head entity word are combined into one entity in the application, so that the problem of inaccurate word segmentation result depending on the extraction of the triplet is solved.

Based on the description of the method embodiment and the device embodiment, the embodiment of the application also provides electronic equipment. Referring to fig. 6, the electronic device 600 at least includes a processor 601, a memory 602, and an input/output unit 603. The processor 601 may be a central processing unit (central processing unit, CPU), and serves as an arithmetic and control core of the computer system, and is a final execution unit for information processing and program execution.

A computer storage medium may be stored in the memory 602 of the electronic device 600, where the computer storage medium is used to store a computer program, where the computer program includes program instructions, and where the processor 601 may execute the program instructions stored in the memory 602. The preset classification model and the like in the embodiment of the present application may also be stored in the above-described memory 602.

In one embodiment, the electronic device 600 described above in the embodiments of the present application may be used to perform a series of processes, including the method of any of the embodiments shown in fig. 1 and 3, and so on.

The embodiment of the application also provides a computer storage medium (Memory), which is a Memory device in the electronic device and is used for storing programs and data. It is understood that the computer storage media herein may include both built-in storage media in the electronic device and extended storage media supported by the electronic device. The computer storage medium provides a storage space that stores an operating system of the electronic device. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory; optionally, at least one computer storage medium remote from the processor may be present.

In one embodiment, one or more instructions stored in a computer storage medium may be loaded and executed by a processor to implement the corresponding steps in the above embodiments; in specific implementations, one or more instructions in the computer storage medium may be loaded by the processor and perform any steps of the methods in fig. 1 and 3, which are not described herein.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the division of the module is merely a logical function division, and there may be another division manner when actually implemented, for example, a plurality of modules or components may be combined or may be integrated into another system, or some features may be omitted or not performed. The coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or module indirect coupling or communication connection, which may be in electrical, mechanical, or other form.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a read-only memory (ROM), or a random-access memory (random access memory, RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a digital versatile disk (digital versatile disc, DVD), or a semiconductor medium, such as a Solid State Disk (SSD), or the like.

Claims

1. The knowledge graph construction method is characterized by comprising the following steps of:

acquiring a target text;

merging the target entry with the head entity word corresponding to the target entry to obtain a merged entry, updating the syntactic analysis result according to the merged entry, and then matching the triplet corresponding to the target text according to a preset matching rule and the syntactic analysis result; acquiring input data, wherein the input data comprises a vector of an original sentence corresponding to a target triplet, a vector of each element in the target triplet and a vector of each element position in the target triplet, the original sentence is a sentence of the target triplet extracted from the target text, the elements in the target triplet comprise entries in the target triplet and dependency relations among the entries in the target triplet, and the target triplet is any triplet corresponding to the target text; inputting the input data into a preset classification model, and processing the input data through the classification model to obtain a credibility identifier of the target triplet, wherein the credibility identifier indicates whether the target triplet is an error triplet or not; based on the credibility identification of the triples corresponding to the target text, screening out the error triples in the triples corresponding to the target text; deleting the erroneous triplet in the triplet;

The classification model is a BERT model structure, the input data is input into a preset classification model, the input data is processed through the preset classification model, the credibility identification of the target triplet is obtained, and the credibility identification indicates whether the target triplet is an error triplet or not, and the steps include: initializing a vector by using the trained weight in the classification model to obtain a vector of the original sentence, a vector of each element in the target triplet and a vector of each element position in the target triplet, wherein the elements comprise a main entity, a relation and a tail entity; adding the vector of each element in the target triplet and the vector of each element position in the target triplet, adding the vector with the vector of the original sentence to serve as the input of a BERT model, performing linear transformation on the last layer through a linear layer after BERT coding, converting the linear transformation into probability through the activation function processing of the classification model to obtain the credibility identifier of the target triplet, wherein the activation function of the classification model is a Sigmoid function, and the credibility identifier comprises correct probability and error probability; determining whether the target triplet is an error triplet according to the correct probability and the error probability;

And constructing a knowledge graph according to the triples.

2. The knowledge graph construction method according to claim 1, wherein the syntactic analysis result corresponding to each term further comprises: the first position information indicates the position of the entry in the target text, and the second position information indicates the position of the head entity word corresponding to the entry in the target text;

acquiring the target entry corresponding to the target syntactic analysis result according to the first position information in the target syntactic analysis result;

and acquiring the head entity word corresponding to the target entry according to the second position information in the target syntactic analysis result.

3. The knowledge graph construction method according to claim 2, wherein the preset matching rule includes a preset inter-term dependency pattern, and a triplet expression corresponding to the preset inter-term dependency pattern;

Determining a group of vocabulary entries meeting the preset inter-vocabulary-entry dependency relationship mode in each vocabulary entry according to the dependency relationship of each vocabulary entry in the syntactic analysis result;

and constructing the group of entries into corresponding triples according to the triples expression corresponding to the preset inter-entry dependency relation mode.

4. The knowledge graph construction method according to claim 1, wherein the obtaining the target text includes:

acquiring a text to be processed corresponding to the key term;

and carrying out sentence segmentation on the text to be processed to obtain target texts, wherein each sentence in the target texts contains a subject.

5. The knowledge graph construction method according to claim 4, wherein the performing a syntactic analysis on the target text to obtain a syntactic analysis result corresponding to each term in the target text includes:

6. The knowledge graph construction device is characterized by comprising:

the acquisition module is used for acquiring the target text;

the merging module is further configured to obtain input data, where the input data includes a vector of an original sentence corresponding to a target triplet, a vector of each element in the target triplet, and a vector of each element position in the target triplet, the original sentence is a sentence in which the target triplet is extracted from the target text, and the elements in the target triplet include terms in the target triplet and dependency relationships between terms in the target triplet, and the target triplet is any triplet corresponding to the target text; inputting the input data into a preset classification model, and processing the input data through the preset classification model to obtain a credibility identifier of the target triplet, wherein the credibility identifier indicates whether the target triplet is an error triplet or not; based on the credibility identification of the triples corresponding to the target text, screening out the error triples in the triples corresponding to the target text; deleting the erroneous triplet in the triplet;

The screening module further comprises a classification module, wherein the classification module is of a BERT model structure, and the classification module is used for initializing vectors by using weights in the trained classification model to obtain vectors of the original sentence, vectors of each element in the target triplet and vectors of each element position in the target triplet, and the elements comprise a main entity, a relation and a tail entity; adding the vector of each element in the target triplet and the vector of each element position in the target triplet, adding the vector with the vector of the original sentence to serve as the input of a BERT model, performing linear transformation on the last layer through a linear layer after BERT coding, converting the linear transformation into probability through the activation function processing of the classification model to obtain the credibility identifier of the target triplet, wherein the activation function of the classification model is a Sigmoid function, and the credibility identifier comprises correct probability and error probability; determining whether the target triplet is an error triplet according to the correct probability and the error probability;

7. An electronic device comprising a processor and a memory, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the knowledge-graph construction method of any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that a computer program is stored, which, when being executed by a processor, causes the processor to perform the steps of the knowledge-graph construction method according to any one of claims 1 to 5.