CN111061882A

CN111061882A - Knowledge graph construction method

Info

Publication number: CN111061882A
Application number: CN201910766428.XA
Authority: CN
Inventors: 金耀初; 何卫灵; 刘华; 张宏辉
Original assignee: Guangzhou Liko Technology Co Ltd
Current assignee: Guangzhou Liko Technology Co Ltd
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2020-04-24

Abstract

The invention relates to the technical field of natural language processing, in particular to a knowledge graph construction method, which comprises the following steps: step S1: obtaining a corpus set; step S2: preprocessing a corpus; step S3: converting the corpus and storing the corpus into a database; step S4: and constructing a knowledge graph according to the database. Compared with the existing knowledge graph construction method, the constructed knowledge graph has higher quality.

Description

Knowledge graph construction method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a knowledge graph construction method.

Background

The natural language refers to the language used by people in daily life, such as Chinese, English, French and the like, is a natural language evolved along with the development of human society, is not an artificial language, and is an important tool for human study and life. In general, natural language refers to a convention of human society that is distinguished from artificial languages, such as programming languages.

Natural Language Processing (NLP) refers to an operation and processing for processing information such as shapes, sounds, and meanings of natural language, i.e., inputting, outputting, recognizing, analyzing, understanding, and generating characters, words, sentences, and chapters, by a computer. The concrete expression forms of natural language processing include machine translation, text summarization, text classification, text proofreading, information extraction, speech synthesis, speech recognition and the like. It can be said that natural language processing is to solve natural language by a computer, and the natural language processing mechanism involves two processes including natural language understanding and natural language generation.

In the current society, with the development of information technology and the popularization of the internet, big data, cloud computing and artificial intelligence become hot topics of the current academic community. Natural language processing is one of the most difficult problems in artificial intelligence, and how to realize information exchange between human and machines and intelligently screen and process massive data is a key technical breakthrough in the artificial intelligence, computer science and linguistic industries. Because of the specificity and complexity of human languages, understanding human languages by machines is a difficult task. Especially in the field of natural language processing, machine understanding of Chinese is far more complex than understanding of English. Therefore, how to make the machine better analyze Chinese becomes a difficult problem that cannot be circumvented in the field of artificial intelligence.

The knowledge graph is a knowledge organization form and specification which takes Natural Language Processing (NLP) as a center and combines a plurality of technologies of applying mathematics, graphics and information visualization. In the knowledge-graph, each node represents an "entity" existing in the real world, and each edge is a "relationship" between entities. Knowledge-graphs are the most efficient way to represent relationships. An entity generally refers to a noun phrase or verb phrase of a particular meaning or strong reference in text, typically including a person's name, place name, organization name, time, proper noun, and the like. Generally, a knowledge graph is a relational network obtained by connecting all kinds of Information (Heterogeneous Information). Knowledge-graphs provide the ability to analyze problems from a "relational" perspective. Knowledge maps have recently enjoyed mature applications in many industries of artificial intelligence, such as search engines, chat robots, intelligent medicine, intelligent hardware, and the like. Although the application of the knowledge graph is wide, the current knowledge graph construction method is not mature, and the defects of manual construction and low data quality still exist. Therefore, a method for constructing a higher quality knowledge graph is needed.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method for constructing a knowledge graph, which can construct a higher quality knowledge graph.

The technical scheme adopted by the invention is as follows:

a method of knowledge-graph construction, the method comprising:

step S1: obtaining a corpus set;

step S2: preprocessing a corpus;

step S3: converting the corpus and storing the corpus into a database;

step S4: and constructing a knowledge graph according to the database.

A corpus, i.e., a collection of linguistic materials, which are basic units constituting a corpus and are in the form of texts. The corpus of the scheme is acquired from the internet, and the concrete acquisition mode is as follows: selecting a target webpage, defining the webpage into a document format, converting all data into texts, traversing the webpage converted into the texts to obtain all text data in the webpage, and finally establishing an element group set element to store all the obtained text data. The webpage converted into text not only contains text content, but also html tags, comments and the like, and the tags contain information such as text styles. The element group stores the html content and the html tag separately for the convenience of subsequent reading and parsing work. And after the corpus is obtained, preprocessing the corpus to reduce the noise of the corpus. Because the triple has high storage performance relative to the database, the preprocessed corpus is converted into the triple and then stored into the database. And finally, constructing a knowledge graph according to the data stored in the database. Compared with the existing knowledge graph construction method, the method has the advantage that the constructed knowledge graph is higher in quality.

Further, the step S2 includes:

step S2.1: cleaning the corpus;

step S2.2: using a word segmentation tool to segment words of the cleaned corpus;

step S2.3: performing part-of-speech tagging on the corpus after word segmentation;

step S2.4: analyzing the part-of-speech labeled corpus through a dependency syntax analyzer;

step S2.5: and extracting noun phrases from the parsed corpus set, and establishing a noun phrase set.

In the scheme, the corpus is cleaned, including deleting cleaning and regular expression matching cleaning. After the corpus is obtained, because the corpus necessarily contains unnecessary information, the corpus must be filtered to obtain useless contents, such as: deletion of useless contents such as advertisements, useless links, html comments and the like; writing a regular expression for describing the keywords, matching the deleted and cleaned corpus with the regular expression, and filtering out the parts which are not consistent with the regular expression. After extracting the useful content text, selecting an applicable label set, for example: and selecting a daily newspaper marking corpus from the policy texts. And (4) performing word segmentation according to the meanings of the words and the words by using a word segmentation tool, and then marking each word with a corresponding part-of-speech tag. The linguistic data set is analyzed in a syntactic manner according to the part of speech, syntactic components contained in a sentence and the relation among the components can be identified through the syntactic analysis, the sentence is understood more deeply through the context, the ambiguity among words is eliminated, and the information error rate is lower. The syntax analysis of the scheme is dependency syntax analysis, and the form is simple and convenient to apply; because a deep learning method based on the labeled data set is not adopted, training is carried out on the basis of the traditional unsupervised learning method, a large amount of manual data labeling is not needed, the method is efficient and quick, errors caused by poor quality of the labeled data set are avoided, and a large amount of manpower and financial resources are saved. And finally, accurately extracting noun phrases from the corpus set analyzed by the dependency syntax analyzer, and establishing a noun phrase set.

Further, the way of extracting noun phrases in step S2.5 includes:

(1) extracting according to a centering relation structure in the phrase;

(2) a noun of a certain length under a non-centered relationship is extracted.

Specifically, the fixed relation in the phrase (1) is commonly used for describing a Chinese-named-Chinese compound vocabulary, the fixed language can limit the central language from number, territory, range, characteristics and quality, a restrictive fixed language, a descriptive fixed language and other fixed phrases are formed, and the phrase semantics can be restored by combining the words of the fixed relation according to the result of syntactic analysis. Because the combination sequence exists between the fixed Chinese words, the scheme also uses a binary insertion method to search and insert the id of the words, thereby improving the extraction speed and ensuring the completeness of the phrases. (2) The word segmentation tool is a supplement to (1) because it can sometimes directly identify noun phrases that are relatively long in recognition as nouns or proper nouns.

Further, the step S3 includes: the corpus is converted into a triple, a triple (Field, Predicate, Value);

wherein Field is the name of the data column, Value is the Value corresponding to Field, and Predicate is the relationship between Field and Value;

the Value comprises a number class, an address class and a nominal class.

The above-described method extracts all possible noun phrases from the corpus, however, it is unlikely that all noun phrases belong to an entity. Therefore, the resulting noun phrase set needs to be further filtered before it can be stored in the database. The screening method adopted by the scheme is a method of a seed library template. The specific process is as follows: and constructing a database, storing the data in the database in a triple mode, taking the existing data of the database as a template set, converting the corpus set into triples, then matching the triples with the template set, and if the triples are matched, storing the triples into the database.

Further, the process of acquiring the Field includes:

step S3.11: acquiring an existing data set of a database as a phrase template set, and converting the phrase template set into a sentence vector by using a language model pre-trained by BERT;

step S3.12: converting noun phrase sets into sentence vectors by using a language model pre-trained by BERT;

step S3.13: calculating the distance between the phrases of the two data sets of the phrase template set and the noun phrase set;

step S3.14: if the similarity meets a certain threshold, the phrases of the noun phrase set are classified as a phrase class of the phrase template set;

the calculation formula of the similarity is as follows:

wherein S is_similarRepresenting the similarity of two phrases, the cosine value range is [ -1, 1]，vec₁And vec₂The output of the penultimate layer in the BERT model is 768-dimensional vectors.

Further, the obtaining process of Value includes:

step S3.21: identification of a number class;

performing syntactic analysis on the sentence through a dependency syntactic analyzer, analyzing whether the part of speech of each word belongs to quantifier and numerator and word number distance between the quantifier and the numerator, and identifying whether the part of speech belongs to a number class;

step S3.22: identifying an address class;

training a corpus through word segmentation and entity labeling, wherein a trained model adopts a word-based BilSTM-CRF model;

step S3.23: identifying a part of speech class;

converting a phrase template set and a noun phrase set into sentence vectors by using a language model pre-trained by BERT;

and calculating the distance between the phrases of the two data sets of the phrase template set and the noun phrase set, and if one word meeting a threshold value exists in any one column of the phrase template set, classifying the phrases of the current noun phrase set into a phrase class of the current column of the phrase template set.

Because the values in the database do not necessarily all exist in the form of noun phrases, there may also be sentences. So that the noun phrase extraction is also required for the content. In the Value acquisition process, the phrase template set in the database is in accordance with the requirement of the entity, but more noise exists due to more contents in the database. It needs to be clustered to eliminate irrelevant noun phrases. Most noun phrases of the same class should be semantically close to reflect the information of the column, with noisy data present, and noise words will not necessarily be similar to each other. Taking a noun phrase set acquired from a certain column in the database as an example, a counting method is adopted to record the number of each noun phrase similar to other noun phrases so as to remove a few noun phrases with a small number.

Further, the relationship categories among the number category, the address category and the part of name category include: greater than, less than, at, yes, no; the relationship class has a relationship constraint of: greater than, less than, the relational constraints used only to describe the numeric class; is located in the description address class; yes, no are used to describe the remaining categories.

Specifically, entities are in the triplets, and the relationship between the entities is as follows: greater than, less than, at, yes, no. Wherein the entity Value is divided into three types of entities: number class, address class, noun class; the relationship categories between these three types of entities have relationship constraints: greater than, less than, the relational constraints used only to describe the numeric class; is located in the description address class; yes, no are used to describe the remaining categories.

Further, the number class, the address class and the part of name class have template vocabularies, the template vocabularies are used for describing the relationship classes, and the process of identifying the relationship classes is as follows:

(1) defining a relationship type and a corresponding relationship constraint;

(2) defining a template vocabulary corresponding to the relation category;

(3) on the premise of ensuring the relation constraint, the noun phrases are checked and judged through the template vocabularies, and if the noun phrases are not satisfied or cannot be identified, the noun phrases are classified into a new large class.

(4) Words with antisense, treated as a new class.

First, a relationship class between entities is defined, and a relationship constraint is added to the relationship class. And then, defining a template vocabulary for describing the relation category, finally, on the premise of ensuring that the relation category of the entity accords with the relation constraint, identifying the relation category of the entity through the template vocabulary, if the relation category of the entity cannot be identified by the template vocabulary, classifying the entity into a new category, and if the expression antisense vocabulary of the category is not more than or not less than the expression vocabulary, additionally processing the entity as a new category. Under the condition of poor performance in the future, the method can be added and deleted, so that the accuracy is improved conveniently, and the robustness and the expansibility of the algorithm are ensured.

Further, the triples have interclass relationship constraints, the interclass relationship constraints can be distinguished by constructing a classification model, the distinguishing type of the classification model is and or, the and type is defined as 1, and the or type is defined as 0.

Further, the cross entropy loss function of the classification model is:

J＝-[yglog(p)+(1-y)glog(1-p)]

where y represents label of the sample and p represents the probability that the sample is predicted to be positive.

Specifically, more than one triple can be extracted from a sentence, and if the relation between the triples is further described to sufficiently restore the semantics of the sentence, a classification model needs to be constructed, which is used for determining the relation between the triples.

Compared with the prior art, the invention has the beneficial effects that:

(1) and the entity extraction and labeling are carried out on the basis of unsupervised learning, so that errors caused by poor quality of a labeled data set are avoided, and the constructed knowledge graph has higher quality.

(2) The segmentation and part-of-speech tagging do not need manual tagging, so that a large amount of manpower and financial resources are saved.

(3) The template vocabulary is externally configured, so that the modification and the configuration can be rapidly and flexibly carried out, and the universality and the robustness of the algorithm are enhanced.

Drawings

FIG. 1 is a system requirements diagram of the present invention;

FIG. 2 is a syntactic analysis diagram of the present invention;

fig. 3 is a map visualization diagram of the present invention.

Detailed Description

The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For a better understanding of the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

Examples

The embodiment provides a knowledge graph construction method, which comprises the following steps:

step S1: obtaining a corpus set;

step S2: preprocessing a corpus;

step S3: converting the corpus and storing the corpus into a database;

step S4: and constructing a knowledge graph according to the database.

Fig. 1 is a system requirement diagram of the present invention, as shown in the figure, first, the step S1 of obtaining text data corresponds to: obtaining a corpus set, wherein the specific obtaining mode is as follows: selecting a target webpage from the network, defining the webpage into a document format, converting all data into texts, traversing the webpage converted into the texts to obtain all text data in the webpage, and finally establishing an element group set element to store all the obtained text data. The webpage converted into text not only contains text content, but also html tags, comments and the like, and the tags contain information such as text styles. The element group stores the html content and the html tag separately for the convenience of subsequent reading and parsing work. Then, the step S2 corresponds to preprocessing and cleaning the text: and preprocessing the corpus to reduce the noise of the text. Because the triple has high storage performance relative to the database, the preprocessed corpus is converted into the triple to be stored in the database, and the concrete process is as follows: and (4) performing entity recognition on the cleaned text, performing relation extraction, and finally completing construction of a triple and storing the triple in a database. This process corresponds to step S3: and converting the corpus and storing the corpus in a database.

The knowledge graph can be constructed according to the triples stored in the database. Compared with the existing knowledge graph construction method, the constructed knowledge graph has higher quality. Fig. 3 is a map visualization diagram of the present invention, the map being as shown.

Further, the step S2 includes:

step S2.1: cleaning the corpus;

In this embodiment, the cleaning of the corpus includes deletion cleaning and regular expression matching cleaning. After the corpus is obtained, because the corpus necessarily contains unnecessary information, the corpus must be filtered to obtain useless contents, such as: deletion of useless contents such as advertisements, useless links, html comments and the like; writing a regular expression for describing the keywords, matching the deleted and cleaned corpus with the regular expression, and filtering out the parts which are not consistent with the regular expression. After extracting the useful content text, selecting an applicable label set, for example: and selecting a daily newspaper marking corpus from the policy texts. And (4) performing word segmentation according to the meanings of the words and the words by using a word segmentation tool, and then marking each word with a corresponding part-of-speech tag. The linguistic data set is analyzed in a syntactic manner according to the part of speech, syntactic components contained in a sentence and the relation among the components can be identified through the syntactic analysis, the sentence is understood more deeply through the context, the ambiguity among words is eliminated, and the information error rate is lower. The syntax analysis of the embodiment is dependency syntax analysis, and the form is simple and convenient to apply; because a deep learning method based on the labeled data set is not adopted, training is carried out on the basis of the traditional unsupervised learning method, a large amount of manual data labeling is not needed, the method is efficient and quick, errors caused by poor quality of the labeled data set are avoided, and a large amount of manpower and financial resources are saved. And finally, accurately extracting noun phrases from the corpus set analyzed by the dependency syntax analyzer, and establishing a noun phrase set.

Further, the way of extracting noun phrases in step S2.5 includes:

(1) extracting according to a centering relation structure in the phrase;

(2) a noun of a certain length under a non-centered relationship is extracted.

Specifically, (1) the centering relationship in the phrases is commonly used to describe a Chinese-named-Chinese compound vocabulary, the fixed language of which can define the central language from number, territory, range, characteristics and quality to form a limiting fixed language, a descriptive fixed language and other centering phrases, fig. 2 is a syntactic analysis diagram of the present invention, as shown in the figure, the simple use of a word segmentation tool to perform word segmentation can segment the university of south China's rational engineering into south China, rational engineering and university, and the combination of the words of the centering relationship can restore the phrase semantics according to the result of the syntactic analysis. Because the combination sequence exists between the fixed Chinese words, the binary insertion method is used for searching and inserting the id of the Chinese words in the embodiment, so that the extraction speed is improved, and the completeness of the phrase is ensured. (2) The word segmentation tool is a supplement to (1) because it can sometimes directly identify noun phrases that are relatively long in recognition as nouns or proper nouns.

the Value comprises a number class, an address class and a nominal class.

The above-described method extracts all possible noun phrases from the corpus, however, it is unlikely that all noun phrases belong to an entity. Therefore, the resulting noun phrase set needs to be further filtered before it can be stored in the database. The screening method used in this example was a method of a library template. The specific process is as follows: and constructing a database, storing the data in the database in a triple mode, taking the existing data of the database as a template set, converting the corpus set into triples, then matching the triples with the template set, and if the triples are matched, storing the triples into the database.

Further, the process of acquiring the Field includes:

the calculation formula of the similarity is as follows:

Further, the obtaining process of Value includes:

step S3.21: identification of a number class;

step S3.22: identifying an address class;

the first layer of the model is a look-up layer, and each word of a sentence is mapped into an embedding matrix; the second layer is a bidirectional LSTM, sentence features are fully extracted, the third layer is a CRF layer, and the sentences are subjected to sequence labeling prediction by combining training labels.

Step S3.23: identifying a part of speech class;

Its pseudo code is as follows:

Begin

for i in noun phrase set:

for j in noun phrase set [ i +1,: ]:

calculating similarity

If similarity satisfies a threshold:

count of noun phrase set [ i ] +1

New noun phrase set-noun phrase set [ taking the first 40% noun phrases from big to small ]

End

(1) defining a relationship type and a corresponding relationship constraint;

(2) defining a template vocabulary corresponding to the relation category;

(4) Words with antisense, treated as a new class.

Further, the cross entropy loss function of the classification model is:

J＝-[yglog(，)+(1-y)glog(1-p)]

It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention claims should be included in the protection scope of the present invention claims.

Claims

1. A method of knowledge graph construction, the method comprising:

step S1: obtaining a corpus set;

step S2: preprocessing a corpus;

step S3: converting the corpus and storing the corpus into a database;

step S4: and constructing a knowledge graph according to the database.

2. The method of knowledge-graph construction according to claim 1, wherein the step S2 includes:

step S2.1: cleaning the corpus;

3. The method of claim 2, wherein the step S2.5 of extracting noun phrases comprises:

(1) extracting according to a centering relation structure in the phrase;

(2) a noun of a certain length under a non-centered relationship is extracted.

4. The method of knowledge-graph construction according to claim 2, wherein the step S3 includes: the corpus is converted into a triple, and the triple is (Field, Predicate, Value);

the Value comprises a number class, an address class and a nominal class.

5. The method for constructing a knowledge graph according to claim 4, wherein the Field acquisition process comprises: step S3.11: acquiring an existing data set of a database as a phrase template set, and converting the phrase template set into a sentence vector by using a language model pre-trained by BERT;

step S3.14: if the similarity meets a certain threshold, the phrases of the noun phrase set are classified as a current phrase class of the phrase template set;

the calculation formula of the similarity is as follows:

6. The method of claim 4, wherein the Value obtaining process comprises:

step S3.21: identification of a number class;

step S3.22: identifying an address class;

step S3.23; identifying a part of speech class;

7. The method of claim 4, wherein the relationship between the number class, the address class and the part of speech class comprises: greater than, less than, at, yes, no; the relationship class has a relationship constraint of: greater than, less than are used to describe the numerical class only; is located in the description address class; yes, no are used to describe the remaining categories.

8. The method of claim 7, wherein the numeric class, the address class and the nominal class have a template vocabulary, the template vocabulary is used for describing the relationship class, and the process of identifying the relationship class is as follows:

(1) defining a relationship type and a corresponding relationship constraint;

(2) defining a template vocabulary corresponding to the relation category;

(4) Words with antisense, treated as a new class.

9. The method for constructing a knowledge graph according to claim 4, wherein the triples have an intergroup relationship constraint, and the intergroup relationship constraint can be identified by constructing a classification model, and the identification type of the classification model is and or, the and class is defined as 1, and the or class is defined as 0.

10. The method of constructing a knowledge graph according to claim 9, wherein the cross entropy loss function of the classification model is:

J＝-[yglog(p)+(1-y)glog(1-p)]