CN111061882A - Knowledge graph construction method - Google Patents

Knowledge graph construction method Download PDF

Info

Publication number
CN111061882A
CN111061882A CN201910766428.XA CN201910766428A CN111061882A CN 111061882 A CN111061882 A CN 111061882A CN 201910766428 A CN201910766428 A CN 201910766428A CN 111061882 A CN111061882 A CN 111061882A
Authority
CN
China
Prior art keywords
class
corpus
phrase
noun
relationship
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910766428.XA
Other languages
Chinese (zh)
Inventor
金耀初
何卫灵
刘华
张宏辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Liko Technology Co Ltd
Original Assignee
Guangzhou Liko Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Liko Technology Co Ltd filed Critical Guangzhou Liko Technology Co Ltd
Priority to CN201910766428.XA priority Critical patent/CN111061882A/en
Publication of CN111061882A publication Critical patent/CN111061882A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The invention relates to the technical field of natural language processing, in particular to a knowledge graph construction method, which comprises the following steps: step S1: obtaining a corpus set; step S2: preprocessing a corpus; step S3: converting the corpus and storing the corpus into a database; step S4: and constructing a knowledge graph according to the database. Compared with the existing knowledge graph construction method, the constructed knowledge graph has higher quality.

Description

Knowledge graph construction method
Technical Field
The invention relates to the technical field of natural language processing, in particular to a knowledge graph construction method.
Background
The natural language refers to the language used by people in daily life, such as Chinese, English, French and the like, is a natural language evolved along with the development of human society, is not an artificial language, and is an important tool for human study and life. In general, natural language refers to a convention of human society that is distinguished from artificial languages, such as programming languages.
Natural Language Processing (NLP) refers to an operation and processing for processing information such as shapes, sounds, and meanings of natural language, i.e., inputting, outputting, recognizing, analyzing, understanding, and generating characters, words, sentences, and chapters, by a computer. The concrete expression forms of natural language processing include machine translation, text summarization, text classification, text proofreading, information extraction, speech synthesis, speech recognition and the like. It can be said that natural language processing is to solve natural language by a computer, and the natural language processing mechanism involves two processes including natural language understanding and natural language generation.
In the current society, with the development of information technology and the popularization of the internet, big data, cloud computing and artificial intelligence become hot topics of the current academic community. Natural language processing is one of the most difficult problems in artificial intelligence, and how to realize information exchange between human and machines and intelligently screen and process massive data is a key technical breakthrough in the artificial intelligence, computer science and linguistic industries. Because of the specificity and complexity of human languages, understanding human languages by machines is a difficult task. Especially in the field of natural language processing, machine understanding of Chinese is far more complex than understanding of English. Therefore, how to make the machine better analyze Chinese becomes a difficult problem that cannot be circumvented in the field of artificial intelligence.
The knowledge graph is a knowledge organization form and specification which takes Natural Language Processing (NLP) as a center and combines a plurality of technologies of applying mathematics, graphics and information visualization. In the knowledge-graph, each node represents an "entity" existing in the real world, and each edge is a "relationship" between entities. Knowledge-graphs are the most efficient way to represent relationships. An entity generally refers to a noun phrase or verb phrase of a particular meaning or strong reference in text, typically including a person's name, place name, organization name, time, proper noun, and the like. Generally, a knowledge graph is a relational network obtained by connecting all kinds of Information (Heterogeneous Information). Knowledge-graphs provide the ability to analyze problems from a "relational" perspective. Knowledge maps have recently enjoyed mature applications in many industries of artificial intelligence, such as search engines, chat robots, intelligent medicine, intelligent hardware, and the like. Although the application of the knowledge graph is wide, the current knowledge graph construction method is not mature, and the defects of manual construction and low data quality still exist. Therefore, a method for constructing a higher quality knowledge graph is needed.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method for constructing a knowledge graph, which can construct a higher quality knowledge graph.
The technical scheme adopted by the invention is as follows:
a method of knowledge-graph construction, the method comprising:
step S1: obtaining a corpus set;
step S2: preprocessing a corpus;
step S3: converting the corpus and storing the corpus into a database;
step S4: and constructing a knowledge graph according to the database.
A corpus, i.e., a collection of linguistic materials, which are basic units constituting a corpus and are in the form of texts. The corpus of the scheme is acquired from the internet, and the concrete acquisition mode is as follows: selecting a target webpage, defining the webpage into a document format, converting all data into texts, traversing the webpage converted into the texts to obtain all text data in the webpage, and finally establishing an element group set element to store all the obtained text data. The webpage converted into text not only contains text content, but also html tags, comments and the like, and the tags contain information such as text styles. The element group stores the html content and the html tag separately for the convenience of subsequent reading and parsing work. And after the corpus is obtained, preprocessing the corpus to reduce the noise of the corpus. Because the triple has high storage performance relative to the database, the preprocessed corpus is converted into the triple and then stored into the database. And finally, constructing a knowledge graph according to the data stored in the database. Compared with the existing knowledge graph construction method, the method has the advantage that the constructed knowledge graph is higher in quality.
Further, the step S2 includes:
step S2.1: cleaning the corpus;
step S2.2: using a word segmentation tool to segment words of the cleaned corpus;
step S2.3: performing part-of-speech tagging on the corpus after word segmentation;
step S2.4: analyzing the part-of-speech labeled corpus through a dependency syntax analyzer;
step S2.5: and extracting noun phrases from the parsed corpus set, and establishing a noun phrase set.
In the scheme, the corpus is cleaned, including deleting cleaning and regular expression matching cleaning. After the corpus is obtained, because the corpus necessarily contains unnecessary information, the corpus must be filtered to obtain useless contents, such as: deletion of useless contents such as advertisements, useless links, html comments and the like; writing a regular expression for describing the keywords, matching the deleted and cleaned corpus with the regular expression, and filtering out the parts which are not consistent with the regular expression. After extracting the useful content text, selecting an applicable label set, for example: and selecting a daily newspaper marking corpus from the policy texts. And (4) performing word segmentation according to the meanings of the words and the words by using a word segmentation tool, and then marking each word with a corresponding part-of-speech tag. The linguistic data set is analyzed in a syntactic manner according to the part of speech, syntactic components contained in a sentence and the relation among the components can be identified through the syntactic analysis, the sentence is understood more deeply through the context, the ambiguity among words is eliminated, and the information error rate is lower. The syntax analysis of the scheme is dependency syntax analysis, and the form is simple and convenient to apply; because a deep learning method based on the labeled data set is not adopted, training is carried out on the basis of the traditional unsupervised learning method, a large amount of manual data labeling is not needed, the method is efficient and quick, errors caused by poor quality of the labeled data set are avoided, and a large amount of manpower and financial resources are saved. And finally, accurately extracting noun phrases from the corpus set analyzed by the dependency syntax analyzer, and establishing a noun phrase set.
Further, the way of extracting noun phrases in step S2.5 includes:
(1) extracting according to a centering relation structure in the phrase;
(2) a noun of a certain length under a non-centered relationship is extracted.
Specifically, the fixed relation in the phrase (1) is commonly used for describing a Chinese-named-Chinese compound vocabulary, the fixed language can limit the central language from number, territory, range, characteristics and quality, a restrictive fixed language, a descriptive fixed language and other fixed phrases are formed, and the phrase semantics can be restored by combining the words of the fixed relation according to the result of syntactic analysis. Because the combination sequence exists between the fixed Chinese words, the scheme also uses a binary insertion method to search and insert the id of the words, thereby improving the extraction speed and ensuring the completeness of the phrases. (2) The word segmentation tool is a supplement to (1) because it can sometimes directly identify noun phrases that are relatively long in recognition as nouns or proper nouns.
Further, the step S3 includes: the corpus is converted into a triple, a triple (Field, Predicate, Value);
wherein Field is the name of the data column, Value is the Value corresponding to Field, and Predicate is the relationship between Field and Value;
the Value comprises a number class, an address class and a nominal class.
The above-described method extracts all possible noun phrases from the corpus, however, it is unlikely that all noun phrases belong to an entity. Therefore, the resulting noun phrase set needs to be further filtered before it can be stored in the database. The screening method adopted by the scheme is a method of a seed library template. The specific process is as follows: and constructing a database, storing the data in the database in a triple mode, taking the existing data of the database as a template set, converting the corpus set into triples, then matching the triples with the template set, and if the triples are matched, storing the triples into the database.
Further, the process of acquiring the Field includes:
step S3.11: acquiring an existing data set of a database as a phrase template set, and converting the phrase template set into a sentence vector by using a language model pre-trained by BERT;
step S3.12: converting noun phrase sets into sentence vectors by using a language model pre-trained by BERT;
step S3.13: calculating the distance between the phrases of the two data sets of the phrase template set and the noun phrase set;
step S3.14: if the similarity meets a certain threshold, the phrases of the noun phrase set are classified as a phrase class of the phrase template set;
the calculation formula of the similarity is as follows:
Figure BDA0002171608610000041
wherein S issimilarRepresenting the similarity of two phrases, the cosine value range is [ -1, 1],vec1And vec2The output of the penultimate layer in the BERT model is 768-dimensional vectors.
Further, the obtaining process of Value includes:
step S3.21: identification of a number class;
performing syntactic analysis on the sentence through a dependency syntactic analyzer, analyzing whether the part of speech of each word belongs to quantifier and numerator and word number distance between the quantifier and the numerator, and identifying whether the part of speech belongs to a number class;
step S3.22: identifying an address class;
training a corpus through word segmentation and entity labeling, wherein a trained model adopts a word-based BilSTM-CRF model;
step S3.23: identifying a part of speech class;
converting a phrase template set and a noun phrase set into sentence vectors by using a language model pre-trained by BERT;
and calculating the distance between the phrases of the two data sets of the phrase template set and the noun phrase set, and if one word meeting a threshold value exists in any one column of the phrase template set, classifying the phrases of the current noun phrase set into a phrase class of the current column of the phrase template set.
Because the values in the database do not necessarily all exist in the form of noun phrases, there may also be sentences. So that the noun phrase extraction is also required for the content. In the Value acquisition process, the phrase template set in the database is in accordance with the requirement of the entity, but more noise exists due to more contents in the database. It needs to be clustered to eliminate irrelevant noun phrases. Most noun phrases of the same class should be semantically close to reflect the information of the column, with noisy data present, and noise words will not necessarily be similar to each other. Taking a noun phrase set acquired from a certain column in the database as an example, a counting method is adopted to record the number of each noun phrase similar to other noun phrases so as to remove a few noun phrases with a small number.
Further, the relationship categories among the number category, the address category and the part of name category include: greater than, less than, at, yes, no; the relationship class has a relationship constraint of: greater than, less than, the relational constraints used only to describe the numeric class; is located in the description address class; yes, no are used to describe the remaining categories.
Specifically, entities are in the triplets, and the relationship between the entities is as follows: greater than, less than, at, yes, no. Wherein the entity Value is divided into three types of entities: number class, address class, noun class; the relationship categories between these three types of entities have relationship constraints: greater than, less than, the relational constraints used only to describe the numeric class; is located in the description address class; yes, no are used to describe the remaining categories.
Further, the number class, the address class and the part of name class have template vocabularies, the template vocabularies are used for describing the relationship classes, and the process of identifying the relationship classes is as follows:
(1) defining a relationship type and a corresponding relationship constraint;
(2) defining a template vocabulary corresponding to the relation category;
(3) on the premise of ensuring the relation constraint, the noun phrases are checked and judged through the template vocabularies, and if the noun phrases are not satisfied or cannot be identified, the noun phrases are classified into a new large class.
(4) Words with antisense, treated as a new class.
First, a relationship class between entities is defined, and a relationship constraint is added to the relationship class. And then, defining a template vocabulary for describing the relation category, finally, on the premise of ensuring that the relation category of the entity accords with the relation constraint, identifying the relation category of the entity through the template vocabulary, if the relation category of the entity cannot be identified by the template vocabulary, classifying the entity into a new category, and if the expression antisense vocabulary of the category is not more than or not less than the expression vocabulary, additionally processing the entity as a new category. Under the condition of poor performance in the future, the method can be added and deleted, so that the accuracy is improved conveniently, and the robustness and the expansibility of the algorithm are ensured.
Further, the triples have interclass relationship constraints, the interclass relationship constraints can be distinguished by constructing a classification model, the distinguishing type of the classification model is and or, the and type is defined as 1, and the or type is defined as 0.
Further, the cross entropy loss function of the classification model is:
J=-[yglog(p)+(1-y)glog(1-p)]
where y represents label of the sample and p represents the probability that the sample is predicted to be positive.
Specifically, more than one triple can be extracted from a sentence, and if the relation between the triples is further described to sufficiently restore the semantics of the sentence, a classification model needs to be constructed, which is used for determining the relation between the triples.
Compared with the prior art, the invention has the beneficial effects that:
(1) and the entity extraction and labeling are carried out on the basis of unsupervised learning, so that errors caused by poor quality of a labeled data set are avoided, and the constructed knowledge graph has higher quality.
(2) The segmentation and part-of-speech tagging do not need manual tagging, so that a large amount of manpower and financial resources are saved.
(3) The template vocabulary is externally configured, so that the modification and the configuration can be rapidly and flexibly carried out, and the universality and the robustness of the algorithm are enhanced.
Drawings
FIG. 1 is a system requirements diagram of the present invention;
FIG. 2 is a syntactic analysis diagram of the present invention;
fig. 3 is a map visualization diagram of the present invention.
Detailed Description
The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For a better understanding of the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
Examples
The embodiment provides a knowledge graph construction method, which comprises the following steps:
step S1: obtaining a corpus set;
step S2: preprocessing a corpus;
step S3: converting the corpus and storing the corpus into a database;
step S4: and constructing a knowledge graph according to the database.
Fig. 1 is a system requirement diagram of the present invention, as shown in the figure, first, the step S1 of obtaining text data corresponds to: obtaining a corpus set, wherein the specific obtaining mode is as follows: selecting a target webpage from the network, defining the webpage into a document format, converting all data into texts, traversing the webpage converted into the texts to obtain all text data in the webpage, and finally establishing an element group set element to store all the obtained text data. The webpage converted into text not only contains text content, but also html tags, comments and the like, and the tags contain information such as text styles. The element group stores the html content and the html tag separately for the convenience of subsequent reading and parsing work. Then, the step S2 corresponds to preprocessing and cleaning the text: and preprocessing the corpus to reduce the noise of the text. Because the triple has high storage performance relative to the database, the preprocessed corpus is converted into the triple to be stored in the database, and the concrete process is as follows: and (4) performing entity recognition on the cleaned text, performing relation extraction, and finally completing construction of a triple and storing the triple in a database. This process corresponds to step S3: and converting the corpus and storing the corpus in a database.
The knowledge graph can be constructed according to the triples stored in the database. Compared with the existing knowledge graph construction method, the constructed knowledge graph has higher quality. Fig. 3 is a map visualization diagram of the present invention, the map being as shown.
Further, the step S2 includes:
step S2.1: cleaning the corpus;
step S2.2: using a word segmentation tool to segment words of the cleaned corpus;
step S2.3: performing part-of-speech tagging on the corpus after word segmentation;
step S2.4: analyzing the part-of-speech labeled corpus through a dependency syntax analyzer;
step S2.5: and extracting noun phrases from the parsed corpus set, and establishing a noun phrase set.
In this embodiment, the cleaning of the corpus includes deletion cleaning and regular expression matching cleaning. After the corpus is obtained, because the corpus necessarily contains unnecessary information, the corpus must be filtered to obtain useless contents, such as: deletion of useless contents such as advertisements, useless links, html comments and the like; writing a regular expression for describing the keywords, matching the deleted and cleaned corpus with the regular expression, and filtering out the parts which are not consistent with the regular expression. After extracting the useful content text, selecting an applicable label set, for example: and selecting a daily newspaper marking corpus from the policy texts. And (4) performing word segmentation according to the meanings of the words and the words by using a word segmentation tool, and then marking each word with a corresponding part-of-speech tag. The linguistic data set is analyzed in a syntactic manner according to the part of speech, syntactic components contained in a sentence and the relation among the components can be identified through the syntactic analysis, the sentence is understood more deeply through the context, the ambiguity among words is eliminated, and the information error rate is lower. The syntax analysis of the embodiment is dependency syntax analysis, and the form is simple and convenient to apply; because a deep learning method based on the labeled data set is not adopted, training is carried out on the basis of the traditional unsupervised learning method, a large amount of manual data labeling is not needed, the method is efficient and quick, errors caused by poor quality of the labeled data set are avoided, and a large amount of manpower and financial resources are saved. And finally, accurately extracting noun phrases from the corpus set analyzed by the dependency syntax analyzer, and establishing a noun phrase set.
Further, the way of extracting noun phrases in step S2.5 includes:
(1) extracting according to a centering relation structure in the phrase;
(2) a noun of a certain length under a non-centered relationship is extracted.
Specifically, (1) the centering relationship in the phrases is commonly used to describe a Chinese-named-Chinese compound vocabulary, the fixed language of which can define the central language from number, territory, range, characteristics and quality to form a limiting fixed language, a descriptive fixed language and other centering phrases, fig. 2 is a syntactic analysis diagram of the present invention, as shown in the figure, the simple use of a word segmentation tool to perform word segmentation can segment the university of south China's rational engineering into south China, rational engineering and university, and the combination of the words of the centering relationship can restore the phrase semantics according to the result of the syntactic analysis. Because the combination sequence exists between the fixed Chinese words, the binary insertion method is used for searching and inserting the id of the Chinese words in the embodiment, so that the extraction speed is improved, and the completeness of the phrase is ensured. (2) The word segmentation tool is a supplement to (1) because it can sometimes directly identify noun phrases that are relatively long in recognition as nouns or proper nouns.
Further, the step S3 includes: the corpus is converted into a triple, a triple (Field, Predicate, Value);
wherein Field is the name of the data column, Value is the Value corresponding to Field, and Predicate is the relationship between Field and Value;
the Value comprises a number class, an address class and a nominal class.
The above-described method extracts all possible noun phrases from the corpus, however, it is unlikely that all noun phrases belong to an entity. Therefore, the resulting noun phrase set needs to be further filtered before it can be stored in the database. The screening method used in this example was a method of a library template. The specific process is as follows: and constructing a database, storing the data in the database in a triple mode, taking the existing data of the database as a template set, converting the corpus set into triples, then matching the triples with the template set, and if the triples are matched, storing the triples into the database.
Further, the process of acquiring the Field includes:
step S3.11: acquiring an existing data set of a database as a phrase template set, and converting the phrase template set into a sentence vector by using a language model pre-trained by BERT;
step S3.12: converting noun phrase sets into sentence vectors by using a language model pre-trained by BERT;
step S3.13: calculating the distance between the phrases of the two data sets of the phrase template set and the noun phrase set;
step S3.14: if the similarity meets a certain threshold, the phrases of the noun phrase set are classified as a phrase class of the phrase template set;
the calculation formula of the similarity is as follows:
Figure BDA0002171608610000081
wherein S issimilarRepresenting the similarity of two phrases, the cosine value range is [ -1, 1],vec1And vec2The output of the penultimate layer in the BERT model is 768-dimensional vectors.
Further, the obtaining process of Value includes:
step S3.21: identification of a number class;
performing syntactic analysis on the sentence through a dependency syntactic analyzer, analyzing whether the part of speech of each word belongs to quantifier and numerator and word number distance between the quantifier and the numerator, and identifying whether the part of speech belongs to a number class;
step S3.22: identifying an address class;
training a corpus through word segmentation and entity labeling, wherein a trained model adopts a word-based BilSTM-CRF model;
the first layer of the model is a look-up layer, and each word of a sentence is mapped into an embedding matrix; the second layer is a bidirectional LSTM, sentence features are fully extracted, the third layer is a CRF layer, and the sentences are subjected to sequence labeling prediction by combining training labels.
Step S3.23: identifying a part of speech class;
converting a phrase template set and a noun phrase set into sentence vectors by using a language model pre-trained by BERT;
and calculating the distance between the phrases of the two data sets of the phrase template set and the noun phrase set, and if one word meeting a threshold value exists in any one column of the phrase template set, classifying the phrases of the current noun phrase set into a phrase class of the current column of the phrase template set.
Because the values in the database do not necessarily all exist in the form of noun phrases, there may also be sentences. So that the noun phrase extraction is also required for the content. In the Value acquisition process, the phrase template set in the database is in accordance with the requirement of the entity, but more noise exists due to more contents in the database. It needs to be clustered to eliminate irrelevant noun phrases. Most noun phrases of the same class should be semantically close to reflect the information of the column, with noisy data present, and noise words will not necessarily be similar to each other. Taking a noun phrase set acquired from a certain column in the database as an example, a counting method is adopted to record the number of each noun phrase similar to other noun phrases so as to remove a few noun phrases with a small number.
Its pseudo code is as follows:
Begin
for i in noun phrase set:
for j in noun phrase set [ i +1,: ]:
calculating similarity
If similarity satisfies a threshold:
count of noun phrase set [ i ] +1
New noun phrase set-noun phrase set [ taking the first 40% noun phrases from big to small ]
End
Further, the relationship categories among the number category, the address category and the part of name category include: greater than, less than, at, yes, no; the relationship class has a relationship constraint of: greater than, less than, the relational constraints used only to describe the numeric class; is located in the description address class; yes, no are used to describe the remaining categories.
Specifically, entities are in the triplets, and the relationship between the entities is as follows: greater than, less than, at, yes, no. Wherein the entity Value is divided into three types of entities: number class, address class, noun class; the relationship categories between these three types of entities have relationship constraints: greater than, less than, the relational constraints used only to describe the numeric class; is located in the description address class; yes, no are used to describe the remaining categories.
Further, the number class, the address class and the part of name class have template vocabularies, the template vocabularies are used for describing the relationship classes, and the process of identifying the relationship classes is as follows:
(1) defining a relationship type and a corresponding relationship constraint;
(2) defining a template vocabulary corresponding to the relation category;
(3) on the premise of ensuring the relation constraint, the noun phrases are checked and judged through the template vocabularies, and if the noun phrases are not satisfied or cannot be identified, the noun phrases are classified into a new large class.
(4) Words with antisense, treated as a new class.
First, a relationship class between entities is defined, and a relationship constraint is added to the relationship class. And then, defining a template vocabulary for describing the relation category, finally, on the premise of ensuring that the relation category of the entity accords with the relation constraint, identifying the relation category of the entity through the template vocabulary, if the relation category of the entity cannot be identified by the template vocabulary, classifying the entity into a new category, and if the expression antisense vocabulary of the category is not more than or not less than the expression vocabulary, additionally processing the entity as a new category. Under the condition of poor performance in the future, the method can be added and deleted, so that the accuracy is improved conveniently, and the robustness and the expansibility of the algorithm are ensured.
Further, the triples have interclass relationship constraints, the interclass relationship constraints can be distinguished by constructing a classification model, the distinguishing type of the classification model is and or, the and type is defined as 1, and the or type is defined as 0.
Further, the cross entropy loss function of the classification model is:
J=-[yglog(,)+(1-y)glog(1-p)]
where y represents label of the sample and p represents the probability that the sample is predicted to be positive.
Specifically, more than one triple can be extracted from a sentence, and if the relation between the triples is further described to sufficiently restore the semantics of the sentence, a classification model needs to be constructed, which is used for determining the relation between the triples.
It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention claims should be included in the protection scope of the present invention claims.

Claims (10)

1. A method of knowledge graph construction, the method comprising:
step S1: obtaining a corpus set;
step S2: preprocessing a corpus;
step S3: converting the corpus and storing the corpus into a database;
step S4: and constructing a knowledge graph according to the database.
2. The method of knowledge-graph construction according to claim 1, wherein the step S2 includes:
step S2.1: cleaning the corpus;
step S2.2: using a word segmentation tool to segment words of the cleaned corpus;
step S2.3: performing part-of-speech tagging on the corpus after word segmentation;
step S2.4: analyzing the part-of-speech labeled corpus through a dependency syntax analyzer;
step S2.5: and extracting noun phrases from the parsed corpus set, and establishing a noun phrase set.
3. The method of claim 2, wherein the step S2.5 of extracting noun phrases comprises:
(1) extracting according to a centering relation structure in the phrase;
(2) a noun of a certain length under a non-centered relationship is extracted.
4. The method of knowledge-graph construction according to claim 2, wherein the step S3 includes: the corpus is converted into a triple, and the triple is (Field, Predicate, Value);
wherein Field is the name of the data column, Value is the Value corresponding to Field, and Predicate is the relationship between Field and Value;
the Value comprises a number class, an address class and a nominal class.
5. The method for constructing a knowledge graph according to claim 4, wherein the Field acquisition process comprises: step S3.11: acquiring an existing data set of a database as a phrase template set, and converting the phrase template set into a sentence vector by using a language model pre-trained by BERT;
step S3.12: converting noun phrase sets into sentence vectors by using a language model pre-trained by BERT;
step S3.13: calculating the distance between the phrases of the two data sets of the phrase template set and the noun phrase set;
step S3.14: if the similarity meets a certain threshold, the phrases of the noun phrase set are classified as a current phrase class of the phrase template set;
the calculation formula of the similarity is as follows:
Figure FDA0002171608600000011
wherein S issimilarRepresenting the similarity of two phrases, the cosine value range is [ -1, 1],vec1And vec2The output of the penultimate layer in the BERT model is 768-dimensional vectors.
6. The method of claim 4, wherein the Value obtaining process comprises:
step S3.21: identification of a number class;
performing syntactic analysis on the sentence through a dependency syntactic analyzer, analyzing whether the part of speech of each word belongs to quantifier and numerator and word number distance between the quantifier and the numerator, and identifying whether the part of speech belongs to a number class;
step S3.22: identifying an address class;
training a corpus through word segmentation and entity labeling, wherein a trained model adopts a word-based BilSTM-CRF model;
step S3.23; identifying a part of speech class;
converting a phrase template set and a noun phrase set into sentence vectors by using a language model pre-trained by BERT;
and calculating the distance between the phrases of the two data sets of the phrase template set and the noun phrase set, and if one word meeting a threshold value exists in any one column of the phrase template set, classifying the phrases of the current noun phrase set into a phrase class of the current column of the phrase template set.
7. The method of claim 4, wherein the relationship between the number class, the address class and the part of speech class comprises: greater than, less than, at, yes, no; the relationship class has a relationship constraint of: greater than, less than are used to describe the numerical class only; is located in the description address class; yes, no are used to describe the remaining categories.
8. The method of claim 7, wherein the numeric class, the address class and the nominal class have a template vocabulary, the template vocabulary is used for describing the relationship class, and the process of identifying the relationship class is as follows:
(1) defining a relationship type and a corresponding relationship constraint;
(2) defining a template vocabulary corresponding to the relation category;
(3) on the premise of ensuring the relation constraint, the noun phrases are checked and judged through the template vocabularies, and if the noun phrases are not satisfied or cannot be identified, the noun phrases are classified into a new large class.
(4) Words with antisense, treated as a new class.
9. The method for constructing a knowledge graph according to claim 4, wherein the triples have an intergroup relationship constraint, and the intergroup relationship constraint can be identified by constructing a classification model, and the identification type of the classification model is and or, the and class is defined as 1, and the or class is defined as 0.
10. The method of constructing a knowledge graph according to claim 9, wherein the cross entropy loss function of the classification model is:
J=-[yglog(p)+(1-y)glog(1-p)]
where y represents label of the sample and p represents the probability that the sample is predicted to be positive.
CN201910766428.XA 2019-08-19 2019-08-19 Knowledge graph construction method Pending CN111061882A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910766428.XA CN111061882A (en) 2019-08-19 2019-08-19 Knowledge graph construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910766428.XA CN111061882A (en) 2019-08-19 2019-08-19 Knowledge graph construction method

Publications (1)

Publication Number Publication Date
CN111061882A true CN111061882A (en) 2020-04-24

Family

ID=70297442

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910766428.XA Pending CN111061882A (en) 2019-08-19 2019-08-19 Knowledge graph construction method

Country Status (1)

Country Link
CN (1) CN111061882A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737496A (en) * 2020-06-29 2020-10-02 东北电力大学 Power equipment fault knowledge map construction method
CN111951079A (en) * 2020-08-14 2020-11-17 国网电子商务有限公司 Credit rating method and device based on knowledge graph and electronic equipment
CN111967761A (en) * 2020-08-14 2020-11-20 国网电子商务有限公司 Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN112163097A (en) * 2020-09-23 2021-01-01 中国电子科技集团公司第十五研究所 Military knowledge graph construction method and system
CN112487206A (en) * 2020-12-09 2021-03-12 中国电子科技集团公司第三十研究所 Entity relationship extraction method for automatically constructing data set
CN113554175A (en) * 2021-09-18 2021-10-26 平安科技(深圳)有限公司 Knowledge graph construction method and device, readable storage medium and terminal equipment
CN113657111A (en) * 2021-07-30 2021-11-16 上海明略人工智能(集团)有限公司 Address recognition method, system, storage medium and electronic device
CN114462400A (en) * 2021-12-31 2022-05-10 深圳市东信时代信息技术有限公司 Directional package script generation method, device, equipment and storage medium
CN116720819A (en) * 2023-08-10 2023-09-08 福建省闽清双棱纸业有限公司 Impregnated paper raw material management system integrating knowledge graph and neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908671A (en) * 2017-10-25 2018-04-13 南京擎盾信息科技有限公司 Knowledge mapping construction method and system based on law data
CN108664615A (en) * 2017-05-12 2018-10-16 华中师范大学 A kind of knowledge mapping construction method of discipline-oriented educational resource
CN109885698A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of knowledge mapping construction method and device, electronic equipment
CN110019839A (en) * 2018-01-03 2019-07-16 中国科学院计算技术研究所 Medical knowledge map construction method and system based on neural network and remote supervisory

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108664615A (en) * 2017-05-12 2018-10-16 华中师范大学 A kind of knowledge mapping construction method of discipline-oriented educational resource
CN107908671A (en) * 2017-10-25 2018-04-13 南京擎盾信息科技有限公司 Knowledge mapping construction method and system based on law data
CN110019839A (en) * 2018-01-03 2019-07-16 中国科学院计算技术研究所 Medical knowledge map construction method and system based on neural network and remote supervisory
CN109885698A (en) * 2019-02-13 2019-06-14 北京航空航天大学 A kind of knowledge mapping construction method and device, electronic equipment

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737496A (en) * 2020-06-29 2020-10-02 东北电力大学 Power equipment fault knowledge map construction method
CN111951079A (en) * 2020-08-14 2020-11-17 国网电子商务有限公司 Credit rating method and device based on knowledge graph and electronic equipment
CN111967761A (en) * 2020-08-14 2020-11-20 国网电子商务有限公司 Monitoring and early warning method and device based on knowledge graph and electronic equipment
CN111967761B (en) * 2020-08-14 2024-04-02 国网数字科技控股有限公司 Knowledge graph-based monitoring and early warning method and device and electronic equipment
CN111951079B (en) * 2020-08-14 2024-04-02 国网数字科技控股有限公司 Credit rating method and device based on knowledge graph and electronic equipment
CN112163097A (en) * 2020-09-23 2021-01-01 中国电子科技集团公司第十五研究所 Military knowledge graph construction method and system
CN112487206B (en) * 2020-12-09 2022-09-20 中国电子科技集团公司第三十研究所 Entity relationship extraction method for automatically constructing data set
CN112487206A (en) * 2020-12-09 2021-03-12 中国电子科技集团公司第三十研究所 Entity relationship extraction method for automatically constructing data set
CN113657111A (en) * 2021-07-30 2021-11-16 上海明略人工智能(集团)有限公司 Address recognition method, system, storage medium and electronic device
CN113554175B (en) * 2021-09-18 2021-11-26 平安科技(深圳)有限公司 Knowledge graph construction method and device, readable storage medium and terminal equipment
CN113554175A (en) * 2021-09-18 2021-10-26 平安科技(深圳)有限公司 Knowledge graph construction method and device, readable storage medium and terminal equipment
CN114462400A (en) * 2021-12-31 2022-05-10 深圳市东信时代信息技术有限公司 Directional package script generation method, device, equipment and storage medium
CN116720819A (en) * 2023-08-10 2023-09-08 福建省闽清双棱纸业有限公司 Impregnated paper raw material management system integrating knowledge graph and neural network
CN116720819B (en) * 2023-08-10 2023-10-27 福建省闽清双棱纸业有限公司 Impregnated paper raw material management system integrating knowledge graph and neural network

Similar Documents

Publication Publication Date Title
CN111061882A (en) Knowledge graph construction method
CN106776562B (en) Keyword extraction method and extraction system
CN111737496A (en) Power equipment fault knowledge map construction method
CN109145260B (en) Automatic text information extraction method
CN110609983B (en) Structured decomposition method for policy file
CN107562919B (en) Multi-index integrated software component retrieval method and system based on information retrieval
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN113486667B (en) Medical entity relationship joint extraction method based on entity type information
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
CN113168499A (en) Method for searching patent document
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN113196277A (en) System for retrieving natural language documents
CN113312922B (en) Improved chapter-level triple information extraction method
CN113196278A (en) Method for training a natural language search system, search system and corresponding use
CN111444704A (en) Network security keyword extraction method based on deep neural network
CN114970536A (en) Combined lexical analysis method for word segmentation, part of speech tagging and named entity recognition
CN111178080A (en) Named entity identification method and system based on structured information
CN114579695A (en) Event extraction method, device, equipment and storage medium
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN113159969A (en) Financial long text rechecking system
Oo et al. An analysis of ambiguity detection techniques for software requirements specification (SRS)
CN112380848A (en) Text generation method, device, equipment and storage medium
CN115618883A (en) Business semantic recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination