CN111209412A

CN111209412A - Method for building knowledge graph of periodical literature by cyclic updating iteration

Info

Publication number: CN111209412A
Application number: CN202010084144.5A
Authority: CN
Inventors: 吕强; 段飞虎; 蔡陨; 谢一鸣; 胡磊; 冯自强; 张宏伟
Original assignee: Tongfang Knowledge Network Digital Publishing Technology Co ltd; Tongfang Knowledge Network Beijing Technology Co ltd
Current assignee: Tongfang Knowledge Network Digital Publishing Technology Co ltd
Priority date: 2020-02-10
Filing date: 2020-02-10
Publication date: 2020-05-29
Anticipated expiration: 2040-02-10
Also published as: CN111209412B

Abstract

The invention discloses a method for constructing a knowledge graph of journal documents by cyclic updating iteration, which comprises the steps of designing a conceptual model, defining an ontology structure of the knowledge graph of the journal documents, and defining an ontology, a relation attribute of the ontology and a data attribute in the ontology; managing word lists and corpora, wherein the word lists are divided into subject word lists and relational word lists, and the corpus is divided into a text library and a sentence library and relates to the corpora of multiple sources; based on a deep learning labeling, training, recognition and calibration entity relationship extraction model, adopting a deep learning entity relationship extraction technology in combination with a dictionary and a corpus to perform entity extraction and relationship extraction, and updating iteration; performing corpus attribute extraction through an ontology structure defined by concept design and introducing a template; checking and disambiguating the results of the entity identification and the relationship extraction, and performing entity disambiguation on the results of the attribute extraction; and storing the recognition result into the knowledge graph, updating the subject dictionary, the relation dictionary and the training model at random, and recognizing the material by using the new dictionary and model to realize the circulation iteration updating and construct the knowledge graph.

Description

Method for building knowledge graph of periodical literature by cyclic updating iteration

Technical Field

The invention relates to the technical field of natural language processing and computer information processing, in particular to a periodical update iterative journal literature knowledge graph construction method.

Background

The conventional knowledge graph is a huge and networked knowledge system constructed by taking a 'semantic network' as a framework and aims to describe concepts, entities, events and relationships among the concepts, the entities and the events in an objective world. The concept means that people form conceptualized representation of objective objects in the process of knowing the world. The key technology of the knowledge graph relates to multiple fields of natural language processing, data mining, information retrieval and the like, is mainly divided into two types of knowledge driving and data driving, and is widely applied along with the development of big data, such as laws, social networks, medical knowledge graphs and the like.

The key technology for establishing the knowledge graph comprises an entity and relationship extraction technology, a knowledge fusion technology, an entity link technology and a knowledge inference technology, wherein the knowledge graph establishment comprises related technologies of all links from a data source to an application and the like. However, the main focus of the current knowledge graph construction lies in enriching and optimizing graph content links such as entity relationship extraction and semantic analysis, and deep exploration is not performed on the construction process. Especially, the updating iteration and calibration of the knowledge graph do not have the system specification, so that the closed loop is achieved, and the intellectualization and automation of the knowledge graph construction are really realized.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide a periodical update iterative journal literature knowledge graph construction method, which is based on the automatic knowledge graph construction, takes a journal literature base of the HowNet as a data source, organically combines a plurality of knowledge graph construction modules such as concept design, dictionary management, corpus management, model training, knowledge element extraction, entity disambiguation and the like, and forms a closed loop by updating iteration and continuously optimizing the accuracy of knowledge graph and training so as to really realize the intelligent periodical update iterative journal literature knowledge graph construction.

The purpose of the invention is realized by the following technical scheme:

a method for building a knowledge graph of periodical literature with iteration circularly updated comprises the following steps:

designing a concept model, defining an ontology structure of a knowledge graph of journal literature, wherein the ontology structure comprises a definition ontology, a relation attribute of the ontology and a data attribute inside the ontology;

b, managing a word list and a corpus, wherein the word list is divided into a subject word list and a relational word list, and the corpus is divided into a text library and a sentence library and relates to the corpora of a plurality of sources;

c, based on a deep learning labeling, training, recognition and calibration entity relation extraction model, adopting a deep learning entity relation extraction technology to combine a dictionary and a corpus, performing entity extraction and relation extraction, and updating iteration;

d, performing corpus attribute extraction through an ontology structure defined by concept design and introducing a template;

e, checking and disambiguating the results of the entity identification and the relationship extraction, and performing entity disambiguation on the results of the attribute extraction;

and F, storing the recognition result into the knowledge graph, updating the subject dictionary, the relation dictionary and the training model at random, and recognizing the material by using the new dictionary and model to realize the circulation iteration updating and construct the knowledge graph.

One or more embodiments of the present invention may have the following advantages over the prior art:

the invention provides a standard process reference for constructing the knowledge graph, so that the constructed knowledge graph is really oriented to intellectualization, the waste of human resources is relatively reduced, and the usability and the practicability of the knowledge graph are improved.

Drawings

FIG. 1 is a flowchart of a method for building a knowledge graph of journal articles with iterative loop updates;

FIG. 2 is a diagram of a topic table structure;

FIG. 3 is a diagram of a text database structure;

FIG. 4 is a diagram of a statement database structure;

FIG. 5 is a flowchart of a method for iteratively updating a knowledge-graph of journal articles;

FIG. 6 is a flow diagram of an entity identification update iterative model;

FIG. 7 is a flow diagram of a relationship identification update iterative model;

FIG. 8 is a flow diagram of an attribute extraction model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

As shown in fig. 1, the method for building a knowledge graph of journal literature for loop update iteration includes step 10 of conceptual model design, and specification of a knowledge graph definition ontology, data attributes and relationship attributes.

The ontology model refers to the ontology models or data standards which are widely applied internationally, such as CIDOC CRM, EDM, FOAF, EVENT, FRBR and the like, and is expanded and customized according to the service characteristics of the ontology model, so that the reusability and the internationalization degree of the ontology model are improved.

The ontology construction of the knowledge graph of the journal literature comprises the definition of an ontology and a data model layer of the knowledge graph of the journal literature, wherein the ontology construction comprises the following steps: defining an ontology, defining a relationship attribute of the ontology, and defining a data attribute inside the ontology.

The ontology is an object or a collection of objects, for example: text, author, and institution, etc. The relationship attributes of the ontologies mainly define the association relationship between ontologies, for example: there are collaborations between authors, dependencies between authors and organizations, etc. The data attribute inside the ontology is that there is no association relationship in the characteristics of the ontology itself, for example: author name, age, and native place, etc.

The invention defines a triple specification for the knowledge graph: (E)₁,R,E₂) And (E, P, V) wherein E represents an ontology, R represents a relationship attribute, P represents a data attribute, and V represents an attribute value. In an entity-relationship-entity relationship, the value range of an entity is an ontology.

The journal literature part ontology structure is defined as follows:

TABLE 1

Identification	Body
		E1	Text
E2	Authors refer to
		E3	Mechanism
E4	Time of day
		E5	Type of relationship
E6	Domain entity
		E7	Region of land

The journal literature part relational attributes defined are as follows:

TABLE 2

Step 20, managing word lists and corpora, wherein the word lists are divided into subject word lists and relational word lists, and the corpus is divided into a text library and a sentence library and relates to the corpora of multiple sources;

the word list and the corpus of the knowledge graph of the journal literature are divided into data of a plurality of fields in the form of a Chinese atlas classification method. The vocabulary is divided into a subject vocabulary and a relational vocabulary in form, the subject vocabulary defines the source, the field, the sub-field and other attributes of the entity words, the relational vocabulary defines the relation between the entity words of the subject vocabulary, and the word relation defines 10 relations of upper and lower, similar, antisense, related and the like in the literature periodical.

The corpus is divided into a text library and a sentence library, the text library is a collection library of network journal documents and local resources, and document data are mainly stored. In order to facilitate deep text mining, journal documents of a text library are preprocessed, and a sentence library is formed. The sentence library includes sentences from journal literature and the positions of the sentences in which the entity words are located in the subject vocabulary. The structure of the topic word list is shown in fig. 2.

Wherein content is entity word, English is English translation, catalog is Chinese graph classification, domain is word source, etc.

The relational word table is as shown in Table 3:

TABLE 3

Wherein orgid and tarid are index ids of the subject word list where the entity word is located, and relatype is word relationship id. The text library and sentence library in the corpus are shown in fig. 3 and fig. 4.

And step 30, based on the labeling, training, recognition and calibration entity relationship extraction model of deep learning, adopting a deep learning entity relationship extraction technology in combination with the dictionary and the corpus to perform entity extraction and relationship extraction, and updating iteration.

Update iteration of entity extraction:

1. and labeling the corpus by using a dictionary, and labeling entity words appearing in the corpus.

2. And selecting an entity recognition algorithm to train the label set. The algorithm of entity identification goes through a process of updating iteration from machine learning to deep learning, for example: HMM, CRF, BILSTM + CRF, Bert + BILSTM + CRF, and the like. The invention adopts an algorithm of Bert + BILSTM + CRF to identify the entity.

3. And continuously recognizing the corpus by using the trained labeling model, calibrating the recognition result, and storing new words which do not appear in the subject dictionary into the subject dictionary.

4. Labeling with the updated dictionary again, and training the updated model and dictionary again.

The entity extraction process forms a closed loop of update iteration by adding a topic dictionary and in the form of a loop labeled corpus and a training model. The model is enabled to be continuously optimized to improve the accuracy of entity identification.

Update iteration of relationship extraction:

1. and labeling the statement set by using the relation dictionary and the existing relation extraction template, and forming a training model. The relation extraction is wide in related field, and a traditional deep learning model is difficult to have good performance on relation extraction training. Therefore, the traditional relation extraction designs a large number of templates containing part-of-speech and grammatical features. The method labels the statement set through two modes of the template and the relational word stock and forms a training sample.

2. And selecting a relation extraction algorithm to train the label set, wherein the relation extraction model selects a PCNN + Attention algorithm. CNN/PCNN was used as the sensor encoder, and a sentence-level annotation mechanism was used.

3. And performing relation recognition on the new corpus by using the training model, storing a recognition result in a database, correcting the recognition result through manual examination, storing the corrected recognition result in a relation dictionary and a statement set, and storing the corrected recognition result in a new training sample.

4. And recognizing the corpus again by using the new training sample and performing loop iteration.

The relationship identification and the entity identification adopt the same loop iteration process, and meanwhile, the accuracy of the identification is improved by combining a template formed by a large amount of past experience.

A flowchart of a method for constructing knowledge maps of journal literature with iterative cycle updating is shown in fig. 5, local data and journal literature data are unified and mapped and sorted into a text library, and the text library data is preprocessed to form a sentence library. The data of the text library and the sentence library are input language chats of an entity extraction model and a relation recognition model, subject words and relation words in a word list are accompanied with a corpus input model, and an attribute extraction model is introduced into a conceptual model at the same time. The output of the entity extraction and relationship identification model is respectively identified entity and new relationship phrase, and the output of the attribute extraction model is entity attribute triple. Entity disambiguation is followed by calibration and updating of the vocabulary database and journal literature knowledge maps. The new word list is combined with the new corpora to carry out data output by the model training book to update the word list and the knowledge graph again, so that updating iteration is realized in the process, the model, the word list and the knowledge graph are continuously corrected, the accuracy and the usability are improved, and an organic intelligent cyclic updating iteration mechanism is formed.

As shown in fig. 6, the iterative model is updated for entity identification, entity tagging is performed on the corpus through the vocabulary, and a tagged sample is input into the model for training. And the trained model carries out entity recognition on the corpus, and the word list and the knowledge graph are updated again according to the recognition result so as to form an updated iterative model of entity recognition.

Meanwhile, a flow chart of the relationship identification updating iterative model is shown in FIG. 7.

And step 40, performing corpus attribute extraction by conceptually designing the defined ontology structure and introducing a template.

The attribute extraction adopts a dependency syntactic analysis model, and the attribute extraction process is as follows:

1. and combining the ontology structure and the data attributes defined in the concept design to form an entity attribute template and traversing the statements of the entity and the existence related attributes in the statement set.

2. A CRF algorithm is adopted to label the part of speech of a sentence, entity words often have fixed parts of speech, and the part of speech labeling is difficult to judge the part of speech of unregistered words and judge the part of speech of word groups and words. The result of part-of-speech tagging has a great influence on the analysis of a sentence method. Therefore, part-of-speech tagging using CRFs enables learning of more entity features and facilitates update iterations.

3. And substituting the labeling result into a syntactic analyzer for syntactic analysis, wherein the syntactic analyzer adopts a dependency algorithm, the core of the algorithm is based on an arc-standard system, a classifier is used for predicting correct conversion operation according to the characteristics extracted from the configuration information, and the calculation efficiency is very high

4. The syntactic results are analyzed by matching syntactic templates and properties, such as the structure of a principal and predicate, are extracted.

As shown in FIG. 8, the attribute extraction model, the conceptual model and the topic dictionary are used as input to extract sentences from the sentence library as sample models. And the sample model is used for performing part-of-speech tagging through the CRF, performing dependency syntax analysis on a tagging result, analyzing a statement result with grammatical features, performing attribute extraction through a grammar template, forming entity attribute triples and storing the entity attribute triples into a knowledge graph. And updating iteration of the attribute extraction model is mainly realized by circularly training the CRF model to calibrate the accuracy of part-of-speech tagging.

And 50, entity disambiguation and verification, namely verifying and disambiguating the result of entity identification and relationship extraction, and performing entity disambiguation the result of attribute extraction.

Entity disambiguation mainly solves the phenomena of one-word polysemy and multiple-word polysemy existing in natural language. The entity disambiguation is divided into two steps, wherein in the first step, deep learning disambiguation is carried out before entity identification and relation identification; the second is to use the relationship dictionary and the topic dictionary for matching disambiguation. And disambiguating results of entity identification, relationship identification and attribute extraction.

Step 60 stores the recognition results in the knowledge graph and updates the topic dictionary, the relationship dictionary and the training model at random. And identifying the material by using the new dictionary and model to realize loop iteration updating and construct the knowledge graph.

Although the embodiments of the present invention have been described above, the above descriptions are only for the convenience of understanding the present invention, and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for building a knowledge graph of periodical literature by iteration in a loop updating manner is characterized by comprising the following steps:

2. The method of iteratively updating a knowledge-graph of journal literature according to claim 1, wherein in step a:

an ontology is an object or a collection of objects;

the relationship attribute of the ontology is used for defining the incidence relationship between the ontologies;

the data attribute inside the ontology is that the characteristics of the ontology do not have an association relationship.

3. The method of iteratively updating a knowledge-graph of journal literature according to claim 1, wherein in step B:

the topic word list defines the source, the field and the sub-field attribute of the entity word;

the relation vocabulary defines the relation between the entity words of the subject vocabulary, and defines the upper and lower, similar, antisense and related relations for the word relation in the literature periodical;

the text library is a collective library of network journal documents and local resources and mainly stores document data; preprocessing journal documents in a text library to form a sentence library; the sentence library comprises sentences from periodical literature and the positions of the sentences of the entity words in the subject vocabulary.

4. The method for constructing a knowledge-graph of journal documents based on iterative update of claim 1, wherein the update iteration of entity extraction in step C comprises:

labeling the corpus by using a dictionary, and labeling entity words appearing in the corpus;

selecting an entity recognition algorithm to train the label set;

continuously recognizing the corpus by using the trained labeling model, calibrating the recognition result and storing new words which do not appear in the subject dictionary into the subject dictionary;

labeling with the updated dictionary again, and training the updated model and dictionary again.

5. The method for constructing a knowledge-graph of journal documents based on iterative update of claim 1, wherein the update iteration of relation extraction in step C comprises:

labeling the sentence set by using a relation dictionary and an existing relation extraction template, and forming a training model;

selecting a relation extraction algorithm to train the label set, and selecting a PCNN + Attention algorithm from the relation extraction model;

carrying out relation recognition on the new corpus by using a training model, storing a recognition result in a database, correcting the recognition result through manual examination, storing the corrected recognition result in a relation dictionary and a statement set, and storing the corrected recognition result in a corpus storage mode for a new training sample;

and recognizing the corpus again by using the new training sample and performing loop iteration.

6. The method for cyclically updating and iterating knowledge graph construction of journal documents according to claim 1, wherein the attribute extraction in step D adopts a dependency syntactic analysis model, and the attribute extraction process is as follows:

combining an ontology structure and data attributes defined in the concept design to form an entity attribute template and traversing the entity and the sentences with the related attributes in the sentence set;

performing part-of-speech tagging on the statement by adopting a CRF algorithm;

substituting the labeling result into a syntactic analyzer for syntactic analysis;

the syntactic results are analyzed and properties extracted by matching the syntactic templates.