CN111209412B

CN111209412B - Periodical literature knowledge graph construction method for cyclic updating iteration

Info

Publication number: CN111209412B
Application number: CN202010084144.5A
Authority: CN
Inventors: 吕强; 段飞虎; 蔡陨; 谢一鸣; 胡磊; 冯自强; 张宏伟
Original assignee: Tongfang Knowledge Network Digital Publishing Technology Co ltd
Current assignee: Tongfang Knowledge Network Digital Publishing Technology Co ltd
Priority date: 2020-02-10
Filing date: 2020-02-10
Publication date: 2023-05-12
Anticipated expiration: 2040-02-10
Also published as: CN111209412A

Abstract

The invention discloses a periodical document knowledge graph construction method for cyclic updating iteration, which comprises the steps of designing a conceptual model, defining a body structure of the periodical document knowledge graph, and defining a body, relationship attributes of the body and data attributes in the body; managing a word list and a corpus, wherein the word list is divided into a subject word list and a relation word list, and the corpus is divided into a text library and a sentence library and relates to the corpus of a plurality of sources; based on the annotation, training, recognition and calibration entity relation extraction model of deep learning, adopting the deep learning entity relation extraction technology to combine the dictionary and the corpus, carrying out entity extraction and relation extraction, and updating iteration; corpus attribute extraction is carried out by designing a defined ontology structure through concepts and introducing templates; auditing and disambiguating results of entity identification and relation extraction, and disambiguating the results of attribute extraction; the recognition result is stored in the knowledge graph, the topic dictionary, the relation dictionary and the training model are updated at intervals, and the new dictionary and model are used for recognizing the language materials to realize cyclic iteration updating and building of the knowledge graph.

Description

Periodical literature knowledge graph construction method for cyclic updating iteration

Technical Field

The invention relates to the technical field of natural language processing and computer information processing, in particular to a periodical literature knowledge graph construction method for cyclic updating iteration.

Background

The prior knowledge graph is a huge and networked knowledge system constructed by taking a semantic network as a framework, and aims to describe concepts, entities, events and relations among the concepts, entities and events in the objective world. Concepts are conceptual representations of things that people form in recognizing the world. The key technology of the knowledge graph relates to a plurality of fields such as natural language processing, data mining, information retrieval and the like, and is mainly divided into two types of knowledge driving and data driving, and along with the development of big data, the knowledge graph is widely applied, such as law, social network, medical knowledge graph and the like.

The key technology of the knowledge graph construction comprises entity and relation extraction technology, knowledge fusion technology, entity linking technology and knowledge reasoning technology, and the knowledge graph construction comprises related technologies of various links from data sources to applications and the like. However, the main emphasis of the current knowledge graph construction is on the links of enriching and optimizing graph contents such as entity relation extraction, semantic analysis and the like, and the construction process is not deeply explored. In particular, the updating iteration and the calibration of the knowledge graph do not have the standard of a system, so that the knowledge graph reaches a closed loop, and the intellectualization and the automation of the construction of the knowledge graph are truly realized.

Disclosure of Invention

In order to solve the technical problems, the invention aims to provide a periodical document knowledge graph construction method for cyclic updating iteration, which takes a knowledge network periodical document library as a data source from the perspective of automatically constructing knowledge graphs, organically combines a plurality of knowledge graph construction modules such as concept design, dictionary management, corpus management, model training, knowledge element extraction, entity disambiguation and the like, and really realizes the intelligent cyclic updating iteration construction of periodical document knowledge graphs by updating iteration and continuously optimizing the accuracy of knowledge graphs and training.

The aim of the invention is achieved by the following technical scheme:

a periodical literature knowledge graph construction method for cyclic update iteration comprises the following steps:

a, designing a concept model, and defining an ontology structure of a journal literature knowledge graph, wherein the ontology structure comprises an ontology, relationship attributes of the ontology and data attributes in the ontology;

b, managing a word list and a corpus, wherein the word list is divided into a subject word list and a relation word list, and the corpus is divided into a text library and a sentence library and relates to the corpus of a plurality of sources;

c, based on the labeling, training, recognition and calibration entity relation extraction model of deep learning, adopting a deep learning entity relation extraction technology to combine a dictionary and a corpus, carrying out entity extraction and relation extraction, and updating iteration;

d, extracting corpus attributes by designing a defined ontology structure through concepts and introducing templates;

e, auditing and disambiguating the results of entity identification and relation extraction, and disambiguating the results of attribute extraction;

and F, storing the recognition result into the knowledge graph, and updating the topic dictionary, the relation dictionary and the training model at intervals, and recognizing the language materials by using the new dictionary and the new model to realize cyclic iteration updating and constructing the knowledge graph.

One or more embodiments of the present invention may have the following advantages over the prior art:

the invention provides a standard flow reference for constructing the knowledge graph, so that the knowledge graph is truly intelligent, the waste of human resources is relatively reduced, and the usability and practicability of the knowledge graph are improved.

Drawings

FIG. 1 is a flow chart of a journal literature knowledge graph construction method for loop update iteration;

FIG. 2 is a diagram of a thesaurus structure;

FIG. 3 is a diagram of a text database structure;

FIG. 4 is a diagram of a statement database structure;

FIG. 5 is a flow chart of a journal literature knowledge graph construction method for loop update iteration;

FIG. 6 is a flow chart of an entity identification update iterative model;

FIG. 7 is a flowchart of a relationship identification update iteration model;

fig. 8 is a flow chart of an attribute extraction model.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following examples and the accompanying drawings.

As shown in FIG. 1, the periodical document knowledge graph construction method for cyclic update iteration comprises the step 10 of conceptual model design, and the specification of ontology, data attribute and relationship attribute is defined for the knowledge graph.

The ontology model refers to an ontology model or data standard which is widely applied internationally such as multiplexing CIDOC CRM, EDM, FOAF, EVENT, FRBR and the like, and is expanded and customized according to own service characteristics, so that the reusability and internationalization degree of the ontology model are improved.

The ontology construction of the journal literature knowledge graph comprises the definition of an ontology and a data model layer of the journal literature knowledge graph, wherein the ontology construction comprises the following steps: define an ontology, define relationship attributes of the ontology, define data attributes inside the ontology.

The ontology is an object or a collection of objects, for example: text, author, and institution information. The relationship attribute of the ontology mainly defines the association relationship between the ontologies, for example: there are collaboration relationships between authors and authors, dependencies between authors and institutions, etc. The data attribute inside the ontology is that the characteristics of the ontology itself have no association relationship, for example: author name, age, native place, etc.

The invention defines a triplet specification for the knowledge graph: (E) ₁ ,R,E ₂ ) (E, P, V) wherein E represents an ontology, R represents a relationship attribute, P represents a data attribute, and V represents an attribute value. In an entity-relationship-entity relationship, the value range of an entity is an ontology.

The body structure of the journal literature part is defined as follows:

TABLE 1

Identification mark	Body
		E1	Text of
E2	Author's authors
		E3	Mechanism
E4	Time
		E5	Relationship type
E6	Domain entity
		E7	Region of

The defined journal literature partial relationship attributes are as follows:

TABLE 2

Step 20, managing a word list and a corpus, wherein the word list is divided into a subject word list and a relation word list, and the corpus is divided into a text library and a sentence library and relates to the corpus of a plurality of sources;

the vocabulary and the corpus of the journal literature knowledge graph are divided into data in a plurality of fields in a form of a middle graph classification method. The vocabulary is formally divided into a subject vocabulary and a relationship vocabulary, wherein the subject vocabulary defines the source, the field, the sub-field and other attributes of the entity words, the relationship vocabulary defines the relationship among the subject vocabulary entity words, and the literature journal defines 10 types of relationships of upper and lower positions, similarity, antisense, correlation and the like for the word relationships.

The corpus is divided into a text library and a sentence library, wherein the text library is a collection library of network journal documents and local resources and mainly stores document data. In order to facilitate the deep text mining, journal documents in a text library are preprocessed, and a sentence library is formed. The sentence library comprises sentences from journal documents and positions of the sentences where the entity words are in the subject word list. The structure of the subject vocabulary is shown in fig. 2.

Wherein content is a physical word, englist is English translation, category is middle graph classification, domain is word source, etc.

The relationship vocabulary is as in table 3:

TABLE 3 Table 3

Where orgid and tarid are index ids of the subject vocabulary where the entity word is located, and reltype is word relation id. The text and sentence libraries in the corpus are shown in fig. 3 and 4.

Step 30, based on the labeling, training, recognition and calibration entity relation extraction model of deep learning, adopting the deep learning entity relation extraction technology to combine the dictionary and the corpus, carrying out entity extraction and relation extraction, and updating iteration.

Update iteration of entity extraction:

1. and labeling the corpus by using the dictionary, and labeling entity words appearing in the corpus.

2. And selecting an entity recognition algorithm to train the annotation set. The algorithm for entity recognition goes through a process of updating iteration from machine learning to deep learning, for example: HMM, CRF, BILSTM +CRF, bert+BILSTM+CRF, etc. The invention adopts the algorithm of Bert+BILSTM+CRF to carry out entity identification.

3. And continuously identifying the corpus by using the trained labeling model, calibrating the identification result, and storing new words which do not appear in the topic dictionary into the topic dictionary.

4. Labeling with the updated dictionary again, and retraining the updated model and dictionary.

The entity extraction process forms a closed loop of update iterations by adding a topic dictionary and in the form of loop labeling corpus and training model. The model can be optimized continuously to improve the accuracy of entity identification.

Update iteration of relation extraction:

1. and labeling the sentence set by using the relation dictionary and the existing relation extraction template, and forming a training model. The relation extraction relates to a wider field, and the traditional deep learning model is difficult to have better performance on the training of the relation extraction. Thus, conventional relational extraction designs a large number of templates that contain both part-of-speech and grammatical features. The invention marks the sentence set through two modes of the template and the relational word stock and forms a training sample.

2. And (3) training the annotation set by selecting a relation extraction algorithm, and selecting a PCNN+attribute algorithm by using a relation extraction model. The CNN/PCNN is used as the content encoder, and the sentence-level content mechanism is used.

3. And carrying out relation recognition on the new corpus by using a training model, storing the recognition result in a database, correcting by manual auditing, storing in a relation dictionary and a sentence set, and storing the corpus for the new training sample.

4. And (5) using the new training samples to identify the corpus again and performing loop iteration.

The relationship identification and the entity identification adopt the same cyclic iteration flow, and meanwhile, the accuracy of the identification is improved by combining templates formed by a great deal of experience in the past.

The flow chart of the periodical document knowledge graph construction method for cyclic updating iteration is shown in fig. 5, local data and periodical document data are mapped and arranged to a text base in a unified mode, and the text base data are preprocessed to form a sentence base. The data of the text library and the sentence library are input chat of an entity extraction model and a relation recognition model, and subject words and relation words in the word list are also input along with the corpus input model, and the attribute extraction model simultaneously introduces the concept model. The output of entity extraction and relation recognition model is the recognized entity and new relation phrase, the output of attribute extraction model is entity attribute triplet. And (5) performing calibration and updating a vocabulary database and a journal document knowledge graph after entity disambiguation. The new vocabulary is combined with the new corpus to update the vocabulary and the knowledge graph again, so that the process realizes updating iteration, continuously corrects the model, the vocabulary and the knowledge graph, improves accuracy and usability, and forms an organic intelligent circulation updating iteration mechanism.

Referring to fig. 6, an iterative model is updated for entity recognition, the language is labeled by the vocabulary, and a labeled sample is input into the model for training. And carrying out entity recognition on the corpus by the trained model, and updating the word list and the knowledge graph again by the recognition result so as to form an updated iteration model of entity recognition.

A flow chart of the simultaneous relationship identification update iteration model is shown in fig. 7.

Step 40, extracting corpus attributes by designing defined ontology structures through concepts and introducing templates.

The attribute extraction adopts a dependency syntactic analysis model, and the attribute extraction process is as follows:

1. and combining the ontology structure and the data attribute defined in the conceptual design to form an entity attribute template and traversing the entity and the sentences with related attributes in the sentence set.

2. The CRF algorithm is adopted to label the parts of speech of the sentences, the entity words often have fixed parts of speech, and the difficulty of the parts of speech label is that the parts of speech of the unregistered words are judged and the word parts of speech of the phrase words are judged. The part-of-speech tagging results have a significant impact on the analysis of the sentence. Thus, using CRF for part-of-speech tagging can learn more of the physical features and facilitate updating iterations.

3. The labeling result is substituted into a syntax analyzer for syntax analysis, the syntax analyzer adopts a dependency algorithm, the core of the algorithm is based on an arc-standard system, a classifier is used for predicting the correct conversion operation according to the characteristics extracted from the configuration information, and the calculation efficiency is very high

4. The syntactical results are parsed by matching the syntactical templates and attributes are extracted, such as a master predicate structure, etc.

As shown in fig. 8, the attribute extraction model, the conceptual model and the topic dictionary extract sentences as a sample model from the sentence library as inputs. The sample model makes part-of-speech tagging through CRF and carries out dependency syntactic analysis on the tagging result, statement results with grammar features are analyzed, attribute extraction is carried out through grammar templates, and entity attribute triples are formed and stored in the knowledge graph. The updating iteration of the attribute extraction model mainly calibrates the accuracy of part-of-speech tagging through cyclic training of the CRF model.

And 50, entity disambiguation and auditing are carried out, the results of entity identification and relation extraction are audited and disambiguated, and the results of attribute extraction are subjected to entity disambiguation.

The entity disambiguation mainly solves the phenomenon of word ambiguity and multi-word ambiguity existing in natural language. The entity disambiguation is divided into two steps, wherein the first step is to perform deep learning disambiguation before entity identification and relationship identification; and secondly, matching and disambiguation is carried out mainly by adopting a relation dictionary and a theme dictionary. And disambiguating results of entity identification, relationship identification and attribute extraction.

Step 60 recognizes the result and stores it in the knowledge graph, and updates the topic dictionary, the relationship dictionary and the training model at intervals. And identifying the language materials by using the new dictionary and the new model to realize cyclic iteration updating and constructing the knowledge graph.

Although the embodiments of the present invention are described above, the embodiments are only used for facilitating understanding of the present invention, and are not intended to limit the present invention. Any person skilled in the art can make any modification and variation in form and detail without departing from the spirit and scope of the present disclosure, but the scope of the present disclosure is still subject to the scope of the appended claims.

Claims

1. The periodical literature knowledge graph construction method based on cyclic updating iteration is characterized by comprising the following steps of:

c, based on the labeling, training, identifying and calibrating entity relation extraction models of deep learning, adopting a deep learning entity relation extraction technology to combine word lists and corpus, carrying out entity extraction and relation extraction, and updating iteration;

and F, storing the recognition result into the knowledge graph, updating the subject vocabulary, the relation vocabulary and the training model at intervals, and recognizing the language materials by using the new vocabulary and the new model to realize cyclic iteration updating and constructing the knowledge graph.

2. The periodical document knowledge graph construction method of loop update iteration according to claim 1, wherein in the step a:

the ontology is an object or a collection of objects;

the relationship attribute of the ontology is used for defining the association relationship between the ontologies;

the data attribute inside the body is that the characteristics of the body do not have association relation.

3. The periodical document knowledge graph construction method of loop update iteration according to claim 1, wherein in the step B:

the subject vocabulary defines the source, domain and sub-domain attributes of the entity words;

guan Jici table defines the relation between the subject vocabulary entity words and defines the upper and lower positions, similarity, antisense and correlation relation for the word relation in the literature journal;

the text library is a collection library of network journal documents and local resources and mainly stores document data; preprocessing journal documents in a text library to form a sentence library; the sentence library comprises sentences from journal documents and positions of sentences where entity words are located in the subject word list.

4. The journal literature knowledge graph construction method of cyclic update iteration of claim 1, wherein the update iteration of entity extraction in step C comprises:

marking the corpus by using a word list, and marking entity words appearing in the corpus with labels;

selecting an entity recognition algorithm to train the annotation set;

continuously identifying the corpus by using the trained labeling model, calibrating the identification result, and storing new words which do not appear in the subject word list into the subject word list;

marking with the updated vocabulary again, and training the updated model and the vocabulary again.

5. The periodical document knowledge graph construction method of claim 1, wherein the updating iteration of the relation extraction in the step C comprises:

marking the sentence set by using the relation word list and the existing relation extraction template, and forming a training model;

training the annotation set by selecting a relation extraction algorithm, and selecting a PCNN+attribute algorithm by using a relation extraction model;

performing relation recognition on the new corpus by using a training model, storing the recognition result in a database, correcting by manual auditing, storing the recognition result in a relation word list and sentence set, and storing the corpus for the new training sample;

and (5) using the new training samples to identify the corpus again and performing loop iteration.

6. The periodical document knowledge graph construction method of loop update iteration according to claim 1, wherein the attribute extraction in the step D adopts a dependency syntactic analysis model, and the attribute extraction process is as follows:

combining the body structure and the data attribute defined in the conceptual design to form an entity attribute template and traversing the entity and the sentences with related attributes in the sentence set;

performing part-of-speech tagging on the sentences by adopting a CRF algorithm;

substituting the labeling result into a syntax analyzer for syntax analysis;

the syntactic result is parsed and attributes are extracted by matching the syntactic templates.