Disclosure of Invention
In order to solve the technical problems, the invention aims to provide a periodical document knowledge graph construction method for cyclic updating iteration, which takes a knowledge network periodical document library as a data source from the perspective of automatically constructing knowledge graphs, organically combines a plurality of knowledge graph construction modules such as concept design, dictionary management, corpus management, model training, knowledge element extraction, entity disambiguation and the like, and really realizes the intelligent cyclic updating iteration construction of periodical document knowledge graphs by updating iteration and continuously optimizing the accuracy of knowledge graphs and training.
The aim of the invention is achieved by the following technical scheme:
a periodical literature knowledge graph construction method for cyclic update iteration comprises the following steps:
a, designing a concept model, and defining an ontology structure of a journal literature knowledge graph, wherein the ontology structure comprises an ontology, relationship attributes of the ontology and data attributes in the ontology;
b, managing a word list and a corpus, wherein the word list is divided into a subject word list and a relation word list, and the corpus is divided into a text library and a sentence library and relates to the corpus of a plurality of sources;
c, based on the labeling, training, recognition and calibration entity relation extraction model of deep learning, adopting a deep learning entity relation extraction technology to combine a dictionary and a corpus, carrying out entity extraction and relation extraction, and updating iteration;
d, extracting corpus attributes by designing a defined ontology structure through concepts and introducing templates;
e, auditing and disambiguating the results of entity identification and relation extraction, and disambiguating the results of attribute extraction;
and F, storing the recognition result into the knowledge graph, and updating the topic dictionary, the relation dictionary and the training model at intervals, and recognizing the language materials by using the new dictionary and the new model to realize cyclic iteration updating and constructing the knowledge graph.
One or more embodiments of the present invention may have the following advantages over the prior art:
the invention provides a standard flow reference for constructing the knowledge graph, so that the knowledge graph is truly intelligent, the waste of human resources is relatively reduced, and the usability and practicability of the knowledge graph are improved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following examples and the accompanying drawings.
As shown in FIG. 1, the periodical document knowledge graph construction method for cyclic update iteration comprises the step 10 of conceptual model design, and the specification of ontology, data attribute and relationship attribute is defined for the knowledge graph.
The ontology model refers to an ontology model or data standard which is widely applied internationally such as multiplexing CIDOC CRM, EDM, FOAF, EVENT, FRBR and the like, and is expanded and customized according to own service characteristics, so that the reusability and internationalization degree of the ontology model are improved.
The ontology construction of the journal literature knowledge graph comprises the definition of an ontology and a data model layer of the journal literature knowledge graph, wherein the ontology construction comprises the following steps: define an ontology, define relationship attributes of the ontology, define data attributes inside the ontology.
The ontology is an object or a collection of objects, for example: text, author, and institution information. The relationship attribute of the ontology mainly defines the association relationship between the ontologies, for example: there are collaboration relationships between authors and authors, dependencies between authors and institutions, etc. The data attribute inside the ontology is that the characteristics of the ontology itself have no association relationship, for example: author name, age, native place, etc.
The invention defines a triplet specification for the knowledge graph: (E) 1 ,R,E 2 ) (E, P, V) wherein E represents an ontology, R represents a relationship attribute, P represents a data attribute, and V represents an attribute value. In an entity-relationship-entity relationship, the value range of an entity is an ontology.
The body structure of the journal literature part is defined as follows:
TABLE 1
Identification mark
|
Body
|
E1
|
Text of
|
E2
|
Author's authors
|
E3
|
Mechanism
|
E4
|
Time
|
E5
|
Relationship type
|
E6
|
Domain entity
|
E7
|
Region of |
The defined journal literature partial relationship attributes are as follows:
TABLE 2
Step 20, managing a word list and a corpus, wherein the word list is divided into a subject word list and a relation word list, and the corpus is divided into a text library and a sentence library and relates to the corpus of a plurality of sources;
the vocabulary and the corpus of the journal literature knowledge graph are divided into data in a plurality of fields in a form of a middle graph classification method. The vocabulary is formally divided into a subject vocabulary and a relationship vocabulary, wherein the subject vocabulary defines the source, the field, the sub-field and other attributes of the entity words, the relationship vocabulary defines the relationship among the subject vocabulary entity words, and the literature journal defines 10 types of relationships of upper and lower positions, similarity, antisense, correlation and the like for the word relationships.
The corpus is divided into a text library and a sentence library, wherein the text library is a collection library of network journal documents and local resources and mainly stores document data. In order to facilitate the deep text mining, journal documents in a text library are preprocessed, and a sentence library is formed. The sentence library comprises sentences from journal documents and positions of the sentences where the entity words are in the subject word list. The structure of the subject vocabulary is shown in fig. 2.
Wherein content is a physical word, englist is English translation, category is middle graph classification, domain is word source, etc.
The relationship vocabulary is as in table 3:
TABLE 3 Table 3
Where orgid and tarid are index ids of the subject vocabulary where the entity word is located, and reltype is word relation id. The text and sentence libraries in the corpus are shown in fig. 3 and 4.
Step 30, based on the labeling, training, recognition and calibration entity relation extraction model of deep learning, adopting the deep learning entity relation extraction technology to combine the dictionary and the corpus, carrying out entity extraction and relation extraction, and updating iteration.
Update iteration of entity extraction:
1. and labeling the corpus by using the dictionary, and labeling entity words appearing in the corpus.
2. And selecting an entity recognition algorithm to train the annotation set. The algorithm for entity recognition goes through a process of updating iteration from machine learning to deep learning, for example: HMM, CRF, BILSTM +CRF, bert+BILSTM+CRF, etc. The invention adopts the algorithm of Bert+BILSTM+CRF to carry out entity identification.
3. And continuously identifying the corpus by using the trained labeling model, calibrating the identification result, and storing new words which do not appear in the topic dictionary into the topic dictionary.
4. Labeling with the updated dictionary again, and retraining the updated model and dictionary.
The entity extraction process forms a closed loop of update iterations by adding a topic dictionary and in the form of loop labeling corpus and training model. The model can be optimized continuously to improve the accuracy of entity identification.
Update iteration of relation extraction:
1. and labeling the sentence set by using the relation dictionary and the existing relation extraction template, and forming a training model. The relation extraction relates to a wider field, and the traditional deep learning model is difficult to have better performance on the training of the relation extraction. Thus, conventional relational extraction designs a large number of templates that contain both part-of-speech and grammatical features. The invention marks the sentence set through two modes of the template and the relational word stock and forms a training sample.
2. And (3) training the annotation set by selecting a relation extraction algorithm, and selecting a PCNN+attribute algorithm by using a relation extraction model. The CNN/PCNN is used as the content encoder, and the sentence-level content mechanism is used.
3. And carrying out relation recognition on the new corpus by using a training model, storing the recognition result in a database, correcting by manual auditing, storing in a relation dictionary and a sentence set, and storing the corpus for the new training sample.
4. And (5) using the new training samples to identify the corpus again and performing loop iteration.
The relationship identification and the entity identification adopt the same cyclic iteration flow, and meanwhile, the accuracy of the identification is improved by combining templates formed by a great deal of experience in the past.
The flow chart of the periodical document knowledge graph construction method for cyclic updating iteration is shown in fig. 5, local data and periodical document data are mapped and arranged to a text base in a unified mode, and the text base data are preprocessed to form a sentence base. The data of the text library and the sentence library are input chat of an entity extraction model and a relation recognition model, and subject words and relation words in the word list are also input along with the corpus input model, and the attribute extraction model simultaneously introduces the concept model. The output of entity extraction and relation recognition model is the recognized entity and new relation phrase, the output of attribute extraction model is entity attribute triplet. And (5) performing calibration and updating a vocabulary database and a journal document knowledge graph after entity disambiguation. The new vocabulary is combined with the new corpus to update the vocabulary and the knowledge graph again, so that the process realizes updating iteration, continuously corrects the model, the vocabulary and the knowledge graph, improves accuracy and usability, and forms an organic intelligent circulation updating iteration mechanism.
Referring to fig. 6, an iterative model is updated for entity recognition, the language is labeled by the vocabulary, and a labeled sample is input into the model for training. And carrying out entity recognition on the corpus by the trained model, and updating the word list and the knowledge graph again by the recognition result so as to form an updated iteration model of entity recognition.
A flow chart of the simultaneous relationship identification update iteration model is shown in fig. 7.
Step 40, extracting corpus attributes by designing defined ontology structures through concepts and introducing templates.
The attribute extraction adopts a dependency syntactic analysis model, and the attribute extraction process is as follows:
1. and combining the ontology structure and the data attribute defined in the conceptual design to form an entity attribute template and traversing the entity and the sentences with related attributes in the sentence set.
2. The CRF algorithm is adopted to label the parts of speech of the sentences, the entity words often have fixed parts of speech, and the difficulty of the parts of speech label is that the parts of speech of the unregistered words are judged and the word parts of speech of the phrase words are judged. The part-of-speech tagging results have a significant impact on the analysis of the sentence. Thus, using CRF for part-of-speech tagging can learn more of the physical features and facilitate updating iterations.
3. The labeling result is substituted into a syntax analyzer for syntax analysis, the syntax analyzer adopts a dependency algorithm, the core of the algorithm is based on an arc-standard system, a classifier is used for predicting the correct conversion operation according to the characteristics extracted from the configuration information, and the calculation efficiency is very high
4. The syntactical results are parsed by matching the syntactical templates and attributes are extracted, such as a master predicate structure, etc.
As shown in fig. 8, the attribute extraction model, the conceptual model and the topic dictionary extract sentences as a sample model from the sentence library as inputs. The sample model makes part-of-speech tagging through CRF and carries out dependency syntactic analysis on the tagging result, statement results with grammar features are analyzed, attribute extraction is carried out through grammar templates, and entity attribute triples are formed and stored in the knowledge graph. The updating iteration of the attribute extraction model mainly calibrates the accuracy of part-of-speech tagging through cyclic training of the CRF model.
And 50, entity disambiguation and auditing are carried out, the results of entity identification and relation extraction are audited and disambiguated, and the results of attribute extraction are subjected to entity disambiguation.
The entity disambiguation mainly solves the phenomenon of word ambiguity and multi-word ambiguity existing in natural language. The entity disambiguation is divided into two steps, wherein the first step is to perform deep learning disambiguation before entity identification and relationship identification; and secondly, matching and disambiguation is carried out mainly by adopting a relation dictionary and a theme dictionary. And disambiguating results of entity identification, relationship identification and attribute extraction.
Step 60 recognizes the result and stores it in the knowledge graph, and updates the topic dictionary, the relationship dictionary and the training model at intervals. And identifying the language materials by using the new dictionary and the new model to realize cyclic iteration updating and constructing the knowledge graph.
Although the embodiments of the present invention are described above, the embodiments are only used for facilitating understanding of the present invention, and are not intended to limit the present invention. Any person skilled in the art can make any modification and variation in form and detail without departing from the spirit and scope of the present disclosure, but the scope of the present disclosure is still subject to the scope of the appended claims.