Disclosure of Invention
In order to solve the technical problems, the invention aims to provide a periodical update iterative journal literature knowledge graph construction method, which is based on the automatic knowledge graph construction, takes a journal literature base of the HowNet as a data source, organically combines a plurality of knowledge graph construction modules such as concept design, dictionary management, corpus management, model training, knowledge element extraction, entity disambiguation and the like, and forms a closed loop by updating iteration and continuously optimizing the accuracy of knowledge graph and training so as to really realize the intelligent periodical update iterative journal literature knowledge graph construction.
The purpose of the invention is realized by the following technical scheme:
a method for building a knowledge graph of periodical literature with iteration circularly updated comprises the following steps:
designing a concept model, defining an ontology structure of a knowledge graph of journal literature, wherein the ontology structure comprises a definition ontology, a relation attribute of the ontology and a data attribute inside the ontology;
b, managing a word list and a corpus, wherein the word list is divided into a subject word list and a relational word list, and the corpus is divided into a text library and a sentence library and relates to the corpora of a plurality of sources;
c, based on a deep learning labeling, training, recognition and calibration entity relation extraction model, adopting a deep learning entity relation extraction technology to combine a dictionary and a corpus, performing entity extraction and relation extraction, and updating iteration;
d, performing corpus attribute extraction through an ontology structure defined by concept design and introducing a template;
e, checking and disambiguating the results of the entity identification and the relationship extraction, and performing entity disambiguation on the results of the attribute extraction;
and F, storing the recognition result into the knowledge graph, updating the subject dictionary, the relation dictionary and the training model at random, and recognizing the material by using the new dictionary and model to realize the circulation iteration updating and construct the knowledge graph.
One or more embodiments of the present invention may have the following advantages over the prior art:
the invention provides a standard process reference for constructing the knowledge graph, so that the constructed knowledge graph is really oriented to intellectualization, the waste of human resources is relatively reduced, and the usability and the practicability of the knowledge graph are improved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.
As shown in fig. 1, the method for building a knowledge graph of journal literature for loop update iteration includes step 10 of conceptual model design, and specification of a knowledge graph definition ontology, data attributes and relationship attributes.
The ontology model refers to the ontology models or data standards which are widely applied internationally, such as CIDOC CRM, EDM, FOAF, EVENT, FRBR and the like, and is expanded and customized according to the service characteristics of the ontology model, so that the reusability and the internationalization degree of the ontology model are improved.
The ontology construction of the knowledge graph of the journal literature comprises the definition of an ontology and a data model layer of the knowledge graph of the journal literature, wherein the ontology construction comprises the following steps: defining an ontology, defining a relationship attribute of the ontology, and defining a data attribute inside the ontology.
The ontology is an object or a collection of objects, for example: text, author, and institution, etc. The relationship attributes of the ontologies mainly define the association relationship between ontologies, for example: there are collaborations between authors, dependencies between authors and organizations, etc. The data attribute inside the ontology is that there is no association relationship in the characteristics of the ontology itself, for example: author name, age, and native place, etc.
The invention defines a triple specification for the knowledge graph: (E)1,R,E2) And (E, P, V) wherein E represents an ontology, R represents a relationship attribute, P represents a data attribute, and V represents an attribute value. In an entity-relationship-entity relationship, the value range of an entity is an ontology.
The journal literature part ontology structure is defined as follows:
TABLE 1
Identification
|
Body
|
E1
|
Text
|
E2
|
Authors refer to
|
E3
|
Mechanism
|
E4
|
Time of day
|
E5
|
Type of relationship
|
E6
|
Domain entity
|
E7
|
Region of land |
The journal literature part relational attributes defined are as follows:
TABLE 2
Step 20, managing word lists and corpora, wherein the word lists are divided into subject word lists and relational word lists, and the corpus is divided into a text library and a sentence library and relates to the corpora of multiple sources;
the word list and the corpus of the knowledge graph of the journal literature are divided into data of a plurality of fields in the form of a Chinese atlas classification method. The vocabulary is divided into a subject vocabulary and a relational vocabulary in form, the subject vocabulary defines the source, the field, the sub-field and other attributes of the entity words, the relational vocabulary defines the relation between the entity words of the subject vocabulary, and the word relation defines 10 relations of upper and lower, similar, antisense, related and the like in the literature periodical.
The corpus is divided into a text library and a sentence library, the text library is a collection library of network journal documents and local resources, and document data are mainly stored. In order to facilitate deep text mining, journal documents of a text library are preprocessed, and a sentence library is formed. The sentence library includes sentences from journal literature and the positions of the sentences in which the entity words are located in the subject vocabulary. The structure of the topic word list is shown in fig. 2.
Wherein content is entity word, English is English translation, catalog is Chinese graph classification, domain is word source, etc.
The relational word table is as shown in Table 3:
TABLE 3
Wherein orgid and tarid are index ids of the subject word list where the entity word is located, and relatype is word relationship id. The text library and sentence library in the corpus are shown in fig. 3 and fig. 4.
And step 30, based on the labeling, training, recognition and calibration entity relationship extraction model of deep learning, adopting a deep learning entity relationship extraction technology in combination with the dictionary and the corpus to perform entity extraction and relationship extraction, and updating iteration.
Update iteration of entity extraction:
1. and labeling the corpus by using a dictionary, and labeling entity words appearing in the corpus.
2. And selecting an entity recognition algorithm to train the label set. The algorithm of entity identification goes through a process of updating iteration from machine learning to deep learning, for example: HMM, CRF, BILSTM + CRF, Bert + BILSTM + CRF, and the like. The invention adopts an algorithm of Bert + BILSTM + CRF to identify the entity.
3. And continuously recognizing the corpus by using the trained labeling model, calibrating the recognition result, and storing new words which do not appear in the subject dictionary into the subject dictionary.
4. Labeling with the updated dictionary again, and training the updated model and dictionary again.
The entity extraction process forms a closed loop of update iteration by adding a topic dictionary and in the form of a loop labeled corpus and a training model. The model is enabled to be continuously optimized to improve the accuracy of entity identification.
Update iteration of relationship extraction:
1. and labeling the statement set by using the relation dictionary and the existing relation extraction template, and forming a training model. The relation extraction is wide in related field, and a traditional deep learning model is difficult to have good performance on relation extraction training. Therefore, the traditional relation extraction designs a large number of templates containing part-of-speech and grammatical features. The method labels the statement set through two modes of the template and the relational word stock and forms a training sample.
2. And selecting a relation extraction algorithm to train the label set, wherein the relation extraction model selects a PCNN + Attention algorithm. CNN/PCNN was used as the sensor encoder, and a sentence-level annotation mechanism was used.
3. And performing relation recognition on the new corpus by using the training model, storing a recognition result in a database, correcting the recognition result through manual examination, storing the corrected recognition result in a relation dictionary and a statement set, and storing the corrected recognition result in a new training sample.
4. And recognizing the corpus again by using the new training sample and performing loop iteration.
The relationship identification and the entity identification adopt the same loop iteration process, and meanwhile, the accuracy of the identification is improved by combining a template formed by a large amount of past experience.
A flowchart of a method for constructing knowledge maps of journal literature with iterative cycle updating is shown in fig. 5, local data and journal literature data are unified and mapped and sorted into a text library, and the text library data is preprocessed to form a sentence library. The data of the text library and the sentence library are input language chats of an entity extraction model and a relation recognition model, subject words and relation words in a word list are accompanied with a corpus input model, and an attribute extraction model is introduced into a conceptual model at the same time. The output of the entity extraction and relationship identification model is respectively identified entity and new relationship phrase, and the output of the attribute extraction model is entity attribute triple. Entity disambiguation is followed by calibration and updating of the vocabulary database and journal literature knowledge maps. The new word list is combined with the new corpora to carry out data output by the model training book to update the word list and the knowledge graph again, so that updating iteration is realized in the process, the model, the word list and the knowledge graph are continuously corrected, the accuracy and the usability are improved, and an organic intelligent cyclic updating iteration mechanism is formed.
As shown in fig. 6, the iterative model is updated for entity identification, entity tagging is performed on the corpus through the vocabulary, and a tagged sample is input into the model for training. And the trained model carries out entity recognition on the corpus, and the word list and the knowledge graph are updated again according to the recognition result so as to form an updated iterative model of entity recognition.
Meanwhile, a flow chart of the relationship identification updating iterative model is shown in FIG. 7.
And step 40, performing corpus attribute extraction by conceptually designing the defined ontology structure and introducing a template.
The attribute extraction adopts a dependency syntactic analysis model, and the attribute extraction process is as follows:
1. and combining the ontology structure and the data attributes defined in the concept design to form an entity attribute template and traversing the statements of the entity and the existence related attributes in the statement set.
2. A CRF algorithm is adopted to label the part of speech of a sentence, entity words often have fixed parts of speech, and the part of speech labeling is difficult to judge the part of speech of unregistered words and judge the part of speech of word groups and words. The result of part-of-speech tagging has a great influence on the analysis of a sentence method. Therefore, part-of-speech tagging using CRFs enables learning of more entity features and facilitates update iterations.
3. And substituting the labeling result into a syntactic analyzer for syntactic analysis, wherein the syntactic analyzer adopts a dependency algorithm, the core of the algorithm is based on an arc-standard system, a classifier is used for predicting correct conversion operation according to the characteristics extracted from the configuration information, and the calculation efficiency is very high
4. The syntactic results are analyzed by matching syntactic templates and properties, such as the structure of a principal and predicate, are extracted.
As shown in FIG. 8, the attribute extraction model, the conceptual model and the topic dictionary are used as input to extract sentences from the sentence library as sample models. And the sample model is used for performing part-of-speech tagging through the CRF, performing dependency syntax analysis on a tagging result, analyzing a statement result with grammatical features, performing attribute extraction through a grammar template, forming entity attribute triples and storing the entity attribute triples into a knowledge graph. And updating iteration of the attribute extraction model is mainly realized by circularly training the CRF model to calibrate the accuracy of part-of-speech tagging.
And 50, entity disambiguation and verification, namely verifying and disambiguating the result of entity identification and relationship extraction, and performing entity disambiguation the result of attribute extraction.
Entity disambiguation mainly solves the phenomena of one-word polysemy and multiple-word polysemy existing in natural language. The entity disambiguation is divided into two steps, wherein in the first step, deep learning disambiguation is carried out before entity identification and relation identification; the second is to use the relationship dictionary and the topic dictionary for matching disambiguation. And disambiguating results of entity identification, relationship identification and attribute extraction.
Step 60 stores the recognition results in the knowledge graph and updates the topic dictionary, the relationship dictionary and the training model at random. And identifying the material by using the new dictionary and model to realize loop iteration updating and construct the knowledge graph.
Although the embodiments of the present invention have been described above, the above descriptions are only for the convenience of understanding the present invention, and are not intended to limit the present invention. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.