CN109145071B - Automatic construction method and system for geophysical field knowledge graph - Google Patents
Automatic construction method and system for geophysical field knowledge graph Download PDFInfo
- Publication number
- CN109145071B CN109145071B CN201810883507.4A CN201810883507A CN109145071B CN 109145071 B CN109145071 B CN 109145071B CN 201810883507 A CN201810883507 A CN 201810883507A CN 109145071 B CN109145071 B CN 109145071B
- Authority
- CN
- China
- Prior art keywords
- relation
- entities
- knowledge
- geophysical
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to an automatic construction method for a knowledge graph in the geophysical field, which comprises the following steps of firstly, establishing a concept knowledge base in the geophysical field; secondly, establishing a corresponding relation indication word bank of each relation in the neighborhood of the geophysical field; then acquiring a geophysical field knowledge data set; the text is then NLP processed, and then the text is identified with labeled geophysical domain knowledge concepts for candidate entity pairs based on word distance and entity distance. Generating a candidate relation indication word set containing noise data according to the part of speech label and the position information, and filtering noise by using a relation indication word library; then, after converting the relation indicator corresponding to each relation defined in advance into a vector, carrying out similarity calculation with the vector converted by the candidate relation indicator to find out the relation corresponding to the relation indicator with the highest similarity; and finally, importing the structured data into a graph database Neo4j to build a geophysical domain knowledge graph.
Description
Technical Field
The invention particularly relates to an automatic construction method and system for a knowledge graph in the geophysical field.
Background
With the continuous deepening and innovation of the theoretical research of the geophysical field and the continuous expansion of the application field, the knowledge data in the discipline are continuously increased, but the discrete distribution form presented by the knowledge data causes the systematic lack of the knowledge data of the geophysical field. In addition, the knowledge storage structure in the form of linear text prevents the rapid circulation of knowledge in the geophysical field between people and the outside, and the demand of people for rapidly acquiring knowledge is not met. Particularly, with the advent of the big data era, the contradiction between the demand of people for quickly acquiring massive knowledge and the difficulty in information acquisition caused by the discrete distribution of knowledge data and the low understanding efficiency caused by the linear structure representation of the knowledge data is increasingly prominent.
In order to solve the above problems, the present patent proposes an automated method for constructing a knowledge graph, so as to establish a knowledge graph in a professional field for the geophysical field. The input is unstructured text in the geophysical domain and the output is structured knowledge data, which is what we often say triplets of knowledge data.
At present, a plurality of methods for automatically constructing a knowledge graph exist, but most of the methods are used for extracting triple data of specified relations, and the method is not suitable for the professional field with more relations and more complex relations. The open triple extraction work is more researched in English, the open triple extraction related research of Chinese is less, and the language phenomena of Chinese and English are greatly different, so that the English method cannot be directly transplanted to Chinese, and the precision is not high.
Disclosure of Invention
The invention aims to solve the technical problem that the existing open type triple automatic extraction technology is insufficient, and provides a method and a system for automatically constructing a knowledge graph in the geophysical field by combining theoretical knowledge structure characteristics in the geophysical field, an established concept knowledge base and a relationship indication word base and a similarity matching algorithm between a generated candidate relationship indication word group and each relationship indication word group.
An automated construction method for a knowledge graph in the geophysical field comprises the following steps:
step 1: establishing a concept knowledge base containing professional vocabularies in the geophysical field;
step 2: establishing a knowledge data set containing unstructured text in the geophysical field;
and step 3: acquiring all relations contained in the knowledge data set and relation indicating words corresponding to the relations according to the knowledge data set established in the step 2, and establishing a relation indicating word library in the geophysical field;
and 4, step 4: performing NLP processing on the knowledge data set according to a concept knowledge base, wherein the NLP processing comprises word segmentation, part of speech tagging and entity identification in the geophysical field;
and 5: identifying whether a relationship exists between any two entities identified in the step 4, and if so, acquiring the relationship between the two entities;
step 6: extracting nouns and verbs distributed between any two entities and behind any two entities as candidate relation indicators, wherein the candidate relation indicators can embody the relation between the two entities acquired in the step 5;
and 7: denoising the candidate relation indicator extracted in the step 6 according to the relation indicator word library established in the step 3 to obtain a high-precision candidate relation indicator;
and 8: converting the relation indication word library and the high-precision candidate relation indication words obtained in the step 7 into vectors, calculating the similarity of the relation indication words, selecting the relation corresponding to the relation indication word with the highest similarity of the high-precision candidate relation indication words as the relation between the two entities, and finally obtaining the structured knowledge data;
and step 9: and (4) importing the structured knowledge data obtained in the step (8) into a graph database for automatically building a geophysical field knowledge graph.
Further, a knowledge data set is established in step 2 by adopting a Scapy crawler framework method.
Further, in step 3, all the relationships included in the knowledge data set and the relationship indicators corresponding to the relationships are obtained by an exhaustion method.
Further, the method for identifying whether a relationship exists between any two entities in step 5 is as follows: when the word distance between two entities does not exceed a preset maximum distance and the number of the entities is less than a preset minimum distance, judging that a relationship exists between the two entities;
further, in step 8, converting the high-precision candidate relation indicator into a vector by using a Bag-of-words method;
further, the structured knowledge data finally obtained in step 8 is triple data.
An automated construction system for a geophysical domain knowledge graph, comprising:
vocabulary collection module: a concept knowledge base used for establishing a professional vocabulary containing the geophysical field;
a text collection module: for building a knowledge data set containing unstructured text of the geophysical field;
a relationship acquisition module: the relation indicating word library is used for acquiring all relations contained in the knowledge data set and relation indicating words corresponding to the relations according to the knowledge data set established in the step 2, and establishing a relation indicating word library in the geophysical field;
an entity identification module: the system is used for performing NLP processing on the knowledge data set according to a concept knowledge base, and comprises word segmentation, part of speech tagging and entity identification in the geophysical field;
a relationship identification module: the method is used for identifying whether a relationship exists between any two entities identified in the step 4, and if the relationship exists, acquiring the relationship between the two entities;
the indicator extraction module: the method is used for extracting nouns or verbs distributed between any two entities and behind any two entities as candidate relation indicators, and the candidate relation indicators can embody the relation between the two entities acquired in the step 5;
the indicator denoising module: the relation indication word library is used for carrying out denoising processing on the candidate relation indication words extracted in the step 6 according to the relation indication word library established in the step 3 to obtain high-precision candidate relation indication words;
a relationship calculation module: the relation indicating word library and the high-precision candidate relation indicating words obtained in the step 7 are converted into vectors, the similarity between the vectors is calculated, the relation corresponding to the relation indicating word with the highest similarity between the high-precision candidate relation indicating words is selected as the relation between the two entities, and finally structured knowledge data are obtained;
automatically building a module: and (4) importing the structured knowledge data obtained in the step (8) into a graph database, and automatically building a geophysical domain knowledge graph.
The established knowledge graph of the professional theory can accelerate the flowing speed of the knowledge data between people and between machines, and the structured geophysical knowledge data lay a foundation for enabling the machines to understand the human knowledge and provide intelligent knowledge services (such as intelligent question answering and intelligent dialogue) through representation learning.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flowchart of an automated construction method for a geophysical domain knowledge base according to the present invention;
FIG. 2 is a diagram of the effect of the geophysical knowledgebase map of the present invention.
Detailed Description
For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
An automatic construction method for a knowledge graph in the geophysical field comprises the following specific steps:
step 1: establishing a concept knowledge base of the geophysical field, wherein the concept knowledge base comprises professional vocabularies of the geophysical field, and loading the concept knowledge base into a Language Technology Platform (LTP) of Harbin university of industry.
Step 2: the method of the script crawler framework is adopted to establish a knowledge data set of the geophysical field, the knowledge data set comprises unstructured text of the geophysical field, a concept knowledge base established in step 1 is adopted to extract a plurality of entities (concepts such as gravity field, gravity anomaly, kirchhoff interface and the like) from the knowledge data set, wherein each entity (for example, "gravity field of the earth", "geophysical") can be identified through the concept knowledge base established in step 1 as supervision data, however, the relationship (for example, "research branch") between the two entities cannot be realized, the relationship between the entities is contained in the knowledge data set (for example, "gravity field of the earth is one of important branches of geophysical research"), and the automated method of step 3 is left for mining instead of relying on manpower.
And step 3: and (3) according to the knowledge data set established in the step (2), acquiring all relations contained in the knowledge data set and relation indicating words corresponding to the relations by using an exhaustion method, and establishing a relation indicating word library in the geophysical field. For example, the relationship between the entity "geophysical" and the entity "gravitational field of the earth" is "research branch", and the relationship indicator may be "research", "branch". Conversely, in the subsequent step 5, after two entities are identified in the unstructured text "the earth gravity field is one of the important branches of the geophysical research", there are relation knowledge words "research" and "branch", so that the relation between the two entities is finally found to be the "research branch", and finally the triad (the earth physics, the research branch, the earth gravity field) can be obtained. The purpose of establishing the relation indication word library is to provide a basis for reversely deducing the relation from the relation indication words in the unstructured text in the step 8.
And 4, step 4: the method comprises the steps of performing NLP processing on a knowledge data set by adopting a Language Technology Platform (LTP) of Harbin Industrial university loaded with a concept knowledge base, and performing word segmentation, part of speech tagging and entity identification in the geophysical field.
And 5: and (4) judging whether the relation exists between any two entities identified in the step (4), wherein the judging method is that the relation exists between the two entities when the word distance between the two entities does not exceed the preset maxDistance and the number of the entities is less than the preset maxEntityDistance. Because the shorter the word distance between entities, the fewer the entities, the greater the probability that a relationship exists.
Step 6: and extracting nouns and verbs distributed between the entity pairs and behind the entity pairs as candidate relation indicators capable of embodying the relation between the two entities identified in step 5, wherein about 70% of the candidate relation indicators are located between the two entities, 10% -20% of the candidate relation indicators are located behind the two entities, a small part of the candidate relation indicators are left to be located before or not existing in the first entity, and the candidate relation indicators mostly appear in the form of nouns or verbs.
And 7: and (4) denoising the candidate relation indicator extracted in the step (6) according to the relation indicator word library established in the step (3) to obtain a high-precision candidate relation indicator.
And 8: converting the relation indication word library corresponding to each relation and the high-precision candidate relation indicator obtained in the step 7 into vectors by using a Bag-of-words method, calculating the similarity of the vectors, selecting the relation corresponding to the relation indicator with the highest similarity of the high-precision candidate relation indicator as the relation between the two entities, and finally obtaining the structured knowledge data, namely the ternary group data.
And step 9: and (4) importing the triple data obtained in the step (8) into a graph database Neo4j for automatically building a geophysical domain knowledge graph.
Knowledge graph visualization is achieved by obtaining structured triple knowledge data and importing the triple knowledge data into the graph database Neo4j, as shown in fig. 2.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (7)
1. An automatic construction method for a knowledge graph in the geophysical field is characterized by comprising the following steps:
step 1: establishing a concept knowledge base containing professional vocabularies in the geophysical field;
step 2: establishing a knowledge data set containing unstructured text in the geophysical field;
and step 3: acquiring all relations contained in the knowledge data set and relation indicating words corresponding to the relations according to the knowledge data set established in the step 2, and establishing a relation indicating word library in the geophysical field;
and 4, step 4: performing NLP processing on the knowledge data set according to a concept knowledge base, wherein the NLP processing comprises word segmentation, part of speech tagging and entity identification in the geophysical field;
and 5: identifying whether a relationship exists between any two entities identified in the step 4, and if so, acquiring the relationship between the two entities;
step 6: extracting nouns or verbs distributed between any two entities and behind any two entities as candidate relation indicators, wherein the candidate relation indicators can embody the relation between the two entities acquired in the step 5;
and 7: denoising the candidate relation indicator extracted in the step 6 according to the relation indicator word library established in the step 3 to obtain a high-precision candidate relation indicator;
and 8: converting the relation indication word library and the high-precision candidate relation indication words obtained in the step 7 into vectors, calculating the similarity of the relation indication words, selecting the relation corresponding to the relation indication word with the highest similarity of the high-precision candidate relation indication words as the relation between the two entities, and finally obtaining the structured knowledge data;
and step 9: and (4) importing the structured knowledge data obtained in the step (8) into a graph database for automatically building a geophysical field knowledge graph.
2. The automated construction method for the geophysical domain knowledge graph according to claim 1, wherein a method of a script crawler framework is adopted to build the knowledge data set in step 2.
3. The automated construction method for the geophysical field knowledge graph according to claim 1, wherein an exhaustion method is adopted in step 3 to obtain all the relationships contained in the knowledge data set and the relationship indicators corresponding to the relationships.
4. The automated construction method for the geophysical domain knowledge graph according to claim 1, wherein the method for identifying whether a relationship exists between any two entities in the step 5 is as follows: and when the word distance between two entities does not exceed the preset maximum distance and the number of the entities is less than the preset minimum distance, judging that the two entities have the relationship.
5. The automated construction method for the knowledge graph of the geophysical field according to claim 1, wherein the high-precision candidate relational indicator is converted into a vector by a Bag-of-words method in step 8.
6. The automated construction method for the geophysical domain knowledge graph according to claim 1, wherein the structured knowledge data finally obtained in step 8 is triple data.
7. An automated construction system for a geophysical domain knowledge graph, comprising:
vocabulary collection module: a concept knowledge base used for establishing a professional vocabulary containing the geophysical field;
a text collection module: for building a knowledge data set containing unstructured text of the geophysical field;
a relationship acquisition module: the system comprises a text acquisition module, a relation indication word database and a display module, wherein the text acquisition module is used for acquiring a knowledge data set established in the text acquisition module, acquiring all relations contained in the knowledge data set and relation indication words corresponding to the relations, and establishing the relation indication word database in the geophysical field;
an entity identification module: the system is used for performing NLP processing on the knowledge data set according to a concept knowledge base, and comprises word segmentation, part of speech tagging and entity identification in the geophysical field;
a relationship identification module: the entity identification module is used for identifying whether a relationship exists between any two entities identified in the entity identification module, and if the relationship exists, acquiring the relationship between the two entities;
the indicator extraction module: the relation identification module is used for extracting nouns or verbs distributed between any two entities and behind any two entities as candidate relation indicators, and the candidate relation indicators can reflect the relation between the two entities acquired in the relation identification module;
the indicator denoising module: the relation indication word library is established in the relation acquisition module and used for denoising the candidate relation indication words extracted by the indication word extraction module to obtain high-precision candidate relation indication words;
a relationship calculation module: the system comprises a relation instruction word library, an instruction word de-noising module, a relation instruction word selection module, a relation instruction word processing module and a relation instruction word processing module, wherein the relation instruction word library is used for converting high-precision candidate relation instruction words obtained by the relation instruction word library and the instruction word de-noising module into vectors, calculating the similarity of the vectors, selecting a relation corresponding to a relation instruction word with the highest similarity of the high-precision candidate relation instruction words as a relation between two entities, and finally obtaining structured knowledge data;
automatically building a module: the knowledge graph database is used for importing the structured knowledge data obtained by the relation calculation module into a graph database and automatically building a geophysical field knowledge graph.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810883507.4A CN109145071B (en) | 2018-08-06 | 2018-08-06 | Automatic construction method and system for geophysical field knowledge graph |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810883507.4A CN109145071B (en) | 2018-08-06 | 2018-08-06 | Automatic construction method and system for geophysical field knowledge graph |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109145071A CN109145071A (en) | 2019-01-04 |
CN109145071B true CN109145071B (en) | 2021-08-27 |
Family
ID=64791709
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810883507.4A Active CN109145071B (en) | 2018-08-06 | 2018-08-06 | Automatic construction method and system for geophysical field knowledge graph |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109145071B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109933789B (en) * | 2019-02-27 | 2021-04-13 | 中国地质大学(武汉) | Neural network-based judicial domain relation extraction method and system |
CN110222196A (en) * | 2019-06-18 | 2019-09-10 | 卓尔智联(武汉)研究院有限公司 | Fishery knowledge mapping construction device, method and computer readable storage medium |
CN110222198A (en) * | 2019-06-18 | 2019-09-10 | 卓尔智联(武汉)研究院有限公司 | Non-ferrous metal industry knowledge mapping construction method, electronic device and storage medium |
CN112559765B (en) * | 2020-12-11 | 2023-06-16 | 中电科大数据研究院有限公司 | Semantic integration method for multi-source heterogeneous database |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105760425A (en) * | 2016-01-17 | 2016-07-13 | 曲阜师范大学 | Ontology data storage method |
CN105760495A (en) * | 2016-02-17 | 2016-07-13 | 扬州大学 | Method for carrying out exploratory search for bug problem based on knowledge map |
EP3051435A1 (en) * | 2013-09-29 | 2016-08-03 | Peking University Founder Group Co., Ltd | Method and system for obtaining a knowledge point implicit relationship |
CN106844658A (en) * | 2017-01-23 | 2017-06-13 | 中山大学 | A kind of Chinese text knowledge mapping method for auto constructing and system |
EP3051434A4 (en) * | 2013-09-29 | 2017-06-14 | Peking University Founder Group Co., Ltd | Method and system for measurement of knowledge point relationship strength |
CN107609152A (en) * | 2017-09-22 | 2018-01-19 | 百度在线网络技术(北京)有限公司 | Method and apparatus for expanding query formula |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10019538B2 (en) * | 2015-04-01 | 2018-07-10 | Tata Consultancy Services Limited | Knowledge representation on action graph database |
-
2018
- 2018-08-06 CN CN201810883507.4A patent/CN109145071B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3051435A1 (en) * | 2013-09-29 | 2016-08-03 | Peking University Founder Group Co., Ltd | Method and system for obtaining a knowledge point implicit relationship |
EP3051434A4 (en) * | 2013-09-29 | 2017-06-14 | Peking University Founder Group Co., Ltd | Method and system for measurement of knowledge point relationship strength |
CN105760425A (en) * | 2016-01-17 | 2016-07-13 | 曲阜师范大学 | Ontology data storage method |
CN105760495A (en) * | 2016-02-17 | 2016-07-13 | 扬州大学 | Method for carrying out exploratory search for bug problem based on knowledge map |
CN106844658A (en) * | 2017-01-23 | 2017-06-13 | 中山大学 | A kind of Chinese text knowledge mapping method for auto constructing and system |
CN107609152A (en) * | 2017-09-22 | 2018-01-19 | 百度在线网络技术(北京)有限公司 | Method and apparatus for expanding query formula |
Also Published As
Publication number | Publication date |
---|---|
CN109145071A (en) | 2019-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109145071B (en) | Automatic construction method and system for geophysical field knowledge graph | |
CN107679039B (en) | Method and device for determining statement intention | |
CN108629414B (en) | Deep hash learning method and device | |
CN112199938B (en) | Science and technology project similarity analysis method, computer equipment and storage medium | |
CN113553412A (en) | Question and answer processing method and device, electronic equipment and storage medium | |
CN115238029A (en) | Construction method and device of power failure knowledge graph | |
CN111241209A (en) | Method and apparatus for generating information | |
CN112069324B (en) | Classification label adding method, device, equipment and storage medium | |
CN109614612A (en) | A kind of Chinese text error correction method based on seq2seq+attention | |
CN113505583A (en) | Sentiment reason clause pair extraction method based on semantic decision diagram neural network | |
CN113763937A (en) | Method, device and equipment for generating voice processing model and storage medium | |
CN116028798A (en) | Water damage early warning data processing method, device, computer equipment and storage medium | |
CN114120166B (en) | Video question-answering method and device, electronic equipment and storage medium | |
CN112599211B (en) | Medical entity relationship extraction method and device | |
CN110580337A (en) | professional entity disambiguation implementation method based on entity similarity calculation | |
CN112818072A (en) | Tourism knowledge map updating method, system, equipment and storage medium | |
CN109754159B (en) | Method and system for extracting information of power grid operation log | |
CN111930959A (en) | Method and device for generating text by using map knowledge | |
CN111814457A (en) | Power grid engineering contract text generation method | |
CN117494806B (en) | Relation extraction method, system and medium based on knowledge graph and large language model | |
CN112837148B (en) | Risk logic relationship quantitative analysis method integrating domain knowledge | |
CN112819205B (en) | Method, device and system for predicting working hours | |
CN116227598B (en) | Event prediction method, device and medium based on dual-stage attention mechanism | |
CN118093785B (en) | Distributed collaboration-oriented avionic fault knowledge fusion method | |
CN113536751B (en) | Processing method and device of form data, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |