CN115757831A

CN115757831A - Method and device for semi-automatically constructing domain knowledge graph

Info

Publication number: CN115757831A
Application number: CN202211502425.3A
Authority: CN
Inventors: 孙羽菲; 赵晓群; 李正丹; 龙肖明; 廖添胤; 朱函蝶
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-03-07

Abstract

The invention provides a method and a device for semi-automatically constructing a domain knowledge graph, wherein the method comprises the following steps: marking original text documents of the books to obtain three groups of data of the physical relationship in the field, organizing the three groups of data into structured data, and constructing an accurate marking knowledge graph according to the structured data; generating descriptions of subjects and objects based on a seq2seq model according to the entity relationship and the context content of the triple data, and supplementing and accurately marking entity description information in the knowledge graph; and (3) performing entity and relationship prediction on the accurate labeling knowledge graph based on the complementary entity description information through a Bert model, and constructing a domain knowledge graph by combining manually labeled entity relationship information. According to the method and the device for semi-automatic construction, the accuracy of the data is improved by utilizing manual marking, and meanwhile, the potential relation among the data can be mined while the labor cost is reduced by using a semi-automatic construction algorithm, so that a more comprehensive domain knowledge graph is constructed.

Description

Method and device for semi-automatically constructing domain knowledge graph

Technical Field

The invention relates to the fields of knowledge graph, automatic construction and natural language processing.

Background

With the advent of the big data era, the characteristics of big data are continuously shown in various fields along with the development of computers, the data scale shows an exponential rising trend, but the relevance between data is weak, so that people encounter a lot of difficulties in the process of processing the data. The knowledge graph is used for describing various entities existing in the real world and relationships among the entities, and the basic composition unit is an entity-relationship-entity triple which is used for representing the entities and the relationships between the two entities. With the increasing of data scale, the requirement of people on data cognitive understanding is also increasing, and the organization mode of the knowledge graph can just effectively improve the ability of people to acquire knowledge from a plurality of information.

In order to solve the problem of weak relevance of a large amount of data, a method and a device for semi-automatically constructing a domain knowledge graph are provided, and the method and the device relate to multiple fields of knowledge graphs, natural language processing and the like. According to the method and the device for semi-automatic construction, the accuracy of the data is improved by utilizing manual marking, and meanwhile, the potential relation among the data can be mined while the labor cost is reduced by using a semi-automatic construction algorithm, so that a more comprehensive domain knowledge graph is constructed.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention aims to provide a method and a device for semi-automatically constructing a domain knowledge graph, which are used for realizing the semi-automatic construction of the domain knowledge graph, realizing the interconnection and intercommunication of all knowledge under the condition of reducing the participation of manpower and constructing a relatively comprehensive domain knowledge graph.

In order to achieve the above purpose, the embodiment of the present invention provides a method and an apparatus for semi-automatically constructing a domain knowledge graph, including the following steps:

marking original text documents of the book to obtain three-element group data of physical relationship in the field, organizing the three-element group data into structural data, and constructing an accurate marking knowledge graph according to the structural data;

generating descriptions of subjects and objects based on a seq2seq model according to the entity relationship and the context content of the triple data, and completing entity description information in the accurate labeling knowledge graph;

and (4) carrying out entity and relation prediction on the accurate labeling knowledge graph based on the complete entity description information through a Bert model, and constructing a domain knowledge graph by combining artificially labeled entity relation information.

In addition, the method and the device for semi-automatically constructing the domain knowledge graph according to the embodiment of the invention can also have the following additional technical characteristics:

further, in an embodiment of the present invention, the labeling the book textual documents to obtain three sets of data of physical relationships in the field, organizing the three sets of data into structured data, and constructing an accurate labeling knowledge graph according to the structured data includes:

carrying out triple labeling on the book original text document, wherein the labeling form is < S, R, O >, S is Subject which is the Subject of the triple, R is relationship which is predicate, and O is Object which is the Object of the triple; and arranging the book original text documents into structural data information according to the labeling content, and constructing an accurate labeling knowledge graph according to the structural information.

Further, in an embodiment of the present invention, the generating descriptions of subjects and objects based on a seq2seq model according to the entity relationship and the context content of the triple data, and completing the entity description information in the precise annotation knowledge graph includes:

and splicing the context content and the associated entities in the ternary group of data by using a seq2seq model, inputting the context content and the associated entities into an encoder, and correspondingly decoding and outputting the data encoded by the encoder by a decoder according to a decoding rule to generate description information of the entities, wherein the seq2seq model comprises the encoder and the decoder.

Further, in an embodiment of the present invention, the accurate labeling knowledge graph based on the complementary entity description information performs entity and relationship prediction through a Bert model, and constructs a domain knowledge graph by combining artificially labeled entity relationship information, including:

according to the book original text document, utilizing a retrieval enhancement TF _ IDF method to supplement retrieval information to pre-train a Bert model, and utilizing the triple data to finely tune a downstream task on the basis of the pre-trained model, so that the model can automatically predict triple data in an article in the same field or unlabeled triple data in the book original text document to form structured data; and updating the labeled knowledge graph by using Cypher according to the structured data and constructing a domain knowledge graph.

In order to achieve the above object, a second embodiment of the present invention provides a domain knowledge graph labeling apparatus, including:

the system comprises a labeling module, a data processing module and a data processing module, wherein the labeling module is used for labeling book original text documents to obtain three groups of data of physical relations in the field, organizing the three groups of data into structured data, and constructing an accurate labeling knowledge graph according to the structured data;

the completion module is used for generating descriptions of subjects and objects based on a seq2seq model according to the entity relationship and the context content of the triple data, and completing entity description information in the accurate labeling knowledge graph;

and the construction module is used for predicting the entity and the relationship through the Bert model based on the accurate labeling knowledge graph of the complete entity description information and constructing the domain knowledge graph by combining the artificially labeled entity relationship information.

Further, in an embodiment of the present invention, the labeling module is further configured to:

and marking the triples of the book original documents visually, marking the entities and the relations between the entities in a mode of connecting the entities, forming triple data and generating structured data, and constructing an accurate marking knowledge graph according to the structured data and the manual marking data.

Further, in an embodiment of the present invention, the labeling module further includes:

and the counting unit is used for counting the number of the entities and the triples marked in the book original document.

Further, in an embodiment of the present invention, the statistical unit is further configured to:

and according to the marked data and the data generated by predicting the seq2seq model, counting the effective ternary group data in the book original text document on the basis of a domain knowledge graph.

To achieve the above object, a third embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the method for semi-automatically constructing the domain knowledge graph as described above.

To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method for semi-automatically constructing the domain knowledge graph as described above.

The method for semi-automatically constructing the domain knowledge graph comprises the steps of extracting data from domain original text information, carrying out data structuring processing, and then constructing the knowledge graph according to the structured data. And supplementing the entity description information lacking in the knowledge graph according to the seq2seq model, and then predicting and supplementing a relatively comprehensive knowledge graph in the field according to the marked content and the pre-training model, wherein the knowledge graph is stored in a Neo4j graph database. By constructing the knowledge map, the knowledge in the field can be intercommunicated and interconnected. On the other hand, the patent mainly completes a domain knowledge graph triple labeling device, and the domain knowledge graph is constructed according to the relationship between labeled entities and entities by labeling the entities such as people, places, events, officers, organizations, documents, time, knowledge points and the like in the book and the relationship information between the entities. According to the method and the device for semi-automatic construction, the accuracy of the data is improved by utilizing manual marking, and meanwhile, the potential relation among the data can be mined while the labor cost is reduced by using a semi-automatic construction algorithm, so that a more comprehensive domain knowledge graph is constructed.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flow chart of a semi-automatic domain knowledge graph construction method according to an embodiment of the present invention.

Fig. 2 is a detailed flowchart of the labeling tool according to the embodiment of the present invention.

Fig. 3 is a presentation interface of a labeling tool according to an embodiment of the present invention.

Fig. 4 is a flow chart of pre-training according to an embodiment of the present invention.

Fig. 5 is a schematic flow chart of a domain knowledge graph semi-automatic construction apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The method and apparatus for semi-automated domain knowledge graph construction according to embodiments of the present invention are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a method and an apparatus for semi-automatically constructing a domain knowledge graph according to an embodiment of the present invention.

As shown in fig. 1, the method and apparatus for semi-automatically constructing the domain knowledge graph includes the following steps:

s101: marking the original text documents of the books to obtain three groups of data of entity relationship in the field, organizing the three groups of data into structured data, and constructing an accurate marking knowledge graph according to the structured data;

further, in an embodiment of the present invention, labeling the book textual documents to obtain triple data of physical relationships in the field, and organizing the triple data into structured data, includes:

carrying out triple labeling on the book original text document, wherein the labeling form is < S, R, O >, S is Subject which is the Subject of the triple, R is predicate, and O is Object which is the Object of the triple; and arranging the original text document of the book into structured data information according to the marked content.

The apparatus formats the tagged data into formatted data, specifically, the tagged data is formatted in a format of { "id": 1"," type ": character", "name": hole "," altName ": zhongni, hole hill", "hitpos": byokName ":": sonic "," pageStart ": 271", "pageEnd": 271"," hitposStart ": 44", "hitposEnd": 46"," relationship ": student", "object": id ": 2", "type": character "," name ": once", "altName": child "," subphyl "," hitpos ": hook N ame": ": word", "pageStart" "," pageStart ": 271", "pageEngeEnd" 271"," hitposStart ": 44", and "object" end position ", and the tagged text" end "of the tagged text, the tagged text entry" is marked with the start name, the tagged text start position of the tagged text name ": 46", and the tagged text start position of the tagged text start and end position of the tagged text entry is marked text entry, the tagged text start name of the tagged text start page name, and end position of the tagged text start and end of the tagged text entry.

The flow chart of the labeling by the labeling device is shown in fig. 2. Firstly, marking entities in a document, marking relationships among the entities in a dragging mode according to the relationships among the entities, and finally selecting the counted entities with the same name and the same type to perform entity relationship fusion to remove repeated data. The system presentation interface is shown in fig. 3.

And constructing a basic knowledge graph according to the labeled structured data, and performing prediction supplementation on the knowledge graph by using the following steps.

S102: generating descriptions of a subject and an object based on a seq2seq model according to the entity relationship and the context content of the triple data, and complementing and accurately labeling entity description information in the knowledge map;

further, generating descriptions of subjects and objects based on a seq2seq model according to the entity relationship and context content of the triple data, and complementing the entity description information in the accurate annotation knowledge graph, wherein the description information comprises:

and splicing the context content and the associated entities in the ternary group data by using a seq2seq model, inputting the context content and the associated entities into an encoder, and correspondingly decoding and outputting the data encoded by the encoder by a decoder according to a decoding rule to generate description information of the entities, wherein the seq2seq model comprises the encoder and the decoder.

In particular, a common seq2seq (Sequence to Sequence) model, the basic idea is that an encoder is used to analyze an input Sequence and a decoder is used to generate an output Sequence. The model comprises an Encoder and a Decoder, namely an Encoder-Decoder structure, context content and associated entities in the triples are spliced and input into the Encoder, and the Decoder decodes and outputs data encoded by the Encoder according to a decoding rule to generate description information of the associated entities. The method is mainly characterized in that if a word or a sentence appears in an article frequently, TF is high, and the word or the sentence rarely appears in other articles, the word or the sentence is considered to have better category distinguishing capability and is used for judging the article which the retrieval content best meets, and the article is used as a retrieval result, and the retrieval result is used for carrying out the pre-training of the seq2seq model. And the input information is coded and combined with the retrieval result to be decoded and output in the fine tuning stage so as to combine the retrieval result to carry out better semantic understanding on the input information. And realizing the generation conversion from the professional field to the common text interpretation. The words in the professional field can be more conveniently explained.

The pre-training language model has the characteristics of big data, big model and big computing power. On the basis, the pre-training language model shows better results. The method has the advantages that the method is greatly improved in multiple aspects of natural language processing. The pre-training language model of the seq2seq structure is suitable for generating a class task. The Encoder part consists of five parts of character embedding, numerical value change, layer-by-layer coding, LN & dropout and storage. The Decoder continues to generate after inputting, and the generated sequence with the highest selection condition at each time step is subjected to inference generation. The Decoder part comprises 6 parts of character embedding, numerical value change, layer-by-layer coding, LN & dropout, storage and acquisition of generated characters. The pre-training implementation flow is shown in fig. 4.

S103: and (3) performing entity and relationship prediction on the accurate labeling knowledge graph based on the complementary entity description information through a Bert model, and constructing a domain knowledge graph by combining manually labeled entity relationship information.

Further, the accurate labeling knowledge graph based on the complementary entity description information performs entity and relationship prediction through a Bert model, and a domain knowledge graph is constructed by combining the artificially labeled entity relationship information, and the method comprises the following steps:

according to book original text documents, a Bert (Bidirectional Encoder retrieval from Transformer) model is pre-trained by supplementing retrieval information by a retrieval enhanced TF _ IDF method, and semantic information in sentences is better learned by the mode that the model randomly shields words and predicts according to context. Or the two sentences are spliced together, so that the input network judges whether the sentences are adjacent sentences in the original text. Fine adjustment of downstream tasks is carried out by utilizing triple data on the basis of a pre-training model, so that triple data in an article in the same field or triple data which are not marked in the book original text document can be automatically predicted by the model, and structured data is formed; and updating the labeled knowledge graph by using Cypher according to the structured data, and constructing the domain knowledge graph.

The method for semi-automatically constructing the domain knowledge graph comprises the steps of extracting data from domain original text information, carrying out data structuring processing, and then constructing the knowledge graph according to the structured data. And supplementing the entity description lacking in the knowledge graph according to the seq2seq model, and then predicting the more comprehensive knowledge graph in the supplementation field according to the marked content and the pre-training model, wherein the knowledge graph is stored in a Neo4j graph database. By constructing the knowledge map, the knowledge in the field can be intercommunicated and interconnected. On the other hand, the patent mainly completes a domain knowledge graph triple labeling device, and the domain knowledge graph is constructed according to the relation between entities and the relation information among the entities, such as people, places, events, officers, organizations, documents, time, knowledge points and the like in the book. According to the method and the device for semi-automatic construction, the accuracy of the data is improved by utilizing manual marking, and meanwhile, the potential relation among the data can be mined while the labor cost is reduced by using a semi-automatic construction algorithm, so that a more comprehensive domain knowledge graph is constructed.

To achieve the above object, a domain knowledge graph semi-automatic construction apparatus is provided in a second aspect of the present invention, and fig. 5 is a schematic structural diagram of a domain knowledge graph semi-automatic construction apparatus according to an embodiment of the present invention.

As shown in fig. 5, the semi-automatic domain knowledge graph building apparatus includes: the labeling module 100, the completion module 200, and the building module 300, wherein,

the labeling module is used for labeling the book original text documents to obtain three groups of data of the physical relationship in the field, organizing the three groups of data into structured data, and constructing an accurate labeling knowledge graph according to the structured data;

the completion module is used for generating descriptions of subjects and objects based on a seq2seq model according to the entity relationship and the context content of the triple data, and completing and accurately marking entity description information in the knowledge graph;

and the construction module is used for predicting the entity and the relationship through the Bert model based on the accurate labeling knowledge graph of the complementary entity description information and constructing the domain knowledge graph by combining the artificially labeled entity relationship information.

and the statistical unit is used for counting the number of entities and triples marked in the book original text document.

and according to the labeled data and the data generated by the seq2seq model prediction, carrying out statistics on the effective ternary group data in the book original text document on the basis of the domain knowledge map.

To achieve the above object, a third embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the domain knowledge graph semi-automatic construction method as described above.

To achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the domain knowledge graph semi-automatic construction method as described above.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A semi-automatic construction method of a domain knowledge graph is characterized by comprising the following steps:

marking original text documents of the books to obtain three groups of data of physical relationship in the field, organizing the three groups of data into structured data, and constructing an accurate marking knowledge graph according to the structured data;

generating descriptions of a subject and an object based on a seq2seq model according to the entity relationship and the context content of the triple data, and completing entity description information in the accurate labeling knowledge graph;

2. The method of claim 1, wherein the labeling of the book textual documents to obtain the three-component data of physical relationships in the field is performed to organize the three-component data into structured data, and comprises:

carrying out triple labeling on the book original text document, wherein the labeling form is < S, R, O >, S is Subject which is the Subject of the triple, R is predicate, and O is Object which is the Object of the triple; and arranging the book original text document into structured data information according to the marked content.

3. The method of claim 1, wherein the generating descriptions of subjects and objects based on a seq2seq model according to the entity relationship and context content of the triple data, and completing the entity description information in the precise annotation knowledge-graph comprises:

4. The method of claim 1, wherein the accurate labeling knowledge graph based on the complementing entity description information performs entity and relationship prediction through a Bert model, and a domain knowledge graph is constructed by combining artificially labeled entity relationship information, and the method comprises the following steps:

according to the book original text document, utilizing a retrieval enhancement TF _ IDF method to supplement retrieval information to pre-train a Bert model, and utilizing the triple data to finely tune a downstream task on the basis of the pre-trained model, so that the model can automatically predict triple data in an article in the same field or unlabeled triple data in the book original text document to form structured data; and updating the labeled knowledge graph and constructing a domain knowledge graph by using the entity relation data according to the structured data by utilizing Cypher's declarative graph query language.

5. A domain knowledge graph semi-automatic construction device is characterized by comprising:

6. The apparatus of claim 5, the tagging module further to:

7. The apparatus of claim 5, the tagging module, further comprising:

8. The apparatus of claims 5 and 7, the statistics unit to further:

and according to the labeled data and the data generated by the seq2seq model prediction, carrying out statistics on the effective ternary group data in the book original text document on the basis of a domain knowledge map.

9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, when executing the computer program, implementing a method for semi-automated domain knowledge graph construction as claimed in any one of claims 1-4.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a method for semi-automated domain knowledge graph construction according to any one of claims 1 to 4.