CN114064910A

CN114064910A - Knowledge graph construction method and system

Info

Publication number: CN114064910A
Application number: CN202111152235.9A
Authority: CN
Inventors: 李涓子; 刘丁枭; 侯磊; 张鹏; 唐杰; 许斌
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-02-18

Abstract

The invention provides a method and a system for constructing a knowledge graph, wherein the method and the system for constructing the knowledge graph are used for carrying out knowledge modeling on original data by acquiring the original data and a background knowledge graph based on the background knowledge graph to generate concept layer data so as to obtain more complete concept layer data; meanwhile, on the basis of a background knowledge graph, encyclopedic triple extraction and/or relation triple extraction are carried out on original data to obtain example layer data, so that more complete example layer data is obtained by combining the encyclopedic triple extraction and open relation extraction; and finally, constructing a new knowledge graph based on the concept layer data and the example layer data, and updating the background knowledge graph by using the new knowledge graph, thereby realizing the dynamic updating of the knowledge graph and the expansion in the using process.

Description

Knowledge graph construction method and system

Technical Field

The invention relates to the technical field of computers, in particular to a method and a system for constructing a knowledge graph.

Background

The knowledge graph is a knowledge base used by Google and services thereof, and is mainly used for enhancing the search efficiency and improving the user experience in the era of high-speed development of the Internet and explosive growth of network data. The knowledge graph establishes a foundation for intelligent information application by virtue of excellent semantic processing technology and interconnectivity, is widely applied to the aspects of search, question answering, information analysis and the like, and promotes the development of information technology from information service to knowledge service. In recent years, all walks of life are researching and applying the knowledge map to the professional field and better serve the specific field.

However, at present, knowledge maps are basically constructed directly and then put into use, and dynamic updating and methods for expanding from the use process are rarely adopted.

Disclosure of Invention

The invention provides a method and a system for constructing a knowledge graph, which are used for solving the defects that the knowledge graph is difficult to dynamically update and expand in the using process in the prior art.

In a first aspect, the present invention provides a method for constructing a knowledge graph, the method comprising:

acquiring original data and a background knowledge map;

performing knowledge modeling on the original data based on the background knowledge map to generate conceptual layer data;

performing encyclopedic triple extraction and/or relation triple extraction on the original data based on the background knowledge graph to obtain example layer data;

and constructing a new knowledge graph based on the conceptual layer data and the example layer data.

According to the method for constructing the knowledge graph, provided by the invention, on the basis of the background knowledge graph, encyclopedic triple extraction and/or relation triple extraction are carried out on the original data to obtain example layer data, and the method comprises the following steps:

dividing the original data into an original entity list and/or unstructured text;

inquiring the original entity list based on the background knowledge graph to obtain a first entity list;

entity linking, keyword extraction and named entity identification are carried out on the unstructured text to obtain a second entity list;

obtaining a combined entity list according to the first entity list and/or the second entity list;

extracting entity related information from the background knowledge graph by taking each entity in the merged entity list as a keyword to obtain an encyclopedic triple extraction result;

performing relation extraction on the unstructured text, and mounting the extracted relation triples on entities related to the unstructured text to obtain a relation triplet extraction result;

and constructing and obtaining example layer data according to the encyclopedic triple extraction result and/or the relation triple extraction result.

According to the method for constructing the knowledge graph provided by the invention, before the entities in the merged entity list are used as key words and the entity related information is obtained from the background knowledge graph and the encyclopedic triple extraction result is obtained, the method further comprises the following steps:

and according to the similarity relation among the entities, extracting the entities with the similarity higher than a preset similarity threshold value from the merged entity list to obtain a candidate entity extraction result.

According to the method for constructing the knowledge graph, provided by the invention, before entity linking, keyword extraction and named entity identification are carried out on the unstructured text to obtain a second entity list, the method further comprises the following steps:

and judging the importance of the unstructured text, and dividing the unstructured text into important texts and/or non-important texts.

According to the method for constructing the knowledge graph, provided by the invention, knowledge modeling is carried out on the original data based on the background knowledge graph to generate conceptual layer data, and the method comprises the following steps:

extracting concepts from the raw data;

acquiring upper and lower concept information corresponding to the extracted concepts based on a known concept tree in the background knowledge graph to obtain candidate concepts;

distributing corresponding weight to each candidate concept according to the data source of the candidate concept, and sequencing and screening the candidate concepts;

and updating the known concept tree based on the concepts extracted from the original data and the sequencing and screening results of the candidate concepts to generate concept layer data.

According to the method for constructing the knowledge graph provided by the invention, knowledge modeling is carried out on the original data based on the background knowledge graph to generate conceptual layer data, and the method further comprises the following steps:

and expanding the upper and lower concepts of any node in the updated known concept tree by a concept expansion mode, and updating the known concept tree for the second time.

In a second aspect, the present invention further provides a system for constructing a knowledge graph, the system comprising:

the first processing module is used for acquiring original data and a background knowledge map;

the second processing module is used for carrying out knowledge modeling on the original data based on the background knowledge graph to generate conceptual layer data;

the third processing module is used for performing encyclopedic triple extraction and/or relation triple extraction on the original data based on the background knowledge graph to obtain example layer data;

and the fourth processing module is used for fusing the conceptual layer data and the example layer data to construct a new knowledge graph.

In a third aspect, the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method for constructing a knowledge graph according to any one of the above descriptions.

In a fourth aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method for constructing a knowledge-graph as described in any one of the above.

In a fifth aspect, the present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method for constructing a knowledge-graph as described in any one of the above.

According to the method and the system for constructing the knowledge graph, the original data and the background knowledge graph are obtained, knowledge modeling is carried out on the original data based on the background knowledge graph, and the concept layer data is generated, so that more complete concept layer data is obtained; meanwhile, on the basis of a background knowledge graph, encyclopedic triple extraction and/or relation triple extraction are carried out on original data to obtain example layer data, so that more complete example layer data is obtained by combining the encyclopedic triple extraction and open relation extraction; and finally, constructing a new knowledge graph based on the concept layer data and the example layer data, and updating the background knowledge graph by using the new knowledge graph, thereby realizing the dynamic updating of the knowledge graph and the expansion in the using process.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for constructing a knowledge graph according to the present invention;

FIG. 2 is a schematic diagram of a concept tree structure constructed using grain as a known concept;

FIG. 3 is a schematic structural diagram of a system for constructing a knowledge graph according to the present invention;

fig. 4 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 illustrates a method for constructing a knowledge graph according to an embodiment of the present invention, where the method includes:

s110: and acquiring original data and a background knowledge map.

The original data refers to data used for constructing a knowledge graph and can be texts, videos, voice and the like or a mixture of several formats; and the background knowledge graph refers to a large-scale knowledge graph of single language or cross-language fusion of a whole field or a certain field. The knowledge graph refers to a database for storing knowledge, and stored therein are triples (such as Zhang III, Sheng Di, Shanghai) and the like, and each triplet represents a fact. The knowledge graph can also be seen in the form of a graph, such as the above triples, Zhang III and Shanghai are nodes, and the place of birth is a directed labeled edge of Zhang III and Shanghai. The background knowledge map may be XLORE, Wikipedia (multilingual Wikipedia), or the like.

S120: and performing knowledge modeling on the original data based on the background knowledge map to generate conceptual layer data.

Specifically, the generation process of the concept layer data in this embodiment includes:

firstly, extracting concepts from original data; wherein a concept refers to a class of entities in a knowledge graph, such as fruits, pomes, and the like.

And then, acquiring the upper and lower concept information corresponding to the extracted concept based on the known concept tree in the background knowledge graph to obtain the candidate concept.

And then, according to the data source of the candidate concepts, distributing corresponding weight to each candidate concept, and sequencing and screening the candidate concepts.

And finally, updating the known concept tree based on the concepts extracted from the original data and the sequencing and screening results of the candidate concepts to generate concept layer data.

For example, if a concept tree about food is constructed, and the concept "food" is first input, then the candidate upper concepts "crop", "food", etc. and the candidate lower concepts "wheat", "cereal", "leafy vegetable", etc. of the "food" can be obtained through the known concept trees in the background knowledge graph, for example, the concept trees in "xlere", "wikipedia", or the general concept trees produced by the user, or the known concept trees in the domain knowledge graph.

These candidate concepts may come from different data sources, and the present embodiment ranks the candidate concepts in a targeted manner for different data sources, for example, in the above example, the present embodiment considers "XLORE" to be more feasible than "Wikipedia", and for this reason, the candidate concepts from "XLORE" are given higher weight when performing the weight calculation. Then, according to the ranking results of all candidate concepts, corresponding screening is performed, and the screened candidate concepts are added into the known concept tree, so as to update the known concept tree, and the obtained concept tree structure is shown in fig. 2.

In addition, the concept expansion mode can be used for expanding the upper and lower concepts of a certain node so as to achieve the purpose of updating the known concept tree for the second time.

After the above operations are completed, if the updated concept tree is not satisfactory, the concept tree can be edited and modified by manual editing such as adding, deleting and modifying.

S130: and performing encyclopedic triple extraction and/or relation triple extraction on the original data based on the background knowledge graph to obtain example layer data.

In this embodiment, the process of acquiring layer data includes:

first, the original data is divided into an original entity list and/or unstructured text.

The original entity list refers to a list of some entities, such as "rice", "peanut", "apple", "banana", etc.

The unstructured text refers to a text, for example, a section of unstructured text introducing ' peanut ', the ' peanut, the original name groundnut, is a nut which is abundant in yield and wide in eating, belongs to an annual herbaceous plant of the order Rosales, the family Leguminosae, the stem is upright or creeping, the length is 30-80 cm, the petal is separated from the keel, the pod is 2-5 cm long and 1-1.3 cm wide, the pod is expanded and thick, and the flower and fruit period is 6-8 months, so that the unstructured text can be used as a raw material of cosmetics such as soap making and hair growing oil. "

Then, based on the background knowledge graph, the original entity list is queried to obtain a first entity list. In this embodiment, the original entity list is queried in the existing encyclopedia knowledge base to obtain a corresponding first entity list.

And meanwhile, entity linking, keyword extraction and named entity identification are carried out on the unstructured text to obtain a second entity list.

Preferably, the embodiment further includes, before performing entity linking, keyword extraction, and named entity recognition on the unstructured text to obtain the second entity list:

In practical application, a recently published text may be considered important, and for this reason, a publication time limit threshold may be set, for example, a text published within three years is considered important, a text published three years ago is not important, and of course, the importance of an unstructured text may be measured by other judgment criteria.

And then, when the second entity list is acquired, only the important text is subjected to entity linking, keyword extraction and named entity identification to obtain the corresponding second entity list. However, when the open relationship extraction is performed later, the relationship extraction needs to be performed on all unstructured texts.

And then, obtaining a combined entity list according to the first entity list and/or the second entity list. In this embodiment, the entity list areas obtained by all the methods are merged to obtain a merged entity list.

And then, extracting entity related information from the background knowledge graph by taking each entity in the combined entity list as a keyword to obtain an encyclopedic triple extraction result.

In this embodiment, the method includes acquiring corresponding related original data in an encyclopedic category knowledge base (or a known knowledge graph) for the merged entity list to obtain an encyclopedic triple extraction result, and then performing a knowledge graph construction process on the part of data.

Meanwhile, performing relation extraction on all unstructured texts, and mounting the extracted relation triples on entities related to the unstructured texts to obtain a relation triplet extraction result.

Considering that the input original data can be divided into different situations, such as the situations of only entity list, only important text, only non-important text, existence entity list and non-important text, existence of important text and non-important text, existence of entity list, important text and non-important text, and the like, and meanwhile, in the extraction of instance layer data, the extraction can be realized by using encyclopedic triple extraction, open relationship extraction or the combination of the two. Specific case and example layer data construction schemes are shown in table 1 below:

TABLE 1 example layer data construction scheme corresponding to different situations

And finally, constructing and obtaining example layer data according to the encyclopedic triple extraction result and/or the relation triple extraction result.

In order to ensure the reliability of data in the merged entity list, in this embodiment, before obtaining the extraction result of the encyclopedic triple, an entity whose approximation degree is higher than a preset approximation degree threshold value is extracted from the merged entity list according to an approximation degree relationship between the entities, so as to obtain an extraction result of a candidate entity.

Mainly considering that there may be multiple meanings of the entities in the merged entity list, for example, "millet" may have many different meanings, such as "internet company", "grain", "drama role", etc., this embodiment will select according to the degree of similarity between all the entities, and finally leave the entities with higher degree of similarity, such as "rice", "sorghum", "corn", etc., in the related entities, and then this embodiment will select to leave the meaning of "grain".

And then, extracting entity related information from the background knowledge graph by taking each entity in the candidate entity extraction result as a keyword to obtain an encyclopedic triple extraction result.

S140: and constructing a new knowledge graph based on the conceptual layer data and the example layer data.

Because the boundary between the concept and the entity is not perfectly clear, some entities can be called as the concept or the entity, and the embodiment searches the example layer data based on the concept name in the concept layer data, and if the search result is the same as the obtained example layer data, the entity is considered as the concept.

Therefore, the construction of a new knowledge graph is realized by respectively constructing and combining the data of the concept layer and the data of the example layer, the original background knowledge graph can be updated through the new knowledge graph, and the dynamic update and the expansion of the knowledge graph in the using process are realized.

The system for constructing the knowledge graph provided by the invention is described below, and the system for constructing the knowledge graph described below and the method for constructing the knowledge graph described above can be referred to correspondingly.

FIG. 3 shows a system for constructing a knowledge graph according to an embodiment of the present invention, the system comprising:

a first processing module 310, configured to obtain raw data and a background knowledge graph;

the second processing module 320 is configured to perform knowledge modeling on the original data based on the background knowledge graph to generate conceptual layer data;

the third processing module 330 is configured to perform encyclopedic triple extraction and/or relationship triple extraction on the original data based on the background knowledge graph to obtain instance layer data;

and the fourth processing module 340 is configured to fuse the concept layer data and the instance layer data to construct a new knowledge graph.

Specifically, the third processing module 330 in this embodiment includes:

the dividing unit is used for dividing the original data into an original entity list and/or an unstructured text;

the entity list processing unit is used for inquiring the original entity list based on the background knowledge graph to obtain a first entity list;

the unstructured text processing unit is used for carrying out entity linking, keyword extraction and named entity identification on the unstructured text to obtain a second entity list;

the merging unit is used for obtaining a merged entity list according to the first entity list and/or the second entity list;

the encyclopedic triple extraction unit is used for extracting entity related information from the background knowledge graph by taking each entity in the combined entity list as a keyword to obtain an encyclopedic triple extraction result;

the relation triple extraction unit is used for carrying out relation extraction on the unstructured text and mounting the extracted relation triples on entities related to the unstructured text to obtain a relation triple extraction result;

and the example layer data construction unit is used for constructing and obtaining the example layer data according to the encyclopedic triple extraction result and/or the relation triple extraction result.

Preferably, in this embodiment, the third processing module 330 further includes:

and the candidate entity extraction unit is used for extracting the entities with the approximation degree higher than a preset approximation degree threshold value from the merged entity list according to the approximation degree relation among the entities to obtain a candidate entity extraction result.

and the importance judging unit is used for judging the importance of the unstructured text and dividing the unstructured text into important texts and/or non-important texts.

Specifically, the second processing module 320 in this embodiment includes:

a concept extraction unit for extracting concepts from the original data;

the candidate concept acquisition unit is used for acquiring superior and inferior concept information corresponding to the extracted concepts based on a known concept tree in the background knowledge graph to obtain candidate concepts;

the concept sorting unit is used for distributing corresponding weight to each candidate concept according to the data source of the candidate concept and sorting and screening the candidate concept;

and the concept layer data generating unit is used for updating the known concept tree based on the concepts extracted from the original data and the sequencing and screening results of the candidate concepts to generate concept layer data.

Preferably, the second processing module 320 in this embodiment further includes:

and the concept extension unit is used for extending the upper and lower concepts of any node in the updated known concept tree in a concept extension mode and carrying out secondary updating on the known concept tree.

Therefore, the system for constructing the knowledge graph provided by the embodiment of the invention generates the concept layer data based on the original data and the background knowledge graph, so as to obtain more complete concept layer data; meanwhile, on the basis of a background knowledge graph, encyclopedic triple extraction and/or relation triple extraction are carried out on original data to obtain example layer data, so that more complete example layer data is obtained by combining the encyclopedic triple extraction and open relation extraction; and finally, constructing a new knowledge graph based on the concept layer data and the example layer data, and updating the background knowledge graph by using the new knowledge graph, thereby realizing the dynamic updating of the knowledge graph and the expansion in the using process.

Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)410, a communication Interface 420, a memory (memory)430 and a communication bus 440, wherein the processor 410, the communication Interface 420 and the memory 430 are communicated with each other via the communication bus 440. Processor 410 may invoke logic instructions in memory 430 to perform a method of knowledge-graph construction, the method comprising: acquiring original data and a background knowledge map; performing knowledge modeling on the original data based on a background knowledge graph to generate conceptual layer data; performing encyclopedic triple extraction and/or relation triple extraction on original data based on a background knowledge graph to obtain example layer data; and constructing a new knowledge graph based on the conceptual layer data and the example layer data.

In addition, the logic instructions in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, a computer is capable of executing the method for constructing a knowledge graph provided by the above methods, the method comprising: acquiring original data and a background knowledge map; performing knowledge modeling on the original data based on a background knowledge graph to generate conceptual layer data; performing encyclopedic triple extraction and/or relation triple extraction on original data based on a background knowledge graph to obtain example layer data; and constructing a new knowledge graph based on the conceptual layer data and the example layer data.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for constructing a knowledge-graph provided by the above methods, the method comprising: acquiring original data and a background knowledge map; performing knowledge modeling on the original data based on a background knowledge graph to generate conceptual layer data; performing encyclopedic triple extraction and/or relation triple extraction on original data based on a background knowledge graph to obtain example layer data; and constructing a new knowledge graph based on the conceptual layer data and the example layer data.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for constructing a knowledge graph, comprising:

acquiring original data and a background knowledge map;

2. The method for constructing a knowledge graph according to claim 1, wherein performing encyclopedic triple extraction and/or relationship triple extraction on the raw data based on the background knowledge graph to obtain instance layer data comprises:

3. The method for constructing a knowledge graph according to claim 2, wherein before obtaining the entity-related information from the background knowledge graph and obtaining the encyclopedic triple extraction result, taking each entity in the merged entity list as a keyword, the method further comprises:

4. The method for constructing a knowledge graph according to claim 2, wherein before performing entity linking, keyword extraction and named entity recognition on the unstructured text to obtain a second entity list, the method further comprises:

5. The method of claim 1, wherein knowledge modeling the raw data based on the background knowledge graph to generate concept layer data comprises:

extracting concepts from the raw data;

6. The method of claim 5, wherein the knowledge modeling of the raw data based on the background knowledge graph to generate concept layer data further comprises:

7. A system for constructing a knowledge graph, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of constructing a knowledge-graph according to any one of claims 1 to 6 when executing the program.

9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the method of constructing a knowledge-graph according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the method of constructing a knowledge-graph according to any one of claims 1 to 6.