CN113127645B

CN113127645B - Automatic extraction method of large-scale knowledge graph body, terminal equipment and storage medium

Info

Publication number: CN113127645B
Application number: CN202110380611.3A
Authority: CN
Inventors: 洪万福; 张林娜
Original assignee: Xiamen Yuanting Information Technology Co ltd
Current assignee: Xiamen Yuanting Information Technology Co ltd
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2022-09-13
Anticipated expiration: 2041-04-09
Also published as: CN113127645A

Abstract

The invention relates to a large-scale knowledge graph ontology automatic extraction method, terminal equipment and a storage medium, wherein the method comprises the following steps: s1: obtaining an entity from a knowledge graph; s2: adopting a rule matching algorithm to carry out primary classification on the extracted entities; s3: adopting a named entity recognition model to perform named entity recognition on the unclassified entity in the step S2, and confirming the type of the recognized named entity; s4: classifying the remaining entities identified by the named entities in the step S3 by adopting a clustering algorithm; s5: and merging and adjusting the classification results of the steps S2, S3 and S4 to obtain a final classification result. The invention innovatively integrates various technical means, effectively realizes the automatic extraction of the ontology of the large-scale knowledge graph in the industry, and can still realize the ontology extraction of the entity of the knowledge graph with complexity, large magnitude and much dirty data under the condition of no manual labeling data.

Description

Automatic extraction method of large-scale knowledge graph ontology, terminal equipment and storage medium

Technical Field

The invention relates to the field of knowledge graphs, in particular to a large-scale knowledge graph ontology automatic extraction method, terminal equipment and a storage medium.

Background

The concept of Knowledge Graph (Knowledge Graph) was formally proposed by google in 2012, aimed at implementing a more intelligent search engine, and began to spread in academia and industry after 2013. At present, with the continuous development of intelligent information service application, knowledge maps have been widely applied in the fields of intelligent search, intelligent question answering, personalized recommendation, intelligence analysis, anti-fraud and the like.

The knowledge graph has two construction modes of top-down and bottom-up. The top-down construction is: defining an ontology, and adding the entity into a knowledge base; the bottom-up construction is that an entity is extracted from publicly acquired data by a certain technical means, and the entity with higher confidence coefficient is selected and added into a knowledge base. At present, the mainstream mode is a bottom-up construction mode, which requires extraction and construction of an ontology after map construction. The ontology construction method can be divided into manual construction, semi-automatic construction and automatic construction according to the degree of manual intervention, but a mature technical system does not exist at present.

Disclosure of Invention

In order to solve the above problems, the present invention provides an automatic extraction method for a large-scale knowledge-graph ontology, a terminal device and a storage medium.

The specific scheme is as follows:

a large-scale knowledge graph ontology automatic extraction method comprises the following steps:

s1: obtaining an entity from a knowledge graph;

s2: adopting a rule matching algorithm to carry out primary classification on the extracted entities;

s3: adopting a named entity recognition model to perform named entity recognition on the unclassified entity in the step S2, and confirming the type of the recognized named entity;

s4: classifying the remaining entities identified by the named entities in the step S3 by adopting a clustering algorithm;

s5: and merging and adjusting the classification results of the steps S2, S3 and S4 to obtain a final classification result.

Further, step S1 includes preprocessing the obtained entities, where the preprocessing includes punctuation cleaning, abnormal length entity filtering, and converting capital letters into lowercase letters.

Further, the clustering algorithm in step S4 adopts a Kmeans clustering algorithm.

Further, the step S4 adopts a clustering algorithm to perform the classification specifically as follows:

s401: for each entity to be classified, extracting one or more of attributes, labels and relations from the knowledge graph, splicing the extracted attributes, labels and relations with the entity name, acquiring vector representation of each character in a spliced character string by using a natural language processing word vector technology, and taking the average value of the vector representations of all characters as the word vector of the entity to be classified;

s402: inputting the word vectors of the entities to be classified into a Kmeans model, and confirming the clustering number k by using an elbow method;

s403: and simultaneously inputting the word vector representation of the entity to be classified and the clustering number k into the Kmeans model to obtain a clustering result.

Further, the natural language processing word vector technique adopted in step S401 is a bert-base-multilingual-uncased model trained on corpus of 102 languages.

Further, if the number of entities in a certain category in the final classification result is greater than the preset number threshold, the entities in the certain category are re-classified again in steps S2 to S5.

The terminal equipment for automatically extracting the large-scale knowledge graph ontology comprises a processor, a memory and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the steps of the method of the embodiment of the invention.

A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method as described above for an embodiment of the invention.

The invention adopts the technical scheme and has the following beneficial effects:

1. the applicability is strong: knowledge maps in different domains can be used with the present invention.

2. The effect is good: the integration of multiple technical means is innovatively carried out, so that the effect of body extraction is ensured; the rules are matched for preliminary classification, so that the classification quality is high; secondly, a named entity recognition model is used, an open source named entity recognition model or a self-training named entity recognition model is optionally used, and the named entity recognition model, whether the open source named entity recognition model or the self-training named entity recognition model is generated based on large-scale text corpus training with labels, so that the method has a good text recognition and classification effect; the entity names are innovatively used for splicing entity attributes, labels and relations, a text vector representation is obtained by using a natural language word vector processing technology, more features are extracted compared with the single use of the entity names, and the learning effect of a subsequent Kmeans model is greatly improved.

3. The speed is high: firstly, the processing speed is high by using rule matching classification and named entity recognition model recognition classification. Secondly, the number of samples to be classified is reduced by using rule matching classification and named entity recognition model recognition classification in advance, so that the time for subsequent word vector conversion and the time for training and predicting the Kmeans model are greatly reduced.

4. The implementation is quick: the named entity recognition model and the natural language processing word vector model are optional, an open source model can be used, the implementation of the first edition project is fast, and the effect can be seen fast.

5. The expansibility is strong: the operation can be circularly iterated according to expectation, and the result has extremely strong expansibility.

Drawings

Fig. 1 is a flowchart illustrating a first embodiment of the present invention.

Fig. 2 is a schematic view showing a line drawing in this embodiment.

Detailed Description

To further illustrate the various embodiments, the invention provides the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the embodiments. Those skilled in the art will appreciate still other possible embodiments and advantages of the present invention with reference to these figures.

The invention will now be further described with reference to the accompanying drawings and detailed description.

The first embodiment is as follows:

the embodiment of the invention provides an automatic extraction method of a large-scale knowledge graph ontology, which is a flow chart of the automatic extraction method of the large-scale knowledge graph ontology, as shown in fig. 1, and the method comprises the following steps:

s1: an entity is obtained from a knowledge graph.

In this embodiment, 40W entities are obtained from the knowledge-graph using cypher query statements.

Further, since the obtained entity formats are not uniform and there are some useless data, the entity formats need to be preprocessed, where the preprocessing in this embodiment includes punctuation cleaning, abnormal length entity filtering, capital letter conversion into lowercase letters, and the like, and in other embodiments, other processing manners may be adopted, which is not limited herein.

S2: and carrying out primary classification on the extracted entities by adopting a rule matching algorithm.

The following rules are employed in this example:

a. entities ending in "ships", "boats", "guns", "radars", "tanks", and the like, the category being "equipment";

b. entities ending in "military," travel, "" team, "" teacher, "" war zone, "etc., and categories are" organizations.

The above is only an example rule adopted in this embodiment, and in other embodiments, a person skilled in the art may set other rules according to requirements, and the rules are not limited herein.

Through the steps, classification of partial entities can be completed, and the classification quality is high.

S3: and adopting a named entity recognition model to perform named entity recognition on the unclassified entity in the step S2, and confirming the type of the recognized named entity.

The named entity recognition model can be an open source model, such as Hanlp, Ltp and the like, and can also be a self-training model. The categories that Hanlp or Ltp of the open source can identify are as follows: name of person, place name, organization name, etc. The classes that are recognizable by the self-trained named entity recognition model are self-defined when the named entity recognition model is trained. The training of the named entity recognition model is beyond the scope of the present invention and will not be described herein. In the present embodiment, a self-trained named entity recognition model is used, the unclassified entity in step S2 is input into the named entity recognition model, and a part of the entities are recognized and classified. Such as: the 'debate eisenhawell' is input into the named entity recognition model, and is recognized and classified as 'person'.

S4: and (4) classifying the remaining entities identified by the named entities in the step (S3) by adopting a clustering algorithm.

In this embodiment, a Kmeans clustering algorithm is used for classification, and the specific classification process is as follows:

s401: and aiming at each entity to be classified, extracting one or more of attributes, labels and relations from the knowledge graph, splicing the extracted attributes, labels and relations with the entity name, using a natural language word vector processing technology to obtain vector representation of each word in the spliced character strings, and taking the average value of the vector representation of all the words as the word vector of the entity to be classified.

The natural language Word vector processing techniques include bert (bidirectional Encoder expressions from transformations), Fasttext, Word2vec, and the like.

Since the knowledge-graph may contain foreign bodies, the embodiment preferably uses a bert-base-multilingual-uncached model trained on 102-language corpora to obtain a vector representation of each word in the concatenated string.

S402: and inputting the word vectors of the entities to be classified into a Kmeans model, and confirming the clustering number k by using an elbow method.

The specific process of the elbow method is as follows: and presetting a start-stop range and interval number for the k value, inputting word vectors of the entities to be classified into a Kmeans model, storing SSEs (simple sequences of edge) under different k values, drawing a line graph, and taking inflection points in the line graph as the final clustering number k.

In this embodiment, the start and end values of k are set to 2 and 20 at an interval of 2, and a line graph is plotted as shown in fig. 2, where the inflection point in the line graph is 4, and the number of clusters k is 4.

In this embodiment, the obtained clustering result is: "equipment", "organization", "literature", "location", "people".

In this embodiment, the final classification result in this round is: "equipment", "organization", "people", "places", "documents".

Further, if the number of entities in a certain category in the final classification result is greater than the preset number threshold, if it is desired to further classify the ontology categories, the entities in the certain category may be further classified again in steps S2 to S5.

In this example, the number of entities belonging to the equipment category is 15W, the number of entities belonging to the organization category is 4W, the number of entities belonging to the people category is 7W, the number of entities belonging to the location category is 6W, and the number of entities belonging to the literature category is 8W. The equipment category number is greater than the number threshold, and therefore further subdivision is required.

And (4) carrying out Kmeans clustering algorithm of step S4 on the 15W entities belonging to the equipment category for classification, wherein the clustering classification results are 'land equipment', 'water equipment' and 'air equipment'. Merging and adjusting the classification results, wherein the final classification result in the current round is as follows: "land equipment" (entity number 7W), "water equipment" (entity number 4W), "air equipment" (entity number 4W), "organization" (entity number 4W), "people" (entity number 7W), "place" (entity number 6W), "literature" (entity number 8W). The number after further classification is smaller than the number threshold, so that the requirements of engineering projects are met, further subdivision is not needed, and the extraction of the body is finished.

The number threshold may be set by one skilled in the art according to actual requirements, and is not limited herein.

The embodiment I of the invention innovatively integrates and uses a plurality of technical means of rule matching, named entity recognition, natural language word vector processing technology and Kmeans clustering, effectively realizes the automatic extraction work of the ontology of the large-scale knowledge graph in the industry, and can still realize the ontology extraction of the entity of the knowledge graph with complexity, large magnitude and much dirty data under the condition of no manual labeling data; if some entities are marked with data, the invention can achieve better effect.

The entity name is used independently, the intrinsic characteristics are insufficient, and Kmeans learning is too simple and is easy to be under-fitted. The entity nodes of the knowledge graph not only have entity names, but also generally have entity attributes, entity labels and entity relationships. According to the embodiment, the entity attributes, the entity labels and the entity relations are innovatively extracted from the map, and the character strings are spliced with the entity names, so that the Kmeans can learn richer characteristics of the entities.

Example two:

the invention also provides a terminal device for automatically extracting the large-scale knowledge graph ontology, which comprises a memory, a processor and a computer program which is stored in the memory and can run on the processor, wherein the processor executes the computer program to realize the steps of the method embodiment of the first embodiment of the invention.

Further, as an executable scheme, the large-scale knowledge base body automatic extraction terminal device may be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The large-scale knowledge graph ontology automatic extraction terminal device can comprise, but is not limited to, a processor and a memory. It will be understood by those skilled in the art that the above-mentioned composition structure of the large-scale automatic-knowledge-graph-ontology extracting terminal device is only an example of the large-scale automatic-knowledge-graph-ontology extracting terminal device, and does not constitute a limitation on the large-scale automatic-knowledge-graph-ontology extracting terminal device, and may include more or less components than the above-mentioned one, or combine some components, or different components, for example, the large-scale automatic-knowledge-graph-ontology extracting terminal device may further include an input-output device, a network access device, a bus, and the like, which is not limited in this embodiment of the present invention.

Further, as an executable solution, the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and the like. The general processor may be a microprocessor or the processor may be any conventional processor, and the processor is a control center of the large-scale knowledge-graph ontology automatic extraction terminal device, and various interfaces and lines are used to connect various parts of the whole large-scale knowledge-graph ontology automatic extraction terminal device.

The memory can be used for storing the computer program and/or the module, and the processor realizes various functions of the large-scale knowledge-graph ontology automatic extraction terminal device by running or executing the computer program and/or the module stored in the memory and calling the data stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the mobile phone, and the like. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

The invention also provides a computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method of an embodiment of the invention.

The module/unit integrated by the large-scale knowledge graph ontology automatic extraction terminal device can be stored in a computer readable storage medium if the module/unit is realized in the form of a software functional unit and is sold or used as an independent product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), software distribution medium, and the like.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A large-scale knowledge graph ontology automatic extraction method is characterized by comprising the following steps:

s1: obtaining an entity from a knowledge graph;

s3: adopting a named entity recognition model to perform named entity recognition on the unclassified entity in the step S2, and confirming the type of the recognized named entity; the types include: a person name, place name, or organization name;

s4: classifying the remaining entities identified by the named entities in the step S3 by adopting a clustering algorithm; the clustering algorithm adopts a Kmeans clustering algorithm; the specific process of classifying by using the clustering algorithm is as follows:

s403: simultaneously inputting the word vector representation of the entity to be classified and the clustering number k into a Kmeans model to obtain a clustering result;

2. The large-scale knowledge-graph ontology automatic extraction method according to claim 1, wherein: step S1 further includes preprocessing the acquired entities, the preprocessing including punctuation cleaning, abnormal length entity filtering, and conversion of capital letters into lowercase letters.

3. The large-scale knowledge-graph ontology automatic extraction method according to claim 1, wherein: the natural language word vector processing technique employed in step S401 is a bert-base-multilingual-uncased model trained on corpus of 102 languages.

4. The large-scale knowledge-graph ontology automatic extraction method according to claim 1, wherein: and if the number of the entities of a certain category in the final classification result is greater than the preset number threshold, re-performing the steps S2-S5 on the entities of the certain category for further classification.

5. The utility model provides an automatic extraction terminal equipment of large-scale knowledge map body which characterized in that: comprising a processor, a memory and a computer program stored in the memory and running on the processor, the processor implementing the steps of the method according to any of claims 1 to 4 when executing the computer program.

6. A computer-readable storage medium storing a computer program, characterized in that: the computer program when executed by a processor implementing the steps of the method as claimed in any one of claims 1 to 4.