CN112463974A

CN112463974A - Method and device for establishing knowledge graph

Info

Publication number: CN112463974A
Application number: CN201910849080.0A
Authority: CN
Inventors: 段戎; 胡康兴
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2021-03-09

Abstract

The application provides a knowledge graph establishing method in the field of artificial intelligence, which comprises the following steps: obtaining a corpus; clustering the entities in the corpus to obtain a target clustering result; determining the similarity between the newly added entity in the corpus and the entity in the seed atlas according to the target clustering result; and adding the new entity to the seed map according to the similarity. By the method, the seed map can be expanded under the condition of insufficient corpus, and dependence on manpower is avoided. Compared with a mode of establishing a knowledge graph by depending on expert manual combing, the method greatly improves the efficiency and saves the labor cost.

Description

Method and device for establishing knowledge graph

Technical Field

The present application relates to the field of natural language processing, and in particular, to a method and an apparatus for establishing a knowledge graph.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence, namely, researching the design principle and implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making

With the continuous development of artificial intelligence technology, natural language human-computer interaction systems, which enable human-computer interaction through natural language, become more and more important. Human-computer interaction through natural language requires a system capable of recognizing specific meanings of human natural language. Typically, systems identify the specific meaning of a sentence by employing key information extraction on the sentence in natural language.

The knowledge-graph is composed of interconnected nodes and edges, providing the ability to analyze problems from a "relational" perspective. Knowledge-graphs are increasingly being applied to knowledge management. Each node of the knowledge-graph may be an entity. The seed map with less entity number can be expanded, so that the establishment of the knowledge map is realized. The establishment of an open domain knowledge graph generally requires a large amount of corpora, which include a large number of entities. And determining the similarity between the newly added entity in the corpus and the entity in the seed map in a machine learning mode, and connecting the entity in the corpus with the example in the seed map according to the similarity to establish the knowledge map.

For a specific field with less linguistic data, such as an internal or specific field of an enterprise, the construction, the expansion and the verification of the knowledge graph greatly depend on expert experience, so that the knowledge graph is established, the workload is large, and the labor cost is high.

Disclosure of Invention

The application provides a knowledge graph establishing method and device, which can realize the establishment of a knowledge graph under the condition of corpus deficiency.

In a first aspect, a method for establishing a knowledge graph is provided, which includes: obtaining a corpus; clustering the entities in the corpus to obtain a target clustering result; determining the similarity between the newly added entity in the corpus and the entity in the seed atlas according to the target clustering result; and adding the new entity to the seed map according to the similarity so as to establish a knowledge map.

By clustering the entities in the corpus, the similarity between the newly added entities in the corpus and the entities in the seed map can be determined under the condition of less corpus, and the seed map is expanded to establish the knowledge map.

With reference to the first aspect, in some possible implementation manners, the clustering the entities in the corpus to obtain a target clustering result includes: clustering the entities in the corpus according to the M initial clustering numbers to obtain M initial clustering results corresponding to the M initial clustering numbers one by one, wherein M is a positive integer; and determining the target clustering result from the M initial clustering results according to the connection relation of the entities in the seed atlas.

The accuracy of the target clustering result is improved by selecting the target clustering result from the at least one initial clustering result.

With reference to the first aspect, in some possible implementations, the determining, by the seed graph and according to a connection relationship between entities in the seed graph, the target clustering result from the M initial clustering results includes: determining a degree of dispersion and accessibility of each initial clustering result according to the connection relation of the entities in the seed map and each initial clustering result, wherein the degree of dispersion is used for representing the degree of dispersion of a neighbor structure of each common entity in the seed map in the initial clustering result, the neighbor structure is composed of one common entity and all adjacent common entities of the common entities in the seed map, and the accessibility is used for representing the shortest distance of the common entities in the seed map in each class of the initial clustering results; and determining the target clustering result from the initial clustering results according to the dispersity and the accessibility.

And verifying the clustering result according to the seed map, thereby determining the clustering result consistent with the seed map. The seed atlas is designed manually, the experience of experts is fully utilized, and a target clustering result consistent with the experience of the experts is determined. With reference to the first aspect, in some possible implementations, an absolute value of a difference between the degree of dispersion and accessibility of the target clustering result is smallest among the M initial clustering results.

With reference to the first aspect, in some possible implementations, the seed graph includes common entities that are the same as entities in the corpus, and the seed graph includes entities that are not similar to the entities outside the corpus, and the method includes: when the out-of-class entity is located on the shortest path of two common entities in the first class of the target clustering result in the seed graph, adding the out-of-class entity to the first class; and/or adding the out-of-class entities to one or more classes of which the entity similarity in the target classification result meets a preset condition according to the similarity between the shared entities adjacent to the out-of-class entities in the seed map and the entities in each class of the target classification result.

By adding the extra-class entities to the clustering result, the newly added entities in the corpus are possibly connected to the extra-class entities, so that more factors are considered in the expansion of the seed map, and the connection between the newly added entities and the seed map is more accurate.

With reference to the first aspect, in some possible implementation manners, the determining, according to the target clustering result, a similarity between a new entity in the corpus and an entity in a seed graph includes: determining an entity vector of each entity in the target clustering result, wherein the nth bit of the entity vector represents whether the entity belongs to the nth class in the target classification result, and n is a positive integer; and determining the distance of the vector of the newly added entity and the entity in the seed map, wherein the distance is used for representing the similarity of the newly added entity and the entity in the seed map.

And determining the similarity between the newly added entity and the entity in the seed map according to the distribution condition of each class of the newly added entity and the entity in the seed map in the clustering result.

With reference to the first aspect, in some possible implementations, the target classification result includes a plurality of classes, a jth class of the plurality of classes includes a new entity, and the method includes: determining a jth target sub-graph most similar to the jth class according to similarities of the entity in the jth class and a plurality of sub-graphs in the seed graph, wherein each sub-graph in the plurality of sub-graphs is composed of one entity in the seed graph and all adjacent entities of the entity in the seed graph; the adding the new entity to the seed map according to the similarity comprises: and adding the newly added entity in the jth class to the jth target subgraph according to the similarity.

And the accuracy of adding the newly added entity is improved by matching the class with the subgraph.

In a second aspect, a knowledge graph establishing apparatus is provided, including: the acquisition module is used for acquiring the corpus; the clustering module is used for clustering the entities in the corpus to obtain a target clustering result; the determining module is used for determining the similarity between the newly-added entity in the corpus and the entity in the seed atlas according to the target clustering result; and the adding module is used for adding the newly added entity to the seed map according to the similarity so as to establish a knowledge map.

With reference to the second aspect, in some possible implementation manners, the clustering module is configured to cluster entities in the corpus according to M initial clustering numbers to obtain M initial clustering results corresponding to the M initial clustering numbers one to one, where M is a positive integer; the determining module is further configured to determine the target clustering result from the M initial clustering results according to a connection relationship between entities in the seed graph.

With reference to the second aspect, in some possible implementations, the seed graph includes common entities that are the same as the entities in the corpus, the determining module is further configured to determine, according to the connection relationship between the entities in the seed graph and each of the initial clustering results, a degree of dispersion and accessibility of each of the initial clustering results, wherein the degree of dispersion is used for representing a degree of dispersion of a neighbor structure of each common entity in the seed graph in the initial clustering results, the neighbor structure is composed of one common entity and all adjacent common entities of the common entity in the seed graph, and the accessibility is used for representing a shortest distance of the common entity in the seed graph in each class of the initial clustering results; the determining module is further configured to determine the target clustering result from the initial clustering results according to the degree of dispersion and the accessibility.

With reference to the second aspect, in some possible implementations, an absolute value of a difference between the degree of dispersion and accessibility of the target clustering result is smallest among the M initial clustering results.

With reference to the second aspect, in some possible implementations, the seed graph includes common entities that are the same as entities in the corpus, the seed graph includes entities outside the corpus, and the adding module is further configured to, when the entities outside the category are located on a shortest path between two common entities in the first category of the target clustering result in the seed graph, add the entities outside the category to the first category; and/or the adding module is further configured to add the out-of-class entities to one or more classes in the target classification result, where the entity similarity satisfies a preset condition, according to the similarity between the common entities adjacent to the out-of-class entities in the seed map and the entities in each class of the target classification result.

With reference to the second aspect, in some possible implementations, the determining module is further configured to determine an entity vector of each entity in the target clustering result, where an nth bit of the entity vector indicates whether the entity belongs to an nth class in the target classification result, and n is a positive integer; the determining module is further configured to determine a distance between the newly added entity and a vector of an entity in the seed graph, where the distance is used to represent a similarity between the newly added entity and the entity in the seed graph.

With reference to the second aspect, in some possible implementations, the target classification result includes a plurality of classes, a jth class in the plurality of classes includes a new entity, and the determining module is further configured to determine a jth target sub-graph most similar to the jth class according to similarities of entities in the jth class and a plurality of sub-graphs in the seed graph, where each sub-graph in the plurality of sub-graphs is composed of one entity in the seed graph and all adjacent entities of the entity in the seed graph; and the adding module is used for adding the newly added entity in the jth class to the jth target subgraph according to the similarity.

In a third aspect, a knowledge graph establishing apparatus is provided, including: a communication interface, a processor; the communication interface is used for acquiring the corpus; the processor is used for clustering the entities in the corpus to obtain a target clustering result; the processor is further used for determining the similarity between the newly added entity in the corpus and the entity in the seed atlas according to the target clustering result; the processor is further configured to add the new entities to the seed graph according to the similarity to establish a knowledge graph.

With reference to the third aspect, in some possible implementation manners, the processor is further configured to cluster entities in the corpus according to M initial clustering numbers to obtain M initial clustering results corresponding to the M initial clustering numbers one to one, where M is a positive integer; the processor is further configured to determine the target clustering result from the M initial clustering results according to a connection relationship between entities in the seed graph.

With reference to the third aspect, in some possible implementations, the seed graph includes common entities that are the same as the entities in the corpus, and the processor is further configured to determine, according to a connection relationship between the entities in the seed graph and each of the initial clustering results, a degree of dispersion and accessibility of each of the initial clustering results, where the degree of dispersion is used to represent a degree of dispersion of a neighbor structure of each common entity in the seed graph in the initial clustering results, the neighbor structure is composed of one common entity and all adjacent common entities of the common entity in the seed graph, and the accessibility is used to represent a shortest distance of the common entity in the seed graph in each class of the initial clustering results; the processor is further configured to determine the target clustering result from the initial clustering results according to the degree of dispersion and the accessibility.

With reference to the third aspect, in some possible implementations, an absolute value of a difference between the degree of dispersion and accessibility of the target clustering result is smallest among the M initial clustering results.

With reference to the third aspect, in some possible implementations, the seed graph includes common entities that are the same as entities in the corpus, the seed graph includes out-of-class entities that are outside of the corpus, and the processor is further configured to add the out-of-class entities to the first class when the out-of-class entities are located on a shortest path between two common entities in the first class of the target clustering result in the seed graph; and/or adding the out-of-class entities to one or more classes of which the entity similarity in the target classification result meets a preset condition according to the similarity between the shared entities adjacent to the out-of-class entities in the seed map and the entities in each class of the target classification result.

With reference to the third aspect, in some possible implementations, the processor is further configured to determine an entity vector of each entity in the target clustering result, where an nth bit of the entity vector indicates whether the entity belongs to an nth class in the target classification result, and n is a positive integer; the processor is further configured to determine a distance between the newly added entity and a vector of entities in the seed graph, where the distance is used to represent a similarity between the newly added entity and the entities in the seed graph.

With reference to the third aspect, in some possible implementations, the target classification result includes a plurality of classes, a jth class in the plurality of classes includes a new entity, and the processor is further configured to determine a jth target sub-graph most similar to the jth class according to similarities of an entity in the jth class and a plurality of sub-graphs in the seed graph, where each sub-graph in the plurality of sub-graphs is composed of one entity in the seed graph and all adjacent entities of the entity in the seed graph; and the processor is further used for adding the newly added entity in the jth class to the jth target subgraph according to the similarity.

In a fourth aspect, a computer storage medium is provided that, when executed on an electronic device, causes the electronic device to perform the method of the first aspect.

In a fifth aspect, a chip system is provided, the chip system comprising at least one processor, which when executed by program instructions causes the chip system to perform the method of the first aspect.

Drawings

Fig. 1 is a schematic view of an application scenario of natural language processing according to an embodiment of the present application.

Fig. 2 is a schematic view of an application scenario of another natural language processing provided in an embodiment of the present application.

Fig. 3 is a schematic flow chart of a method for establishing a knowledge graph according to an embodiment of the present application.

Fig. 4 is a schematic flow chart of a method for establishing a knowledge graph according to an embodiment of the present application.

Fig. 5 is a schematic flow chart of a method for determining a clustering result according to an embodiment of the present application.

FIG. 6 is a schematic representation of a seed map.

Fig. 7 is a schematic diagram showing the relationship between the degree of dispersion of the clustering result and the number of clusters.

FIG. 8 is a diagram illustrating the accessibility of the clustering results versus the number of clusters.

FIG. 9 is a schematic diagram of an extended knowledge-graph.

FIG. 10 is a schematic flow diagram of a method of corpus pre-processing.

Fig. 11 is a schematic structural diagram of a knowledge-map creating apparatus according to an embodiment of the present application.

Fig. 12 is a schematic structural diagram of a knowledge-map creating apparatus according to an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

As shown in fig. 1, a Natural Language Processing (NLP) system includes a user device and a data processing device.

The user equipment comprises a user, a mobile phone, a personal computer or an intelligent terminal such as an information processing center. The user equipment is an initiating end of natural language data processing, and is used as an initiating party of requests such as language question answering or query, and the requests are initiated by users through the user equipment.

The data processing device can be a device or a server with a data processing function, such as a cloud server, a network server, an application server, and a management server. The data processing equipment receives question sentences such as query sentences/voice/text and the like from the intelligent terminal through the interactive interface, and then performs language data processing in the modes of machine learning, deep learning, searching, reasoning, decision making and the like through a memory for storing data and a processor link for data processing. The memory may be a generic term that includes a database that stores local and historical data, which may be on the data processing device or on other network servers.

Fig. 2 shows another application scenario of the natural language processing system. In this scenario, the intelligent terminal directly serves as a data processing device, directly receives an input from a user, and directly processes the input by hardware of the intelligent terminal itself, and the specific process is similar to that shown in fig. 1, and reference may be made to the above description, which is not repeated herein.

For the convenience of understanding, the related terms and related concepts such as neural networks referred to in the embodiments of the present application will be described below.

(1) Knowledge graph

Knowledge Graph (KG) describes relationships between entities in a graph or web. The knowledge-graph is composed of nodes and edges connected with each other, the nodes represent entities, and the edges represent relationships between the nodes. A knowledge graph is essentially a semantic network, and is a graph-based data structure. By knowing the graph, different kinds of information can be connected together to obtain a relationship network. Knowledge-graphs provide the ability to analyze problems from a "relational" perspective. Knowledge maps can be used to optimize existing search engines. Unlike traditional search engines based on keyword searches, knowledge profiles can be used to better query complex associated information, understand user intent from a semantic level, and improve search quality.

(2) Corpus

A corpus refers to linguistic text. A corpus may include a large amount of text. The corpus may have a given format and markup, usually organized.

(3) Entity

An entity refers to something that is distinguishable and exists independently. Such as a person, a city, a plant, a commodity, etc. An entity may be represented by a word or a phrase. Many entities may be included in a corpus.

(4) Supervised learning

Supervised learning is a way of machine learning. Training samples with conceptual labels (classes) are learned to make label (class) predictions as possible for data outside the training sample set. Here, all labels (classifications) are known.

(5) Unsupervised learning

Unsupervised learning is a way of machine learning. Training samples without concept labels (classes) are learned to discover structural knowledge in a set of training samples. Here, all labels (classifications) are unknown.

(6) Properties

Attributes are used to describe characteristics or characteristics of an entity or concept.

With the development of the mobile internet, the possibility of interconnection of everything is increased, data generated by the interconnection is also increased explosively, and the data can be just used as effective raw materials for analyzing the relationship. If the former intelligent analysis is focused on every individual, in addition to the individual in the mobile internet era, the relationship between the individuals is necessarily an important part of the deep analysis required by people. With the assistance of the knowledge graph, the search engine can return more accurate and structured information according to semantic information behind the user query.

The development of information technology, the acquisition and storage of information are easy day by day; not only internet companies but also traditional enterprises accumulate a large amount of information required by the operation of the enterprises, and the information is scattered in different IT systems. How to use this information to help business personnel efficiently complete the work is a problem facing knowledge management in every large organization or enterprise.

The traditional knowledge management system is a classification system for manually combing knowledge by experts. This approach is time consuming, slow to progress, and the expertise is not always comprehensive nor completely error free.

With the development of Artificial Intelligence (AI) technology, knowledge maps are increasingly being applied to knowledge management to enhance the understanding of natural language in question-answering or search systems. The knowledge graph is used as a new technology and widely applied by Internet companies. The construction of open domain knowledge maps, particularly for search engines, has gradually formed a relatively mature set of construction methods.

The construction of open domain knowledge graphs requires a large amount of corpora. In some specific fields, the construction technology of the knowledge graph of the open domain is not completely applicable due to the limited number of the linguistic data, and the extension of the knowledge graph is difficult.

At present, an enterprise knowledge graph is used as a knowledge graph in a specific field, in the construction process of the knowledge graph, a seed graph is generally generated manually by an expert, and the seed graph is expanded according to corpora, so that the construction of the knowledge graph is completed.

The expert summarizes the seed map according to the knowledge in a specific field, and the seed map needs to be further expanded downwards to support the application. The full manual knowledge graph expansion is huge.

A method for extending the atlas is to judge the similarity of the entities according to the attributes of the new entities and the attributes of the existing entities in the atlas, and connect the new entities to the entities with the most similar attributes.

In particular, the classification model may be obtained by a supervised learning algorithm. Training data is acquired. The training data includes a large number of bands of corpora, as well as each entity in the corpora and its corresponding label. The label, which may also be referred to as a tag, identifies the attributes of the entity, i.e., the class to which the entity belongs. And inputting the corpus into the original model to obtain the output of the original model. The output of the original model is the label corresponding to the entity in the corpus. And adjusting the parameters of the original model so that the similarity between the output of the original model and the corresponding mark of each entity in the training data is in a preset range. And taking the adjusted original model as a classification model. The classification model is used to classify new entities. And classifying the entities, namely determining the labels corresponding to the entities.

After the classification model is obtained, the corpus is input into the classification model, and the label of each entity in the corpus can be obtained. Attributes of each node, i.e., each entity, are recorded in the graph. Based on the similarity of the labels of the entities, a new entity in the corpus may be connected to the entity in the graph with the most similar attributes to the entity.

To improve the accuracy of the classification model, a large amount of training data is required. That is, a large corpus is required, along with each entity in the corpus and its corresponding tag.

The labels corresponding to entities in the training data are typically determined manually. Therefore, training of the classification model requires a large amount of labor.

For a specific tissue or domain, the training data is limited, and the accuracy of the classification model cannot be guaranteed.

Another way to perform map expansion can be implemented by an unsupervised learning algorithm. And obtaining the vector of the entity through machine learning according to a large amount of linguistic data. The distance between two vectors indicates the similarity between two entities respectively corresponding to the two vectors. And connecting the newly added entity in the corpus to the entity with highest similarity in the map or with similarity larger than a preset value according to the similarity between the entities.

The method represents entities in the linguistic data through vectors in a machine learning mode according to the input linguistic data. The process of machine learning requires a large amount of corpus input in order to obtain an accurate vector of entities. For a specific organization or domain, the number of corpora is limited, and the requirement of machine learning for the number of corpora may not be satisfied, so that the accuracy of vector representation cannot be guaranteed.

In order to solve the above problem, an embodiment of the present application provides a method for establishing a knowledge graph.

In step S301, a corpus is acquired.

The obtained corpus may be a corpus subjected to word segmentation processing, or may be an original corpus which is not subjected to word segmentation processing. The original corpus may be participled to determine entities in the corpus. The method of word segmentation processing can be seen in fig. 10. The corpus may also be processed in other ways to determine entities in the corpus.

The corpus includes a large number of entities. The corpus contains all or part of entities in the seed map. Entities that both the corpus and the seed graph contain may be referred to as common entities. The corpus also contains entities outside the seed map, namely newly added entities.

The obtaining of the corpus may be receiving a corpus sent by another device, or obtaining the corpus from a memory, or obtaining the corpus as a processing result of another processor.

In step S302, the entities in the corpus are clustered to obtain a target clustering result.

The corpus can be clustered by any one of clustering methods such as a topic model and a hierarchical model.

The clustering result is related to the determined number of clusters. The number of clusters can be determined based on expert experience, i.e., the number of clusters can be set manually. Alternatively, the number of clusters may be determined by some characteristic of the data itself.

Optionally, the number of clusters and the clustering result may be verified by a seed graph. And determining a clustering result of the unsupervised clustering algorithm by using a seed knowledge graph established by expert experience. And simultaneously, verifying the points in the seed map which are inconsistent with the unsupervised clustering algorithm again. Through cross check of consistency of the two groups of data, parameters of a clustering algorithm are found out, and meanwhile accuracy of expert experience is guaranteed.

And clustering the entities in the corpus according to the M initial clustering numbers to obtain M initial clustering results corresponding to the M initial clustering numbers one to one, wherein M is a positive integer. And determining one initial clustering result from the M initial clustering results as the target clustering result according to the connection relation of the entities in the seed atlas.

Specifically, the dispersity and accessibility of each initial clustering result can be determined according to the connection relationship of the entities in the seed graph and each initial clustering result. The neighbor structure consists of one common entity and all neighboring common entities of the common entity in the seed map. The degree of dispersion is used to represent the degree of dispersion of the neighbor structure of each common entity in the seed graph in the initial clustering result. Accessibility is used to represent the shortest distance of the common entities in the seed graph in each class of the initial clustering result. A target clustering result may be determined from the initial clustering results based on the degree of dispersion and the accessibility.

One clustering result with the dispersity and the accessibility respectively meeting certain preset conditions can be selected from the M initial clustering results as a target clustering result. If so, selecting a clustering result with the dispersity smaller than or equal to a first preset value and the accessibility larger than or equal to a second preset value as a target clustering result.

The accessibility-clustering number relation curve and the dispersity-clustering number relation curve can also be drawn according to the M kinds of initial clustering results, and the clustering number corresponding to the intersection point of the two curves can be used as the clustering number of the target clustering result, so that the target clustering result is obtained. That is, the absolute value of the difference in the degree of dispersion and accessibility of the target clustering result is smallest among the M kinds of initial clustering results. Specific calculations and descriptions of dispersability and accessibility can be found in figure 5.

In step S303, according to the target clustering result, determining the similarity between the newly added entity in the corpus and the entity in the seed atlas.

Entities other than corpora may be included in the seed graph, which may be referred to as out-of-class entities. An out-of-class entity may be added to the clustering result.

The out-of-class entity may be added to the first class of the target clustering result when the out-of-class entity is located on a shortest path of two common entities in the first class in the seed graph. And the cluster result has an out-of-class entity on the shortest path between two common entities of the same class, and the out-of-class entity can be added to the class to which the two common entities belong.

And adding the out-of-class entity into the clustering result according to the similarity between the subgraph of the out-of-class entity and the class in the clustering result. The out-of-class entities may be added to one or more classes in the target classification result, where the entity similarity satisfies a preset condition, according to the similarity between the common entities adjacent to the out-of-class entities in the seed map and the entities in each class of the target classification result.

The similarity between the newly added entity and the common entity can be determined according to the distribution of the newly added entity and the common entity in the clustering result, which are simultaneously appeared and in different classes of the clustering result. If a newly added entity and a common entity always appear at the same time in different classes or do not appear at the same time, the similarity of the two entities is high.

Each entity may be represented by a vector. Each bit of the vector represents a class of the clustering result. A "1" and a "0" of the bit in the vector respectively indicate whether the entity represented by the vector belongs to this class. That is, the jth bit of the vector indicates whether the entity belongs to the jth class in the target classification result, j being a positive integer.

The similarity of the entities corresponding to the two vectors can be represented by the distance of the two vectors. The distance of the vectors of the newly added entity and the entity in the seed map is determined, so that the similarity of the newly added entity and the entity in the seed map can be determined.

The similarity between each newly added entity and each common entity can be determined, and the similarity between one newly added entity and the entity in the whole seed map range can be determined. The new entity can be connected to the entity with the similarity meeting the preset condition in the whole seed map.

Or determining the subgraph with the most similar class to which the newly added entity belongs, and determining the similarity between the newly added entity and the entity in the range of the subgraph. The new entity can be connected to the entity with the similarity meeting the preset condition in the subgraph.

And determining a j target subgraph which is most similar to the j type according to the similarity of the entity in the j type and a plurality of subgraphs in the seed graph, wherein each subgraph in the plurality of subgraphs is composed of one entity in the seed graph and all adjacent entities of the entity in the seed graph. And adding the newly added entity in the jth class to the jth target subgraph according to the similarity.

And determining a subgraph with the most similar class, namely the matching of the class and the subgraph. A word may have different meanings, i.e., the word is a polysemous word. In the case of different meanings, the word may be divided into different classes. Each class corresponds to a topic and each entity in the sub-graph of entity 1 is related to entity 1 and can be considered to correspond to a topic. In this case, the clustering result may be matched with the subgraph, so that the newly added entity is connected to the entity in the subgraph of the topic corresponding to the newly added entity, and the newly added entity is prevented from being connected to the entity in the subgraph corresponding to the topic irrelevant to the newly added entity to some extent.

And connecting the newly added entity to the sub-graph corresponding to the class of the newly added entity through matching of the class and the sub-graph, so that the accuracy of seed map expansion is improved.

In step S304, the new entity is added to the seed map according to the similarity.

Through the steps S301-S304, the knowledge graph can be established under the condition of insufficient corpus, and the dependence on manpower is avoided. Compared with a mode of establishing a knowledge graph by depending on expert manual combing, the method greatly improves the efficiency and saves the labor cost. The automatic expansion of the knowledge graph under the condition of insufficient corpus or lack of labels can be realized.

According to the method provided by the embodiment of the application, words in the corpus are divided into different groups by using an unsupervised clustering algorithm, the consistency of clustering results of a seed map provided by a seed expert and the unsupervised clustering algorithm is measured by defining two measuring indexes of Dispersion and accessibility, and the optimal clustering number is found out from the balance of the Dispersion and the accessibility. And (3) according to a clustering result obtained by the optimal clustering number, associating the clustered group with a certain subgraph, associating the newly added entities in the class with the entities in the seed map through vector similarity, and completing the connection of each newly added entity with the entities in the seed map. And pushing the connection of the important newly added entity to expert verification according to the defined importance so as to ensure that the expert time is efficiently utilized and the important newly added entity is rapidly added into the seed map. The connection of the other newly added entities with the entities in the seed map can be verified during the application process.

In step S401, corpus preprocessing. The process of corpus preprocessing can be seen in fig. 10.

And preprocessing the corpus, which is used for decomposing the original text corpus to determine an entity in the corpus.

After step S401, the entities in the corpus are clustered.

Unsupervised clustering generally requires presetting parameters to be classified into several classes as model parameters, and the setting of the parameters is generally determined according to expert experience or some characteristics of data per se. The number of classes in the clustering result is determined according to the expert experience mode, and the labor cost is high. The number of classes in the clustering result is determined according to the characteristics of the data, the accuracy of the clustering result cannot be guaranteed, and manual verification is usually required.

The embodiment of the application utilizes the seed knowledge graph established by expert experience to determine the parameters of the unsupervised clustering algorithm model. And the seed knowledge graph provided by the expert is fully utilized to find out the clustering number consistent with the experience of the expert.

Meanwhile, the seed atlas can be verified according to an unsupervised clustering algorithm. If in a sub-graph of the seed graph, other entities are in one class of the clustering result, and one or more entities are in another class in the sub-graph, it is likely that the one or more entities are not located exactly in the seed graph. An expert may be enabled to determine whether the connection relationship of the one or more entities is accurate.

And through cross checking of consistency of the two groups of data, the number parameter of the subjects of the clustering algorithm is found out, and meanwhile, the accuracy of expert experience is ensured.

And S402-S404, mutually checking and determining the parameters of unsupervised learning by utilizing an initial map built by expert experience and unsupervised learning based on linguistic data, and verifying the correctness of the expert experience.

In step S402, the entities in the corpus are clustered according to the number of clusters.

The number of clusters k may be preset, for example, by setting k to 1, or other values.

And inputting the preprocessed text corpus into a clustering model through a clustering algorithm for processing to finish the clustering of the words.

There are many ways to cluster corpora, such as clustering methods using topic models, hierarchical models, etc. The topic model (topic model) can group words in text that describe the same content into a class, which can represent a topic. Latent Dirichlet Allocation (LDA) is a document topic generation model. The subject model phrase LDA (phrase LDA) based on phrase improvement can consider word and phrase collocation more. The topic model is a probability distribution taking all characters in the text as a support set, and represents the frequency of the characters appearing in the topic, namely, the characters with high relevance to the topic have higher probability of appearing.

In step S403, the clustering result corresponding to the number k of clusters is calculated as the degree of dispersion and accessibility.

And adjusting the numerical value of the clustering number k. And clustering the entities in the corpus according to different clustering numbers k.

The seed map can be designed manually and is a manifestation of expert experience. And verifying the clustering result and the seed map through the dispersion degree and the accessibility so as to obtain the clustering result consistent with the expert experience.

In step S404, the final number of clusters is determined according to the degree of dispersion and the accessibility index.

Through steps S402-S404, a final clustering result may be determined. The specific process can be seen in fig. 5.

In addition, the correctness of the seed atlas can be judged through the clustering result. For an entity, if other entities in a subgraph are in one class and the entity is classified in another class through a clustering algorithm, the position of the entity in a seed graph is not very accurate, and the connection condition of the entity can be pushed to expert verification. In this way, the correctness of the seed map can be judged. After the correctness of the seed map is judged, the seed map can be expanded.

Through steps S402-S404, the parameter of the cluster number is determined through non-purpose trial calculation by adopting an unsupervised algorithm. The expert knowledge is efficiently applied, the knowledge operation and maintenance efficiency is obviously improved, and the operation and maintenance cost is reduced.

And S405-S408, vectorizing the entity according to the class obtained by the unsupervised clustering, determining the connection of the newly added entity and the entity in the seed map through the matching of the class and the subgraph, and expanding the seed map.

The seed map is an extension mainly including steps S406 to S408. Step S405 may also be performed before step S406.

Entities that are not included in the corpus that do not belong to any one class may exist in the seed graph, which may be referred to as out-of-class entities. There may be an association between an out-of-class entity and a newly added entity. For example, a newly added entity may represent the same meaning as an out-of-class entity, just the name being different. To cover this when performing the spectrum expansion, step S405 may be performed.

In step S405, an out-of-class entity is added to the clustering result. I.e. such that the out-of-class entity belongs to one or more classes of the clustered result.

For a certain class of the clustering result, when the out-of-class entity is located on the shortest path of two common entities in the class on the seed map, the out-of-class entity can be added to the class.

Or determining the class with the similarity to the neighbor structure meeting the preset condition in the clustering result according to the neighbor structure of the entity outside the class. The structure of the close neighborhood of the out-of-class entity comprises the out-of-class entity and the shared entity adjacent to the out-of-class entity in the seed map.

Taking the seed map described in fig. 6 as an example, if the result of the classification of the corpus is: c₁＝{A,F,G}，C₂＝{A,B,K,H}，C₃＝{C,D,H,I}，C₄{ B, F }. E is an extraclass entity and does not belong to any class.

E may be added to a class if E is on the shortest path of two common entities in the class when computing the accessibility of the class. For the above clustering result, E should be added to C on the shortest path between C and D₃In (1).

If E is not on the shortest path of any class, then E is added to the most similar class according to Jaccard similarity (Jaccord similarity coefficient), or is a class with Jaccard similarity greater than a certain value. Class C_KThe Jaccord similarity to subgraph G' can be expressed as:

wherein, G'. andu.C_KI represents class C_KAnd the number of entities included in subgraph G '| G'. sup.C_KI represents C_KAnd the number of all entities included in sub graph G'.

Through step S405, the expert knowledge is utilized to supplement the missing of the entity in the corpus and improve the semantic understanding of the predictive clustering.

In step S406, the clustering result is matched with the subgraph.

A word may have different meanings, i.e., the word is a polysemous word. In the case of different meanings, the word may be divided into different classes. Each class corresponds to a topic and each entity in the sub-graph of entity 1 is related to entity 1 and can be considered to correspond to a topic. In this case, the clustering result may be matched with the subgraph, so that the newly added entity is connected to the entity in the subgraph of the topic corresponding to the newly added entity, and the newly added entity is prevented from being connected to the entity in the subgraph corresponding to the topic irrelevant to the newly added entity to some extent.

Each entity in the seed map corresponds to a subgraph. A sub-graph of an entity is a once-from-center network of the entity. A subgraph of an entity in a seed graph includes the entity and entities adjacent to the entity in the seed graph.

And matching the clustering result with the subgraph, namely determining the subgraph corresponding to each class in the clustering result. And calculating the similarity between the subgraphs corresponding to each class in the clustering result and each common entity included in the class.

The classification result of the corpus is as follows: c₁＝{A,F,G}，C₂＝{A,B,K,H}，C₃＝{C,D,H,I}，C₄{ B, F }. In step S405, add E to C₃Thus, new classes are formed: c₁＝{A,F,G}，C₂＝{A,B,K,H}，C₃＝{C,D,H,I,E}，C₄{ B, F }. As shown in Table 1, C₁Including a common entity A, a calculation C₁Similarity to subgraph of a. C₂Including common entities A and B, calculating C separately₂Similarity with subgraph of A and subgraph of B. C₃Including a common entity C, D, E, calculating C separately₃Similarity with subgraph of C, D, E. C₄Including common entities, calculating C₄Similarity to the subgraph of B.

TABLE 1

Determining C according to the calculation result₁Matching with the subgraph of A. C₂Matching with the subgraph of A. C₃Matching with the subgraph of D. C₄Matching with the subgraph of B.

Through step S406, the new entity can be added to the subgraph corresponding to the class of the new entity, so as to avoid connection errors.

In step S407, the similarity between the newly added entity and the entity in the matched sub-graph is determined.

And representing the entity through a vector according to the clustering result, and representing the similarity through the distance between the newly added entity and the vector of the entity in the matched subgraph.

Class-based vector representation, i.e., representing entities by vectors according to their distribution within a class. Each bit of the vector of an entity indicates whether the entity belongs to a class in the new clustering result. And vector representation of the entities, wherein each entity is vectorized according to whether the entity is included in each class, the congruence similarity of the newly added entity and the entity in the sub-graph corresponding to the class to which the newly added entity belongs is calculated, and the newly added entity is connected to the node with the minimum congruence similarity in the corresponding sub-graph. The greater the distance between two vectors, the minimum the similarity of the cosine of the entities corresponding to the two vectors. The similarity of the cosine and the sine of two vectors can be understood as the proportion of the same number of bits in the two vectors.

According to new C in Table 1_K，C₁、C₂Comprising entities A, C₃、C₄Excluding entity a, entity a may be represented as: a (1100). The other entities may be represented as: b (0101), C (0010), D (0010), E (0010), H (0110), K (0100), I (0010), F (1001), G (1000).

For each newly added entity, its similarity to each entity in the seed graph can be calculated. Or, for each newly added entity, only the similarity of the newly added entity to each entity in the subgraph corresponding to the class to which the newly added entity belongs may be calculated.

Table 2 is a table of similarity for each entity in the sub-graph corresponding to the newly added entity and the class to which the newly added entity belongs.

TABLE 2

	A(1100)	B(0101)	C(0010)	D(0010)	E(0010)
						H(0110)	0.5	0.5	0.71	0.71	0.71
K(0100)	0.71	0.71	0
						I(0010)		0		1	1
F(1001)	0.5	0.5	0		0
						G(1000)	0.71	0	0

In step S408, adding the new entity in the seed map according to the similarity between the new entity and the entity in the subgraph.

Based on the similarities in table 2, the new entities are connected to the most similar common entities to build a knowledge graph. Referring to fig. 9, H is connected to C, D, E, K is connected to A, B, I is connected to D, E, F is connected to A, B, and G is connected to a.

Through steps S405-S408, an extension to the seed map is achieved.

After step S408, the connection of the newly added entity may be confirmed.

The expanded new entity in the seed map can be applied. Based on the feedback of the application, it can be determined whether to anchor the newly added entity.

And applying the expanded seed map in a search system. The search system provides the user with information options based on the expanded knowledge graph. The user enters a first entity and searches. The search system provides the information of the first entity and the information of the entity having the relationship with the first entity to the user according to the entity input by the user. One entity has a relationship with another entity, i.e., the two entities are connected by an edge. If the user clicks the information of the second entity having the relationship with the first entity, the connection relationship between the first entity and the second entity is approved by the user. The more times of clicking, the more reliable the relationship between the pair of entities. And according to the number of clicks, fixedly adding the newly added entity into the seed map. If the click quantity of the information of the second entity is small, the connection relation between the first entity and the second entity can be deleted.

Considering the limitation of the amount of information displayed by the user device, the search system may only provide information to the user of a limited number of entities having a relationship with the first entity at a time, that is, determining whether a relationship exists between the first entity and other entities may require a significant amount of time.

But the perception of the application strongly depends on the size of the seed map. When the seed map is too small, the knowledge covered by the seed map is not wide enough, and the recommended knowledge is not rich enough and the user experience is poor. In the construction of the enterprise map, the seed map manually combed by experts is generally very small, and the number of newly added entities in the corpus is possibly far greater than that of the entities in the seed map.

And determining the addition of the newly added entity according to the feedback of the application, and determining that a certain time is required for adding the newly added entity. Prior to determining that the newly added entity is added to the seed graph, the knowledge graph is not fully established, possibly resulting in a poor user experience. In order to obtain better user experience, the established knowledge graph can be verified by experts, namely, the experts determine the connection between the newly added entity and the common entity.

Because the number of the added entities is large, the expert verifies the connection between each added entity and the common entity, the cost is high, and the expert can verify the connection between the added entity with high importance and the common entity according to the importance of the added entity.

And S409-S411, for the newly added nodes, providing an index for judging importance, and determining the importance of the newly added entities. The experts verify the newly added entities with higher importance and improve the efficiency.

In step S409, the importance of the newly added entity is calculated. Importance may be determined by one or more of centrality, nearcentrality, and intermediaries. For example, importance may be an average of centrality, recenterness, and intermediaries.

Degree centrality (degree centrality) is the most direct measure characterizing node centrality (centrality) in network analysis. The node degree of a node is larger, which means that the node degree is more central, and the node is more important in the network.

Proximity centrality (closeness) reflects the proximity between a node and other nodes in the network. The cumulative reciprocal of the shortest path distance from one node to all other nodes represents proximity centrality. That is, for a node, the closer it is to other nodes, the greater its proximity centrality.

The between centricity (between centricity) is an index that characterizes the importance of a node in terms of the number of shortest paths through the node. The intermediary centrality may reflect the number of times a node is the shortest between two other nodes. The higher the number of times a node acts as an "intermediary," the more central its intermediary is. If standardization is to be considered, the number of times a node will bear the shortest bridge may be divided by the number of all paths. Normalization is to adjust the calculation between 0 and 1.

In step S410, the newly added entity to be verified is determined according to the importance.

And outputting the newly added entity to be verified, and verifying whether the newly added entity is connected with the entity in the seed map or not in a human-provided mode.

Under the conditions that the seed map is relatively small and more newly added entities exist, the expert resources are efficiently utilized. Limited expert resources are applied to the verification of important newly added entities.

In step S411, the verification result is acquired, and the knowledge map is determined.

And acquiring a verification result of whether the important newly added entity is connected with the entity in the seed map, and determining the connection between the important newly added entity and the entity in the seed map so as to determine the verification result. For newly added entities with lower importance, the new entities can be added into the seed map so as to complete the establishment of the knowledge map, and whether the new entities are connected or not can be determined according to the feedback of the application.

Through the steps S409-S411, the problem of effective verification of the newly added nodes in the map expansion is solved.

Through the steps S401-411, the knowledge graph is established, and the establishment of the knowledge graph under the condition that the corpus quantity is insufficient is realized. And the entity does not need to be marked, so that the labor cost is reduced.

Fig. 5 is a method for determining a clustering result according to an embodiment of the present application. By the method, the clustering result consistent with the expert experience can be determined.

And inputting the corpus and inputting or presetting the initial clustering number. The number of clusters is the number of classes included in the clustering result. The initial number of clusters may be a small value i, which may be, for example, 2. In practical applications, the corpus can be clustered into at least 2 classes.

The clustering number can be from 1 to a preset value, and multiple iterative loop calculation is carried out through a clustering algorithm. The preset value is less than or equal to the number of entities in the seed map. And determining a clustering result consistent with expert experience by using two indexes of dispersity (dispersion) and scalability (accessibility).

Entities in the corpus, including the same entities in the seed map, are referred to as common entities, and also include entities outside the seed map, which are referred to as new entities. And the establishment of the knowledge graph is to add the newly added entities into the seed graph. The establishment of the knowledge graph is realized by expanding the seed graph.

And clustering the entities in the corpus according to the clustering number.

And then, calculating the dispersity and accessibility of the clustering result.

The degree of dispersion of a common entity is used to indicate the degree of dispersion of the neighboring structures of the common entity in the clustering result. The structure of the neighborhood of a common entity refers to the structure formed by the common entity and its neighboring common entities in the seed map. The seed map is designed by an expert and represents the expert's experience. Each entity in the neighborhood structure of a common entity is associated with the common entity, and can be considered to have a high probability of belonging to the same topic. And clustering according to the topic model, wherein one class in the clustering result represents one topic. If the adjacent common entities in the seed atlas are concentrated in one class in the clustering result, the consistency of the clustering result and the neighbor structure of the common entities is higher, namely the consistency of the clustering result and the expert experience is higher. The common entity is a point in the seed map, and the degree of dispersion of the common entity can also be referred to as the degree of dispersion of the point.

And judging from the angle of the dispersion degree of the common entities, and reflecting the consistency degree of the clustering result and expert experience on the whole by the average value of the dispersion degrees of all the common entities in the seed map for a certain clustering result. For a clustering result, the average of the dispersion of all common entities in the seed graph can be referred to as the dispersion of the clustering result.

Entropy is a measure used to represent random variable uncertainty, and is a desirable measure of the amount of information that can be generated by all events that may occur. The larger the value number of the random variable is, the larger the state number is, the larger the entropy is, and the larger the chaos degree is. In addition, the more uniform the random variables are distributed in different states, the greater the entropy and the greater the degree of disorder. For the clustering result, the more the common entities in the seed map and the adjacent common entities belong to in the clustering result, the more the neighbor structures of the common entities are dispersed in the clustering result, and the lower the consistency with the expert experience is. Therefore, the degree of dispersion can be expressed by entropy.

Entropy can be expressed as

Wherein, the negative sign is used to ensure that the information amount is positive or zero. The choice of the log function basis is arbitrary and can be chosen to be 2, for example.

The degree of dispersion of a certain common entity V can be expressed as:

wherein the content of the first and second substances,

common entities in the neighbor structure of the common entity V of Γ (V), C_KRepresenting the entities in the Kth class in the clustering result, and | x | representing the number of the entities in x.

In the neighbor structure of the common entities, the larger the number of common entities, the higher the probability that the entropy value is large. Therefore, considering the influence of the number of entities in the neighborhood structure of the common entity on the entropy, the entropy is adjusted when calculating the degree of dispersion of the common entity V. The entropy value may be divided by a function that increases with increasing | Γ (v) |, and the value of the function is positive.

For example, the entropy may be normalized. The normalization factor increases with the number of classes in the neighborhood structure of the shared entity and guarantees that the factor is 1 (no penalty) when there is only one neighbor. The normalization coefficient may be set as:

wherein the log function basis in the normalization function may be the same as or different from the log function basis in the entropy function, and may be selected to be 2, for example.

Entropy is normalized, and the entropy value is divided by log (| Γ (V) | + 1). By normalization, the influence of the number of entities in the neighborhood structure of the common entities on the degree of dispersion is balanced.

The value range of the dispersion degree of the common entity V may be adjusted, for example, the value range is adjusted between 0 and 1. The value range of the dispersion can be adjusted by using an adjustment function, which can be, for example, a hyperbolic tangent tanh function. And a hyperbolic tangent tanh function is adopted to ensure that the value range of the dispersion degree is between [0 and 1] so as to be convenient for comparison with accessibility. the change of the tanh function is obvious under the condition that the independent variable value is small, and the condition that the number of clusters is too small can be avoided.

Taking the seed map shown in fig. 6 as an example, the neighbors of B are Γ (B) ═ a, B, C, E), | Γ (B) | 4. G represents an entity in the seed map, C_KRepresenting entities in the Kth class in the clustering result, G &' C_KIs represented by C_KAnd G are included as common entities. The clustering result is G ^ C₁＝{A，B，C}，G∩C₂＝{E}，G∩C₃{ D }. Since the neighbors of nodes A and C are both in class 1, ζ_A＝ζ _c0. Of the 4 neighbor nodes of node B, A, B, C are in one class, E is in another class, and the degree of dispersion of B is:

D. the dispersity of the two points E is respectively as follows:

comparing these five points, neighbors of A and C are in one class, neighbors of B and D are in two classes, but D has only one neighbor, B has three neighbors, ζ_A＝ζ_c<ζ_B<ζ_D。

If merge C₂And C₃，ζ_A＝ζ_C＝ζ_D＝0，ζ_B＞0，ζ_E＞0。

If all points are in one class, the dispersion of all points is 0, ζ_A＝ζ_B＝ζ_C＝ζ_D＝ζ_E＝0。

In the structure of the close neighborhood of the shared entity, the less the class of the entity, the lower the degree of dispersion of the entity.

Adjacent points in the seed map are clustered in the same class. The seed atlas manually combed by experts is consistent with the clustering algorithm. It can be seen that the point dispersion tends to reduce the number of classes, and in the end case, when there is only one class, the dispersion of all points is 0.

Therefore, the consistency of the clustering result and the seed map is not enough to be measured only by the dispersity index.

In the seed graph, the smaller the number of classes, the more entities each class contains, and the average distance between common entities in each class increases. In order to balance the trend of reducing the number of classes caused by judging the consistency by adopting the dispersion degree, the consistency of the clustering result and the seed map can be judged by the accessibility while the dispersion degree is considered.

Accessibility is used to represent the shortest distance in the seed graph of the common entities in a class in the clustering result. The accessibility can also be called the accessibility of the class, all the common entities in one class of the clustering result are mapped into the middle seed map, the sum of the shortest paths between every two common entities in the class is calculated according to the connection relation of the entities in the seed map, and then the number of the common entities in the class is normalized by using the words in the class to calculate the average value. The accessibility of the kth class in the clustering result can be expressed as:

wherein G represents an entity in the seed map, C_KRepresenting entities in the Kth class in the clustering result, G &' C_KIs represented by C_KCommon entities included in both G, i, j representing entities included in C_KAs with the two different common entities in G, Σ SP (i, j) represents the shortest distance of i, j in the seed graph.

The shorter the distance of the words in the clustering result in the seed map, the smaller the accessibility. The number of clusters is reduced, the number of entities in each class is increased, and accessibility is increased.

According to the accessibility of each class in the clustering result, the accessibility of the whole clustering result can be determined. As the number of classes in the clustering result decreases, the accessibility of each class increases. The accessibility of the entire clustering result may be the sum of the accessibility of each weighted class, taking into account the effect of the number of each common entity in the clustering result on the accessibility of the class. The weight may be determined based on the number of words in each class as a percentage of the total number. The accessibility of the whole clustering result can be expressed as:

in a certain clustering result, common entities may belong to different classes at the same time. Considering this case, the accessibility of the clustering result can be expressed as:

the consistency of the clustering result and the seed map structure can be checked through the accessibility rho of the clustering result. The smaller rho is, the smaller the distance of the common entities of the same class in the clustering result in the seed map is. Accessibility is between [0,1 ].

Taking the seed map shown in FIG. 6 as an example, if the clustering result is G ^ C₁＝{A，B，C}，G∩C₂＝{E}，G∩C₃{ D }. Then

The weight of each class is respectively

The accessibility ρ of the clustering result is 0.6.

If the classification result is positive, G ^ C₁＝{A,B,C}，G∩C₂When { E, D }, ρ is 1.

If the classification result is positive, G ^ C₁When { a, B, C, E, D }, ρ is 3.

The optimal condition is that rho is 1, and the shared entities in each class of the clustering result are all directly adjacent in the seed map, namely adjacent in pairs. But the optimal number of clusters cannot be determined by p alone. ρ is a trend to increase the number of clusters.

The dispersity and the accessibility of the clustering result are measurement modes for measuring the consistency of the seed map and the result of the unsupervised clustering algorithm. To meet the lower degree of dispersion, the number of classes of the clustering result is reduced. To meet the higher accessibility, the number of classes of clustering results is increased. That is, the dispersibility tends to decrease the number of clusters, and the accessibility tends to increase the number of clusters.

Therefore, a smaller cluster number i can be initially set, and the value of i is gradually increased. And drawing an accessibility-clustering number curve of the clustering result and a dispersity-clustering number curve of the clustering result by taking i as an x axis, and finding a cross point of the two measurement indexes. The number of clusters corresponding to the intersection points is the number of clusters consistent with the expert experience. The clustering result corresponding to the clustering number is the clustering result consistent with the expert experience.

The curves in the figure reflect the variation of the dispersion of the clustering results with the number of clusters. The number of clusters can also be understood as the number of topics (topic) in the clustering result. The abscissa is the number of clusters, and the ordinate is the degree of dispersion of the clustering results, i.e., the average value of the degree of dispersion of each common entity in the clustering results. The smaller the dispersion degree of the clustering result is, the more consistent the clustering result is and the seed map is.

Generally, as the number of clusters increases, the degree of dispersion of entities in the neighborhood of a common entity increases.

The curves in the figure reflect the case where the accessibility of the clustering results varies with the number of clusters. The number of clusters can also be understood as the number of topics (topic) in the clustering result. The abscissa is the number of clusters, and the ordinate is the accessibility of the clustering results, which is determined by the accessibility of each class in the clustering results. The smaller the accessibility of the clustering result is, the more consistent the clustering result is with the seed map.

Generally, as the number of clusters gradually decreases, the accessibility of the clustering results is greater.

And determining the intersection point of the dispersity of the clustering result and the accessibility of the clustering result, namely the clustering result with the clustering result consistent with the seed map.

Or clustering can be performed according to a plurality of preset theme quantity parameter values, and the value range of the theme quantity parameter is further determined according to the dispersion degree (dispersion) and the scalability (accessibility) corresponding to the clustering result, so that the clustering number and the clustering result consistent with the expert experience are determined.

For the finally determined clustering result, each class comprises the newly added entity and the common entity, and the common entity and the newly added entity in the same class have a correlation. Because the common entities exist in the seed map, the newly added entities can be added into the seed map to expand the seed map to complete the establishment of the knowledge map according to the existence of the correlation between the newly added entities and the common entities.

FIG. 10 is a schematic flow diagram of a method of corpus pre-processing.

And preprocessing the corpus, which is used for decomposing the original text corpus to obtain entities in the corpus, wherein the entities can be represented by words or phrases. The corpus pre-processing process may perform word segmentation twice.

Taking Chinese corpus as an example, before word segmentation, text cleaning can be performed to remove punctuation and perform conversion between traditional Chinese characters and simplified Chinese characters. The traditional Chinese characters in the corpus can be converted into simplified Chinese characters. Words may also be deactivated.

The stop word means that in the information retrieval, in order to save storage space and improve search efficiency, some words or words are automatically filtered before or after processing natural language data (or text), and the words or words are called stop words (stop words). Stop words may be removed based on the stop word list.

Entities in the seed graph are represented by words, which may also be referred to as entity words.

And performing first word segmentation, namely completing initial word segmentation according to the entity words in the seed map. And if the entity words in the seed map do not appear in the text sentences of the corpus, decomposing the atomic words according to the linguistic rules. And if the entity words in the seed map appear in the text sentences of the corpus, taking the entity words in the seed map as a fixed phrase, and segmenting the corpus according to the language rules to obtain an initial segmentation result.

And after the first word segmentation, forming a new word according to the statistical rule of the appearance of the atomic words and the entity words in the seed map in the corpus text.

An atomic word or an entity word may be referred to as an individual word. And if the collocation among the individual words in the initial word segmentation result frequently appears and the individual word collocation in the collocation formed phrases is high in tight combination degree, extracting the fixed collocation formed phrases to form alternative phrases. For example, for phrases formed by fixed collocation, left and right boundary mutual information, left and right adjacent word entropy, phrase word frequency and the like can be counted. And determining alternative phrases according to one or more types of information of the statistics.

The higher the mutual information value of two consecutive words X and Y, indicating that X and Y are more relevant, the more likely X and Y are to constitute a phrase; conversely, the lower the mutual information value of X and Y, the lower the correlation between X and Y, and the greater the likelihood of a phrase boundary between X and Y. And determining the combination tightness degree in the word string according to the mutual information.

Entropy this term represents a measure of the uncertainty of a random variable. Left-right entropy refers to the entropy of the left boundary and the entropy of the right boundary of a multi-word expression. And determining the boundary outside the word string according to the entropy of the left and right adjacent words.

And then performing second word segmentation. And completing secondary word segmentation according to the entity words and the initial candidate phrases in the seed map. In the word segmentation process, the entity words and the initial candidate phrases in the seed map are used as a fixed phrase, and the entities in the corpus are decomposed according to the linguistic rules. De-generic words may also be performed after the second word segmentation.

And obtaining a processing result, namely the processed corpus, through two word segmentation.

For convenience of understanding, fig. 10 describes a corpus preprocessing method by taking word segmentation twice as an example. It should be noted that, in particular, other processes may be added or some of the above processes may be reduced in the application.

Method embodiments of the present application are described above in conjunction with fig. 1-10, and apparatus embodiments of the present application are described below in conjunction with fig. 11-12. It is to be understood that the description of the method embodiments corresponds to the description of the apparatus embodiments, and therefore reference may be made to the preceding method embodiments for parts not described in detail.

Fig. 11 is a schematic structural diagram of a knowledge-map creating apparatus according to an embodiment of the present application. Apparatus 1100, comprising: the system comprises an acquisition module 1101, a clustering module 1102, a determination module 1103 and an adding module 1104.

An obtaining module 1101 is configured to obtain a corpus.

And a clustering module 1102, configured to cluster the entities in the corpus to obtain a target clustering result.

A determining module 1103, configured to determine, according to the target clustering result, a similarity between a newly added entity in the corpus and an entity in the seed graph.

An adding module 1104, configured to add the new entity to the seed graph according to the similarity, so as to establish a knowledge graph.

Optionally, the clustering module 1102 is configured to cluster entities in the corpus according to M initial clustering numbers to obtain M initial clustering results corresponding to the M initial clustering numbers one to one, where M is a positive integer;

the determining module 1103 is further configured to determine the target clustering result from the M initial clustering results according to the connection relationship of the entities in the seed graph.

Optionally, the seed graph includes common entities that are the same as entities in the corpus.

The determining module 1103 is further configured to determine, according to the connection relationship of the entities in the seed graph and each of the initial clustering results, a degree of dispersion and accessibility of each of the initial clustering results, where the degree of dispersion is used to represent a degree of dispersion of a neighbor structure of each common entity in the seed graph in the initial clustering results, the neighbor structure is composed of one common entity and all adjacent common entities of the common entity in the seed graph, and the accessibility is used to represent a shortest distance of the common entity in the seed graph in each class of the initial clustering results.

The determining module 1103 is further configured to determine the target clustering result from the initial clustering results according to the degree of dispersion and the accessibility.

Optionally, the absolute value of the difference between the dispersity and accessibility of the target clustering results is the smallest among the M initial clustering results.

Optionally, the seed graph includes common entities that are the same as the entities in the corpus, and the seed graph includes entities that are not similar to the entities in the corpus.

The adding module 1104 is further configured to add the out-of-class entity to the first class when the out-of-class entity is located on the shortest path between two common entities in the first class of the target clustering result in the seed graph; and/or the adding module 1104 is further configured to add the out-of-class entity to one or more classes in the target classification result, where the similarity of the entity in the target classification result meets a preset condition, according to the similarity between the shared entity adjacent to the out-of-class entity in the seed map and the entity in each class of the target classification result.

Optionally, the determining module 1103 is further configured to determine an entity vector of each entity in the target clustering result, where an nth bit of the entity vector indicates whether the entity belongs to an nth class in the target classification result, and n is a positive integer.

The determining module 1103 is further configured to determine a distance between the added entity and a vector of an entity in the seed graph, where the distance is used to represent a similarity between the added entity and the entity in the seed graph.

Optionally, the target classification result includes a plurality of classes, and a jth class of the plurality of classes includes a new entity.

The determining module 1103 is further configured to determine a j-th target sub-graph most similar to the j-th class according to similarities of the entity in the j-th class and multiple sub-graphs in the seed graph, where each sub-graph in the multiple sub-graphs is composed of one entity in the seed graph and all adjacent entities of the entity in the seed graph.

The adding module 1104 is configured to add the newly added entity in the jth class to the jth target sub-graph according to the similarity.

Fig. 12 is a schematic structural diagram of a knowledge-map creating apparatus according to an embodiment of the present application. The apparatus 1200 includes: a communication interface 1201 and a processor 1202.

The communication interface 1201 is used to obtain corpora.

The processor 1202 is configured to cluster the entities in the corpus to obtain a target clustering result.

The processor 1202 is further configured to determine, according to the target clustering result, a similarity between a newly added entity in the corpus and an entity in the seed graph.

The processor 1202 is further configured to add the new entity to the seed graph according to the similarity to establish a knowledge graph.

Optionally, the processor 1202 is further configured to cluster the entities in the corpus according to the number of M initial clusters, so as to obtain M initial clustering results corresponding to the number of M initial clusters one to one, where M is a positive integer.

The processor 1202 is further configured to determine the target clustering result from the M initial clustering results according to the connection relationship of the entities in the seed graph.

Optionally, the seed graph includes common entities that are the same as entities in the corpus,

the processor 1202 is further configured to determine, according to the connection relationship of the entities in the seed graph and each of the initial clustering results, a degree of dispersion and accessibility of each of the initial clustering results, wherein the degree of dispersion is used for representing a degree of dispersion of a neighbor structure of each common entity in the seed graph in the initial clustering results, the neighbor structure is composed of one common entity and all adjacent common entities of the common entity in the seed graph, and the accessibility is used for representing a shortest distance of the common entity in the seed graph in each class of the initial clustering results;

the processor 1202 is further configured to determine the target clustering result from the initial clustering results according to the degree of dispersion and the accessibility.

The processor 1202 is further configured to add the out-of-class entity to the first class of the target clustering result when the out-of-class entity is located on a shortest path between two common entities in the first class in the seed graph; and/or the processor 1202 is further configured to add the out-of-class entity to one or more classes in the target classification result, where the similarity of the entity in the target classification result satisfies a preset condition, according to the similarity between the common entity adjacent to the out-of-class entity in the seed map and the entity in each class of the target classification result.

Optionally, the processor 1202 is further configured to determine an entity vector of each entity in the target clustering result, where an nth bit of the entity vector indicates whether the entity belongs to an nth class in the target classification result, and n is a positive integer;

the processor 1202 is further configured to determine a distance between the added entity and a vector of entities in the seed graph, where the distance is used to represent a similarity between the added entity and the entities in the seed graph.

The processor 1202 is further configured to determine a j-th target sub-graph most similar to the j-th class according to similarities of the entity in the j-th class and a plurality of sub-graphs in the seed graph, where each sub-graph in the plurality of sub-graphs is composed of one entity in the seed graph and all adjacent entities of the entity in the seed graph.

The processor 1202 is further configured to add the newly added entity in the jth class to the jth target subgraph according to the similarity.

The embodiment of the present application further provides a knowledge graph establishing apparatus, including: at least one processor and a communication interface for the apparatus to interact with other apparatus, the program instructions when executed in the at least one processor causing the apparatus to perform the method above.

Embodiments of the present application further provide a computer program storage medium, which is characterized by having program instructions, when the program instructions are directly or indirectly executed, the method in the foregoing is implemented.

An embodiment of the present application further provides a chip system, where the chip system includes at least one processor, and when a program instruction is executed in the at least one processor, the method in the foregoing is implemented.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of knowledge graph building, comprising:

obtaining a corpus;

clustering the entities in the corpus to obtain a target clustering result;

determining the similarity between the newly added entity in the corpus and the entity in the seed atlas according to the target clustering result;

and adding the new entity to the seed map according to the similarity so as to establish a knowledge map.

2. The method according to claim 1, wherein said clustering the entities in the corpus to obtain a target clustering result comprises:

clustering the entities in the corpus according to the M initial clustering numbers to obtain M initial clustering results corresponding to the M initial clustering numbers one by one, wherein M is a positive integer;

and determining the target clustering result from the M initial clustering results according to the connection relation between the entities in the seed atlas.

3. The method of claim 2, wherein the seed graph includes common entities that are the same as entities in the corpus,

determining the target clustering result from the M initial clustering results according to the connection relationship of the entities in the seed atlas, wherein the determining the target clustering result comprises the following steps:

determining a degree of dispersion and accessibility of each initial clustering result according to the connection relation of the entities in the seed map and each initial clustering result, wherein the degree of dispersion is used for representing the degree of dispersion of a neighbor structure of each common entity in the seed map in the initial clustering result, the neighbor structure is composed of one common entity and all adjacent common entities of the common entities in the seed map, and the accessibility is used for representing the shortest distance of the common entities in the seed map in each class of the initial clustering results;

and determining the target clustering result from the initial clustering results according to the dispersity and the accessibility.

4. The method of claim 3, wherein the absolute value of the difference between the degree of dispersion and the accessibility of the target clustered results is smallest among the M initial clustered results.

5. The method according to any one of claims 1-4, wherein the seed graph includes common entities that are the same as entities in the corpus, and wherein the seed graph includes entities that are outside of the corpus, the method further comprising:

when the out-of-class entity is located on the shortest path of two common entities in the first class of the target clustering result in the seed graph, adding the out-of-class entity to the first class; and/or the presence of a gas in the gas,

and adding the out-of-class entities to one or more classes of which the entity similarity in the target classification result meets a preset condition according to the similarity between the shared entities adjacent to the out-of-class entities in the seed map and the entities in each class of the target classification result.

6. The method according to any one of claims 1 to 5, wherein the determining the similarity between the new entity in the corpus and the entity in the seed graph according to the target clustering result comprises:

determining an entity vector of each entity in the target clustering result, wherein the nth bit of the entity vector represents whether the entity belongs to the nth class in the target classification result, and n is a positive integer;

and determining the distance of the vector of the newly added entity and the entity in the seed map, wherein the distance is used for representing the similarity of the newly added entity and the entity in the seed map.

7. The method of any of claims 1-6, wherein the target classification result comprises a plurality of classes, a jth class of the plurality of classes comprising a new entity,

the method further comprises the following steps: determining a jth target sub-graph most similar to the jth class according to similarities of the entity in the jth class and a plurality of sub-graphs in the seed graph, wherein each sub-graph in the plurality of sub-graphs is composed of one entity in the seed graph and all adjacent entities of the entity in the seed graph;

the adding the new entity to the seed map according to the similarity comprises: and adding the newly added entity in the jth class to the jth target subgraph according to the similarity.

8. A knowledge-graph building apparatus, comprising:

the acquisition module is used for acquiring the corpus;

the clustering module is used for clustering the entities in the corpus to obtain a target clustering result;

the determining module is used for determining the similarity between the newly-added entity in the corpus and the entity in the seed atlas according to the target clustering result;

and the adding module is used for adding the newly added entity to the seed map according to the similarity so as to establish a knowledge map.

9. The apparatus of claim 8,

the clustering module is used for clustering the entities in the corpus according to the number of the M initial clusters to obtain M initial clustering results corresponding to the number of the M initial clusters one by one, wherein M is a positive integer;

the determining module is further configured to determine the target clustering result from the M initial clustering results according to a connection relationship between entities in the seed graph.

10. The apparatus of claim 9, wherein the seed graph includes common entities that are the same as entities in the corpus,

the determining module is further used for determining the dispersion degree and accessibility of each initial clustering result according to the connection relation of the entities in the seed map and each initial clustering result, wherein the dispersion degree is used for representing the dispersion degree of the neighbor structure of each common entity in the seed map in the initial clustering results, the neighbor structure is composed of one common entity and all adjacent common entities of the common entity in the seed map, and the accessibility is used for representing the shortest distance of the common entity in the seed map in each class of the initial clustering results;

the determining module is further configured to determine the target clustering result from the initial clustering results according to the degree of dispersion and the accessibility.

11. The apparatus of claim 10, wherein the absolute value of the difference between the degree of dispersion and the accessibility of the target clustering result is the smallest among the M initial clustering results.

12. The apparatus of any of claims 8-11, wherein the seed graph includes common entities that are the same as entities in the corpus, wherein the seed graph includes entities outside of a class that are outside of the corpus,

the adding module is further configured to add the out-of-class entity to the first class when the out-of-class entity is located on a shortest path between two common entities in the first class of the target clustering result in the seed graph; and/or the presence of a gas in the gas,

the adding module is further configured to add the out-of-class entity to one or more classes in the target classification result, where the entity similarity satisfies a preset condition, according to the similarity between a common entity adjacent to the out-of-class entity in the seed map and an entity in each class of the target classification result.

13. The apparatus according to any one of claims 8-12,

the determining module is further configured to determine an entity vector of each entity in the target clustering result, where an nth bit of the entity vector indicates whether the entity belongs to an nth class in the target classification result, and n is a positive integer;

the determining module is further configured to determine a distance between the newly added entity and a vector of an entity in the seed graph, where the distance is used to represent a similarity between the newly added entity and the entity in the seed graph.

14. The apparatus of any of claims 8-13, wherein the target classification result comprises a plurality of classes, a jth class of the plurality of classes comprising a newly added entity,

the determining module is further configured to determine a jth target sub-graph most similar to the jth class according to similarities of the entity in the jth class and a plurality of sub-graphs in the seed graph, wherein each sub-graph in the plurality of sub-graphs is composed of one entity in the seed graph and all adjacent entities of the entity in the seed graph;

and the adding module is used for adding the newly added entity in the jth class to the jth target subgraph according to the similarity.

15. A knowledge-graph building apparatus, comprising: at least one processor and a communication interface for the knowledge-graph establishing apparatus to interact with other apparatus for information, the knowledge-graph establishing apparatus performing the method of any one of claims 1-7 when program instructions are executed in the at least one processor.

16. A computer program storage medium, characterized in that the computer program storage medium has program instructions which, when executed directly or indirectly, perform the method of any one of claims 1-7.

17. A chip system, comprising at least one processor, wherein when program instructions are executed in the at least one processor, the method of any one of claims 1-7 is performed.