CN107665252B

CN107665252B - Method and device for creating knowledge graph

Info

Publication number: CN107665252B
Application number: CN201710890548.1A
Authority: CN
Inventors: 毛瑞彬; 朱菁; 张俊; 王仁勇; 邓永翠; 赵洪杰
Original assignee: SHENZHEN SECURITIES INFORMATION CO Ltd
Current assignee: SHENZHEN SECURITIES INFORMATION CO Ltd
Priority date: 2017-09-27
Filing date: 2017-09-27
Publication date: 2020-08-25
Anticipated expiration: 2037-09-27
Also published as: CN107665252A

Abstract

A method and a device for creating a knowledge graph are provided, the method is applied to a data analysis device, and the method comprises the following steps: acquiring a data source, wherein the data source comprises a plurality of entities; performing semantic analysis and cluster analysis on the data source, and extracting an entity set and an attribute set from the data source, wherein the attribute set comprises entity attributes of all entities in the entity set; acquiring an incidence relation between each entity and the attribute in the entity set; and creating and outputting a knowledge graph according to the entity set, the attribute set and the association relationship among the entities and the attributes, wherein the knowledge graph comprises the entities, the entity attributes, the association relationship among the entities and the attributes and the association relationship among the entities. By adopting the scheme, the knowledge graph can be accurately created, and the relationship between the entities and the attributes and the association relationship between the entities can be visually presented.

Description

Method and device for creating knowledge graph

Technical Field

The application relates to the technical field of big data processing, in particular to a method and a device for creating a knowledge graph.

Background

The knowledge map is a visual map of a knowledge domain, is a series of different graphs for displaying the development process and the structural relationship of knowledge, and can be used for presenting the association relationship between knowledge resources, knowledge resource carriers, mining, analyzing, constructing and displaying knowledge. The knowledge graph can be used for intelligent question answering of the intelligent robot, and the application range is wide.

However, in the existing mechanism, when a knowledge graph is constructed, all acquired entities in a data source are analyzed, and then an association relationship between all the entities and entity attributes is established. Therefore, although the constructed knowledge graph can cover a wide range, the constructed knowledge graph cannot visually present important structural relationships to users, so that effective information cannot be rapidly identified when the knowledge graph is managed, the reference value in use is limited, the users spend a long time for analysis, and therefore key structural information cannot be presented in a targeted manner.

Disclosure of Invention

The application provides a method and a device for creating a knowledge graph, which can solve the problem that the constructed knowledge graph in the prior art is low in pertinence.

The application provides a method for creating a knowledge graph, which is applied to a data analysis device and comprises the following steps:

obtaining a data source, the data source comprising a plurality of entities;

performing semantic analysis and cluster analysis on the data source, and extracting an entity set and an attribute set from the data source, wherein the attribute set comprises entity attributes of all entities in the entity set;

acquiring an incidence relation between each entity and the attribute in the entity set;

and creating and outputting a knowledge graph according to the entity set, the attribute set and the association relationship between the entities and the attributes, wherein the knowledge graph comprises the entities, the entity attributes, the association relationship between the entities and the attributes and the association relationship between the entities.

In some possible designs, the method further comprises:

and vectorizing each entity in the entity set to obtain a training vector.

In some possible designs, the vectorizing each entity in the entity set to obtain a training vector includes:

adopting a multilayer neural network to carry out named entity identification on each entity in the entity set to obtain an entity context of each entity;

extracting the association relation among the entities from the entity context of each entity;

and obtaining the training vector according to the entity context of each entity and the incidence relation between the entities.

In some possible designs, after the identifying the named entities of each entity in the entity set by using the multi-layer neural network and obtaining the entity context of each entity, before extracting the association relationship between each entity from the obtained entity context of each entity, the method further includes:

and adopting a maximum log-likelihood method to respectively perform maximization processing on the obtained entity context of each entity.

In some possible designs, after extracting the association relationship between the entities from the entity context of each entity, and before obtaining the entity training vector according to the entity context of each entity and the association relationship between each entity, the method further includes:

and performing maximization processing on the obtained association relationship between the entities of each entity by adopting a maximum log-likelihood method.

In some possible designs, the extracting the association relationship between the entities from the entity context of the entities includes:

according to the attribute set, the entity set and the time recurrent neural network model, respectively labeling the incidence relation of each entity in the entity set, wherein the labeled incidence relation comprises the position of a word in the entity, the type of the incidence relation and the position of the incidence relation;

calculating the weight value of the relationship type by adopting a method of calculating association embedding;

screening candidate incidence relations from the marked incidence relations according to a closest distance principle and the incidence relation types;

and classifying the screened candidate incidence relations according to the keyword pairs of the incidence relation types to obtain the incidence relations among the entities.

In some possible designs, after the extracting the entity set and the attribute set from the data source and before the obtaining the association relationship between each entity and the attribute in the entity set, the method further includes:

calculating the weight value of each entity in the entity set according to the entity attribute of the entity;

and sorting the attributes of the entities in the entity set according to the weight values of the entities.

In some possible designs, the method further comprises:

and calculating the similarity among the entities through entity attribute embedding, and performing at least one of combination, duplication removal and differentiation on the entities with the same or similar entity types in the knowledge graph.

In some possible designs, the data source includes a first data table and a second data table, the plurality of entities includes at least one first entity and at least one second entity, the first entity belongs to the first data table, the second entity belongs to the second data table, and the knowledge graph includes at least two connected graphs, and an offspring relationship and/or a parent-offspring relationship exist between the at least two connected graphs.

In some possible designs, the at least one of merging, deduplicating, and differentiating entities of the same or similar entity type in the knowledge-graph includes:

if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, combining the first entity and the second entity or deleting the first entity or the second entity from the knowledge graph.

And if the similarity between the first entity and the second entity is higher than the preset similarity and the first entity and the second entity are determined to belong to any one link map, distinguishing the first entity from the second entity in the knowledge graph.

if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, determining a first entity set directly associated with the first entity and a second entity set directly associated with the second entity;

when it is determined that the intersection of the first set of entities and the second set of entities includes at least two entities, the first entity and the second entity are merged, or the first entity or the second entity is deleted from the knowledge-graph.

In some possible designs, the knowledge-graph is based on a time dimension, and the link graph in each time window in the time dimension is an association relationship between entities in the time window and a snapshot of attributes of the entities.

In some possible designs, the knowledge-graph further satisfies at least one of:

in the knowledge graph, the entities with the association relationship are displayed in a gradient manner from strong to weak according to the strength of the association relationship;

highlighting a specific entity in the knowledge graph, wherein the specific entity marks a risk assessment value, and the specific entity is an entity with a risk assessment value higher than a preset risk assessment value;

when an entity in the knowledge-graph is updated, distinguishing the updated entity;

adding a time axis to the entity attribute with time updating, and displaying the replacing time on the time axis;

and for the entity attributes of the same entity, deep and shallow coloring are performed according to the weight value of the entity attributes from high to low.

In some possible designs, the performing semantic analysis and cluster analysis on the corpus set and extracting an entity set and an attribute set from the corpus set includes:

performing word segmentation and semantic annotation processing on the corpora in the corpus set to obtain the entity set and the attribute set;

labeling the incidence relation type among the entities in the entity set;

based on a conditional random field model, respectively adjusting the entity set and the attribute set, and respectively predicting each entity in the entity set and each attribute in the attribute set to obtain an incidence relation type between the entities and obtain a mapping between the entities and the attributes.

A second aspect of the present application provides an apparatus for creating a knowledge-graph having functions to implement a method of creating a knowledge-graph corresponding to the first aspect provided above. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.

In one possible design, the means for creating a knowledge-graph includes:

a transceiver module for acquiring a data source, the data source comprising a plurality of entities;

the processing module is used for performing semantic analysis and cluster analysis on the data source acquired by the transceiver module, and extracting an entity set and an attribute set from the data source, wherein the attribute set comprises entity attributes of each entity in the entity set; acquiring an incidence relation between each entity and the attribute in the entity set; and creating and outputting a knowledge graph according to the entity set, the attribute set and the association relationship between the entities and the attributes, wherein the knowledge graph comprises the entities, the entity attributes, the association relationship between the entities and the attributes and the association relationship between the entities.

In some possible designs, the processing module is further to:

and vectorizing each entity in the entity set to obtain a training vector.

In some possible designs, the processing module is further to:

In some possible designs, the processing module performs named entity recognition on each entity in the entity set by using a multi-layer neural network, obtains an entity context of each entity, and extracts an association relationship between each entity from the obtained entity context of each entity, and further:

In some possible designs, after the processing module extracts the association relationship between the entities from the entity context of each entity, before the entity training vector is obtained according to the entity context of each entity and the association relationship between the entities, the processing module is further configured to:

In some possible designs, the processing module is specifically configured to:

calculating the weight value of the relationship type by adopting an incidence relationship embedding method;

In some possible designs, after the processing module extracts the entity set and the attribute set from the data source, and before acquiring the association relationship between each entity and the attribute in the entity set, the processing module is further configured to:

In some possible designs, the processing module is further to:

In some possible designs, the processing module is specifically configured to:

labeling the incidence relation type among the entities in the entity set;

Yet another aspect of the present application provides an apparatus for creating a knowledge-graph comprising at least one connected processor, memory, transmitter, and receiver, wherein the memory is configured to store program code, and the processor is configured to invoke the program code in the memory to perform the method of the above-described aspects.

Yet another aspect of the present application provides a computer storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of the above-described aspects.

Compared with the prior art, in the scheme provided by the application, semantic analysis and cluster analysis are performed on the acquired data source, the entity set and the attribute set are extracted from the data source, and the weight value of each entity in the entity set is calculated according to the entity attribute of the entity; according to the weight values of the entities, the attributes of the entities in the entity set are sorted, then the incidence relation between the entities and the attributes in the entity set is established, and a knowledge graph is established and output according to the entity set, the attribute set and the incidence relation between the entities and the attributes. The knowledge graph comprises entities, entity attributes, incidence relations between the entities and the attributes, and incidence relations between the entities. By adopting the scheme, the knowledge graph can be accurately created, and the relationship between the entities and the attributes and the association relationship between the entities can be visually presented. And the personalized visual display is facilitated, and the unified management can also be facilitated.

Drawings

FIG. 1-1 is a schematic flow chart of creating a knowledge-graph in an embodiment of the invention;

FIGS. 1-2 are a schematic flow chart of a method for constructing a first function according to an embodiment of the present invention;

FIGS. 1-3 are schematic structural diagrams of entities and associations between entities according to embodiments of the present invention;

FIGS. 1-4 are schematic flow diagrams illustrating the construction of a second function according to embodiments of the present invention;

FIG. 2 is a schematic flow chart illustrating the creation of an enterprise knowledge graph in accordance with an embodiment of the present invention;

FIG. 3 is a diagram illustrating an employee with the same name and the same surname in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a structure of an enterprise knowledge graph in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of an embodiment of an apparatus for creating a knowledge-graph;

FIG. 6 is a schematic diagram of an embodiment of an entity apparatus for performing knowledge-graph creation.

Detailed Description

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprise," "include," and "have," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules expressly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, the division of modules presented herein is merely a logical division that may be implemented in a practical application in a further manner, such that a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not implemented, and such that couplings or direct couplings or communicative coupling between each other as shown or discussed may be through some interfaces, indirect couplings or communicative coupling between modules may be electrical or other similar forms, this application is not intended to be limiting. The modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the present disclosure.

The application provides a method and a device for creating a knowledge graph, which are used for the technical field of big data. The details will be described below.

Referring to fig. 1-1, a method for creating a knowledge graph according to the present application is illustrated below, where the method is applied to a data analysis apparatus, the data analysis apparatus in the present application may be a server or a terminal device, and the data analysis apparatus may also be an application installed in the server or the terminal device, and the present application is not limited in particular. The method mainly comprises the following steps:

101. a data source is acquired.

The data source includes a plurality of entities. The data source may be text data such as news, posts, popular articles, etc., and the text data may be in the form of a data table or other forms, and the application is not limited in particular. The data source may be referred to as a collection of corpora. The data source can be obtained by crawling crawlers of network news, announcements, legal documents, industrial and commercial websites, enterprise official websites, personal homepages, encyclopedias and the like in a crawler mode. The data source may also be any device for collecting and sending data, such as a terminal device, which may be a smart phone, a tablet computer, a laptop computer, a desktop computer, or a crawler server.

102. And performing semantic analysis and cluster analysis on the data source, and extracting an entity set and an attribute set from the data source.

Wherein the attribute set includes entity attributes of each entity in the entity set.

For example, in creating a corporate knowledge graph, entities may refer to employee names and business names, and entity attributes may refer to business attributes and employee attributes.

The entity attribute of the employee can be employee position, employee gender, employee academic history, prize winning information, employee level, employee resume, patent, news, event and the like.

The entity attributes of the enterprise may be information such as announcements, news, legal documents, intellectual property, products, qualifications, official networks, recruiting, administrative penalties, research teams and events, stock codes, stockholder information, investments, and high governance.

Semantic analysis refers to semantic examination and processing according to the grammar category recognized by the grammar analyzer to obtain the substantial meaning of the text.

Cluster analysis refers to the process of analysis of a collection of physical or abstract objects grouped into classes composed of similar objects.

103. And acquiring the incidence relation between each entity and the attribute in the entity set.

Optionally, in some embodiments, after step 102, step 103 may further include the following steps:

and calculating the weight value of each entity in the entity set according to the entity attribute of the entity.

The weight value of the entity refers to the importance of the entity in the whole knowledge graph to be created, and can be customized. The weight value may also be obtained by performing an embedded calculation on the association relationship.

104. And creating and outputting a knowledge graph according to the entity set, the attribute set and the incidence relation between the entities and the attributes.

The knowledge-graph may include entities, entity attributes, associations between entities and attributes, and associations between entities.

The knowledge graph refers to a priori knowledge, which is used for providing detailed structural abstract for entities contained in user query or returned answers, and mainly comprises concepts, concept hierarchies, attributes, attribute value types, relations, a relation definition Domain (Domain) concept set and a relation value Domain (Range) concept set.

The knowledge graph can cover most of the common sense knowledge by gathering structured data from encyclopedia sites and various vertical sites, which enrich the description of entities by extracting attribute-value pairs of related entities from various semi-structured data (in the form of HTML tables). In addition, new entities or new entity attributes are discovered through search logs (query logs) to continually expand the coverage of the knowledge-graph. Compared with common knowledge, the knowledge data obtained by data mining and extraction is larger, the query requirements of the current user can be reflected better, and the latest entity or fact can be found in time.

When the knowledge graph is established, the confidence coefficient of the internet can be evaluated through voting or other aggregation algorithms in subsequent mining by utilizing the redundancy of the internet, and the confidence coefficient is added into the preset knowledge graph through manual examination. Various candidate entities (concepts) required for constructing the knowledge graph and attribute associations thereof are extracted from various types of data sources, and an isolated Extraction graph is formed.

For example, when the knowledge-graph is an enterprise knowledge-graph, the flowchart for creating the enterprise knowledge-graph may refer to the flowchart shown in fig. 2. The enterprise knowledge graph may include inter-enterprise associations, inter-employee associations, enterprise attributes, employee attributes, and enterprise-task associations. Employees and enterprises are nodes that are mapped on the enterprise knowledge graph.

The incidence relation between enterprises comprises external investment, stockholders, clients, competition relations and the like, and the incidence relation between the enterprises and the employees comprises duties, stockholders, corporate representatives and the like; the incidence relation among the staff comprises the relations of colleagues, classmates, relatives, opponents and the like; attributes of an enterprise include announcements, news, legal documents, intellectual property, products, qualifications, official networks, recruiting, administrative penalties, research teams and events, etc.; the attributes of the employee include scholars, experiences, patents, news, events, and the like.

Therefore, the knowledge graph in the embodiment of the application can be used for showing the analysis result of the big data and can also be used for a search engine. For example, after receiving a search request of a user at a later stage, the server performs semantic analysis on the search request based on the obtained knowledge graph, and then returns an answer to the user. To a certain extent, the answer with higher accuracy can be output for the user, and the speed of responding to the user can be increased. Moreover, the search based on the knowledge graph can reduce the operation load of the search engine and improve the performance of the search engine.

Compared with the existing mechanism, in the embodiment of the application, the data analysis device can perform semantic analysis and cluster analysis on the acquired data source, extract the entity set and the attribute set from the data source, and calculate the weight value of each entity in the entity set according to the entity attribute of the entity; according to the weight values of the entities, the attributes of the entities in the entity set are sorted, then the incidence relation between the entities and the attributes in the entity set is established, and a knowledge graph is established and output according to the entity set, the attribute set and the incidence relation between the entities and the attributes. The knowledge graph comprises entities, entity attributes, incidence relations between the entities and the attributes, and incidence relations between the entities. By adopting the scheme, the knowledge graph can be accurately created, and the relationship between the entities and the attributes and the association relationship between the entities can be visually presented. And the personalized visual display is facilitated, and the unified management can also be facilitated.

Optionally, in some inventive embodiments, after step 103 and before step 104, the method further includes:

and vectorizing each entity in the entity set to obtain a training vector.

Specifically, the following operations may be included:

(a) and adopting a multilayer neural network to carry out named entity identification on each entity in the entity set to obtain the entity context of each entity.

The entity context refers to a set of words around the entity, for example, the entity e context can be represented by context (e).

In some embodiments, the maximum log-likelihood method may be further used to maximize the obtained entity context of each entity. For example, a first function may be constructed using maximum log-likelihood, and then the first function may be maximized. The first function is as follows:

function(s)

Where C is a corpus, θ is a set of pending parameters, and F (e, context (e), θ) represents a first function that may be constructed by a multi-layer neural network, e.g., by the schematic shown in fig. 1-2.

In this step, the entity context corresponding to each entity is finally obtained by performing maximization processing on the first function.

(b) And extracting the association relation among the entities from the entity context of each entity.

Specifically, the context of each entity obtained in step (a) is used as the input of the system for extracting the association relationship between the entities, and finally, the expression set of the association relationship between the entities is output. For example, fig. 1-3 illustrate the structure of each entity and the association relationship between the entities, where fig. 1-3 illustrate the association relationship corresponding to each entity, and fe in fig. 1-3₁-fe_l、e₀、be₁-be_sAre all entities, w₁-w_lAnd v₁-v_sAll represent an associative relationship. E.g. w₁Represents fe₁And e₀The association relationship between them.

For an entity e₀In terms of { fe_i(ii) a i ═ 1, Kl } is the entity e₀Front entities (front NEs) to which it is connected in all associations, { be }_j(ii) a j is 1, Ks is entity e₀All the back entities (back NEs) connected to it in the association. For example, for the corpus "Facebook," Source3, a content copyright initiatives was purchased, "Facebook" is its front entity for the named entity "Source 3" in the "acquisition" relationship; for the named entity "Facebook," Source3 "is its rear entity. { omega [ [ omega ] ]_i(ii) a i-1, Kl } and { upsilon_j(ii) a j ═ 1, Ks } is a relational weight (relations), determined according to the financial activity priority level.

(c) And obtaining the training vector according to the entity context of each entity and the incidence relation between the entities.

In some embodiments, based on the obtained association relationship between the entities, the obtained association relationship between the entities may be further maximized by using a maximum log-likelihood method.

Specifically, based on the embodiments shown in fig. 1-2 and 1-3, the second function may be constructed by maximum log likelihood and then maximized. The second function is as follows:

wherein C is a corpus, λ is a set of undetermined parameters, G (e, front (e), back (e), λ) represents a second function, and G (e, front (e), back (e), λ) can also be constructed by a multilayer neural network. For example, the construction can be made by the schematic diagrams shown in fig. 1-4.

In step (c), an association relationship (EntityRep available) between the entities is finally generated, which is obtained by training the context relationship of the entities_e；relationRepresentation). Finally, combining the obtained context of each entity and the incidence relation between each entity by using a Kronecker product to form a final training vector of the entity, wherein the training vector can be expressed by the following expression:

the entities can be converted into a distributed expression through vectorization processing, and after the distributed expression is used, the associated or similar entities are close to each other in distance, so that the text is further endowed with the characteristic of calculation, and the knowledge extraction and association relationship reasoning can be carried out on the subsequent knowledge graph.

Optionally, in some embodiments of the present invention, the extracting the association relationship between the entities from the entity context of the obtained entities includes:

(a) and taking the attribute set and the entity set as input, and respectively labeling the association relation of each entity in the entity set based on a time recurrent neural network model, wherein the labeled association relation comprises the position of a word in the entity, the type of the association relation and the position of the association relation, and the position of the association relation refers to the position of the association relation in a knowledge graph.

(b) Calculating the weight value of the relationship type by adopting an incidence relationship embedding method;

(c) and screening candidate association relations from the labeled association relations according to a closest distance principle and the association relation type.

(d) And classifying the screened candidate incidence relations according to the keyword pairs of the incidence relation types to obtain the incidence relations among the entities.

Therefore, the potential semantic relation among the entities can be identified by extracting the association relation among the entities from the text data. In some embodiments, the association relationship between entities may be represented by an entity relationship triple (entity 1, association type or association indication information, entity 2). Specifically, the method and the device can convert the extraction problem of the incidence relation between the entities into a Sequence Tagging Task (Sequence Tagging Task), and can adopt a neural network model to carry out the Sequence Tagging Task.

For example, the attribute set and the entity set may be used as input, and then the corpus may be subjected to sequence labeling based on a neural network model. After the training corpus is subjected to sequence labeling based on the neural network model, the output results of the final sequence labeling have two types:

one output result represents a label unrelated to the association to be extracted, and a label unrelated to the association to be extracted may be represented by 'O'.

And the other output result represents a label related to the association relation to be extracted and can be represented by a relation label except 'O'. The relationship label represents the position of the label in the entity, the type of the association relationship and the position of the association relationship.

Finally, label-based prediction vectors T are computed by the output layers shown in FIGS. 1-4_tNormalized entity tag probability of (2):

y_t＝W_yT_t+b_y,

wherein, W_tIs a weight matrix, N_tIs the total number of tags, F (N)_t,y_t) Representing a normalization function, T_tRepresenting the vectorized label. To be provided withAnd the neural network model can learn long-term dependence, so the decoding mode can establish label interaction.

And constructing a third function by using the log-likelihood function, wherein the third function is as follows:

wherein L is_jIs the sentence x_jIs the training set sample capacity, | D |, is the condition_t,jIs a conditional predicate, Θ is the coefficient space, α is the bias weight, the larger the value of which, the more affected the corresponding relationship label in the model, and i (O) is the transfer function that distinguishes the loss between the 'O' label and the relationship label, as follows:

by combining the method, the relation type and the closest distance principle, the candidate relation triples can be extracted preliminarily, and the candidate entity relation triples are screened through the relation category keywords to obtain the finally classified entity relation triples.

It should be noted that the formulas corresponding to the first function, the second function and the third function given in the present application are only examples, and modifications may be made on the given formulas, and specific modifications and specific forms of the formulas are not limited in the present application.

Optionally, in some embodiments of the present invention, since there are many different data sources, entities with the same or similar names may exist in the obtained entity set, and in order to simplify and improve the accuracy of the knowledge graph, the entities with the same or similar names may be merged, or entities with the same name but different entities in nature may be distinguished. In an embodiment of the present application, the method further includes:

and at least one of merging, de-duplicating and distinguishing entities with the same or similar entity types in the knowledge graph.

In some embodiments, the data source may include a first data table and a second data table, the plurality of entities includes at least one first entity and at least one second entity, the first entity belongs to the first data table, the second entity belongs to the second data table, and the knowledge graph includes at least two connected graphs, and an offspring relationship and/or a parent-child relationship exists between the at least two connected graphs.

Specifically, since data such as entities, association relations, and the like in a finally constructed knowledge graph may come from different data sources, and a phenomenon that the entities in the knowledge graph repeat may eventually occur, deduplication processing is performed before graph data is used, first, common synonyms are merged, and then merging is performed by using connectivity of the knowledge graph itself, that is, if a distance between entities of the same name in an entity network is less than N, the entities are merged into the same entity, all relations of the entities are also merged, a value of N may be adjusted according to a rareness degree of the entity name, that is, the rarer the entity name is, the larger the value of N is, and the smaller the value of N is, and the application is not limited by the specific value.

The merging, deduplication and differentiation of entities of the same or similar entity type in the knowledge graph are illustrated below, respectively, where the similarity is calculated by entity attribute embedding:

if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, combining the first entity and the second entity, or deleting the first entity or the second entity from the knowledge graph.

Taking an enterprise knowledge graph as an example, as shown in a in fig. 3, the enterprise knowledge graph includes two spangles, and both the two spangles correspond to respective employee attributes, for example, one spangles is in company a, and is also in company B, and company B is a subsidiary of company a, and the other spangles is in company B. Then, the method of maximum link graph can be adopted for identification, firstly, the maximum link graph is extracted from the enterprise knowledge graph, then people with the same name and the same name in the link graph are merged, as shown in b in fig. 3, and finally, two "zhanglong" are regarded as the same person. If not in the same communication graph, the people are considered to be different.

And secondly, if the similarity between the first entity and the second entity is higher than the preset similarity and the first entity and the second entity are determined to belong to any one link graph, distinguishing the first entity from the second entity in the knowledge graph.

And thirdly, if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, determining a first entity set directly associated with the first entity and a second entity set directly associated with the second entity. And then comparing the first entity set with the second entity set, and when the intersection of the first entity set and the second entity set is determined to comprise at least two entities, combining the first entity and the second entity, or deleting the first entity or the second entity from the knowledge-graph.

Optionally, in some embodiments of the invention, the knowledge graph is based on a time dimension, and the link graph in each time window in the time dimension is an association relationship between entities in the time window and a snapshot of attributes of the entities.

Optionally, in some inventive embodiments, the knowledge-graph further satisfies at least one of:

when an entity in the knowledge-graph is updated, the updated entity is distinguished.

And adding a time axis to the entity attribute with time update, and displaying the alternate time on the time axis.

For example, when the knowledge-graph in the embodiments of the present application is applied to an enterprise knowledge-graph, the enterprise knowledge-graph can visually present each entity, as well as various types of attributes associated with each entity. In order to facilitate management and quickly identify information such as key entities, changed entities and the like, the enterprise knowledge graph can be further processed as follows:

(1) and sorting the importance of different attributes, wherein the importance can be set by self-definition aiming at different users.

(2) The display of attributes may be arranged according to importance and colored from dark to light with colors of the same color family.

(3) When an entity change or an association relationship change between entities occurs in the knowledge graph, the entity or the entity association relationship with the state change may be displayed in a color different from that of the conventional color.

(4) The compactness of the association relationship between the entities can be highlighted through thickness display of lines.

(5) And (3) obtaining risk assessment of the enterprise through the positive and negative analysis of news, the rise and fall of stock price, financial change, acquisition combination and other major events, and respectively identifying corresponding colors.

(6) Different shapes or different colors are respectively used for the listed companies (different plates) and the non-listed companies.

After the processes (1) to (6) are performed on the enterprise knowledge graph, the enterprise knowledge graph shown in fig. 4 can be obtained.

Optionally, in some embodiments of the present invention, the performing semantic analysis and cluster analysis on the corpus set, and extracting an entity set and an attribute set from the corpus set includes:

and performing word segmentation and semantic annotation processing on the corpora in the corpus set to obtain the entity set and the attribute set.

And marking the incidence relation types among the entities in the entity set.

And respectively adjusting the entity set and the attribute set based on a sequence labeling model, and respectively predicting each entity in the entity set and each attribute in the attribute set to obtain the association relationship type between the entities and obtain the mapping between the entities and the attributes.

Optionally, in some embodiments of the present invention, the present application further provides a hierarchical layout method of an optimized knowledge graph. The main method comprises the following steps:

with a given entity a as the center, the entities in the knowledge-graph may be referred to as nodes, e.g., entity a may be referred to as a center node. The neighbor entities of the entity a are distributed on each layer of circular rings taking the entity a as the center, the nodes with larger hop number away from the entity a are positioned on the outer layer of circular rings, and the nodes at the same level are positioned on one circular ring surface. In order to show more nodes on a unit area screen, nodes on a ring at the same level are also distributed in a multilayer spiral manner, the size of the whole graph is prevented from being enlarged due to overlarge radius of the ring, and meanwhile, in order to view adjacent nodes, the placement positions of adjacent nodes and all sub-nodes of the same node are as together as possible due to the minimum view span, the method mainly comprises the following steps:

(a) and calculating the node distance between each node in the network to be visualized and the entity a.

(b) And determining candidate positions of all nodes in the knowledge graph according to the relevant configurations of node distance, node size, node spacing, interlayer spacing and the like, wherein the positions are only vacant and no node is arranged.

(c) The central node is placed on the central void.

(d) And taking out all neighbor nodes of the central node, and placing each neighbor node on a vacant site nearest to the central node.

(e) Repeating the operations described in (a) - (d) until all nodes have been placed in position, i.e., all empty spaces have been filled.

Optionally, in some inventive embodiments, the association relationship between any nodes in the knowledge graph may be visualized.

Because the incidence relation between two entity nodes needs to be examined sometimes in actual service, all shortest paths between two nodes of any giant head incidence relation can be calculated by using a shortest path algorithm, then the two nodes to be examined are placed at the left end and the right end of a knowledge graph, and N vertical lines are made on a connecting line x of the nodes at the two ends so as to divide the connecting line x equally.

And other nodes on the shortest path are randomly placed on corresponding vertical lines according to the hop number of the nodes at two ends, and then the position of the middle node is adjusted by using a force-guided layout algorithm with constraint conditions, wherein the hierarchy of the node cannot be adjusted during adjustment, and only the position of the node on the hierarchy is adjusted. The operation of specifically adjusting the position of the intermediate node is as follows:

(1) each node is placed on a corresponding hierarchy according to the distance from the nodes at two ends, and the node positions are randomly placed in the hierarchy when a plurality of nodes are arranged in the same hierarchy.

(2) The edge connecting the two nodes is thought to be a spring, the stress between each pair of nodes can be calculated according to the Hooke's law, and all the stresses of each point are synthesized to obtain the stress condition of the point.

(3) The stress is converted into corresponding displacement according to the stress condition of each point, namely the node is subjected to position movement, the movement process is limited by the hierarchy and the position, namely the node can only move up and down in the same hierarchy, and the node only needs to move on a plurality of preset fixed positions, so that the consistent interval among the nodes is ensured, and the interface is more attractive.

(4) And after moving a certain distance, calculating the stress condition of each point again, and repeatedly operating until the sum of the stresses of each node of the whole graph converges to a stable value.

Optionally, in some invention embodiments, the application may perform visualization processing on the two time point diagram changes, and may show the dynamic changes of the association relationship between the entities at the two time points by using time as a dimension. Specifically, first two time points are selected on the interface of the knowledge-graph: for example, "time point one" and "time point two" are selected, where time point one is an earlier time point in the day, "time point two" is a later time point in the day, and "time point one" and "time point two" are time points on the same day.

Correspondingly, network graphs corresponding to the time point I and the time point II can be obtained, then the node and the side information in the two network graphs are combined respectively, namely the combined network graphs simultaneously contain the node and the side information in the time point I and the time point II, when the knowledge graph is displayed, highlight color identification can be used for nodes and association relations newly added in the time point II, and dotted line identification is used for the sides and the nodes which do not exist in the time point I and the time point II.

While a method of creating a knowledge-graph in the present application is described above, an apparatus for performing the above method of creating a knowledge-graph is described below, having functions to implement a method of creating a knowledge-graph corresponding to those provided in the above method embodiments. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware. The data analysis device in the present application may be a server or a terminal device, or may be an application installed in the server or the terminal device, and the present application is not limited in particular. When the apparatus is a terminal device, the terminal device may be a device that provides voice and/or data connectivity to a user, a handheld device having wireless connection capability, or other processing device connected to a wireless modem. A wireless terminal, which may be a mobile terminal such as a mobile phone (or a "cellular" phone) and a computer having a mobile terminal, for example, a portable, pocket, hand-held, computer-included or vehicle-mounted mobile device, may communicate with one or more core networks via a Radio Access Network (RAN). Examples of such devices include Personal Communication Service (PCS) phones, cordless phones, Session Initiation Protocol (SIP) phones, Wireless Local Loop (WLL) stations, and Personal Digital Assistants (PDA). A wireless Terminal may also be referred to as a system, a Subscriber Unit (Subscriber Unit), a Subscriber Station (Subscriber Station), a Mobile Station (Mobile), a Remote Station (Remote Station), an Access Point (Access Point), a Remote Terminal (Remote Terminal), an Access Terminal (Access Terminal), a User Terminal (User Terminal), a Terminal Device, a User Agent (User Agent), a User Device (User Device), or a User Equipment (User Equipment). As shown in fig. 5, the apparatus 50 for creating a knowledge-graph includes:

a transceiver module 501, configured to acquire a data source, where the data source includes multiple entities;

a processing module 502, configured to perform semantic analysis and cluster analysis on the data source obtained by the transceiver module 501, and extract an entity set and an attribute set from the data source, where the attribute set includes entity attributes of each entity in the entity set; calculating the weight value of each entity in the entity set according to the entity attribute of the entity; sorting the attributes of the entities in the entity set according to the weight values of the entities;

creating an incidence relation between each entity and the attribute in the entity set; and creating and outputting a knowledge graph according to the entity set, the attribute set and the association relationship between the entities and the attributes, wherein the knowledge graph comprises the entities, the entity attributes, the association relationship between the entities and the attributes and the association relationship between the entities.

In the embodiment of the present application, the processing module 502 performs semantic analysis and cluster analysis on the data source acquired by the transceiver module 501, extracts an entity set and an attribute set from the data source, and calculates a weight value of each entity in the entity set according to an entity attribute of the entity; according to the weight values of the entities, the attributes of the entities in the entity set are sorted, then the incidence relation between the entities and the attributes in the entity set is established, and a knowledge graph is established and output according to the entity set, the attribute set and the incidence relation between the entities and the attributes. The knowledge graph comprises entities, entity attributes, incidence relations between the entities and the attributes, and incidence relations between the entities. By adopting the scheme, the knowledge graph can be accurately created, and the relationship between the entities and the attributes and the association relationship between the entities can be visually presented. And the knowledge map is convenient for individual visual display and unified management.

Optionally, the processing module 502 is further configured to:

Optionally, the data source includes a first data table and a second data table, the plurality of entities includes at least one first entity and at least one second entity, the first entity belongs to the first data table, the second entity belongs to the second data table, the knowledge graph includes at least two connected graphs, and a descendant relationship and/or a parent-child relationship exists between the at least two connected graphs.

Optionally, the processing module 502 is specifically configured to:

Optionally, the knowledge graph is based on a time dimension, and the link graph in each time window in the time dimension is an association relationship between entities in the time window and a snapshot of an entity attribute.

Optionally, the knowledge-graph further satisfies at least one of:

Optionally, the processing module 502 is specifically configured to:

labeling the incidence relation type among the entities in the entity set;

It should be noted that, in the embodiment corresponding to fig. 5 of the present application, the entity device corresponding to the transceiver module is a transceiver, and the entity device corresponding to the processing module may be a processor. Each of the devices shown in fig. 5 may have a structure as shown in fig. 6, when one of the devices has the structure as shown in fig. 6, the processor, the transmitter and the receiver in fig. 6 implement the same or similar functions of the processing module, the transmitting module and the receiving module provided in the device embodiment corresponding to the device, and the memory in fig. 6 stores program codes that the processor needs to call when executing the above-mentioned method for creating a knowledge graph.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The technical solutions provided by the present application are introduced in detail, and the present application applies specific examples to explain the principles and embodiments of the present application, and the descriptions of the above examples are only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of creating a knowledge-graph, the method being applied to a data analysis device, the method comprising:

obtaining a data source, the data source comprising a plurality of entities;

creating and outputting a knowledge graph according to the entity set, the attribute set and the association relationship between the entities and the attributes, wherein the knowledge graph comprises the entities, the entity attributes, the association relationship between the entities and the attributes and the association relationship between the entities;

the method further comprises the following steps:

vectorizing each entity in the entity set to obtain a training vector;

the vectorizing each entity in the entity set to obtain a training vector includes:

obtaining the training vector according to the entity context of each entity and the incidence relation between the entities;

the extracting the association relationship among the entities from the entity context of each entity includes:

classifying the screened candidate incidence relations according to the keyword pairs of the incidence relation types to obtain the incidence relations among the entities;

the method further comprises the following steps:

calculating the similarity among the entities through entity attribute embedding, and performing at least one of combination, duplication removal and differentiation on the entities with the same or similar entity types in the knowledge graph;

the method further comprises the following steps:

when the distance between at least two same-name entities in an entity network is smaller than N, merging the at least two same-name entities into a same entity, merging the association relations of the at least two same-name entities into a same association, wherein the value of the N is used for indicating the rarity degree of the entity name;

the knowledge graph is based on a time dimension, and a link graph in each time window in the time dimension is an association relation between entities in the time window and a snapshot of entity attributes;

the knowledge-graph further satisfies at least one of:

adding a time axis to the entity attribute with time update, and displaying the time of the update on the time axis;

2. The method of claim 1, wherein after the multi-layer neural network is used to perform named entity recognition on each entity in the entity set, and after the entity context of each entity is obtained, and before the association relationship between each entity is extracted from the entity context of each entity, the method further comprises:

3. The method of claim 1, wherein after extracting the association between the entities from the entity context of each entity, and before obtaining the entity training vector according to the entity context of each entity and the association between each entity, the method further comprises:

4. The method according to any one of claims 1 to 3, wherein the data source comprises a first data table and a second data table, the plurality of entities comprises at least one first entity and at least one second entity, the first entity belongs to the first data table, the second entity belongs to the second data table, the knowledge graph comprises at least two connection graphs, and an offspring relationship and/or a parent-child relationship exist between the at least two connection graphs.

5. The method of claim 4, wherein at least one of merging, deduplicating and differentiating entities of the same or similar entity type in the knowledge-graph comprises:

if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, combining the first entity and the second entity or deleting the first entity or the second entity from the knowledge graph;

6. The method of claim 4, wherein at least one of merging, deduplicating and differentiating entities of the same or similar entity type in the knowledge-graph comprises:

7. The method according to any one of claims 1, 2, 3, 5 and 6, wherein the performing semantic analysis and cluster analysis on the data source and extracting entity sets and attribute sets from the data source comprises:

performing word segmentation and semantic annotation processing on the linguistic data in the linguistic data set to obtain the entity set and the attribute set;

labeling the incidence relation type among the entities in the entity set;

8. An apparatus for creating a knowledge graph, the apparatus comprising:

at least one processor, memory, receiver, and transmitter;

wherein the memory is configured to store program code and the processor is configured to invoke the program code stored in the memory to perform the method of any of claims 1-7.

9. A computer storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-7.