CN107665252B - Method and device for creating knowledge graph - Google Patents

Method and device for creating knowledge graph Download PDF

Info

Publication number
CN107665252B
CN107665252B CN201710890548.1A CN201710890548A CN107665252B CN 107665252 B CN107665252 B CN 107665252B CN 201710890548 A CN201710890548 A CN 201710890548A CN 107665252 B CN107665252 B CN 107665252B
Authority
CN
China
Prior art keywords
entity
entities
graph
knowledge
attributes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710890548.1A
Other languages
Chinese (zh)
Other versions
CN107665252A (en
Inventor
毛瑞彬
朱菁
张俊
王仁勇
邓永翠
赵洪杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN SECURITIES INFORMATION CO Ltd
Original Assignee
SHENZHEN SECURITIES INFORMATION CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN SECURITIES INFORMATION CO Ltd filed Critical SHENZHEN SECURITIES INFORMATION CO Ltd
Priority to CN201710890548.1A priority Critical patent/CN107665252B/en
Publication of CN107665252A publication Critical patent/CN107665252A/en
Application granted granted Critical
Publication of CN107665252B publication Critical patent/CN107665252B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and a device for creating a knowledge graph are provided, the method is applied to a data analysis device, and the method comprises the following steps: acquiring a data source, wherein the data source comprises a plurality of entities; performing semantic analysis and cluster analysis on the data source, and extracting an entity set and an attribute set from the data source, wherein the attribute set comprises entity attributes of all entities in the entity set; acquiring an incidence relation between each entity and the attribute in the entity set; and creating and outputting a knowledge graph according to the entity set, the attribute set and the association relationship among the entities and the attributes, wherein the knowledge graph comprises the entities, the entity attributes, the association relationship among the entities and the attributes and the association relationship among the entities. By adopting the scheme, the knowledge graph can be accurately created, and the relationship between the entities and the attributes and the association relationship between the entities can be visually presented.

Description

Method and device for creating knowledge graph
Technical Field
The application relates to the technical field of big data processing, in particular to a method and a device for creating a knowledge graph.
Background
The knowledge map is a visual map of a knowledge domain, is a series of different graphs for displaying the development process and the structural relationship of knowledge, and can be used for presenting the association relationship between knowledge resources, knowledge resource carriers, mining, analyzing, constructing and displaying knowledge. The knowledge graph can be used for intelligent question answering of the intelligent robot, and the application range is wide.
However, in the existing mechanism, when a knowledge graph is constructed, all acquired entities in a data source are analyzed, and then an association relationship between all the entities and entity attributes is established. Therefore, although the constructed knowledge graph can cover a wide range, the constructed knowledge graph cannot visually present important structural relationships to users, so that effective information cannot be rapidly identified when the knowledge graph is managed, the reference value in use is limited, the users spend a long time for analysis, and therefore key structural information cannot be presented in a targeted manner.
Disclosure of Invention
The application provides a method and a device for creating a knowledge graph, which can solve the problem that the constructed knowledge graph in the prior art is low in pertinence.
The application provides a method for creating a knowledge graph, which is applied to a data analysis device and comprises the following steps:
obtaining a data source, the data source comprising a plurality of entities;
performing semantic analysis and cluster analysis on the data source, and extracting an entity set and an attribute set from the data source, wherein the attribute set comprises entity attributes of all entities in the entity set;
acquiring an incidence relation between each entity and the attribute in the entity set;
and creating and outputting a knowledge graph according to the entity set, the attribute set and the association relationship between the entities and the attributes, wherein the knowledge graph comprises the entities, the entity attributes, the association relationship between the entities and the attributes and the association relationship between the entities.
In some possible designs, the method further comprises:
and vectorizing each entity in the entity set to obtain a training vector.
In some possible designs, the vectorizing each entity in the entity set to obtain a training vector includes:
adopting a multilayer neural network to carry out named entity identification on each entity in the entity set to obtain an entity context of each entity;
extracting the association relation among the entities from the entity context of each entity;
and obtaining the training vector according to the entity context of each entity and the incidence relation between the entities.
In some possible designs, after the identifying the named entities of each entity in the entity set by using the multi-layer neural network and obtaining the entity context of each entity, before extracting the association relationship between each entity from the obtained entity context of each entity, the method further includes:
and adopting a maximum log-likelihood method to respectively perform maximization processing on the obtained entity context of each entity.
In some possible designs, after extracting the association relationship between the entities from the entity context of each entity, and before obtaining the entity training vector according to the entity context of each entity and the association relationship between each entity, the method further includes:
and performing maximization processing on the obtained association relationship between the entities of each entity by adopting a maximum log-likelihood method.
In some possible designs, the extracting the association relationship between the entities from the entity context of the entities includes:
according to the attribute set, the entity set and the time recurrent neural network model, respectively labeling the incidence relation of each entity in the entity set, wherein the labeled incidence relation comprises the position of a word in the entity, the type of the incidence relation and the position of the incidence relation;
calculating the weight value of the relationship type by adopting a method of calculating association embedding;
screening candidate incidence relations from the marked incidence relations according to a closest distance principle and the incidence relation types;
and classifying the screened candidate incidence relations according to the keyword pairs of the incidence relation types to obtain the incidence relations among the entities.
In some possible designs, after the extracting the entity set and the attribute set from the data source and before the obtaining the association relationship between each entity and the attribute in the entity set, the method further includes:
calculating the weight value of each entity in the entity set according to the entity attribute of the entity;
and sorting the attributes of the entities in the entity set according to the weight values of the entities.
In some possible designs, the method further comprises:
and calculating the similarity among the entities through entity attribute embedding, and performing at least one of combination, duplication removal and differentiation on the entities with the same or similar entity types in the knowledge graph.
In some possible designs, the data source includes a first data table and a second data table, the plurality of entities includes at least one first entity and at least one second entity, the first entity belongs to the first data table, the second entity belongs to the second data table, and the knowledge graph includes at least two connected graphs, and an offspring relationship and/or a parent-offspring relationship exist between the at least two connected graphs.
In some possible designs, the at least one of merging, deduplicating, and differentiating entities of the same or similar entity type in the knowledge-graph includes:
if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, combining the first entity and the second entity or deleting the first entity or the second entity from the knowledge graph.
And if the similarity between the first entity and the second entity is higher than the preset similarity and the first entity and the second entity are determined to belong to any one link map, distinguishing the first entity from the second entity in the knowledge graph.
In some possible designs, the at least one of merging, deduplicating, and differentiating entities of the same or similar entity type in the knowledge-graph includes:
if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, determining a first entity set directly associated with the first entity and a second entity set directly associated with the second entity;
when it is determined that the intersection of the first set of entities and the second set of entities includes at least two entities, the first entity and the second entity are merged, or the first entity or the second entity is deleted from the knowledge-graph.
In some possible designs, the knowledge-graph is based on a time dimension, and the link graph in each time window in the time dimension is an association relationship between entities in the time window and a snapshot of attributes of the entities.
In some possible designs, the knowledge-graph further satisfies at least one of:
in the knowledge graph, the entities with the association relationship are displayed in a gradient manner from strong to weak according to the strength of the association relationship;
highlighting a specific entity in the knowledge graph, wherein the specific entity marks a risk assessment value, and the specific entity is an entity with a risk assessment value higher than a preset risk assessment value;
when an entity in the knowledge-graph is updated, distinguishing the updated entity;
adding a time axis to the entity attribute with time updating, and displaying the replacing time on the time axis;
and for the entity attributes of the same entity, deep and shallow coloring are performed according to the weight value of the entity attributes from high to low.
In some possible designs, the performing semantic analysis and cluster analysis on the corpus set and extracting an entity set and an attribute set from the corpus set includes:
performing word segmentation and semantic annotation processing on the corpora in the corpus set to obtain the entity set and the attribute set;
labeling the incidence relation type among the entities in the entity set;
based on a conditional random field model, respectively adjusting the entity set and the attribute set, and respectively predicting each entity in the entity set and each attribute in the attribute set to obtain an incidence relation type between the entities and obtain a mapping between the entities and the attributes.
A second aspect of the present application provides an apparatus for creating a knowledge-graph having functions to implement a method of creating a knowledge-graph corresponding to the first aspect provided above. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.
In one possible design, the means for creating a knowledge-graph includes:
a transceiver module for acquiring a data source, the data source comprising a plurality of entities;
the processing module is used for performing semantic analysis and cluster analysis on the data source acquired by the transceiver module, and extracting an entity set and an attribute set from the data source, wherein the attribute set comprises entity attributes of each entity in the entity set; acquiring an incidence relation between each entity and the attribute in the entity set; and creating and outputting a knowledge graph according to the entity set, the attribute set and the association relationship between the entities and the attributes, wherein the knowledge graph comprises the entities, the entity attributes, the association relationship between the entities and the attributes and the association relationship between the entities.
In some possible designs, the processing module is further to:
and vectorizing each entity in the entity set to obtain a training vector.
In some possible designs, the processing module is further to:
adopting a multilayer neural network to carry out named entity identification on each entity in the entity set to obtain an entity context of each entity;
extracting the association relation among the entities from the entity context of each entity;
and obtaining the training vector according to the entity context of each entity and the incidence relation between the entities.
In some possible designs, the processing module performs named entity recognition on each entity in the entity set by using a multi-layer neural network, obtains an entity context of each entity, and extracts an association relationship between each entity from the obtained entity context of each entity, and further:
and adopting a maximum log-likelihood method to respectively perform maximization processing on the obtained entity context of each entity.
In some possible designs, after the processing module extracts the association relationship between the entities from the entity context of each entity, before the entity training vector is obtained according to the entity context of each entity and the association relationship between the entities, the processing module is further configured to:
and performing maximization processing on the obtained association relationship between the entities of each entity by adopting a maximum log-likelihood method.
In some possible designs, the processing module is specifically configured to:
according to the attribute set, the entity set and the time recurrent neural network model, respectively labeling the incidence relation of each entity in the entity set, wherein the labeled incidence relation comprises the position of a word in the entity, the type of the incidence relation and the position of the incidence relation;
calculating the weight value of the relationship type by adopting an incidence relationship embedding method;
screening candidate incidence relations from the marked incidence relations according to a closest distance principle and the incidence relation types;
and classifying the screened candidate incidence relations according to the keyword pairs of the incidence relation types to obtain the incidence relations among the entities.
In some possible designs, after the processing module extracts the entity set and the attribute set from the data source, and before acquiring the association relationship between each entity and the attribute in the entity set, the processing module is further configured to:
calculating the weight value of each entity in the entity set according to the entity attribute of the entity;
and sorting the attributes of the entities in the entity set according to the weight values of the entities.
In some possible designs, the processing module is further to:
and calculating the similarity among the entities through entity attribute embedding, and performing at least one of combination, duplication removal and differentiation on the entities with the same or similar entity types in the knowledge graph.
In some possible designs, the data source includes a first data table and a second data table, the plurality of entities includes at least one first entity and at least one second entity, the first entity belongs to the first data table, the second entity belongs to the second data table, and the knowledge graph includes at least two connected graphs, and an offspring relationship and/or a parent-offspring relationship exist between the at least two connected graphs.
In some possible designs, the processing module is specifically configured to:
if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, combining the first entity and the second entity or deleting the first entity or the second entity from the knowledge graph.
And if the similarity between the first entity and the second entity is higher than the preset similarity and the first entity and the second entity are determined to belong to any one link map, distinguishing the first entity from the second entity in the knowledge graph.
In some possible designs, the processing module is specifically configured to:
if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, determining a first entity set directly associated with the first entity and a second entity set directly associated with the second entity;
when it is determined that the intersection of the first set of entities and the second set of entities includes at least two entities, the first entity and the second entity are merged, or the first entity or the second entity is deleted from the knowledge-graph.
In some possible designs, the knowledge-graph is based on a time dimension, and the link graph in each time window in the time dimension is an association relationship between entities in the time window and a snapshot of attributes of the entities.
In some possible designs, the knowledge-graph further satisfies at least one of:
in the knowledge graph, the entities with the association relationship are displayed in a gradient manner from strong to weak according to the strength of the association relationship;
highlighting a specific entity in the knowledge graph, wherein the specific entity marks a risk assessment value, and the specific entity is an entity with a risk assessment value higher than a preset risk assessment value;
when an entity in the knowledge-graph is updated, distinguishing the updated entity;
adding a time axis to the entity attribute with time updating, and displaying the replacing time on the time axis;
and for the entity attributes of the same entity, deep and shallow coloring are performed according to the weight value of the entity attributes from high to low.
In some possible designs, the processing module is specifically configured to:
performing word segmentation and semantic annotation processing on the corpora in the corpus set to obtain the entity set and the attribute set;
labeling the incidence relation type among the entities in the entity set;
based on a conditional random field model, respectively adjusting the entity set and the attribute set, and respectively predicting each entity in the entity set and each attribute in the attribute set to obtain an incidence relation type between the entities and obtain a mapping between the entities and the attributes.
Yet another aspect of the present application provides an apparatus for creating a knowledge-graph comprising at least one connected processor, memory, transmitter, and receiver, wherein the memory is configured to store program code, and the processor is configured to invoke the program code in the memory to perform the method of the above-described aspects.
Yet another aspect of the present application provides a computer storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of the above-described aspects.
Compared with the prior art, in the scheme provided by the application, semantic analysis and cluster analysis are performed on the acquired data source, the entity set and the attribute set are extracted from the data source, and the weight value of each entity in the entity set is calculated according to the entity attribute of the entity; according to the weight values of the entities, the attributes of the entities in the entity set are sorted, then the incidence relation between the entities and the attributes in the entity set is established, and a knowledge graph is established and output according to the entity set, the attribute set and the incidence relation between the entities and the attributes. The knowledge graph comprises entities, entity attributes, incidence relations between the entities and the attributes, and incidence relations between the entities. By adopting the scheme, the knowledge graph can be accurately created, and the relationship between the entities and the attributes and the association relationship between the entities can be visually presented. And the personalized visual display is facilitated, and the unified management can also be facilitated.
Drawings
FIG. 1-1 is a schematic flow chart of creating a knowledge-graph in an embodiment of the invention;
FIGS. 1-2 are a schematic flow chart of a method for constructing a first function according to an embodiment of the present invention;
FIGS. 1-3 are schematic structural diagrams of entities and associations between entities according to embodiments of the present invention;
FIGS. 1-4 are schematic flow diagrams illustrating the construction of a second function according to embodiments of the present invention;
FIG. 2 is a schematic flow chart illustrating the creation of an enterprise knowledge graph in accordance with an embodiment of the present invention;
FIG. 3 is a diagram illustrating an employee with the same name and the same surname in an embodiment of the present invention;
FIG. 4 is a schematic diagram of a structure of an enterprise knowledge graph in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of an embodiment of an apparatus for creating a knowledge-graph;
FIG. 6 is a schematic diagram of an embodiment of an entity apparatus for performing knowledge-graph creation.
Detailed Description
The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprise," "include," and "have," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules expressly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, the division of modules presented herein is merely a logical division that may be implemented in a practical application in a further manner, such that a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not implemented, and such that couplings or direct couplings or communicative coupling between each other as shown or discussed may be through some interfaces, indirect couplings or communicative coupling between modules may be electrical or other similar forms, this application is not intended to be limiting. The modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the present disclosure.
The application provides a method and a device for creating a knowledge graph, which are used for the technical field of big data. The details will be described below.
Referring to fig. 1-1, a method for creating a knowledge graph according to the present application is illustrated below, where the method is applied to a data analysis apparatus, the data analysis apparatus in the present application may be a server or a terminal device, and the data analysis apparatus may also be an application installed in the server or the terminal device, and the present application is not limited in particular. The method mainly comprises the following steps:
101. a data source is acquired.
The data source includes a plurality of entities. The data source may be text data such as news, posts, popular articles, etc., and the text data may be in the form of a data table or other forms, and the application is not limited in particular. The data source may be referred to as a collection of corpora. The data source can be obtained by crawling crawlers of network news, announcements, legal documents, industrial and commercial websites, enterprise official websites, personal homepages, encyclopedias and the like in a crawler mode. The data source may also be any device for collecting and sending data, such as a terminal device, which may be a smart phone, a tablet computer, a laptop computer, a desktop computer, or a crawler server.
102. And performing semantic analysis and cluster analysis on the data source, and extracting an entity set and an attribute set from the data source.
Wherein the attribute set includes entity attributes of each entity in the entity set.
For example, in creating a corporate knowledge graph, entities may refer to employee names and business names, and entity attributes may refer to business attributes and employee attributes.
The entity attribute of the employee can be employee position, employee gender, employee academic history, prize winning information, employee level, employee resume, patent, news, event and the like.
The entity attributes of the enterprise may be information such as announcements, news, legal documents, intellectual property, products, qualifications, official networks, recruiting, administrative penalties, research teams and events, stock codes, stockholder information, investments, and high governance.
Semantic analysis refers to semantic examination and processing according to the grammar category recognized by the grammar analyzer to obtain the substantial meaning of the text.
Cluster analysis refers to the process of analysis of a collection of physical or abstract objects grouped into classes composed of similar objects.
103. And acquiring the incidence relation between each entity and the attribute in the entity set.
Optionally, in some embodiments, after step 102, step 103 may further include the following steps:
and calculating the weight value of each entity in the entity set according to the entity attribute of the entity.
The weight value of the entity refers to the importance of the entity in the whole knowledge graph to be created, and can be customized. The weight value may also be obtained by performing an embedded calculation on the association relationship.
And sorting the attributes of the entities in the entity set according to the weight values of the entities.
104. And creating and outputting a knowledge graph according to the entity set, the attribute set and the incidence relation between the entities and the attributes.
The knowledge-graph may include entities, entity attributes, associations between entities and attributes, and associations between entities.
The knowledge graph refers to a priori knowledge, which is used for providing detailed structural abstract for entities contained in user query or returned answers, and mainly comprises concepts, concept hierarchies, attributes, attribute value types, relations, a relation definition Domain (Domain) concept set and a relation value Domain (Range) concept set.
The knowledge graph can cover most of the common sense knowledge by gathering structured data from encyclopedia sites and various vertical sites, which enrich the description of entities by extracting attribute-value pairs of related entities from various semi-structured data (in the form of HTML tables). In addition, new entities or new entity attributes are discovered through search logs (query logs) to continually expand the coverage of the knowledge-graph. Compared with common knowledge, the knowledge data obtained by data mining and extraction is larger, the query requirements of the current user can be reflected better, and the latest entity or fact can be found in time.
When the knowledge graph is established, the confidence coefficient of the internet can be evaluated through voting or other aggregation algorithms in subsequent mining by utilizing the redundancy of the internet, and the confidence coefficient is added into the preset knowledge graph through manual examination. Various candidate entities (concepts) required for constructing the knowledge graph and attribute associations thereof are extracted from various types of data sources, and an isolated Extraction graph is formed.
For example, when the knowledge-graph is an enterprise knowledge-graph, the flowchart for creating the enterprise knowledge-graph may refer to the flowchart shown in fig. 2. The enterprise knowledge graph may include inter-enterprise associations, inter-employee associations, enterprise attributes, employee attributes, and enterprise-task associations. Employees and enterprises are nodes that are mapped on the enterprise knowledge graph.
The incidence relation between enterprises comprises external investment, stockholders, clients, competition relations and the like, and the incidence relation between the enterprises and the employees comprises duties, stockholders, corporate representatives and the like; the incidence relation among the staff comprises the relations of colleagues, classmates, relatives, opponents and the like; attributes of an enterprise include announcements, news, legal documents, intellectual property, products, qualifications, official networks, recruiting, administrative penalties, research teams and events, etc.; the attributes of the employee include scholars, experiences, patents, news, events, and the like.
Therefore, the knowledge graph in the embodiment of the application can be used for showing the analysis result of the big data and can also be used for a search engine. For example, after receiving a search request of a user at a later stage, the server performs semantic analysis on the search request based on the obtained knowledge graph, and then returns an answer to the user. To a certain extent, the answer with higher accuracy can be output for the user, and the speed of responding to the user can be increased. Moreover, the search based on the knowledge graph can reduce the operation load of the search engine and improve the performance of the search engine.
Compared with the existing mechanism, in the embodiment of the application, the data analysis device can perform semantic analysis and cluster analysis on the acquired data source, extract the entity set and the attribute set from the data source, and calculate the weight value of each entity in the entity set according to the entity attribute of the entity; according to the weight values of the entities, the attributes of the entities in the entity set are sorted, then the incidence relation between the entities and the attributes in the entity set is established, and a knowledge graph is established and output according to the entity set, the attribute set and the incidence relation between the entities and the attributes. The knowledge graph comprises entities, entity attributes, incidence relations between the entities and the attributes, and incidence relations between the entities. By adopting the scheme, the knowledge graph can be accurately created, and the relationship between the entities and the attributes and the association relationship between the entities can be visually presented. And the personalized visual display is facilitated, and the unified management can also be facilitated.
Optionally, in some inventive embodiments, after step 103 and before step 104, the method further includes:
and vectorizing each entity in the entity set to obtain a training vector.
Specifically, the following operations may be included:
(a) and adopting a multilayer neural network to carry out named entity identification on each entity in the entity set to obtain the entity context of each entity.
The entity context refers to a set of words around the entity, for example, the entity e context can be represented by context (e).
In some embodiments, the maximum log-likelihood method may be further used to maximize the obtained entity context of each entity. For example, a first function may be constructed using maximum log-likelihood, and then the first function may be maximized. The first function is as follows:
Figure BDA0001421097350000121
function(s)
Where C is a corpus, θ is a set of pending parameters, and F (e, context (e), θ) represents a first function that may be constructed by a multi-layer neural network, e.g., by the schematic shown in fig. 1-2.
In this step, the entity context corresponding to each entity is finally obtained by performing maximization processing on the first function.
(b) And extracting the association relation among the entities from the entity context of each entity.
Specifically, the context of each entity obtained in step (a) is used as the input of the system for extracting the association relationship between the entities, and finally, the expression set of the association relationship between the entities is output. For example, fig. 1-3 illustrate the structure of each entity and the association relationship between the entities, where fig. 1-3 illustrate the association relationship corresponding to each entity, and fe in fig. 1-31-fel、e0、be1-besAre all entities, w1-wlAnd v1-vsAll represent an associative relationship. E.g. w1Represents fe1And e0The association relationship between them.
For an entity e0In terms of { fei(ii) a i ═ 1, Kl } is the entity e0Front entities (front NEs) to which it is connected in all associations, { be }j(ii) a j is 1, Ks is entity e0All the back entities (back NEs) connected to it in the association. For example, for the corpus "Facebook," Source3, a content copyright initiatives was purchased, "Facebook" is its front entity for the named entity "Source 3" in the "acquisition" relationship; for the named entity "Facebook," Source3 "is its rear entity. { omega [ [ omega ] ]i(ii) a i-1, Kl } and { upsilonj(ii) a j ═ 1, Ks } is a relational weight (relations), determined according to the financial activity priority level.
(c) And obtaining the training vector according to the entity context of each entity and the incidence relation between the entities.
In some embodiments, based on the obtained association relationship between the entities, the obtained association relationship between the entities may be further maximized by using a maximum log-likelihood method.
Specifically, based on the embodiments shown in fig. 1-2 and 1-3, the second function may be constructed by maximum log likelihood and then maximized. The second function is as follows:
Figure BDA0001421097350000131
wherein C is a corpus, λ is a set of undetermined parameters, G (e, front (e), back (e), λ) represents a second function, and G (e, front (e), back (e), λ) can also be constructed by a multilayer neural network. For example, the construction can be made by the schematic diagrams shown in fig. 1-4.
In step (c), an association relationship (EntityRep available) between the entities is finally generated, which is obtained by training the context relationship of the entitiese;relationRepresentation). Finally, combining the obtained context of each entity and the incidence relation between each entity by using a Kronecker product to form a final training vector of the entity, wherein the training vector can be expressed by the following expression:
Figure BDA0001421097350000132
the entities can be converted into a distributed expression through vectorization processing, and after the distributed expression is used, the associated or similar entities are close to each other in distance, so that the text is further endowed with the characteristic of calculation, and the knowledge extraction and association relationship reasoning can be carried out on the subsequent knowledge graph.
Optionally, in some embodiments of the present invention, the extracting the association relationship between the entities from the entity context of the obtained entities includes:
(a) and taking the attribute set and the entity set as input, and respectively labeling the association relation of each entity in the entity set based on a time recurrent neural network model, wherein the labeled association relation comprises the position of a word in the entity, the type of the association relation and the position of the association relation, and the position of the association relation refers to the position of the association relation in a knowledge graph.
(b) Calculating the weight value of the relationship type by adopting an incidence relationship embedding method;
(c) and screening candidate association relations from the labeled association relations according to a closest distance principle and the association relation type.
(d) And classifying the screened candidate incidence relations according to the keyword pairs of the incidence relation types to obtain the incidence relations among the entities.
Therefore, the potential semantic relation among the entities can be identified by extracting the association relation among the entities from the text data. In some embodiments, the association relationship between entities may be represented by an entity relationship triple (entity 1, association type or association indication information, entity 2). Specifically, the method and the device can convert the extraction problem of the incidence relation between the entities into a Sequence Tagging Task (Sequence Tagging Task), and can adopt a neural network model to carry out the Sequence Tagging Task.
For example, the attribute set and the entity set may be used as input, and then the corpus may be subjected to sequence labeling based on a neural network model. After the training corpus is subjected to sequence labeling based on the neural network model, the output results of the final sequence labeling have two types:
one output result represents a label unrelated to the association to be extracted, and a label unrelated to the association to be extracted may be represented by 'O'.
And the other output result represents a label related to the association relation to be extracted and can be represented by a relation label except 'O'. The relationship label represents the position of the label in the entity, the type of the association relationship and the position of the association relationship.
Finally, label-based prediction vectors T are computed by the output layers shown in FIGS. 1-4tNormalized entity tag probability of (2):
yt=WyTt+by,
Figure BDA0001421097350000142
wherein, WtIs a weight matrix, NtIs the total number of tags, F (N)t,yt) Representing a normalization function, TtRepresenting the vectorized label. To be provided withAnd the neural network model can learn long-term dependence, so the decoding mode can establish label interaction.
And constructing a third function by using the log-likelihood function, wherein the third function is as follows:
Figure BDA0001421097350000141
wherein L isjIs the sentence xjIs the training set sample capacity, | D |, is the conditiont,jIs a conditional predicate, Θ is the coefficient space, α is the bias weight, the larger the value of which, the more affected the corresponding relationship label in the model, and i (O) is the transfer function that distinguishes the loss between the 'O' label and the relationship label, as follows:
Figure BDA0001421097350000151
by combining the method, the relation type and the closest distance principle, the candidate relation triples can be extracted preliminarily, and the candidate entity relation triples are screened through the relation category keywords to obtain the finally classified entity relation triples.
It should be noted that the formulas corresponding to the first function, the second function and the third function given in the present application are only examples, and modifications may be made on the given formulas, and specific modifications and specific forms of the formulas are not limited in the present application.
Optionally, in some embodiments of the present invention, since there are many different data sources, entities with the same or similar names may exist in the obtained entity set, and in order to simplify and improve the accuracy of the knowledge graph, the entities with the same or similar names may be merged, or entities with the same name but different entities in nature may be distinguished. In an embodiment of the present application, the method further includes:
and at least one of merging, de-duplicating and distinguishing entities with the same or similar entity types in the knowledge graph.
In some embodiments, the data source may include a first data table and a second data table, the plurality of entities includes at least one first entity and at least one second entity, the first entity belongs to the first data table, the second entity belongs to the second data table, and the knowledge graph includes at least two connected graphs, and an offspring relationship and/or a parent-child relationship exists between the at least two connected graphs.
Specifically, since data such as entities, association relations, and the like in a finally constructed knowledge graph may come from different data sources, and a phenomenon that the entities in the knowledge graph repeat may eventually occur, deduplication processing is performed before graph data is used, first, common synonyms are merged, and then merging is performed by using connectivity of the knowledge graph itself, that is, if a distance between entities of the same name in an entity network is less than N, the entities are merged into the same entity, all relations of the entities are also merged, a value of N may be adjusted according to a rareness degree of the entity name, that is, the rarer the entity name is, the larger the value of N is, and the smaller the value of N is, and the application is not limited by the specific value.
The merging, deduplication and differentiation of entities of the same or similar entity type in the knowledge graph are illustrated below, respectively, where the similarity is calculated by entity attribute embedding:
if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, combining the first entity and the second entity, or deleting the first entity or the second entity from the knowledge graph.
Taking an enterprise knowledge graph as an example, as shown in a in fig. 3, the enterprise knowledge graph includes two spangles, and both the two spangles correspond to respective employee attributes, for example, one spangles is in company a, and is also in company B, and company B is a subsidiary of company a, and the other spangles is in company B. Then, the method of maximum link graph can be adopted for identification, firstly, the maximum link graph is extracted from the enterprise knowledge graph, then people with the same name and the same name in the link graph are merged, as shown in b in fig. 3, and finally, two "zhanglong" are regarded as the same person. If not in the same communication graph, the people are considered to be different.
And secondly, if the similarity between the first entity and the second entity is higher than the preset similarity and the first entity and the second entity are determined to belong to any one link graph, distinguishing the first entity from the second entity in the knowledge graph.
And thirdly, if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, determining a first entity set directly associated with the first entity and a second entity set directly associated with the second entity. And then comparing the first entity set with the second entity set, and when the intersection of the first entity set and the second entity set is determined to comprise at least two entities, combining the first entity and the second entity, or deleting the first entity or the second entity from the knowledge-graph.
Optionally, in some embodiments of the invention, the knowledge graph is based on a time dimension, and the link graph in each time window in the time dimension is an association relationship between entities in the time window and a snapshot of attributes of the entities.
Optionally, in some inventive embodiments, the knowledge-graph further satisfies at least one of:
in the knowledge graph, the entities with the association relationship are displayed in a gradient manner from strong to weak according to the strength of the association relationship;
highlighting a specific entity in the knowledge graph, wherein the specific entity marks a risk assessment value, and the specific entity is an entity with a risk assessment value higher than a preset risk assessment value;
when an entity in the knowledge-graph is updated, the updated entity is distinguished.
And adding a time axis to the entity attribute with time update, and displaying the alternate time on the time axis.
And for the entity attributes of the same entity, deep and shallow coloring are performed according to the weight value of the entity attributes from high to low.
For example, when the knowledge-graph in the embodiments of the present application is applied to an enterprise knowledge-graph, the enterprise knowledge-graph can visually present each entity, as well as various types of attributes associated with each entity. In order to facilitate management and quickly identify information such as key entities, changed entities and the like, the enterprise knowledge graph can be further processed as follows:
(1) and sorting the importance of different attributes, wherein the importance can be set by self-definition aiming at different users.
(2) The display of attributes may be arranged according to importance and colored from dark to light with colors of the same color family.
(3) When an entity change or an association relationship change between entities occurs in the knowledge graph, the entity or the entity association relationship with the state change may be displayed in a color different from that of the conventional color.
(4) The compactness of the association relationship between the entities can be highlighted through thickness display of lines.
(5) And (3) obtaining risk assessment of the enterprise through the positive and negative analysis of news, the rise and fall of stock price, financial change, acquisition combination and other major events, and respectively identifying corresponding colors.
(6) Different shapes or different colors are respectively used for the listed companies (different plates) and the non-listed companies.
After the processes (1) to (6) are performed on the enterprise knowledge graph, the enterprise knowledge graph shown in fig. 4 can be obtained.
Optionally, in some embodiments of the present invention, the performing semantic analysis and cluster analysis on the corpus set, and extracting an entity set and an attribute set from the corpus set includes:
and performing word segmentation and semantic annotation processing on the corpora in the corpus set to obtain the entity set and the attribute set.
And marking the incidence relation types among the entities in the entity set.
And respectively adjusting the entity set and the attribute set based on a sequence labeling model, and respectively predicting each entity in the entity set and each attribute in the attribute set to obtain the association relationship type between the entities and obtain the mapping between the entities and the attributes.
Optionally, in some embodiments of the present invention, the present application further provides a hierarchical layout method of an optimized knowledge graph. The main method comprises the following steps:
with a given entity a as the center, the entities in the knowledge-graph may be referred to as nodes, e.g., entity a may be referred to as a center node. The neighbor entities of the entity a are distributed on each layer of circular rings taking the entity a as the center, the nodes with larger hop number away from the entity a are positioned on the outer layer of circular rings, and the nodes at the same level are positioned on one circular ring surface. In order to show more nodes on a unit area screen, nodes on a ring at the same level are also distributed in a multilayer spiral manner, the size of the whole graph is prevented from being enlarged due to overlarge radius of the ring, and meanwhile, in order to view adjacent nodes, the placement positions of adjacent nodes and all sub-nodes of the same node are as together as possible due to the minimum view span, the method mainly comprises the following steps:
(a) and calculating the node distance between each node in the network to be visualized and the entity a.
(b) And determining candidate positions of all nodes in the knowledge graph according to the relevant configurations of node distance, node size, node spacing, interlayer spacing and the like, wherein the positions are only vacant and no node is arranged.
(c) The central node is placed on the central void.
(d) And taking out all neighbor nodes of the central node, and placing each neighbor node on a vacant site nearest to the central node.
(e) Repeating the operations described in (a) - (d) until all nodes have been placed in position, i.e., all empty spaces have been filled.
Optionally, in some inventive embodiments, the association relationship between any nodes in the knowledge graph may be visualized.
Because the incidence relation between two entity nodes needs to be examined sometimes in actual service, all shortest paths between two nodes of any giant head incidence relation can be calculated by using a shortest path algorithm, then the two nodes to be examined are placed at the left end and the right end of a knowledge graph, and N vertical lines are made on a connecting line x of the nodes at the two ends so as to divide the connecting line x equally.
And other nodes on the shortest path are randomly placed on corresponding vertical lines according to the hop number of the nodes at two ends, and then the position of the middle node is adjusted by using a force-guided layout algorithm with constraint conditions, wherein the hierarchy of the node cannot be adjusted during adjustment, and only the position of the node on the hierarchy is adjusted. The operation of specifically adjusting the position of the intermediate node is as follows:
(1) each node is placed on a corresponding hierarchy according to the distance from the nodes at two ends, and the node positions are randomly placed in the hierarchy when a plurality of nodes are arranged in the same hierarchy.
(2) The edge connecting the two nodes is thought to be a spring, the stress between each pair of nodes can be calculated according to the Hooke's law, and all the stresses of each point are synthesized to obtain the stress condition of the point.
(3) The stress is converted into corresponding displacement according to the stress condition of each point, namely the node is subjected to position movement, the movement process is limited by the hierarchy and the position, namely the node can only move up and down in the same hierarchy, and the node only needs to move on a plurality of preset fixed positions, so that the consistent interval among the nodes is ensured, and the interface is more attractive.
(4) And after moving a certain distance, calculating the stress condition of each point again, and repeatedly operating until the sum of the stresses of each node of the whole graph converges to a stable value.
Optionally, in some invention embodiments, the application may perform visualization processing on the two time point diagram changes, and may show the dynamic changes of the association relationship between the entities at the two time points by using time as a dimension. Specifically, first two time points are selected on the interface of the knowledge-graph: for example, "time point one" and "time point two" are selected, where time point one is an earlier time point in the day, "time point two" is a later time point in the day, and "time point one" and "time point two" are time points on the same day.
Correspondingly, network graphs corresponding to the time point I and the time point II can be obtained, then the node and the side information in the two network graphs are combined respectively, namely the combined network graphs simultaneously contain the node and the side information in the time point I and the time point II, when the knowledge graph is displayed, highlight color identification can be used for nodes and association relations newly added in the time point II, and dotted line identification is used for the sides and the nodes which do not exist in the time point I and the time point II.
While a method of creating a knowledge-graph in the present application is described above, an apparatus for performing the above method of creating a knowledge-graph is described below, having functions to implement a method of creating a knowledge-graph corresponding to those provided in the above method embodiments. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware. The data analysis device in the present application may be a server or a terminal device, or may be an application installed in the server or the terminal device, and the present application is not limited in particular. When the apparatus is a terminal device, the terminal device may be a device that provides voice and/or data connectivity to a user, a handheld device having wireless connection capability, or other processing device connected to a wireless modem. A wireless terminal, which may be a mobile terminal such as a mobile phone (or a "cellular" phone) and a computer having a mobile terminal, for example, a portable, pocket, hand-held, computer-included or vehicle-mounted mobile device, may communicate with one or more core networks via a Radio Access Network (RAN). Examples of such devices include Personal Communication Service (PCS) phones, cordless phones, Session Initiation Protocol (SIP) phones, Wireless Local Loop (WLL) stations, and Personal Digital Assistants (PDA). A wireless Terminal may also be referred to as a system, a Subscriber Unit (Subscriber Unit), a Subscriber Station (Subscriber Station), a Mobile Station (Mobile), a Remote Station (Remote Station), an Access Point (Access Point), a Remote Terminal (Remote Terminal), an Access Terminal (Access Terminal), a User Terminal (User Terminal), a Terminal Device, a User Agent (User Agent), a User Device (User Device), or a User Equipment (User Equipment). As shown in fig. 5, the apparatus 50 for creating a knowledge-graph includes:
a transceiver module 501, configured to acquire a data source, where the data source includes multiple entities;
a processing module 502, configured to perform semantic analysis and cluster analysis on the data source obtained by the transceiver module 501, and extract an entity set and an attribute set from the data source, where the attribute set includes entity attributes of each entity in the entity set; calculating the weight value of each entity in the entity set according to the entity attribute of the entity; sorting the attributes of the entities in the entity set according to the weight values of the entities;
creating an incidence relation between each entity and the attribute in the entity set; and creating and outputting a knowledge graph according to the entity set, the attribute set and the association relationship between the entities and the attributes, wherein the knowledge graph comprises the entities, the entity attributes, the association relationship between the entities and the attributes and the association relationship between the entities.
In the embodiment of the present application, the processing module 502 performs semantic analysis and cluster analysis on the data source acquired by the transceiver module 501, extracts an entity set and an attribute set from the data source, and calculates a weight value of each entity in the entity set according to an entity attribute of the entity; according to the weight values of the entities, the attributes of the entities in the entity set are sorted, then the incidence relation between the entities and the attributes in the entity set is established, and a knowledge graph is established and output according to the entity set, the attribute set and the incidence relation between the entities and the attributes. The knowledge graph comprises entities, entity attributes, incidence relations between the entities and the attributes, and incidence relations between the entities. By adopting the scheme, the knowledge graph can be accurately created, and the relationship between the entities and the attributes and the association relationship between the entities can be visually presented. And the knowledge map is convenient for individual visual display and unified management.
Optionally, the processing module 502 is further configured to:
and at least one of merging, de-duplicating and distinguishing entities with the same or similar entity types in the knowledge graph.
Optionally, the data source includes a first data table and a second data table, the plurality of entities includes at least one first entity and at least one second entity, the first entity belongs to the first data table, the second entity belongs to the second data table, the knowledge graph includes at least two connected graphs, and a descendant relationship and/or a parent-child relationship exists between the at least two connected graphs.
Optionally, the processing module 502 is specifically configured to:
if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, combining the first entity and the second entity or deleting the first entity or the second entity from the knowledge graph.
And if the similarity between the first entity and the second entity is higher than the preset similarity and the first entity and the second entity are determined to belong to any one link map, distinguishing the first entity from the second entity in the knowledge graph.
Optionally, the processing module 502 is specifically configured to:
if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, determining a first entity set directly associated with the first entity and a second entity set directly associated with the second entity;
when it is determined that the intersection of the first set of entities and the second set of entities includes at least two entities, the first entity and the second entity are merged, or the first entity or the second entity is deleted from the knowledge-graph.
Optionally, the knowledge graph is based on a time dimension, and the link graph in each time window in the time dimension is an association relationship between entities in the time window and a snapshot of an entity attribute.
Optionally, the knowledge-graph further satisfies at least one of:
in the knowledge graph, the entities with the association relationship are displayed in a gradient manner from strong to weak according to the strength of the association relationship;
highlighting a specific entity in the knowledge graph, wherein the specific entity marks a risk assessment value, and the specific entity is an entity with a risk assessment value higher than a preset risk assessment value;
when an entity in the knowledge-graph is updated, distinguishing the updated entity;
adding a time axis to the entity attribute with time updating, and displaying the replacing time on the time axis;
and for the entity attributes of the same entity, deep and shallow coloring are performed according to the weight value of the entity attributes from high to low.
Optionally, the processing module 502 is specifically configured to:
performing word segmentation and semantic annotation processing on the corpora in the corpus set to obtain the entity set and the attribute set;
labeling the incidence relation type among the entities in the entity set;
and respectively adjusting the entity set and the attribute set based on a sequence labeling model, and respectively predicting each entity in the entity set and each attribute in the attribute set to obtain the association relationship type between the entities and obtain the mapping between the entities and the attributes.
It should be noted that, in the embodiment corresponding to fig. 5 of the present application, the entity device corresponding to the transceiver module is a transceiver, and the entity device corresponding to the processing module may be a processor. Each of the devices shown in fig. 5 may have a structure as shown in fig. 6, when one of the devices has the structure as shown in fig. 6, the processor, the transmitter and the receiver in fig. 6 implement the same or similar functions of the processing module, the transmitting module and the receiving module provided in the device embodiment corresponding to the device, and the memory in fig. 6 stores program codes that the processor needs to call when executing the above-mentioned method for creating a knowledge graph.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.
The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
The technical solutions provided by the present application are introduced in detail, and the present application applies specific examples to explain the principles and embodiments of the present application, and the descriptions of the above examples are only used to help understand the method and the core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (9)

1. A method of creating a knowledge-graph, the method being applied to a data analysis device, the method comprising:
obtaining a data source, the data source comprising a plurality of entities;
performing semantic analysis and cluster analysis on the data source, and extracting an entity set and an attribute set from the data source, wherein the attribute set comprises entity attributes of all entities in the entity set;
acquiring an incidence relation between each entity and the attribute in the entity set;
creating and outputting a knowledge graph according to the entity set, the attribute set and the association relationship between the entities and the attributes, wherein the knowledge graph comprises the entities, the entity attributes, the association relationship between the entities and the attributes and the association relationship between the entities;
the method further comprises the following steps:
vectorizing each entity in the entity set to obtain a training vector;
the vectorizing each entity in the entity set to obtain a training vector includes:
adopting a multilayer neural network to carry out named entity identification on each entity in the entity set to obtain an entity context of each entity;
extracting the association relation among the entities from the entity context of each entity;
obtaining the training vector according to the entity context of each entity and the incidence relation between the entities;
the extracting the association relationship among the entities from the entity context of each entity includes:
according to the attribute set, the entity set and the time recurrent neural network model, respectively labeling the incidence relation of each entity in the entity set, wherein the labeled incidence relation comprises the position of a word in the entity, the type of the incidence relation and the position of the incidence relation;
calculating the weight value of the relationship type by adopting an incidence relationship embedding method;
screening candidate incidence relations from the marked incidence relations according to a closest distance principle and the incidence relation types;
classifying the screened candidate incidence relations according to the keyword pairs of the incidence relation types to obtain the incidence relations among the entities;
the method further comprises the following steps:
calculating the similarity among the entities through entity attribute embedding, and performing at least one of combination, duplication removal and differentiation on the entities with the same or similar entity types in the knowledge graph;
the method further comprises the following steps:
when the distance between at least two same-name entities in an entity network is smaller than N, merging the at least two same-name entities into a same entity, merging the association relations of the at least two same-name entities into a same association, wherein the value of the N is used for indicating the rarity degree of the entity name;
the knowledge graph is based on a time dimension, and a link graph in each time window in the time dimension is an association relation between entities in the time window and a snapshot of entity attributes;
the knowledge-graph further satisfies at least one of:
in the knowledge graph, the entities with the association relationship are displayed in a gradient manner from strong to weak according to the strength of the association relationship;
highlighting a specific entity in the knowledge graph, wherein the specific entity marks a risk assessment value, and the specific entity is an entity with a risk assessment value higher than a preset risk assessment value;
when an entity in the knowledge-graph is updated, distinguishing the updated entity;
adding a time axis to the entity attribute with time update, and displaying the time of the update on the time axis;
and for the entity attributes of the same entity, deep and shallow coloring are performed according to the weight value of the entity attributes from high to low.
2. The method of claim 1, wherein after the multi-layer neural network is used to perform named entity recognition on each entity in the entity set, and after the entity context of each entity is obtained, and before the association relationship between each entity is extracted from the entity context of each entity, the method further comprises:
and adopting a maximum log-likelihood method to respectively perform maximization processing on the obtained entity context of each entity.
3. The method of claim 1, wherein after extracting the association between the entities from the entity context of each entity, and before obtaining the entity training vector according to the entity context of each entity and the association between each entity, the method further comprises:
and performing maximization processing on the obtained association relationship between the entities of each entity by adopting a maximum log-likelihood method.
4. The method according to any one of claims 1 to 3, wherein the data source comprises a first data table and a second data table, the plurality of entities comprises at least one first entity and at least one second entity, the first entity belongs to the first data table, the second entity belongs to the second data table, the knowledge graph comprises at least two connection graphs, and an offspring relationship and/or a parent-child relationship exist between the at least two connection graphs.
5. The method of claim 4, wherein at least one of merging, deduplicating and differentiating entities of the same or similar entity type in the knowledge-graph comprises:
if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, combining the first entity and the second entity or deleting the first entity or the second entity from the knowledge graph;
and if the similarity between the first entity and the second entity is higher than the preset similarity and the first entity and the second entity are determined to belong to any one link map, distinguishing the first entity from the second entity in the knowledge graph.
6. The method of claim 4, wherein at least one of merging, deduplicating and differentiating entities of the same or similar entity type in the knowledge-graph comprises:
if the similarity between the first entity and the second entity is higher than a preset similarity and the first entity and the second entity belong to at least one link graph, determining a first entity set directly associated with the first entity and a second entity set directly associated with the second entity;
when it is determined that the intersection of the first set of entities and the second set of entities includes at least two entities, the first entity and the second entity are merged, or the first entity or the second entity is deleted from the knowledge-graph.
7. The method according to any one of claims 1, 2, 3, 5 and 6, wherein the performing semantic analysis and cluster analysis on the data source and extracting entity sets and attribute sets from the data source comprises:
performing word segmentation and semantic annotation processing on the linguistic data in the linguistic data set to obtain the entity set and the attribute set;
labeling the incidence relation type among the entities in the entity set;
based on a conditional random field model, respectively adjusting the entity set and the attribute set, and respectively predicting each entity in the entity set and each attribute in the attribute set to obtain an incidence relation type between the entities and obtain a mapping between the entities and the attributes.
8. An apparatus for creating a knowledge graph, the apparatus comprising:
at least one processor, memory, receiver, and transmitter;
wherein the memory is configured to store program code and the processor is configured to invoke the program code stored in the memory to perform the method of any of claims 1-7.
9. A computer storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-7.
CN201710890548.1A 2017-09-27 2017-09-27 Method and device for creating knowledge graph Active CN107665252B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710890548.1A CN107665252B (en) 2017-09-27 2017-09-27 Method and device for creating knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710890548.1A CN107665252B (en) 2017-09-27 2017-09-27 Method and device for creating knowledge graph

Publications (2)

Publication Number Publication Date
CN107665252A CN107665252A (en) 2018-02-06
CN107665252B true CN107665252B (en) 2020-08-25

Family

ID=61098564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710890548.1A Active CN107665252B (en) 2017-09-27 2017-09-27 Method and device for creating knowledge graph

Country Status (1)

Country Link
CN (1) CN107665252B (en)

Families Citing this family (94)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108491421B (en) * 2018-02-07 2021-04-16 北京百度网讯科技有限公司 Method, device and equipment for generating question and answer and computing storage medium
CN108182295B (en) * 2018-02-09 2021-09-10 重庆电信系统集成有限公司 Enterprise knowledge graph attribute extraction method and system
CN108363695B (en) * 2018-02-23 2020-04-24 西南交通大学 User comment attribute extraction method based on bidirectional dependency syntax tree representation
CN108062639A (en) * 2018-02-23 2018-05-22 大连火眼征信管理有限公司 A kind of Risk Propagation Model and the algorithm suitable for the model
CN108510110A (en) * 2018-03-13 2018-09-07 浙江禹控科技有限公司 A kind of water table trend analysis method of knowledge based collection of illustrative plates
CN108877336A (en) * 2018-03-26 2018-11-23 深圳市波心幻海科技有限公司 Teaching method, cloud service platform and tutoring system based on augmented reality
CN108596439A (en) * 2018-03-29 2018-09-28 北京中兴通网络科技股份有限公司 A kind of the business risk prediction technique and system of knowledge based collection of illustrative plates
CN108509654B (en) * 2018-04-18 2021-12-28 上海交通大学 Construction method of dynamic knowledge graph
CN108520365A (en) * 2018-04-23 2018-09-11 温州市鹿城区中津先进科技研究院 Education decision system based on big data analysis
CN109165296B (en) * 2018-06-27 2021-05-18 南京邮电大学 Industrial Internet of things resource knowledge map construction method, readable storage medium and terminal
CN108921213B (en) * 2018-06-28 2021-06-22 国信优易数据股份有限公司 Entity classification model training method and device
CN110750649A (en) * 2018-07-06 2020-02-04 中兴通讯股份有限公司 Knowledge graph construction and intelligent response method, device, equipment and storage medium
CN109166631A (en) * 2018-07-10 2019-01-08 武汉海云健康科技股份有限公司 The construction method of map is associated with the drug of convolutional neural networks based on Network Science
CN109086391B (en) * 2018-07-27 2022-07-01 北京光年无限科技有限公司 Method and system for constructing knowledge graph
CN109242548A (en) * 2018-08-20 2019-01-18 北京众标智能科技有限公司 A kind of sales lead recognition methods of knowledge based map and device
CN109189937B (en) * 2018-08-22 2021-02-09 创新先进技术有限公司 Feature relationship recommendation method and device, computing device and storage medium
CN110879853B (en) * 2018-09-06 2023-05-26 腾讯科技(深圳)有限公司 Information vectorization method and computer-readable storage medium
CN109344174A (en) * 2018-09-13 2019-02-15 深圳易投云智能科技有限公司 Financial analysis method and system
CN109192321A (en) * 2018-09-26 2019-01-11 北京理工大学 The construction method and calculating storage device of drug knowledge mapping
CN109657074B (en) * 2018-09-28 2023-11-10 北京信息科技大学 News knowledge graph construction method based on address tree
CN109597894B (en) * 2018-09-30 2023-10-03 创新先进技术有限公司 Correlation model generation method and device, and data correlation method and device
CN109558492A (en) * 2018-10-16 2019-04-02 中山大学 A kind of listed company's knowledge mapping construction method and device suitable for event attribution
TWI682287B (en) * 2018-10-25 2020-01-11 財團法人資訊工業策進會 Knowledge graph generating apparatus, method, and computer program product thereof
CN109635120B (en) * 2018-10-30 2020-06-09 百度在线网络技术(北京)有限公司 Knowledge graph construction method and device and storage medium
CN109271504B (en) * 2018-11-07 2021-06-25 爱因互动科技发展(北京)有限公司 Inference dialogue method based on knowledge graph
CN109582933B (en) * 2018-11-13 2021-09-03 北京合享智慧科技有限公司 Method and related device for determining text novelty
CN109522419B (en) * 2018-11-15 2020-08-04 北京搜狗科技发展有限公司 Session information completion method and device
CN109918639B (en) * 2018-12-13 2024-02-13 北京海致星图科技有限公司 Bank credit text analysis method based on deep learning technology and rule base
CN111368145A (en) * 2018-12-26 2020-07-03 沈阳新松机器人自动化股份有限公司 Knowledge graph creating method and system and terminal equipment
CN109800671B (en) * 2018-12-28 2021-03-02 北京市遥感信息研究所 Target interpretation-oriented multisource remote sensing information knowledge graph construction method and system
CN109815296B (en) * 2018-12-29 2020-12-22 北京中科闻歌科技股份有限公司 Figure knowledge base construction method and device for notarization document and storage medium
CN109697050B (en) * 2019-01-07 2021-04-27 浙江大学 Demand description model design method facing E-commerce field based on knowledge graph
CN109828965B (en) * 2019-01-09 2021-06-15 千城数智(北京)网络科技有限公司 Data processing method and electronic equipment
CN109740026A (en) * 2019-01-11 2019-05-10 深圳市中电数通智慧安全科技股份有限公司 Smart city edge calculations platform and its management method, server and storage medium
CN109885692B (en) * 2019-01-11 2023-06-16 平安科技(深圳)有限公司 Knowledge data storage method, apparatus, computer device and storage medium
CN109815340A (en) * 2019-01-17 2019-05-28 云南师范大学 A kind of construction method of national culture information resources knowledge mapping
CN109885697B (en) * 2019-02-01 2022-02-18 北京百度网讯科技有限公司 Method, apparatus, device and medium for constructing data model
CN109933788B (en) * 2019-02-14 2023-05-23 北京百度网讯科技有限公司 Type determining method, device, equipment and medium
CN109918452A (en) * 2019-02-14 2019-06-21 北京明略软件系统有限公司 A kind of method, apparatus of data processing, computer storage medium and terminal
CN111625653A (en) * 2019-02-26 2020-09-04 广州慧睿思通信息科技有限公司 Legal data processing method and device, computer equipment and storage medium
CN112784062B (en) * 2019-03-15 2024-06-04 北京金山数字娱乐科技有限公司 Idiom knowledge graph construction method and device
CN109933674B (en) * 2019-03-22 2021-06-04 中国电子科技集团公司信息科学研究院 Attribute aggregation-based knowledge graph embedding method and storage medium thereof
CN109977198B (en) * 2019-04-01 2021-08-31 北京百度网讯科技有限公司 Method and device for establishing mapping relation, hardware equipment and computer readable medium
CN110134842B (en) * 2019-04-03 2021-08-31 深圳价值在线信息科技股份有限公司 Information matching method and device based on information map, storage medium and server
CN110196848B (en) * 2019-04-09 2022-04-12 广联达科技股份有限公司 Cleaning and duplicate removal method and system for public resource transaction data
CN110188198B (en) * 2019-05-13 2021-06-22 北京一览群智数据科技有限责任公司 Anti-fraud method and device based on knowledge graph
CN110390021A (en) * 2019-06-13 2019-10-29 平安科技(深圳)有限公司 Drug knowledge mapping construction method, device, computer equipment and storage medium
CN110390003A (en) * 2019-06-19 2019-10-29 北京百度网讯科技有限公司 Question and answer processing method and system, computer equipment and readable medium based on medical treatment
CN110321435B (en) * 2019-06-28 2020-09-29 京东数字科技控股有限公司 Data source dividing method, device, equipment and storage medium
CN110334220A (en) * 2019-07-15 2019-10-15 中国人民解放军战略支援部队航天工程大学 A kind of knowledge mapping construction method based on multi-data source
CN110363449B (en) * 2019-07-25 2022-04-15 中国工商银行股份有限公司 Risk identification method, device and system
CN110580304A (en) * 2019-07-26 2019-12-17 平安科技(深圳)有限公司 Data fusion method and device, computer equipment and computer storage medium
CN112346711A (en) * 2019-08-07 2021-02-09 上海交通大学 Programming standard knowledge graph construction system and method for semantic recognition
CN110543571A (en) * 2019-08-07 2019-12-06 北京市天元网络技术股份有限公司 knowledge graph construction method and device for water conservancy informatization
CN110457536A (en) * 2019-08-16 2019-11-15 北京金山数字娱乐科技有限公司 A kind of knowledge mapping construction method and device
CN110674313B (en) * 2019-09-20 2022-12-13 四川长虹电器股份有限公司 Method for dynamically updating knowledge graph based on user log
CN110704620B (en) * 2019-09-25 2022-06-10 海信集团有限公司 Method and device for identifying same entity based on knowledge graph
CN110879842A (en) * 2019-10-15 2020-03-13 东南大学 Legal knowledge graph construction method based on information extraction
CN111026865B (en) * 2019-10-18 2023-07-21 平安科技(深圳)有限公司 Knowledge graph relationship alignment method, device, equipment and storage medium
CN110738431B (en) * 2019-10-28 2022-06-17 北京明略软件系统有限公司 Method and device for allocating monitoring resources
CN110990584B (en) * 2019-11-26 2021-02-09 口口相传(北京)网络技术有限公司 Knowledge graph generation method and device
CN110941664B (en) * 2019-12-11 2024-01-09 北京百度网讯科技有限公司 Knowledge graph construction method, knowledge graph detection method, knowledge graph construction device, knowledge graph detection equipment and storage medium
CN112015792B (en) * 2019-12-11 2023-12-01 天津泰凡科技有限公司 Material repeated code analysis method and device and computer storage medium
CN111125352B (en) * 2019-12-23 2023-05-16 同方知网数字出版技术股份有限公司 Knowledge graph-based associated data visualized data cockpit construction method
CN113127527B (en) * 2019-12-30 2022-09-30 海信集团有限公司 Entity relation mining method and device of knowledge graph
CN111159430A (en) * 2019-12-31 2020-05-15 秒针信息技术有限公司 Live pig breeding prediction method and system based on knowledge graph
CN111259164A (en) * 2020-01-14 2020-06-09 清华大学 Knowledge graph-oriented interactive visualization method and system
CN111444181B (en) * 2020-03-20 2021-05-11 腾讯科技(深圳)有限公司 Knowledge graph updating method and device and electronic equipment
CN111401777B (en) * 2020-03-30 2024-03-12 未来地图(深圳)智能科技有限公司 Enterprise risk assessment method, enterprise risk assessment device, terminal equipment and storage medium
CN113535966A (en) * 2020-04-13 2021-10-22 阿里巴巴集团控股有限公司 Knowledge graph creating method, information obtaining method, device and equipment
CN111598702A (en) * 2020-04-14 2020-08-28 徐佳慧 Knowledge graph-based method for searching investment risk semantics
EP3905097A1 (en) * 2020-04-30 2021-11-03 Robert Bosch GmbH Device and method for determining a knowledge graph
CN111611405B (en) * 2020-05-22 2023-03-21 北京明略软件系统有限公司 Knowledge graph construction method and device, electronic equipment and storage medium
CN111612633A (en) * 2020-05-27 2020-09-01 佛山市知识图谱科技有限公司 Stock analysis method, stock analysis device, computer equipment and storage medium
CN113761214B (en) * 2020-06-05 2024-08-27 智慧芽信息科技(苏州)有限公司 Information stream extraction method, device and equipment
CN111861120B (en) * 2020-06-17 2023-10-13 国家计算机网络与信息安全管理中心 Method, device, equipment and computer readable medium for constructing enterprise association graph
CN111930863A (en) * 2020-07-08 2020-11-13 武汉智博创享科技股份有限公司 Geological survey data processing method
CN111949744A (en) * 2020-07-31 2020-11-17 北京明略昭辉科技有限公司 Associated information mining method and device based on knowledge graph
WO2022051996A1 (en) * 2020-09-10 2022-03-17 西门子(中国)有限公司 Method and apparatus for constructing knowledge graph
CN112100288B (en) * 2020-09-15 2023-07-28 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for outputting information
CN112104734B (en) * 2020-09-15 2022-09-02 北京百度网讯科技有限公司 Method, device, equipment and storage medium for pushing information
CN112380355B (en) * 2020-11-20 2024-08-13 华南理工大学 Method for representing and storing time slot heterogeneous knowledge graph
CN112417004A (en) * 2020-11-23 2021-02-26 中国建设银行股份有限公司 Method and device for building entity relationship graph
CN112287674B (en) * 2020-12-17 2021-03-26 成都数联铭品科技有限公司 Method and system for identifying homonymous large nodes among enterprises, electronic equipment and storage medium
CN112905891B (en) * 2021-03-05 2021-12-10 中国科学院计算机网络信息中心 Scientific research knowledge map talent recommendation method and device based on graph neural network
CN113157882B (en) * 2021-03-31 2022-05-31 山东大学 Knowledge graph path retrieval method and device with user semantics as center
CN113538178A (en) * 2021-06-10 2021-10-22 北京易创新科信息技术有限公司 Intellectual property value evaluation method and device, electronic equipment and readable storage medium
CN113239111B (en) * 2021-06-17 2024-06-21 上海海洋大学 Knowledge graph-based network public opinion visual analysis method and system
CN113378302B (en) * 2021-06-29 2023-04-14 江南造船(集团)有限责任公司 Ship piping schematic diagram creating method, system, medium and electronic equipment
CN114239834B (en) * 2021-11-17 2022-07-19 中国人民解放军军事科学院国防科技创新研究院 Adversary relationship reasoning method and device based on multi-round confrontation attribute sharing
US20230153537A1 (en) * 2021-11-18 2023-05-18 International Business Machines Corporation Automatic data domain identification
CN114168757B (en) * 2022-02-11 2022-04-29 子长科技(北京)有限公司 Company event risk prediction method, device, storage medium and electronic equipment
CN114741522B (en) * 2022-03-11 2024-09-06 北京师范大学 Text analysis method and device, storage medium and electronic equipment
CN116628230A (en) * 2023-07-25 2023-08-22 航天宏图信息技术股份有限公司 Method and device for expressing attribute association relationship, electronic equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10628490B2 (en) * 2015-11-05 2020-04-21 Microsoft Technology Licensing, Llc Techniques for digital entity correlation
CN106777274B (en) * 2016-06-16 2018-05-29 北京理工大学 A kind of Chinese tour field knowledge mapping construction method and system
CN106776711B (en) * 2016-11-14 2020-04-07 浙江大学 Chinese medical knowledge map construction method based on deep learning
CN106933985B (en) * 2017-02-20 2020-06-26 广东省中医院 Analysis and discovery method of core party
CN106934032B (en) * 2017-03-14 2019-10-18 北京软通智城科技有限公司 A kind of city knowledge mapping construction method and device
CN107169078A (en) * 2017-05-10 2017-09-15 京东方科技集团股份有限公司 Knowledge of TCM collection of illustrative plates and its method for building up and computer system
CN107203511B (en) * 2017-05-27 2020-07-17 中国矿业大学 Network text named entity identification method based on neural network probability disambiguation

Also Published As

Publication number Publication date
CN107665252A (en) 2018-02-06

Similar Documents

Publication Publication Date Title
CN107665252B (en) Method and device for creating knowledge graph
Kim et al. Similarity matching for integrating spatial information extracted from place descriptions
US10789229B2 (en) Determining a hierarchical concept tree using a large corpus of table values
US20180232443A1 (en) Intelligent matching system with ontology-aided relation extraction
US20170235820A1 (en) System and engine for seeded clustering of news events
Ghahremanlou et al. Geotagging twitter messages in crisis management
WO2022116418A1 (en) Method and apparatus for automatically determining trademark infringement, electronic device, and storage medium
Pernelle et al. An automatic key discovery approach for data linking
US20150052098A1 (en) Contextually propagating semantic knowledge over large datasets
CN110377747B (en) Knowledge base fusion method for encyclopedic website
Afzaal et al. Fuzzy aspect based opinion classification system for mining tourist reviews
US20200301987A1 (en) Taste extraction curation and tagging
CN109992784B (en) Heterogeneous network construction and distance measurement method fusing multi-mode information
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN106599037A (en) Recommendation method based on label semantic normalization
Li et al. A deep dive into user display names across social networks
CN105719191A (en) System and method of discovering social group having unspecified behavior senses in multi-dimensional space
Suma et al. Automatic detection and validation of smart city events using hpc and apache spark platforms
WO2016029230A1 (en) Automated creation of join graphs for unrelated data sets among relational databases
Cortis et al. Discovering semantic equivalence of people behind online profiles
Fernández et al. Characterising RDF data sets
Nikhil et al. A survey on text mining and sentiment analysis for unstructured web data
KR20210063649A (en) System for Providing Tourism information based on Bigdata and Driving method of the Same
CN115982379A (en) User portrait construction method and system based on knowledge graph
Ahmed et al. Analysis of K-means, DBSCAN and OPTICS Cluster algorithms on Al-Quran verses

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant