CN107180024A - A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system - Google Patents

A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system Download PDF

Info

Publication number
CN107180024A
CN107180024A CN201710188586.2A CN201710188586A CN107180024A CN 107180024 A CN107180024 A CN 107180024A CN 201710188586 A CN201710188586 A CN 201710188586A CN 107180024 A CN107180024 A CN 107180024A
Authority
CN
China
Prior art keywords
center connected
connected subgraph
entity
candidate set
subgraph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710188586.2A
Other languages
Chinese (zh)
Inventor
赵淦森
庄序填
任雪琦
吴杰超
林嘉洺
尹怀英
聂瑞华
汤庸
唐华
马朝辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China Normal University
Original Assignee
South China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China Normal University filed Critical South China Normal University
Priority to CN201710188586.2A priority Critical patent/CN107180024A/en
Publication of CN107180024A publication Critical patent/CN107180024A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Multi-source heterogeneous data entity recognition methods and system the invention discloses a kind of center connected subgraph, method include:All center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set;According to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is filtered and matched, obtain the center connected subgraph matched with service logic.System includes recognition unit and matching unit.The multi-source heterogeneous structural data isolated can be described by the present invention, with intuitive and can be descriptive strong the characteristics of, protruded for the relationship expression between database table, it is basic to provide a kind of efficiently available model for the fusion of multi-source structural data.And the entity Candidate Set that the present invention is found using the method for search center connected subgraph in fused data further identifies entity, so as to identify the data entity with more expression and significance.

Description

A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system
Technical field
Know the present invention relates to the multi-source heterogeneous data entity of field of computer technology, more particularly to a kind of center connected subgraph Other method and system.
Background technology
Data fusion is the important step that big data is administered.Under the multi-source data environment that isomery is isolated, data set It is to find to whether there is identical data entity or information individual in each data source into a mission critical with merging, with And the incidence relation between data that may be present or information, so that the relevance between getting through data, solves data heterogeneous Make information entity more perfect, enrich and uniformly.Thus, entity describes a physics as the elementary cell of information carrier The each side of world's correspondence concept, under the scene that data fusion is described above, common-denominator target is in interconnected data Recognize and find that information improves unified entity as far as possible in network.
Prior art:In text data analysis field, most of technical scheme is identified for name entity, The name in corpus of text is exactly recognized, place name, institutional framework name etc. names entity, and main method has:(1)It is rule-based and The method of dictionary:Using domain expert by hand construction rule template, from lexical characteristics, match in mode with character string for Main Means are identified;(2)Based on statistical machine learning method:Common methods have HMM, maximum entropy, support The method such as vector machine and condition random field, main process is to carry out statistical by the language message included to training corpus Analysis, feature is excavated from training corpus, Entity recognition is named using these features;(3)Mixed method:Using based on rule Then it is combined with statistical analysis technique;(4)Method of model identification based on deep learning:In recent years, due to deep learning not Disconnected development, powerful effect and performance is shown in text handling method, therefore do name entity knowledge using deep learning method Also do not turn into study hotspot.
In structured data analysis field, for relevant database, traditional entity is built and recognition methods is to utilize What the thinking of reverse-engineering was carried out.Main technical method has:(1)Using data list structure pattern in relevant database, borrow Help the graphic method of pattern.By each attribute in table(Field)It is depicted and, then describe the dependence between table and table, most The ER models of whole database are expressed afterwards.(2)Utilize the method for software document and data base manipulation statement.Utilize Software for Design mistake Cheng Zhong, the design documentation and database of database and establishment and the action statement of table, recover the ER models of whole database, So as to realize Entity recognition.
The problem of prior art is present or shortcoming:Text data analysis field is primarily directed to name Entity recognition, Exactly recognize the entity name included in text, such as name, place name, mechanism name etc..The entity being directed in the present invention, is that correspondence is existing A physical concept in the real world, including attribute, many aspects information such as relation.In existing structural data Entity recognition Technical scheme in, the shortcoming existed has:(1)Forms data source structure data entity recognition methods:Traditional utilization database The method of reverse-engineering, both for centralized database, it is impossible to which the multi-source data successfully managed under big data application scenarios melts Close the identification of entity;(2)Traditional Land use models graphic mode reduces the mode of physical model, and the attribute that can be associated is confined to The field or same field that can be effectively matched, it is difficult to find with identification information more comprehensively, more rich across the source number of intension Factually body.
The content of the invention
In order to solve the above-mentioned technical problem, it can recognize that more expression and significance data are real it is an object of the invention to provide one kind A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph of body and system.
The technical solution used in the present invention is:
A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph, comprises the following steps:
All center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set;
According to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is carried out filtering and Match somebody with somebody, obtain the center connected subgraph matched with service logic.
It is described as a kind of further improvement of the multi-source heterogeneous data entity recognition methods of described center connected subgraph All center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set, the step for be specially:
Centroid of some node as connected subgraph will be preset, other nodes for being connected to the Centroid are searched, constituted One subgraph, obtains center connected subgraph;
All points of the given figure of traversal, Centroid is preset as by each point, until identification finds all centers of given figure Connected subgraph.
It is described as a kind of further improvement of the multi-source heterogeneous data entity recognition methods of described center connected subgraph According to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is filtered and matched, Obtain the center connected subgraph matched with service logic, the step for be specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
Another technical scheme of the present invention is:
A kind of multi-source heterogeneous data entity identifying system of center connected subgraph, including:
Recognition unit, for finding all center connected subgraphs to identification in given figure, that is, obtains entity Candidate Set;
Matching unit, for according to entity Candidate Set and default logic rules, to the center connected subgraph in entity Candidate Set Filtered and matched, obtain the center connected subgraph matched with service logic.
It is described as a kind of further improvement of the multi-source heterogeneous data entity identifying system of described center connected subgraph Recognition unit is specifically included:
Subgraph Component units, for will preset Centroid of some node as connected subgraph, lookup is connected to the centromere Other nodes of point, constitute a subgraph, obtain center connected subgraph;
Traversal Unit, all points for traveling through given figure, Centroid is preset as by each point, until identification finds to give All center connected subgraphs of figure.
It is described as a kind of further improvement of the multi-source heterogeneous data entity identifying system of described center connected subgraph Matching unit is specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
The beneficial effects of the invention are as follows:
A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph of the present invention and system can be isolated to multi-source heterogeneous Structural data is described, with intuitive and can be descriptive strong the characteristics of, between database table relationship expression dash forward Go out, a kind of efficiently available model basis is provided for the fusion of multi-source structural data.And the present invention utilizes search center connection The method of subgraph further identifies entity come the entity Candidate Set found in fused data, so as to identify with more expression The data entity of meaning.
Brief description of the drawings
The embodiment to the present invention is described further below in conjunction with the accompanying drawings:
Fig. 1 is a kind of step flow chart of the multi-source heterogeneous data entity recognition methods of center connected subgraph of the invention;
Fig. 2 is a kind of block diagram of the multi-source heterogeneous data entity identifying system of center connected subgraph of the invention.
Embodiment
With reference to Fig. 1, a kind of multi-source heterogeneous data entity recognition methods of center connected subgraph of the invention, including following step Suddenly:
All center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set;
According to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is carried out filtering and Match somebody with somebody, obtain the center connected subgraph matched with service logic.
It is further used as preferred embodiment, it is described that all center connected subgraphs are found to identification in given figure, i.e., Obtain entity Candidate Set, the step for be specially:
Centroid of some node as connected subgraph will be preset, other nodes for being connected to the Centroid are searched, constituted One subgraph, obtains center connected subgraph;
All points of the given figure of traversal, Centroid is preset as by each point, until identification finds all centers of given figure Connected subgraph.
It is further used as preferred embodiment, it is described according to entity Candidate Set and default logic rules, to entity Center connected subgraph in Candidate Set is filtered and matched, and obtains the center connected subgraph matched with service logic, this Step is specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
With reference to Fig. 2, a kind of multi-source heterogeneous data entity identifying system of center connected subgraph of the invention, including:
Recognition unit, for finding all center connected subgraphs to identification in given figure, that is, obtains entity Candidate Set;
Matching unit, for according to entity Candidate Set and default logic rules, to the center connected subgraph in entity Candidate Set Filtered and matched, obtain the center connected subgraph matched with service logic.
It is further used as preferred embodiment, the recognition unit is specifically included:
Subgraph Component units, for will preset Centroid of some node as connected subgraph, lookup is connected to the centromere Other nodes of point, constitute a subgraph, obtain center connected subgraph;
Traversal Unit, all points for traveling through given figure, Centroid is preset as by each point, until identification finds to give All center connected subgraphs of figure.
It is further used as preferred embodiment, the matching unit is specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
In the embodiment of the present invention, node structural data pattern switched on figure is expressed, the pass between tables of data Connection relation is described by connecting side.If an entity, it is only necessary to a side(One database table)Description, Then the database table can characterize corresponding entity.If an entity needs multiple sides(Multiple database tables)Description, Then the entity corresponding multiple database tables should can be by the binding of some method and be associated together, i.e., all with some table Directly or indirectly associate.
In multi-source data, the structural data pattern of each data source is converted to figure and is modeled, namely to every The database or tables of data of individual data source, table is indicated with node of graph, the dependence between representing table with side.Therefore, Each database table is mapped to a point, and each relationship maps are mapped to figure to a side, then simple entity.Present invention assumes that In the data network figure of fusion, data description is correct and complete, and corresponding processing has all been done for redundancy and ambiguity entity. The present invention is recognized and found entity by center connected subgraph in a preferable datagram.
The specific embodiment of the present invention is as follows:
1st, identification finds entity Candidate Set with finding.Because entity is mapped to figure in the presence of two kinds of possibility:One entity is mapped to one Individual point or the point of multiple interconnections, therefore the mapping graph of entity is a center connected subgraph.Conversely, a center connection Subgraph might not have unique entity to correspond to therewith.The process for datagram, included in identification data it is all can The entity type of energy, finds entity Candidate Set as big as possible.So, by the route searching of figure, find all centers and connect Logical subgraph, constitutes entity candidate collection.
2nd, analysis is with judging.After all center connected subgraphs are found by above-mentioned steps, namely obtain entity candidate Set.Node and side for each center connected subgraph(Node represents database table information with side and associates letter respectively Breath), according to given business scene, logic rules or specific domain knowledge, to center connected subgraph carry out filtering and Match somebody with somebody, obtain matching and corresponding center connected subgraph with service logic, namely obtain with expression and significance, can correspond to and describe Entity.It is physically present or the conceptive clear and definite things of expression because entity is one, therefore each entity is to that should have phase The semanteme and information answered.
From the foregoing it can be that multi-source heterogeneous data entity recognition methods and the system of a kind of center connected subgraph of the invention The multi-source heterogeneous structural data isolated can be described, with intuitive and can be descriptive strong the characteristics of, for database Relationship expression between table is protruded, and a kind of efficiently available model basis is provided for the fusion of multi-source structural data.And this hair The bright method using search center connected subgraph further identifies entity come the entity Candidate Set found in fused data, from And identify the data entity with more expression and significance.
Above is the preferable implementation to the present invention is illustrated, but the invention is not limited to the implementation Example, those skilled in the art can also make a variety of equivalent variations or replace on the premise of without prejudice to spirit of the invention Change, these equivalent deformations or replacement are all contained in the application claim limited range.

Claims (6)

1. the multi-source heterogeneous data entity recognition methods of a kind of center connected subgraph, it is characterised in that comprise the following steps:
All center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set;
According to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is carried out filtering and Match somebody with somebody, obtain the center connected subgraph matched with service logic.
2. a kind of multi-source heterogeneous data entity recognition methods of center connected subgraph according to claim 1, its feature exists In:It is described that all center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set, the step for be specially:
Centroid of some node as connected subgraph will be preset, other nodes for being connected to the Centroid are searched, constituted One subgraph, obtains center connected subgraph;
All points of the given figure of traversal, Centroid is preset as by each point, until identification finds all centers of given figure Connected subgraph.
3. a kind of multi-source heterogeneous data entity recognition methods of center connected subgraph according to claim 1, its feature exists In:It is described according to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is filtered And matching, obtain the center connected subgraph matched with service logic, the step for be specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
4. a kind of multi-source heterogeneous data entity identifying system of center connected subgraph, it is characterised in that including:
Recognition unit, for finding all center connected subgraphs to identification in given figure, that is, obtains entity Candidate Set;
Matching unit, for according to entity Candidate Set and default logic rules, to the center connected subgraph in entity Candidate Set Filtered and matched, obtain the center connected subgraph matched with service logic.
5. a kind of multi-source heterogeneous data entity identifying system of center connected subgraph according to claim 4, its feature exists In:The recognition unit is specifically included:
Subgraph Component units, for will preset Centroid of some node as connected subgraph, lookup is connected to the centromere Other nodes of point, constitute a subgraph, obtain center connected subgraph;
Traversal Unit, all points for traveling through given figure, Centroid is preset as by each point, until identification finds to give All center connected subgraphs of figure.
6. a kind of multi-source heterogeneous data entity identifying system of center connected subgraph according to claim 5, its feature exists In:The matching unit is specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
CN201710188586.2A 2017-03-27 2017-03-27 A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system Pending CN107180024A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710188586.2A CN107180024A (en) 2017-03-27 2017-03-27 A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710188586.2A CN107180024A (en) 2017-03-27 2017-03-27 A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system

Publications (1)

Publication Number Publication Date
CN107180024A true CN107180024A (en) 2017-09-19

Family

ID=59830209

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710188586.2A Pending CN107180024A (en) 2017-03-27 2017-03-27 A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system

Country Status (1)

Country Link
CN (1) CN107180024A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886107A (en) * 2017-09-26 2018-04-06 赵淦森 A kind of fusion method of big data, system and device
CN108804599A (en) * 2018-05-29 2018-11-13 浙江大学 A kind of fast searching method of similar subgraph
CN112052404A (en) * 2020-09-23 2020-12-08 西安交通大学 Group discovery method, system, device and medium for multi-source heterogeneous relation network

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239553A (en) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 Entity recognition method based on Map-Reduce framework
CN106021306A (en) * 2016-05-05 2016-10-12 上海交通大学 Ontology matching based case search system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239553A (en) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 Entity recognition method based on Map-Reduce framework
CN106021306A (en) * 2016-05-05 2016-10-12 上海交通大学 Ontology matching based case search system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王宏志 等: "复杂数据上的实体识别技术研究", 《计算机学报》 *
霍然: "量质融合数据管理系统中实体识别子系统的研究与实现", 《中国优秀硕士学位论文全文数据库-信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886107A (en) * 2017-09-26 2018-04-06 赵淦森 A kind of fusion method of big data, system and device
CN107886107B (en) * 2017-09-26 2021-03-30 赵淦森 Big data fusion method, system and device
CN108804599A (en) * 2018-05-29 2018-11-13 浙江大学 A kind of fast searching method of similar subgraph
CN108804599B (en) * 2018-05-29 2022-01-04 浙江大学 Rapid searching method for similar transaction modes
CN112052404A (en) * 2020-09-23 2020-12-08 西安交通大学 Group discovery method, system, device and medium for multi-source heterogeneous relation network
CN112052404B (en) * 2020-09-23 2023-08-15 西安交通大学 Group discovery method, system, equipment and medium of multi-source heterogeneous relation network

Similar Documents

Publication Publication Date Title
CN110941612B (en) Autonomous data lake construction system and method based on associated data
CN107609052B (en) A kind of generation method and device of the domain knowledge map based on semantic triangle
CN111026842B (en) Natural language processing method, natural language processing device and intelligent question-answering system
WO2020135048A1 (en) Data merging method and apparatus for knowledge graph
JP6894534B2 (en) Information processing method and terminal, computer storage medium
CN103699689B (en) Method and device for establishing event repository
CN104346377B (en) A kind of data integration and transfer method based on unique mark
CN111159330B (en) Database query statement generation method and device
CN109325040B (en) FAQ question-answer library generalization method, device and equipment
US20150154286A1 (en) Method for disambiguated features in unstructured text
CN107391677A (en) Carry the generation method and device of the Universal Chinese character knowledge mapping of entity-relationship-attribute
CN104239513A (en) Semantic retrieval method oriented to field data
CN111914550B (en) Knowledge graph updating method and system oriented to limited field
CN105808853B (en) A kind of ontological construction management of Engineering Oriented application and ontology data automatic obtaining method
Achichi et al. Automatic key selection for data linking
CN107180024A (en) A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system
CN106202167B (en) A kind of oriented label figure adaptive index construction method based on structural outline model
Paulus et al. Gathering and Combining Semantic Concepts from Multiple Knowledge Bases.
CN105989097A (en) Ontology-based knowledge base query method and system
CN110442730A (en) A kind of knowledge mapping construction method based on deepdive
CN110263021B (en) Theme library generation method based on personalized label system
Shcherban et al. Multiclass Classification of Four Types of UML Diagrams from Images Using Deep Learning.
CN114153983A (en) Multi-source construction method of industry knowledge graph
CN111984745A (en) Dynamic expansion method, device, equipment and storage medium for database field
CN115827885A (en) Operation and maintenance knowledge graph construction method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170919

RJ01 Rejection of invention patent application after publication