CN107180024A - A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system - Google Patents
A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system Download PDFInfo
- Publication number
- CN107180024A CN107180024A CN201710188586.2A CN201710188586A CN107180024A CN 107180024 A CN107180024 A CN 107180024A CN 201710188586 A CN201710188586 A CN 201710188586A CN 107180024 A CN107180024 A CN 107180024A
- Authority
- CN
- China
- Prior art keywords
- center connected
- connected subgraph
- entity
- candidate set
- subgraph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Multi-source heterogeneous data entity recognition methods and system the invention discloses a kind of center connected subgraph, method include:All center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set;According to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is filtered and matched, obtain the center connected subgraph matched with service logic.System includes recognition unit and matching unit.The multi-source heterogeneous structural data isolated can be described by the present invention, with intuitive and can be descriptive strong the characteristics of, protruded for the relationship expression between database table, it is basic to provide a kind of efficiently available model for the fusion of multi-source structural data.And the entity Candidate Set that the present invention is found using the method for search center connected subgraph in fused data further identifies entity, so as to identify the data entity with more expression and significance.
Description
Technical field
Know the present invention relates to the multi-source heterogeneous data entity of field of computer technology, more particularly to a kind of center connected subgraph
Other method and system.
Background technology
Data fusion is the important step that big data is administered.Under the multi-source data environment that isomery is isolated, data set
It is to find to whether there is identical data entity or information individual in each data source into a mission critical with merging, with
And the incidence relation between data that may be present or information, so that the relevance between getting through data, solves data heterogeneous
Make information entity more perfect, enrich and uniformly.Thus, entity describes a physics as the elementary cell of information carrier
The each side of world's correspondence concept, under the scene that data fusion is described above, common-denominator target is in interconnected data
Recognize and find that information improves unified entity as far as possible in network.
Prior art:In text data analysis field, most of technical scheme is identified for name entity,
The name in corpus of text is exactly recognized, place name, institutional framework name etc. names entity, and main method has:(1)It is rule-based and
The method of dictionary:Using domain expert by hand construction rule template, from lexical characteristics, match in mode with character string for
Main Means are identified;(2)Based on statistical machine learning method:Common methods have HMM, maximum entropy, support
The method such as vector machine and condition random field, main process is to carry out statistical by the language message included to training corpus
Analysis, feature is excavated from training corpus, Entity recognition is named using these features;(3)Mixed method:Using based on rule
Then it is combined with statistical analysis technique;(4)Method of model identification based on deep learning:In recent years, due to deep learning not
Disconnected development, powerful effect and performance is shown in text handling method, therefore do name entity knowledge using deep learning method
Also do not turn into study hotspot.
In structured data analysis field, for relevant database, traditional entity is built and recognition methods is to utilize
What the thinking of reverse-engineering was carried out.Main technical method has:(1)Using data list structure pattern in relevant database, borrow
Help the graphic method of pattern.By each attribute in table(Field)It is depicted and, then describe the dependence between table and table, most
The ER models of whole database are expressed afterwards.(2)Utilize the method for software document and data base manipulation statement.Utilize Software for Design mistake
Cheng Zhong, the design documentation and database of database and establishment and the action statement of table, recover the ER models of whole database,
So as to realize Entity recognition.
The problem of prior art is present or shortcoming:Text data analysis field is primarily directed to name Entity recognition,
Exactly recognize the entity name included in text, such as name, place name, mechanism name etc..The entity being directed in the present invention, is that correspondence is existing
A physical concept in the real world, including attribute, many aspects information such as relation.In existing structural data Entity recognition
Technical scheme in, the shortcoming existed has:(1)Forms data source structure data entity recognition methods:Traditional utilization database
The method of reverse-engineering, both for centralized database, it is impossible to which the multi-source data successfully managed under big data application scenarios melts
Close the identification of entity;(2)Traditional Land use models graphic mode reduces the mode of physical model, and the attribute that can be associated is confined to
The field or same field that can be effectively matched, it is difficult to find with identification information more comprehensively, more rich across the source number of intension
Factually body.
The content of the invention
In order to solve the above-mentioned technical problem, it can recognize that more expression and significance data are real it is an object of the invention to provide one kind
A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph of body and system.
The technical solution used in the present invention is:
A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph, comprises the following steps:
All center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set;
According to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is carried out filtering and
Match somebody with somebody, obtain the center connected subgraph matched with service logic.
It is described as a kind of further improvement of the multi-source heterogeneous data entity recognition methods of described center connected subgraph
All center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set, the step for be specially:
Centroid of some node as connected subgraph will be preset, other nodes for being connected to the Centroid are searched, constituted
One subgraph, obtains center connected subgraph;
All points of the given figure of traversal, Centroid is preset as by each point, until identification finds all centers of given figure
Connected subgraph.
It is described as a kind of further improvement of the multi-source heterogeneous data entity recognition methods of described center connected subgraph
According to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is filtered and matched,
Obtain the center connected subgraph matched with service logic, the step for be specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set
In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
Another technical scheme of the present invention is:
A kind of multi-source heterogeneous data entity identifying system of center connected subgraph, including:
Recognition unit, for finding all center connected subgraphs to identification in given figure, that is, obtains entity Candidate Set;
Matching unit, for according to entity Candidate Set and default logic rules, to the center connected subgraph in entity Candidate Set
Filtered and matched, obtain the center connected subgraph matched with service logic.
It is described as a kind of further improvement of the multi-source heterogeneous data entity identifying system of described center connected subgraph
Recognition unit is specifically included:
Subgraph Component units, for will preset Centroid of some node as connected subgraph, lookup is connected to the centromere
Other nodes of point, constitute a subgraph, obtain center connected subgraph;
Traversal Unit, all points for traveling through given figure, Centroid is preset as by each point, until identification finds to give
All center connected subgraphs of figure.
It is described as a kind of further improvement of the multi-source heterogeneous data entity identifying system of described center connected subgraph
Matching unit is specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set
In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
The beneficial effects of the invention are as follows:
A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph of the present invention and system can be isolated to multi-source heterogeneous
Structural data is described, with intuitive and can be descriptive strong the characteristics of, between database table relationship expression dash forward
Go out, a kind of efficiently available model basis is provided for the fusion of multi-source structural data.And the present invention utilizes search center connection
The method of subgraph further identifies entity come the entity Candidate Set found in fused data, so as to identify with more expression
The data entity of meaning.
Brief description of the drawings
The embodiment to the present invention is described further below in conjunction with the accompanying drawings:
Fig. 1 is a kind of step flow chart of the multi-source heterogeneous data entity recognition methods of center connected subgraph of the invention;
Fig. 2 is a kind of block diagram of the multi-source heterogeneous data entity identifying system of center connected subgraph of the invention.
Embodiment
With reference to Fig. 1, a kind of multi-source heterogeneous data entity recognition methods of center connected subgraph of the invention, including following step
Suddenly:
All center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set;
According to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is carried out filtering and
Match somebody with somebody, obtain the center connected subgraph matched with service logic.
It is further used as preferred embodiment, it is described that all center connected subgraphs are found to identification in given figure, i.e.,
Obtain entity Candidate Set, the step for be specially:
Centroid of some node as connected subgraph will be preset, other nodes for being connected to the Centroid are searched, constituted
One subgraph, obtains center connected subgraph;
All points of the given figure of traversal, Centroid is preset as by each point, until identification finds all centers of given figure
Connected subgraph.
It is further used as preferred embodiment, it is described according to entity Candidate Set and default logic rules, to entity
Center connected subgraph in Candidate Set is filtered and matched, and obtains the center connected subgraph matched with service logic, this
Step is specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set
In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
With reference to Fig. 2, a kind of multi-source heterogeneous data entity identifying system of center connected subgraph of the invention, including:
Recognition unit, for finding all center connected subgraphs to identification in given figure, that is, obtains entity Candidate Set;
Matching unit, for according to entity Candidate Set and default logic rules, to the center connected subgraph in entity Candidate Set
Filtered and matched, obtain the center connected subgraph matched with service logic.
It is further used as preferred embodiment, the recognition unit is specifically included:
Subgraph Component units, for will preset Centroid of some node as connected subgraph, lookup is connected to the centromere
Other nodes of point, constitute a subgraph, obtain center connected subgraph;
Traversal Unit, all points for traveling through given figure, Centroid is preset as by each point, until identification finds to give
All center connected subgraphs of figure.
It is further used as preferred embodiment, the matching unit is specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set
In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
In the embodiment of the present invention, node structural data pattern switched on figure is expressed, the pass between tables of data
Connection relation is described by connecting side.If an entity, it is only necessary to a side(One database table)Description,
Then the database table can characterize corresponding entity.If an entity needs multiple sides(Multiple database tables)Description,
Then the entity corresponding multiple database tables should can be by the binding of some method and be associated together, i.e., all with some table
Directly or indirectly associate.
In multi-source data, the structural data pattern of each data source is converted to figure and is modeled, namely to every
The database or tables of data of individual data source, table is indicated with node of graph, the dependence between representing table with side.Therefore,
Each database table is mapped to a point, and each relationship maps are mapped to figure to a side, then simple entity.Present invention assumes that
In the data network figure of fusion, data description is correct and complete, and corresponding processing has all been done for redundancy and ambiguity entity.
The present invention is recognized and found entity by center connected subgraph in a preferable datagram.
The specific embodiment of the present invention is as follows:
1st, identification finds entity Candidate Set with finding.Because entity is mapped to figure in the presence of two kinds of possibility:One entity is mapped to one
Individual point or the point of multiple interconnections, therefore the mapping graph of entity is a center connected subgraph.Conversely, a center connection
Subgraph might not have unique entity to correspond to therewith.The process for datagram, included in identification data it is all can
The entity type of energy, finds entity Candidate Set as big as possible.So, by the route searching of figure, find all centers and connect
Logical subgraph, constitutes entity candidate collection.
2nd, analysis is with judging.After all center connected subgraphs are found by above-mentioned steps, namely obtain entity candidate
Set.Node and side for each center connected subgraph(Node represents database table information with side and associates letter respectively
Breath), according to given business scene, logic rules or specific domain knowledge, to center connected subgraph carry out filtering and
Match somebody with somebody, obtain matching and corresponding center connected subgraph with service logic, namely obtain with expression and significance, can correspond to and describe
Entity.It is physically present or the conceptive clear and definite things of expression because entity is one, therefore each entity is to that should have phase
The semanteme and information answered.
From the foregoing it can be that multi-source heterogeneous data entity recognition methods and the system of a kind of center connected subgraph of the invention
The multi-source heterogeneous structural data isolated can be described, with intuitive and can be descriptive strong the characteristics of, for database
Relationship expression between table is protruded, and a kind of efficiently available model basis is provided for the fusion of multi-source structural data.And this hair
The bright method using search center connected subgraph further identifies entity come the entity Candidate Set found in fused data, from
And identify the data entity with more expression and significance.
Above is the preferable implementation to the present invention is illustrated, but the invention is not limited to the implementation
Example, those skilled in the art can also make a variety of equivalent variations or replace on the premise of without prejudice to spirit of the invention
Change, these equivalent deformations or replacement are all contained in the application claim limited range.
Claims (6)
1. the multi-source heterogeneous data entity recognition methods of a kind of center connected subgraph, it is characterised in that comprise the following steps:
All center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set;
According to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is carried out filtering and
Match somebody with somebody, obtain the center connected subgraph matched with service logic.
2. a kind of multi-source heterogeneous data entity recognition methods of center connected subgraph according to claim 1, its feature exists
In:It is described that all center connected subgraphs are found to identification in given figure, that is, obtain entity Candidate Set, the step for be specially:
Centroid of some node as connected subgraph will be preset, other nodes for being connected to the Centroid are searched, constituted
One subgraph, obtains center connected subgraph;
All points of the given figure of traversal, Centroid is preset as by each point, until identification finds all centers of given figure
Connected subgraph.
3. a kind of multi-source heterogeneous data entity recognition methods of center connected subgraph according to claim 1, its feature exists
In:It is described according to entity Candidate Set and default logic rules, the center connected subgraph in entity Candidate Set is filtered
And matching, obtain the center connected subgraph matched with service logic, the step for be specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set
In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
4. a kind of multi-source heterogeneous data entity identifying system of center connected subgraph, it is characterised in that including:
Recognition unit, for finding all center connected subgraphs to identification in given figure, that is, obtains entity Candidate Set;
Matching unit, for according to entity Candidate Set and default logic rules, to the center connected subgraph in entity Candidate Set
Filtered and matched, obtain the center connected subgraph matched with service logic.
5. a kind of multi-source heterogeneous data entity identifying system of center connected subgraph according to claim 4, its feature exists
In:The recognition unit is specifically included:
Subgraph Component units, for will preset Centroid of some node as connected subgraph, lookup is connected to the centromere
Other nodes of point, constitute a subgraph, obtain center connected subgraph;
Traversal Unit, all points for traveling through given figure, Centroid is preset as by each point, until identification finds to give
All center connected subgraphs of figure.
6. a kind of multi-source heterogeneous data entity identifying system of center connected subgraph according to claim 5, its feature exists
In:The matching unit is specially:
The node of center connected subgraph in entity Candidate Set and side, according to default logic rules, to entity Candidate Set
In center connected subgraph filtered and matched, obtain the center connected subgraph matched with service logic.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710188586.2A CN107180024A (en) | 2017-03-27 | 2017-03-27 | A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710188586.2A CN107180024A (en) | 2017-03-27 | 2017-03-27 | A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107180024A true CN107180024A (en) | 2017-09-19 |
Family
ID=59830209
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710188586.2A Pending CN107180024A (en) | 2017-03-27 | 2017-03-27 | A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107180024A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107886107A (en) * | 2017-09-26 | 2018-04-06 | 赵淦森 | A kind of fusion method of big data, system and device |
CN108804599A (en) * | 2018-05-29 | 2018-11-13 | 浙江大学 | A kind of fast searching method of similar subgraph |
CN112052404A (en) * | 2020-09-23 | 2020-12-08 | 西安交通大学 | Group discovery method, system, device and medium for multi-source heterogeneous relation network |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104239553A (en) * | 2014-09-24 | 2014-12-24 | 江苏名通信息科技有限公司 | Entity recognition method based on Map-Reduce framework |
CN106021306A (en) * | 2016-05-05 | 2016-10-12 | 上海交通大学 | Ontology matching based case search system |
-
2017
- 2017-03-27 CN CN201710188586.2A patent/CN107180024A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104239553A (en) * | 2014-09-24 | 2014-12-24 | 江苏名通信息科技有限公司 | Entity recognition method based on Map-Reduce framework |
CN106021306A (en) * | 2016-05-05 | 2016-10-12 | 上海交通大学 | Ontology matching based case search system |
Non-Patent Citations (2)
Title |
---|
王宏志 等: "复杂数据上的实体识别技术研究", 《计算机学报》 * |
霍然: "量质融合数据管理系统中实体识别子系统的研究与实现", 《中国优秀硕士学位论文全文数据库-信息科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107886107A (en) * | 2017-09-26 | 2018-04-06 | 赵淦森 | A kind of fusion method of big data, system and device |
CN107886107B (en) * | 2017-09-26 | 2021-03-30 | 赵淦森 | Big data fusion method, system and device |
CN108804599A (en) * | 2018-05-29 | 2018-11-13 | 浙江大学 | A kind of fast searching method of similar subgraph |
CN108804599B (en) * | 2018-05-29 | 2022-01-04 | 浙江大学 | Rapid searching method for similar transaction modes |
CN112052404A (en) * | 2020-09-23 | 2020-12-08 | 西安交通大学 | Group discovery method, system, device and medium for multi-source heterogeneous relation network |
CN112052404B (en) * | 2020-09-23 | 2023-08-15 | 西安交通大学 | Group discovery method, system, equipment and medium of multi-source heterogeneous relation network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110941612B (en) | Autonomous data lake construction system and method based on associated data | |
CN107609052B (en) | A kind of generation method and device of the domain knowledge map based on semantic triangle | |
CN111026842B (en) | Natural language processing method, natural language processing device and intelligent question-answering system | |
WO2020135048A1 (en) | Data merging method and apparatus for knowledge graph | |
JP6894534B2 (en) | Information processing method and terminal, computer storage medium | |
CN103699689B (en) | Method and device for establishing event repository | |
CN104346377B (en) | A kind of data integration and transfer method based on unique mark | |
CN111159330B (en) | Database query statement generation method and device | |
CN109325040B (en) | FAQ question-answer library generalization method, device and equipment | |
US20150154286A1 (en) | Method for disambiguated features in unstructured text | |
CN107391677A (en) | Carry the generation method and device of the Universal Chinese character knowledge mapping of entity-relationship-attribute | |
CN104239513A (en) | Semantic retrieval method oriented to field data | |
CN111914550B (en) | Knowledge graph updating method and system oriented to limited field | |
CN105808853B (en) | A kind of ontological construction management of Engineering Oriented application and ontology data automatic obtaining method | |
Achichi et al. | Automatic key selection for data linking | |
CN107180024A (en) | A kind of multi-source heterogeneous data entity recognition methods of center connected subgraph and system | |
CN106202167B (en) | A kind of oriented label figure adaptive index construction method based on structural outline model | |
Paulus et al. | Gathering and Combining Semantic Concepts from Multiple Knowledge Bases. | |
CN105989097A (en) | Ontology-based knowledge base query method and system | |
CN110442730A (en) | A kind of knowledge mapping construction method based on deepdive | |
CN110263021B (en) | Theme library generation method based on personalized label system | |
Shcherban et al. | Multiclass Classification of Four Types of UML Diagrams from Images Using Deep Learning. | |
CN114153983A (en) | Multi-source construction method of industry knowledge graph | |
CN111984745A (en) | Dynamic expansion method, device, equipment and storage medium for database field | |
CN115827885A (en) | Operation and maintenance knowledge graph construction method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170919 |
|
RJ01 | Rejection of invention patent application after publication |