CN106021401A - Extensible entity analysis algorithm based on reverse indices - Google Patents
Extensible entity analysis algorithm based on reverse indices Download PDFInfo
- Publication number
- CN106021401A CN106021401A CN201610316161.0A CN201610316161A CN106021401A CN 106021401 A CN106021401 A CN 106021401A CN 201610316161 A CN201610316161 A CN 201610316161A CN 106021401 A CN106021401 A CN 106021401A
- Authority
- CN
- China
- Prior art keywords
- entity
- record
- algorithm based
- task
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an extensible entity analysis algorithm based on reverse indices. The extensible entity analysis algorithm based on the reverse indices comprises a data source set, an entity set, a record set and an attribute set, wherein the data source set, the entity set, the record set and the attribute set comprise the following steps: a first step, firstly loading data to perform preprocessing, thereby obtaining a preprocessing result of RDD; and a second step, analyzing record to be in the form of key/value, wherein key is ID, value is a character string including Title, Description and Manufacturer information. According to the extensible entity analysis algorithm based on the reverse indices, the measurement of record similarity can be as character string similarity comparison, and the extensible entity analysis algorithm provided by the invention can run in a Spark cluster in parallel.
Description
Technical field
The invention discloses extendible entity resolution algorithm based on inverted index.
Background technology
Entity resolution, is to identify from structuring or unstructured data, link/be grouped same real-world objects not
The same form of expression.Entity resolution is the weight in the fields such as data management, data integration (data fusion), data cleansing and data mining
That wants studies a question.Finding that the entity mated is one of typical application of entity resolution two data concentrations, it is isomery number
According to the core procedure of data digging method.Entity resolution method generally uses similarity function between record to calculate, and and valve
Value compares, so that it is determined that whether 2 records are the entity mated.Entity resolution need all records are done paired two-by-two
Relatively, this process is the most time-consuming.Therefore, Recent study person propose entity resolution technology based on partition, i.e.
According to certain feature or rule, data set is carried out pretreatment, the data block that the scale that is classified into is less, and in these blocks
Carry out entity resolution, to improve efficiency of algorithm.ER problem is in big data age more challenge.First, isomery, destructuring
Data set, has different data patterns and method for expressing, even there is data quality problem;Secondly, ER algorithm should be can
Extension, and can parallel computation in the cluster.3rd, concentrate the entity finding coupling from large-scale data, need to design space-time
Cost and the efficient algorithm of communication overhead.Classical ER algorithm is primarily upon the effectiveness of Entity recognition, the most how can be accurate
Identify the object describing same entity, and the most few towards the extendible entity resolution algorithm research of big data.
Summary of the invention
The present invention is to solve problem of the prior art, it is provided that record measuring similarity regards string-similarity ratio as
Relatively, extendible based on inverted index the entity resolution algorithm that this algorithm can run concurrently in Spark cluster.
The concrete technical scheme of the present invention is as follows: extendible entity resolution algorithm based on inverted index, its feature exists
In: including data source collection, entity set, record set and property set, described data source collection, entity set, record set and property set include
The following step:
The first step: first load data into and carry out pretreatment, its result is RDD;
Second step: record is resolved to key/value form, key is ID, value be contain Title,
The character string of Description, Manufacturer information;
Limit ground further as the present invention, described algorithm also includes program, the Spark application program that user writes, negative
One spark schedule work is shown as high-rise control stream by duty, the conversion of user's definable RDD or perform behaviour in driver program
Make.
Limit ground further as the present invention, described driver program, a sparkcontext object is created.
SparkContext can connect various types of cluster manager dual system, and cluster manager dual system is allocated resources, once
SparkContext is connected to cluster manager dual system, and cluster starts sparkexecutor, the Driver journey in each worker node
Code and task are passed to executor by sequence, and RDD is performed various computings, complete task task, and task will after completing task
Data write file system.
The technique effect of the present invention: extendible based on inverted index the entity resolution algorithm of the present invention, it is possible to pass through
Record measuring similarity regards that string-similarity compares as, and it is expansible that this algorithm can run concurrently in Spark cluster
Based on inverted index.
Accompanying drawing explanation
Fig. 1 is the schematic diagram of extendible based on inverted index the entity resolution algorithm of the embodiment of the present invention.
Detailed description of the invention
The present invention will be further described below in conjunction with the accompanying drawings.
As it is shown in figure 1, the present embodiment Apache Spark is efficient, the general cluster of large-scale data analyzing and processing
Calculating system.Spark utilizes loop-around data flow model, and each parallel work-flow is buffered in each of cluster in this model
Individual Worker node.Utilize the mechanism that partition data is buffered in Worker node memory so that Spark can efficiently, hand over
Big data are analyzed on formula ground mutually.Spark is by abstract for distributed data for elasticity distribution formula data set (RDD), and RDD is read-only distribution
The set of formula data object.Spark follows the tracks of the blood relationship information of RDD, it is ensured that if a node failure or the data caused of bursting of running quickly
Loss can effectively be rebuild.Spark provide two classes RDD operation-Transformation with
Action.Transformations (such as map, join and reduceByKey) is to postpone assessment, say, that they are not vertical
I.e. calculate.The Transformed RDD only ability when Action (such as count, collect and reduce) runs thereon
Perform.RDD can also be buffered in internal memory for follow-up efficient calculating.The Spark application program that user writes, is referred to as
Driver program, as it is shown in figure 1, it is responsible for being shown as a spark schedule work high-rise control stream.Driver program is used
The conversion of family definable RDD or perform operation, these operations perform on the worker node of cluster.In driver program,
One sparkcontext object is created.SparkContext can connect various types of cluster manager dual system, such as Spark
Standalone or YARN etc., these cluster manager dual systems are allocated resources.Once SparkContext is connected to cluster pipe
Reason device, cluster starts the spark executor in each worker node.Code is passed to by Driver program with task
Executor, and RDD is performed various computings, completing task task, task writes data into file system after completing task.
The first step of ER algorithm is exactly first to load data into and carry out pretreatment, and its result is RDD.Each to data set
OK, it would be desirable to record resolves to key/value form, key is ID, value be contain Title, Description,
The character string of Manufacturer information.Data load as follows with the code snippet of preprocessing part:
LoadedData=sc.textFile (filename, 4)
.map(parseDatafileLine)
.cache()
Wherein, textFile function is used for loading data into Spark, paserDatafileLine function for by each
Row record resolves to key/value form type.
Extendible based on inverted index the entity resolution algorithm of the present invention, it is possible to regarded as by record measuring similarity
Being that string-similarity compares, it is extendible based on inverted index that this algorithm can run concurrently in Spark cluster.Need
It is noted that above-mentioned preferred embodiment is only technology design and the feature of the explanation present invention, its object is to allow and be familiar with this skill
The personage of art will appreciate that present disclosure and implements according to this, can not limit the scope of the invention with this.All bases
The equivalence that spirit of the invention is made changes or modifies, and all should contain within protection scope of the present invention.
Claims (3)
- The most extendible entity resolution algorithm based on inverted index, it is characterised in that: include data source collection, entity set, record Collection and property set, described data source collection, entity set, record set and property set through the following steps:The first step: first load data into and carry out pretreatment, its result is RDD;Second step: record is resolved to key/value form, key is ID, value be contain Title, Description, The character string of Manufacturer information.
- The most extendible entity resolution algorithm based on inverted index, it is characterised in that: described algorithm Also include program, the Spark application program that user writes, it is responsible for being shown as a spark schedule work high-rise control stream, The conversion of user's definable RDD or perform operation in driver program.
- The most extendible entity resolution algorithm based on inverted index, it is characterised in that: described Driver program, a sparkcontext object is created.SparkContext can connect various types of cluster management Device, cluster manager dual system is allocated resources, and once SparkContext is connected to cluster manager dual system, and cluster starts each Code and task are passed to executor by spark executor, the Driver program in worker node, and perform RDD each Planting computing, complete task task, task writes data into file system after completing task.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610316161.0A CN106021401A (en) | 2016-05-16 | 2016-05-16 | Extensible entity analysis algorithm based on reverse indices |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610316161.0A CN106021401A (en) | 2016-05-16 | 2016-05-16 | Extensible entity analysis algorithm based on reverse indices |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106021401A true CN106021401A (en) | 2016-10-12 |
Family
ID=57099386
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610316161.0A Pending CN106021401A (en) | 2016-05-16 | 2016-05-16 | Extensible entity analysis algorithm based on reverse indices |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021401A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107391704A (en) * | 2017-07-28 | 2017-11-24 | 重庆邮电大学 | A kind of entity Unified Algorithm based on Spark frameworks |
WO2018129787A1 (en) * | 2017-01-10 | 2018-07-19 | 网宿科技股份有限公司 | Data persistence method and system in stream computing |
CN111984257A (en) * | 2020-06-29 | 2020-11-24 | 山东浪潮通软信息科技有限公司 | Solid modeling customized extension method and device |
-
2016
- 2016-05-16 CN CN201610316161.0A patent/CN106021401A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018129787A1 (en) * | 2017-01-10 | 2018-07-19 | 网宿科技股份有限公司 | Data persistence method and system in stream computing |
CN107391704A (en) * | 2017-07-28 | 2017-11-24 | 重庆邮电大学 | A kind of entity Unified Algorithm based on Spark frameworks |
CN111984257A (en) * | 2020-06-29 | 2020-11-24 | 山东浪潮通软信息科技有限公司 | Solid modeling customized extension method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021103492A1 (en) | Risk prediction method and system for business operations | |
Ho et al. | Online monitoring of metric temporal logic | |
Dijkman et al. | Aligning business process models | |
CN105550268B (en) | Big data process modeling analysis engine | |
Song et al. | Efficient alignment between event logs and process models | |
CN106293648B (en) | Services Composition behavior compliance measure based on Route Dependence figure | |
US9098630B2 (en) | Data selection | |
CN104809244B (en) | Data digging method and device under a kind of big data environment | |
CN105808438B (en) | A kind of Reuse of Test Cases method based on function call path | |
CN106126601A (en) | A kind of social security distributed preprocess method of big data and system | |
CN102053912A (en) | Device and method for automatically testing software based on UML (unified modeling language) graphs | |
Assy et al. | An automated approach for assisting the design of configurable process models | |
CN109799990A (en) | Source code annotates automatic generation method and system | |
CN103729295A (en) | Method for analyzing taint propagation path | |
CN103116574A (en) | Method for mining domain process ontology from natural language text | |
CN112579586A (en) | Data processing method, device, equipment and storage medium | |
CN106021401A (en) | Extensible entity analysis algorithm based on reverse indices | |
Soetens et al. | An initial investigation into change-based reconstruction of floss-refactorings | |
CN106649329A (en) | Safety production big data mining system | |
Hartmann et al. | Model-driven analytics: Connecting data, domain knowledge, and learning | |
CN109783353A (en) | A kind of program analysis method and terminal device | |
CN102880500B (en) | The optimization method of a kind of task tree and device | |
CN106294139B (en) | A kind of Detection and Extraction method of repeated fragment in software code | |
Joishi et al. | Graph or relational databases: A speed comparison for process mining algorithm | |
Petermann et al. | Graph mining for complex data analytics |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20161012 |
|
WD01 | Invention patent application deemed withdrawn after publication |