CN106021401A - Extensible entity analysis algorithm based on reverse indices - Google Patents

Extensible entity analysis algorithm based on reverse indices Download PDF

Info

Publication number
CN106021401A
CN106021401A CN201610316161.0A CN201610316161A CN106021401A CN 106021401 A CN106021401 A CN 106021401A CN 201610316161 A CN201610316161 A CN 201610316161A CN 106021401 A CN106021401 A CN 106021401A
Authority
CN
China
Prior art keywords
entity
record
algorithm based
task
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610316161.0A
Other languages
Chinese (zh)
Inventor
陈敏刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201610316161.0A priority Critical patent/CN106021401A/en
Publication of CN106021401A publication Critical patent/CN106021401A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an extensible entity analysis algorithm based on reverse indices. The extensible entity analysis algorithm based on the reverse indices comprises a data source set, an entity set, a record set and an attribute set, wherein the data source set, the entity set, the record set and the attribute set comprise the following steps: a first step, firstly loading data to perform preprocessing, thereby obtaining a preprocessing result of RDD; and a second step, analyzing record to be in the form of key/value, wherein key is ID, value is a character string including Title, Description and Manufacturer information. According to the extensible entity analysis algorithm based on the reverse indices, the measurement of record similarity can be as character string similarity comparison, and the extensible entity analysis algorithm provided by the invention can run in a Spark cluster in parallel.

Description

Extendible entity resolution algorithm based on inverted index
Technical field
The invention discloses extendible entity resolution algorithm based on inverted index.
Background technology
Entity resolution, is to identify from structuring or unstructured data, link/be grouped same real-world objects not The same form of expression.Entity resolution is the weight in the fields such as data management, data integration (data fusion), data cleansing and data mining That wants studies a question.Finding that the entity mated is one of typical application of entity resolution two data concentrations, it is isomery number According to the core procedure of data digging method.Entity resolution method generally uses similarity function between record to calculate, and and valve Value compares, so that it is determined that whether 2 records are the entity mated.Entity resolution need all records are done paired two-by-two Relatively, this process is the most time-consuming.Therefore, Recent study person propose entity resolution technology based on partition, i.e. According to certain feature or rule, data set is carried out pretreatment, the data block that the scale that is classified into is less, and in these blocks Carry out entity resolution, to improve efficiency of algorithm.ER problem is in big data age more challenge.First, isomery, destructuring Data set, has different data patterns and method for expressing, even there is data quality problem;Secondly, ER algorithm should be can Extension, and can parallel computation in the cluster.3rd, concentrate the entity finding coupling from large-scale data, need to design space-time Cost and the efficient algorithm of communication overhead.Classical ER algorithm is primarily upon the effectiveness of Entity recognition, the most how can be accurate Identify the object describing same entity, and the most few towards the extendible entity resolution algorithm research of big data.
Summary of the invention
The present invention is to solve problem of the prior art, it is provided that record measuring similarity regards string-similarity ratio as Relatively, extendible based on inverted index the entity resolution algorithm that this algorithm can run concurrently in Spark cluster.
The concrete technical scheme of the present invention is as follows: extendible entity resolution algorithm based on inverted index, its feature exists In: including data source collection, entity set, record set and property set, described data source collection, entity set, record set and property set include The following step:
The first step: first load data into and carry out pretreatment, its result is RDD;
Second step: record is resolved to key/value form, key is ID, value be contain Title, The character string of Description, Manufacturer information;
Limit ground further as the present invention, described algorithm also includes program, the Spark application program that user writes, negative One spark schedule work is shown as high-rise control stream by duty, the conversion of user's definable RDD or perform behaviour in driver program Make.
Limit ground further as the present invention, described driver program, a sparkcontext object is created. SparkContext can connect various types of cluster manager dual system, and cluster manager dual system is allocated resources, once SparkContext is connected to cluster manager dual system, and cluster starts sparkexecutor, the Driver journey in each worker node Code and task are passed to executor by sequence, and RDD is performed various computings, complete task task, and task will after completing task Data write file system.
The technique effect of the present invention: extendible based on inverted index the entity resolution algorithm of the present invention, it is possible to pass through Record measuring similarity regards that string-similarity compares as, and it is expansible that this algorithm can run concurrently in Spark cluster Based on inverted index.
Accompanying drawing explanation
Fig. 1 is the schematic diagram of extendible based on inverted index the entity resolution algorithm of the embodiment of the present invention.
Detailed description of the invention
The present invention will be further described below in conjunction with the accompanying drawings.
As it is shown in figure 1, the present embodiment Apache Spark is efficient, the general cluster of large-scale data analyzing and processing Calculating system.Spark utilizes loop-around data flow model, and each parallel work-flow is buffered in each of cluster in this model Individual Worker node.Utilize the mechanism that partition data is buffered in Worker node memory so that Spark can efficiently, hand over Big data are analyzed on formula ground mutually.Spark is by abstract for distributed data for elasticity distribution formula data set (RDD), and RDD is read-only distribution The set of formula data object.Spark follows the tracks of the blood relationship information of RDD, it is ensured that if a node failure or the data caused of bursting of running quickly Loss can effectively be rebuild.Spark provide two classes RDD operation-Transformation with Action.Transformations (such as map, join and reduceByKey) is to postpone assessment, say, that they are not vertical I.e. calculate.The Transformed RDD only ability when Action (such as count, collect and reduce) runs thereon Perform.RDD can also be buffered in internal memory for follow-up efficient calculating.The Spark application program that user writes, is referred to as Driver program, as it is shown in figure 1, it is responsible for being shown as a spark schedule work high-rise control stream.Driver program is used The conversion of family definable RDD or perform operation, these operations perform on the worker node of cluster.In driver program, One sparkcontext object is created.SparkContext can connect various types of cluster manager dual system, such as Spark Standalone or YARN etc., these cluster manager dual systems are allocated resources.Once SparkContext is connected to cluster pipe Reason device, cluster starts the spark executor in each worker node.Code is passed to by Driver program with task Executor, and RDD is performed various computings, completing task task, task writes data into file system after completing task.
The first step of ER algorithm is exactly first to load data into and carry out pretreatment, and its result is RDD.Each to data set OK, it would be desirable to record resolves to key/value form, key is ID, value be contain Title, Description, The character string of Manufacturer information.Data load as follows with the code snippet of preprocessing part:
LoadedData=sc.textFile (filename, 4)
.map(parseDatafileLine)
.cache()
Wherein, textFile function is used for loading data into Spark, paserDatafileLine function for by each Row record resolves to key/value form type.
Extendible based on inverted index the entity resolution algorithm of the present invention, it is possible to regarded as by record measuring similarity Being that string-similarity compares, it is extendible based on inverted index that this algorithm can run concurrently in Spark cluster.Need It is noted that above-mentioned preferred embodiment is only technology design and the feature of the explanation present invention, its object is to allow and be familiar with this skill The personage of art will appreciate that present disclosure and implements according to this, can not limit the scope of the invention with this.All bases The equivalence that spirit of the invention is made changes or modifies, and all should contain within protection scope of the present invention.

Claims (3)

  1. The most extendible entity resolution algorithm based on inverted index, it is characterised in that: include data source collection, entity set, record Collection and property set, described data source collection, entity set, record set and property set through the following steps:
    The first step: first load data into and carry out pretreatment, its result is RDD;
    Second step: record is resolved to key/value form, key is ID, value be contain Title, Description, The character string of Manufacturer information.
  2. The most extendible entity resolution algorithm based on inverted index, it is characterised in that: described algorithm Also include program, the Spark application program that user writes, it is responsible for being shown as a spark schedule work high-rise control stream, The conversion of user's definable RDD or perform operation in driver program.
  3. The most extendible entity resolution algorithm based on inverted index, it is characterised in that: described Driver program, a sparkcontext object is created.SparkContext can connect various types of cluster management Device, cluster manager dual system is allocated resources, and once SparkContext is connected to cluster manager dual system, and cluster starts each Code and task are passed to executor by spark executor, the Driver program in worker node, and perform RDD each Planting computing, complete task task, task writes data into file system after completing task.
CN201610316161.0A 2016-05-16 2016-05-16 Extensible entity analysis algorithm based on reverse indices Pending CN106021401A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610316161.0A CN106021401A (en) 2016-05-16 2016-05-16 Extensible entity analysis algorithm based on reverse indices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610316161.0A CN106021401A (en) 2016-05-16 2016-05-16 Extensible entity analysis algorithm based on reverse indices

Publications (1)

Publication Number Publication Date
CN106021401A true CN106021401A (en) 2016-10-12

Family

ID=57099386

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610316161.0A Pending CN106021401A (en) 2016-05-16 2016-05-16 Extensible entity analysis algorithm based on reverse indices

Country Status (1)

Country Link
CN (1) CN106021401A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391704A (en) * 2017-07-28 2017-11-24 重庆邮电大学 A kind of entity Unified Algorithm based on Spark frameworks
WO2018129787A1 (en) * 2017-01-10 2018-07-19 网宿科技股份有限公司 Data persistence method and system in stream computing
CN111984257A (en) * 2020-06-29 2020-11-24 山东浪潮通软信息科技有限公司 Solid modeling customized extension method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018129787A1 (en) * 2017-01-10 2018-07-19 网宿科技股份有限公司 Data persistence method and system in stream computing
CN107391704A (en) * 2017-07-28 2017-11-24 重庆邮电大学 A kind of entity Unified Algorithm based on Spark frameworks
CN111984257A (en) * 2020-06-29 2020-11-24 山东浪潮通软信息科技有限公司 Solid modeling customized extension method and device

Similar Documents

Publication Publication Date Title
WO2021103492A1 (en) Risk prediction method and system for business operations
Yang et al. A system architecture for manufacturing process analysis based on big data and process mining techniques
Ho et al. Online monitoring of metric temporal logic
CN106294762B (en) Entity identification method based on learning
US9098630B2 (en) Data selection
CN106293648B (en) Services Composition behavior compliance measure based on Route Dependence figure
CN104809244B (en) Data digging method and device under a kind of big data environment
CN105808438B (en) A kind of Reuse of Test Cases method based on function call path
CN106126601A (en) A kind of social security distributed preprocess method of big data and system
CN102053912A (en) Device and method for automatically testing software based on UML (unified modeling language) graphs
Assy et al. An automated approach for assisting the design of configurable process models
CN109799990A (en) Source code annotates automatic generation method and system
CN103729295A (en) Method for analyzing taint propagation path
CN103116574A (en) Method for mining domain process ontology from natural language text
CN106021401A (en) Extensible entity analysis algorithm based on reverse indices
Soetens et al. An initial investigation into change-based reconstruction of floss-refactorings
Hartmann et al. Model-driven analytics: Connecting data, domain knowledge, and learning
CN109783353A (en) A kind of program analysis method and terminal device
CN102880500B (en) The optimization method of a kind of task tree and device
CN106294139B (en) A kind of Detection and Extraction method of repeated fragment in software code
Petermann et al. Graph mining for complex data analytics
CN106682072A (en) Knowledge management based data mining method for digital archives
CN113254517A (en) Service providing method based on internet big data
Xiong et al. ShenZhen transportation system (SZTS): a novel big data benchmark suite
Leśniak et al. Application of the Bayesian networks in construction engineering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20161012

WD01 Invention patent application deemed withdrawn after publication