CN106021401A

CN106021401A - Extensible entity analysis algorithm based on reverse indices

Info

Publication number: CN106021401A
Application number: CN201610316161.0A
Authority: CN
Inventors: 陈敏刚
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-05-16
Filing date: 2016-05-16
Publication date: 2016-10-12

Abstract

The invention discloses an extensible entity analysis algorithm based on reverse indices. The extensible entity analysis algorithm based on the reverse indices comprises a data source set, an entity set, a record set and an attribute set, wherein the data source set, the entity set, the record set and the attribute set comprise the following steps: a first step, firstly loading data to perform preprocessing, thereby obtaining a preprocessing result of RDD; and a second step, analyzing record to be in the form of key/value, wherein key is ID, value is a character string including Title, Description and Manufacturer information. According to the extensible entity analysis algorithm based on the reverse indices, the measurement of record similarity can be as character string similarity comparison, and the extensible entity analysis algorithm provided by the invention can run in a Spark cluster in parallel.

Description

Extendible entity resolution algorithm based on inverted index

Technical field

The invention discloses extendible entity resolution algorithm based on inverted index.

Background technology

Entity resolution, is to identify from structuring or unstructured data, link/be grouped same real-world objects not The same form of expression.Entity resolution is the weight in the fields such as data management, data integration (data fusion), data cleansing and data mining That wants studies a question.Finding that the entity mated is one of typical application of entity resolution two data concentrations, it is isomery number According to the core procedure of data digging method.Entity resolution method generally uses similarity function between record to calculate, and and valve Value compares, so that it is determined that whether 2 records are the entity mated.Entity resolution need all records are done paired two-by-two Relatively, this process is the most time-consuming.Therefore, Recent study person propose entity resolution technology based on partition, i.e. According to certain feature or rule, data set is carried out pretreatment, the data block that the scale that is classified into is less, and in these blocks Carry out entity resolution, to improve efficiency of algorithm.ER problem is in big data age more challenge.First, isomery, destructuring Data set, has different data patterns and method for expressing, even there is data quality problem；Secondly, ER algorithm should be can Extension, and can parallel computation in the cluster.3rd, concentrate the entity finding coupling from large-scale data, need to design space-time Cost and the efficient algorithm of communication overhead.Classical ER algorithm is primarily upon the effectiveness of Entity recognition, the most how can be accurate Identify the object describing same entity, and the most few towards the extendible entity resolution algorithm research of big data.

Summary of the invention

The present invention is to solve problem of the prior art, it is provided that record measuring similarity regards string-similarity ratio as Relatively, extendible based on inverted index the entity resolution algorithm that this algorithm can run concurrently in Spark cluster.

The concrete technical scheme of the present invention is as follows: extendible entity resolution algorithm based on inverted index, its feature exists In: including data source collection, entity set, record set and property set, described data source collection, entity set, record set and property set include The following step:

The first step: first load data into and carry out pretreatment, its result is RDD；

Second step: record is resolved to key/value form, key is ID, value be contain Title, The character string of Description, Manufacturer information；

Limit ground further as the present invention, described algorithm also includes program, the Spark application program that user writes, negative One spark schedule work is shown as high-rise control stream by duty, the conversion of user's definable RDD or perform behaviour in driver program Make.

Limit ground further as the present invention, described driver program, a sparkcontext object is created. SparkContext can connect various types of cluster manager dual system, and cluster manager dual system is allocated resources, once SparkContext is connected to cluster manager dual system, and cluster starts sparkexecutor, the Driver journey in each worker node Code and task are passed to executor by sequence, and RDD is performed various computings, complete task task, and task will after completing task Data write file system.

The technique effect of the present invention: extendible based on inverted index the entity resolution algorithm of the present invention, it is possible to pass through Record measuring similarity regards that string-similarity compares as, and it is expansible that this algorithm can run concurrently in Spark cluster Based on inverted index.

Accompanying drawing explanation

Fig. 1 is the schematic diagram of extendible based on inverted index the entity resolution algorithm of the embodiment of the present invention.

Detailed description of the invention

The present invention will be further described below in conjunction with the accompanying drawings.

As it is shown in figure 1, the present embodiment Apache Spark is efficient, the general cluster of large-scale data analyzing and processing Calculating system.Spark utilizes loop-around data flow model, and each parallel work-flow is buffered in each of cluster in this model Individual Worker node.Utilize the mechanism that partition data is buffered in Worker node memory so that Spark can efficiently, hand over Big data are analyzed on formula ground mutually.Spark is by abstract for distributed data for elasticity distribution formula data set (RDD), and RDD is read-only distribution The set of formula data object.Spark follows the tracks of the blood relationship information of RDD, it is ensured that if a node failure or the data caused of bursting of running quickly Loss can effectively be rebuild.Spark provide two classes RDD operation-Transformation with Action.Transformations (such as map, join and reduceByKey) is to postpone assessment, say, that they are not vertical I.e. calculate.The Transformed RDD only ability when Action (such as count, collect and reduce) runs thereon Perform.RDD can also be buffered in internal memory for follow-up efficient calculating.The Spark application program that user writes, is referred to as Driver program, as it is shown in figure 1, it is responsible for being shown as a spark schedule work high-rise control stream.Driver program is used The conversion of family definable RDD or perform operation, these operations perform on the worker node of cluster.In driver program, One sparkcontext object is created.SparkContext can connect various types of cluster manager dual system, such as Spark Standalone or YARN etc., these cluster manager dual systems are allocated resources.Once SparkContext is connected to cluster pipe Reason device, cluster starts the spark executor in each worker node.Code is passed to by Driver program with task Executor, and RDD is performed various computings, completing task task, task writes data into file system after completing task.

The first step of ER algorithm is exactly first to load data into and carry out pretreatment, and its result is RDD.Each to data set OK, it would be desirable to record resolves to key/value form, key is ID, value be contain Title, Description, The character string of Manufacturer information.Data load as follows with the code snippet of preprocessing part:

LoadedData=sc.textFile (filename, 4)

.map(parseDatafileLine)

.cache()

Wherein, textFile function is used for loading data into Spark, paserDatafileLine function for by each Row record resolves to key/value form type.

Extendible based on inverted index the entity resolution algorithm of the present invention, it is possible to regarded as by record measuring similarity Being that string-similarity compares, it is extendible based on inverted index that this algorithm can run concurrently in Spark cluster.Need It is noted that above-mentioned preferred embodiment is only technology design and the feature of the explanation present invention, its object is to allow and be familiar with this skill The personage of art will appreciate that present disclosure and implements according to this, can not limit the scope of the invention with this.All bases The equivalence that spirit of the invention is made changes or modifies, and all should contain within protection scope of the present invention.

Claims

The most extendible entity resolution algorithm based on inverted index, it is characterised in that: include data source collection, entity set, record Collection and property set, described data source collection, entity set, record set and property set through the following steps:

The first step: first load data into and carry out pretreatment, its result is RDD；

Second step: record is resolved to key/value form, key is ID, value be contain Title, Description, The character string of Manufacturer information.
The most extendible entity resolution algorithm based on inverted index, it is characterised in that: described algorithm Also include program, the Spark application program that user writes, it is responsible for being shown as a spark schedule work high-rise control stream, The conversion of user's definable RDD or perform operation in driver program.
The most extendible entity resolution algorithm based on inverted index, it is characterised in that: described Driver program, a sparkcontext object is created.SparkContext can connect various types of cluster management Device, cluster manager dual system is allocated resources, and once SparkContext is connected to cluster manager dual system, and cluster starts each Code and task are passed to executor by spark executor, the Driver program in worker node, and perform RDD each Planting computing, complete task task, task writes data into file system after completing task.