WO2016119508A1 - 基于Spark系统的大规模对象识别方法 - Google Patents

基于Spark系统的大规模对象识别方法 Download PDF

Info

Publication number
WO2016119508A1
WO2016119508A1 PCT/CN2015/094377 CN2015094377W WO2016119508A1 WO 2016119508 A1 WO2016119508 A1 WO 2016119508A1 CN 2015094377 W CN2015094377 W CN 2015094377W WO 2016119508 A1 WO2016119508 A1 WO 2016119508A1
Authority
WO
WIPO (PCT)
Prior art keywords
record
matching
rule
same
data
Prior art date
Application number
PCT/CN2015/094377
Other languages
English (en)
French (fr)
Inventor
王明兴
吴颖徽
马帅
汤南
贾西贝
Original Assignee
深圳市华傲数据技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市华傲数据技术有限公司 filed Critical 深圳市华傲数据技术有限公司
Publication of WO2016119508A1 publication Critical patent/WO2016119508A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to a large-scale object recognition method based on a Spark system.
  • the first thing is to identify which records belong to the same real consumer according to the information of the consumers in the record, and usually the consumer information content of different e-commerce records will be different, or The information of the same real consumer registration on each e-commerce website is different, some data will be missing or even wrong, so the same consumer cannot be identified by simple weighting.
  • Object recognition also known as record matching, is designed to identify records representing the same real object from various (unreliable) data sources.
  • Object recognition plays an important role in applications such as data cleaning, data integration, and data analysis.
  • the information of an object usually needs to be associated with the information of other data sources.
  • information representing the same object in other data sources may be erroneous or have different representations. Therefore, object recognition is not simple.
  • data is exploding. It is almost impossible to identify which objects are the same (or similar) from the massive data by traditional methods. Need to be solved. There are two key issues: one is how to identify the same object for the lack of data and the wrong situation; the second is how to solve the matching efficiency problem in the face of massive data, the traditional strategy can not do anything when faced with massive data.
  • Spark system is an open source, universal parallel distributed computing framework developed by the AMP Lab at the University of California, Berkeley, suitable for a variety of iterative algorithms and interactive data analysis to improve the real-time and accuracy of big data processing. It has gradually gained support from many companies. Spark is an open source cluster computing environment similar to Hadoop, but Spark enables memory-distributed data sets, and intermediate output results can be stored in memory, eliminating the need to read and write HDFS and shortening access latency. In addition to providing interactive queries, it can optimize iterative workloads. Therefore, the Spark system can be better applied to data mining and machine learning and other iterative MapReduce algorithms.
  • the object of the present invention is to provide a large-scale object recognition method based on the Spark system, which can improve the matching efficiency against massive data.
  • the present invention provides a large-scale object recognition method based on the Spark system, including:
  • Step 10 Read and parse all matching rules
  • Step 20 reading and parsing the record as the object description data
  • Step 30 For each matching rule, if the record has all the attributes required by the matching rule, the matching result is an attribute string composed of the contents of all the attributes of the record and a record id of the record;
  • Step 40 Collect the record ids corresponding to the same attribute string into a set of record ids, and identify the same object by the set of record ids;
  • Step 50 Broadcast the object to which each object has a record id, and perform a transitive closure process on the object corresponding to the same record id to obtain a new object;
  • Step 60 step 50 is repeated until the number of objects has not changed.
  • Step 30 further includes: if the record does not match any of the matching rules, the matching result includes a special value and a record id of the record.
  • step 50 includes:
  • Step 501 Broadcast an object to which each object has a record id;
  • Step 502 Collect an object to which each record id belongs. If there is only one object to which the record id belongs, the state of the object corresponding to the mark is reserved; otherwise, the record id in all objects is merged and deduplicated, a new object is generated, and the tag is marked. The status of the new object is new, and the status of each old object is marked as deleted.
  • Step 503 Combine the state information of each object. If the state includes new, the object needs to be retained; if the state includes deletion, the object needs to be deleted; otherwise, the object needs to be retained;
  • Step 504 Output all objects that need to be retained.
  • the attribute string consists of the concatenation of all the attributes of the attribute.
  • step 10 includes:
  • step 20 includes:
  • the Spark system reads the source file
  • the matching rule includes:
  • the data format of the matching rule includes a rule id and a list of attribute columns to be compared;
  • the meaning of the matching rule is that for any two records, if the attributes to be compared are not empty and equal, the two records matching rules are said to be successful.
  • any two records satisfying any one of the rules, that is, the two records matching rules are successful.
  • the first rule determines that the first record and the second record are the same object
  • the second rule determines that the second record and the third record are the same object
  • the first record, the second record, and the third record are the same object
  • the present invention solves the problem of matching efficiency in the face of massive data by adopting a massively parallel strategy; the problem of data lacking and error is avoided by pre-defined matching rules.
  • FIG. 1 is a flow chart of a preferred embodiment of a large-scale object recognition method based on the Spark system of the present invention.
  • the present invention pre-determines several key matching rules, and when two consumer record information meets a certain matching rule, they are considered to be the same consumer, for example, this Invention can set consumer name and phone number When the code is the same, it can be considered as the same consumer.
  • This method can well avoid the problem of data lack and error.
  • the present invention adopts a massively parallel strategy, uses multiple machines to process in parallel, and specifically adopts a parallel processing strategy based on memory computing Spark system to solve this problem, and processes object recognition ratio Hadoop framework. quicker.
  • a row of object description data is a record in which the first column "id" of the data is the unique identifier of the record, and the second column and subsequent columns are attributes describing the record.
  • An object may have multiple pieces of record information, or there may be only one. For example, if a consumer has a consumption record on different e-commerce websites, there will be multiple pieces of record information; if there is only consumption on one website, there will only be one record information.
  • Matching Rules are as follows:
  • Rule id A list of attribute columns to be compared.
  • rule1 2, 3.
  • the meaning of the rule is: any two records r1 and r2. If the attributes of the second and third columns are not empty and the two records are equal, the matching rules of the records r1 and r2 are said to be successful, that is, the records r1 and r2 are recorded. For the same object.
  • the matching rule is successful as long as the records r1 and r2 satisfy any of the rules.
  • rule a determines that records r1 and r2 are the same object
  • rule b determines that records r2 and r3 are the same object
  • records r1, r2, r3 are the same object.
  • the preparation of object recognition is to formulate reasonable matching rules for different business data and different needs.
  • the present invention can pre-determine the following rules (assuming that the second column of the data is the name, column 3) For the phone, the fourth column is the mailbox):
  • the two consumers are considered to be the same consumer.
  • the invention adopts a parallel computing strategy based on in-memory computing Spark system to cope with massive data.
  • Step 10 Read and parse all matching rules.
  • the present invention first deals with matching rules.
  • Val ruleData SparkContex.textFile("ruleFileName)
  • Valrules ruleData.map(_.split(":”)(1).split(",”).map(_.toInt)).collect()
  • Step 20 Read and parse the record as the object description data. Next, the recorded data is processed. Without loss of generality, the present invention assumes that the data files are stored in a text file, one record is stored as one line, and the column attributes are separated by commas.
  • Val orgData SparkContex.textFile("dataFileName)
  • Val recorders orgData.map(_.split(","))
  • the record as the object description data is input through step 20, and the recorded data format includes the record id and the corresponding attribute. After parsing, you can get the record id and the value of each column attribute, for example:
  • Step 30 For each matching rule, if the record has all the attributes required by the matching rule, the matching result is an attribute string composed of the contents of all the attributes of the record and a record id of the record. Step 30 identifies the object by using matching rules to match the recorded data. First calculate each rule to identify which records represent the same object.
  • the matching method for each rule for each record is as follows:
  • step 30 for each matching rule rule, all the contents of the column included in the rule are read. If the content of a column is empty, the rule is ignored; otherwise, the record is matched with the rule rule. For example, corresponding to the above record data, if the rule includes two columns, respectively, the second column and the fourth column, it is necessary to determine whether the contents of the second column and the fourth column are empty, and if any of the columns is empty, ignore This rule is used to make the next rule judgment; here, the contents of the second column and the fourth column are "Attr1" and "Attr3", respectively, and the output attribute string is "Attr1, Attr3" and the record id "1". .
  • step 30 may further include: if the record does not match any of the rules, then special content needs to be output to prevent the record from being lost, for example, the output attribute string may be the recorded id value, by the record id and the rule included Each column attribute is distinguished.
  • Step 40 Collect the record ids corresponding to the same attribute string into a set of record ids, and identify the same object by the set of record ids. After the rule is used to match the data, the records corresponding to the same attribute string are the same object, so the record id corresponding to the same attribute string needs to be aggregated in one Start and go heavy, you can get the initial same object results:
  • the set of record ids may be: all the record ids are concatenated by commas, and one object is saved as one line, such as "1, 3, 4".
  • the present invention can calculate in parallel that each matching rule can identify which records represent the same object, for example, rule 1 recognizes that records 1, 3, and 4 are the same object, and rule 2 recognizes that 2 and 4 are the same object. By passing, it can be known that the records 1, 2, 3, and 4 all represent the same object, so the result of the rule matching needs to be processed again.
  • the present invention refers to this step as a transitive closure, and the execution process is as shown in steps 50 and 60. Because there may be multiple passes between objects, the present invention specifically employs an iterative process to solve.
  • Step 50 Broadcast the object to which each object has a record id, and perform a transitive closure process on the object corresponding to the same record id to obtain a new object.
  • it may include:
  • Step 501 Broadcast an object to which each object has a record id;
  • Step 502 Collect an object to which each record id belongs. If there is only one object to which the record id belongs, the state of the object corresponding to the mark is reserved; otherwise, the record id in all objects is merged and deduplicated, a new object is generated, and the tag is marked. The status of the new object is new, and the status of each old object is marked as deleted.
  • Step 503 Combine the state information of each object. If the state includes new, the object needs to be retained; if the state includes deletion, the object needs to be deleted; otherwise, the object needs to be retained;
  • Step 504 Output all objects that need to be retained.
  • the input of step 50 is the output of step 40 or the output of step 504, which is the output of step 504.
  • the content of each line is an object, that is, a set of record ids identifying the same object. For example, when the object is "1, 3, 4", three sets of contents will be output, which are "1" / "1, 3, 4", "3" / "1, 3, 4" and "4" / "1, respectively. 3,4".
  • the purpose of this process is to broadcast which objects each record id belongs to.
  • each record id of the object will add a state information to the object, and the state information may be inconsistent.
  • the object For example, for the object "1, 3, 4", "1" belongs to this object only, so it will give The object adds the state “reserved", and "4" belongs to multiple objects, indicating that "1, 3, 4" needs to be merged with other objects and deleted, retaining the newly added object, so it will add the state "delete” to the object. . Therefore, it is necessary to merge all state information of the object and determine the final state of the object.
  • the first step may result in "1, 2", "2, 3", "3, 4".
  • the records "1, 2, 3, 4" all represent the same object. After a round of transmission closure calculations, "1, 2, 3” and “2, 3, 4" are obtained, and the final result of the transfer closure is "1, 2, 3, 4". That is, step 60 is performed, and step 50 is repeated until the number of objects has not changed.
  • Steps 50 and 60 are as follows:
  • the multi-state processing method is as follows:
  • the present invention is completed based on the large-scale object recognition method of the Spark system.
  • the large-scale object recognition method based on the Spark system adopts a massively parallel strategy, and solves the problem of matching efficiency in the face of massive data; the problem of data lacking and error is avoided by pre-defined matching rules; As we all know, the value of the data is 1+1>>2.
  • the present invention links the originally isolated but highly correlated data, and its value is much greater than the sum of its own values.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于Spark系统的大规模对象识别方法。该方法包括:步骤10、读取并解析所有匹配规则(10);步骤20、读取并解析作为对象描述数据的记录(20);步骤30、对于每个匹配规则,如果记录具有该匹配规则所需的所有属性,匹配结果为该记录的该所有属性的内容所组成的属性串以及该记录的记录id(30);步骤40、将相同属性串对应的记录id聚集在一起成为记录id的集合,以该记录id的集合标识同一对象(40);步骤50、对每个对象所具有的记录id广播其所属的对象,对于同一记录id所对应的对象进行传递闭包处理得到新的对象(50);步骤60、反复进行步骤50,直至对象的数量没有改变(60)。通过采用大规模并行的策略,解决了面对海量数据的匹配效率问题;通过预定义的匹配规则,规避了数据缺少与错误的问题。

Description

基于Spark系统的大规模对象识别方法 技术领域
本发明涉及数据处理技术领域,尤其涉及一种基于Spark系统的大规模对象识别方法。
背景技术
网络技术飞速发展的今天,大量网络应用和产品的使用产生了海量的数据,当我们需要对数据进行清洗、集成时,就需要识别出这些数据中哪些记录是描述同一现实对象的。举个例子:各个电商销售商品时通常会记录消费者本身的信息(姓名、性别、年龄、电话、邮箱、住址等)以及商品的信息(如商品名称、类别、单价、数量等),当需要分析消费者的消费行为时,首要的事情时根据记录中消费者的信息来识别哪些记录是隶属于同一现实消费者,而通常不同的电商记录的消费者信息内容会有所不同,或者同一现实消费者在各电商网站注册的信息有差异,部分数据会缺少甚至错误,因此不能通过简单的去重来识别同一消费者。
对象识别又称记录匹配,其目的是从(不可靠的)各种数据源中识别出表示同一现实对象的记录。对象识别在数据清洗、数据集成、数据分析等应用中具有重要作用。在实际应用中,一个对象的信息通常需要与其他数据源的信息进行关联。然而,其他数据源中表示同一对象的信息可能存在错误或具有不同的表示形式。因此,对象识别并不简单,特别是在互联网技术的迅猛发展的今天,数据在急剧膨胀,采用传统的方法从海量数据中识别出哪些对象是相同(或相似的)几乎不可行,相关问题亟需解决。其中包含两个关键问题:一是针对数据缺少与错误的情况如何识别同一对象;二是面对海量的数据如何解决匹配效率问题,传统的策略面对海量数据时已无能为力。
另一方面,Spark系统是一个开源的通用并行分布式计算框架,由加州大学伯克利分校的AMP实验室开发,适合各种迭代算法和交互式数据分析,能够提升大数据处理的实时性和准确性,现已逐渐获得很多企业的支持。Spark是一种与Hadoop相似的开源集群计算环境,但是 Spark启用了内存分布数据集,中间输出结果可以保存在内存中,从而不再需要读写HDFS,缩短访问延迟,除了能够提供交互式查询外,还可以优化迭代工作负载。因此Spark系统能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。
发明内容
本发明的目的在于提供一种基于Spark系统的大规模对象识别方法,能够提高面对海量数据的匹配效率。
为实现上述目的,本发明提供一种基于Spark系统的大规模对象识别方法,包括:
步骤10、读取并解析所有匹配规则;
步骤20、读取并解析作为对象描述数据的记录;
步骤30、对于每个匹配规则,如果记录具有该匹配规则所需的所有属性,匹配结果为该记录的该所有属性的内容所组成的属性串以及该记录的记录id;
步骤40、将相同属性串对应的记录id聚集在一起成为记录id的集合,以该记录id的集合标识同一对象;
步骤50、对每个对象所具有的记录id广播其所属的对象,对于同一记录id所对应的对象进行传递闭包处理得到新的对象;
步骤60、反复进行步骤50,直至对象的数量没有改变。
其中,步骤30还包括:如果记录不匹配任一匹配规则,匹配结果包括特殊值和该记录的记录id。
其中,步骤50包括:
步骤501、对每个对象所具有的记录id广播其所属的对象;
步骤502、收集每个记录id所属的对象,如果记录id所属的对象只有一个,则标记对应的对象的状态为保留;否则合并所有对象中的记录id并去重,生成新的对象并标记该新的对象的状态为新增,标记每个旧的对象的状态为删除;
步骤503、合并每个对象的状态信息,如果状态内包含新增,此对象需保留;如果状态内包含删除,此对象需删除;否则,此对象需保留;
步骤504、输出所有需要保留的对象。
其中,该属性串由连接符串联该所有属性的内容组成。
其中,步骤10包括:
读取匹配规则的记录文件;
获取每个规则包含的属性列。
其中,步骤20包括:
Spark系统读取源文件;
解析源文件中的记录数据,以分割符对每行数据进行拆分。
其中,该匹配规则包括:
匹配规则的数据格式包括规则id及待比较的属性列的列表;
该匹配规则的含义为,对于任意两条记录,如果待比较的属性都不为空且相等,则称该两条记录匹配规则成功。
其中,对于多条匹配规则,任意两条记录满足任一条规则即称该两条记录匹配规则成功。
其中,如果第一规则判定第一记录和第二记录为同一对象,第二规则判定该第二记录和第三记录为同一对象,则该第一记录、第二记录和第三记录为同一对象。
综上所述,本发明通过采用大规模并行的策略,解决了面对海量数据的匹配效率问题;通过预定义的匹配规则,规避了数据缺少与错误的问题。
附图说明
图1为本发明基于Spark系统的大规模对象识别方法一较佳实施例的流程图。
具体实施方式
下面结合附图,通过对本发明的具体实施方式详细描述,将使本发明的技术方案及其有益效果显而易见。
针对数据缺少与错误的情况如何识别同一对象的问题,本发明预先制定出几个关键的匹配规则,当两个消费者记录信息满足某一匹配规则时就认为他们是同一消费者,例如,本发明可设定消费者姓名与电话号 码相同时就可认为是同一消费者,通过这个方法可以很好的规避数据缺少与错误的问题。为了解决面对海量的数据的匹配效率问题,本发明采用大规模并行的策略,利用多台机器并行处理,具体采用了基于内存计算Spark系统并行处理策略来解决这个问题,处理对象识别比Hadoop框架更快速。
下面详细介绍本发明的处理细节。
·概念定义
不失一般性,本发明一较佳实施例使用如下通用的对象描述数据格式:
id 姓名 性别 就职企业
1 王明兴 华傲数据  
记录——本发明中称一行对象描述数据为一条记录,其中数据第一列“id”为记录的唯一标识,第二列以及随后的列为描述记录的属性。
对象——本发明中称现实中相同的实体为对象。例如,同一消费者、同一某物品等。
一个对象可能存在多条记录信息,也可能只存在一条。例如,某一消费者在不同的电商网站都有消费记录,则会存在多条记录信息;如果只在某一网站有消费,则只会有一条记录信息。
匹配规则——本发明一较佳实施例中定义配规则如下:
规则id:待比较的属性列的列表。
例如:rule1:2,3。
该规则的含义为:任意两条记录r1和r2,如果第二、第三列的属性都不为空且两条记录之间相等,则称记录r1、r2匹配规则成功,即记录r1、r2为同一对象。
对于多条匹配规则,只要记录r1和r2满足任一条规则即称匹配规则成功。
记录匹配的传递性——如果规则a判定记录r1和r2为同一对象,规则b判定记录r2和r3为同一对象,则有记录r1、r2、r3为同一对象。
·制定匹配规则
对象识别的准备工作为针对不同的业务数据、不同的需求制定合理的匹配规则,例如针对上面消费者的例子,本发明可预先制定如下规则(假设数据中第2列内容为姓名,第3列为电话,第4列为邮箱):
rule1:2,3
rule2:2,4
rule3:3,4
即如果两个消费者姓名和电话相同,或者姓名和邮箱相同,或者电话和邮箱相同即认为这两个消费者为同一消费者。
下面结合图1所示的本发明基于Spark系统的大规模对象识别方法一较佳实施例的流程图及伪代码,具体举例说明本发明的详细步骤。
·识别同一对象
制定好匹配规则后,下一步就是利用规则来识别同一对象。本发明采用基于内存计算Spark系统并行处理策略来应付海量数据。
步骤10、读取并解析所有匹配规则。本发明首先处理匹配规则。
先读取匹配规则文件并解析,处理过程如下:
a.读取匹配规则的记录文件:
val ruleData=SparkContex.textFile(“ruleFileName”)
b.解析,忽略规则id,获取每个匹配规则包含的属性列(规则列):
valrules=ruleData.map(_.split(":")(1).split(",").map(_.toInt)).collect()
步骤20、读取并解析作为对象描述数据的记录。接下来处理记录数据。不失一般性,本发明假定数据文件存储在文本文件中,一条记录存储为一行,各列属性以逗号分隔。
a.Spark系统读取源文件:
val orgData=SparkContex.textFile(“dataFileName”)
b.解析源文件中的记录数据,以逗号对每行数据进行拆分:
val recorders=orgData.map(_.split(","))
通过步骤20输入作为对象描述数据的记录,记录的数据格式包括记录id及相应的属性。解析后,可得到记录id,以及各列属性值,例如:
1 Attr1 Attr2 Attr3 Attr4
步骤30、对于每个匹配规则,如果记录具有该匹配规则所需的所有属性,匹配结果为该记录的该所有属性的内容所组成的属性串以及该记录的记录id。步骤30通过使用匹配规则来匹配记录数据来识别对象。首先计算出每一个规则能识别出哪些记录是代表同一对象的。
使用匹配规则匹配数据:
Figure PCTCN2015094377-appb-000001
每个规则对每条记录的匹配方法如下:
Figure PCTCN2015094377-appb-000002
步骤30中,对于每个匹配规则rule,读取规则所包含列的所有内容,如果某列内容为空,则忽略此规则;否则称该条记录匹配规则rule。例如,对应上述的记录数据,假设此规则包含2个列,分别为第二列与第四列,则需判断第二列和第四列内容是否为空,如果任一列内容为空,则忽略此规则,进行下一规则判断;此处第二列和第四列内容分别为“Attr1”,“Attr3”,都不为空,输出的属性串为“Attr1,Attr3”以及记录id“1”。
此外,步骤30还可以包括:如果该记录不匹配任一规则,则需要输出特殊的内容以防止该记录丢失,例如,输出的属性串可以为记录的id值,通过记录id与规则所包含的各列属性进行区分。
步骤40、将相同属性串对应的记录id聚集在一起成为记录id的集合,以该记录id的集合标识同一对象。使用规则匹配数据后,相同属性串对应的记录为同一对象,因此需将相同属性串对应的记录id聚集在一 起,并去重,可得到初步的同一对象结果:
var sameObject=matchData.groupByKey().map(x=>x._2.toSet)
在步骤40中,记录id的集合即对象的形式可以是:将所有的记录id用逗号串联起来,使用文本的方式,一个对象保存为一行,如“1,3,4”。
通过上述步骤,本发明能够并行计算得到每个匹配规则能识别哪些记录是代表同一对象的,如规则1识别出记录1、3、4为同一对象,规则2识别出2、4为同一对象,通过传递可知道,记录1、2、3、4都表示同一对象,因此需要将规则匹配的结果再处理一下,本发明称此步骤为传递闭包,执行过程参见步骤50和60。因为对象之间可能存在多次传递,本发明具体采用迭代过程来解决。
步骤50、对每个对象所具有的记录id广播其所属的对象,对于同一记录id所对应的对象进行传递闭包处理得到新的对象。
具体可以包括:
步骤501、对每个对象所具有的记录id广播其所属的对象;
步骤502、收集每个记录id所属的对象,如果记录id所属的对象只有一个,则标记对应的对象的状态为保留;否则合并所有对象中的记录id并去重,生成新的对象并标记该新的对象的状态为新增,标记每个旧的对象的状态为删除;
步骤503、合并每个对象的状态信息,如果状态内包含新增,此对象需保留;如果状态内包含删除,此对象需删除;否则,此对象需保留;
步骤504、输出所有需要保留的对象。
步骤50的输入为步骤40的输出或上一次迭代也就是步骤504的输出,可以采用文本输入格式,每行内容为一个对象,也就是标识同一对象的记录id的集合。例如对象为“1,3,4”时将输出3组内容,分别为“1”/“1,3,4”、“3”/“1,3,4”以及“4”/“1,3,4”。此过程的目的是广播每个记录id分别属于哪些对象。
因为对象的每个记录id都将给该对象增加一个状态信息,且状态信息可能不一致,如对于对象“1,3,4”,“1”只属于此对象,因此它将给该 对象增加状态“保留”,而“4”属于多个对象,表明“1,3,4”需与其他对象合并后删除,保留那个新增的对象,因此它将给该对象增加状态“删除”。故此需要合并对象的所有状态信息,并确定对象的最终状态。例如:第一步可能得到的结果为“1,2”,“2,3”,“3,4”,经分析可得记录“1,2,3,4”都表示同一个对象,而经过一轮传递闭包计算后得到“1,2,3”和“2,3,4”,需再做一次传递闭包才得最终的结果“1,2,3,4”。也就是执行步骤60,反复进行步骤50,直至对象的数量没有改变。
步骤50和60具体如下:
Figure PCTCN2015094377-appb-000003
Figure PCTCN2015094377-appb-000004
其中多状态处理方法如下:
Figure PCTCN2015094377-appb-000005
至此,本发明基于Spark系统的大规模对象识别方法执行完成。
综上所述,本发明基于Spark系统的大规模对象识别方法采用大规模并行的策略,解决了面对海量数据的匹配效率问题;通过预定义的匹配规则,规避了数据缺少与错误的问题;众所周知,数据的价值是1+1>>2的,本发明将原本孤立但却高度相关的数据联系起来,其价值要远大于本身价值之和。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。

Claims (9)

  1. 一种基于Spark系统的大规模对象识别方法,其特征在于,包括:
    步骤10、读取并解析所有匹配规则;
    步骤20、读取并解析作为对象描述数据的记录;
    步骤30、对于每个匹配规则,如果记录具有该匹配规则所需的所有属性,匹配结果为该记录的该所有属性的内容所组成的属性串以及该记录的记录id;
    步骤40、将相同属性串对应的记录id聚集在一起成为记录id的集合,以该记录id的集合标识同一对象;
    步骤50、对每个对象所具有的记录id广播其所属的对象,对于同一记录id所对应的对象进行传递闭包处理得到新的对象;
    步骤60、反复进行步骤50,直至对象的数量没有改变。
  2. 根据权利要求1所述的基于Spark系统的大规模对象识别方法,其特征在于,步骤30还包括:如果记录不匹配任一匹配规则,匹配结果包括特殊值和该记录的记录id。
  3. 根据权利要求1所述的基于Spark系统的大规模对象识别方法,其特征在于,步骤50包括:
    步骤501、对每个对象所具有的记录id广播其所属的对象;
    步骤502、收集每个记录id所属的对象,如果记录id所属的对象只有一个,则标记对应的对象的状态为保留;否则合并所有对象中的记录id并去重,生成新的对象并标记该新的对象的状态为新增,标记每个旧的对象的状态为删除;
    步骤503、合并每个对象的状态信息,如果状态内包含新增,此对象需保留;如果状态内包含删除,此对象需删除;否则,此对象需保留;
    步骤504、输出所有需要保留的对象。
  4. 根据权利要求1所述的基于Spark系统的大规模对象识别方法,其特征在于,该属性串由连接符串联该所有属性的内容组成。
  5. 根据权利要求1所述的基于Spark系统的大规模对象识别方法, 其特征在于,步骤10包括:
    读取匹配规则的记录文件;
    获取每个规则包含的属性列。
  6. 根据权利要求1所述的基于Spark系统的大规模对象识别方法,其特征在于,步骤20包括:
    Spark系统读取源文件;
    解析源文件中的记录数据,以分割符对每行数据进行拆分。
  7. 根据权利要求1所述的基于Spark系统的大规模对象识别方法,其特征在于,该匹配规则包括:
    匹配规则的数据格式包括规则id及待比较的属性列的列表;
    该匹配规则的含义为,对于任意两条记录,如果待比较的属性都不为空且相等,则称该两条记录匹配规则成功。
  8. 根据权利要求1所述的基于Spark系统的大规模对象识别方法,其特征在于,对于多条匹配规则,任意两条记录满足任一条规则即称该两条记录匹配规则成功。
  9. 根据权利要求1所述的基于Spark系统的大规模对象识别方法,其特征在于,如果第一规则判定第一记录和第二记录为同一对象,第二规则判定该第二记录和第三记录为同一对象,则该第一记录、第二记录和第三记录为同一对象。
PCT/CN2015/094377 2015-01-30 2015-11-12 基于Spark系统的大规模对象识别方法 WO2016119508A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510047803.7A CN104573098B (zh) 2015-01-30 2015-01-30 基于Spark系统的大规模对象识别方法
CN2015100478037 2015-01-30

Publications (1)

Publication Number Publication Date
WO2016119508A1 true WO2016119508A1 (zh) 2016-08-04

Family

ID=53089160

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/094377 WO2016119508A1 (zh) 2015-01-30 2015-11-12 基于Spark系统的大规模对象识别方法

Country Status (2)

Country Link
CN (1) CN104573098B (zh)
WO (1) WO2016119508A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104573094B (zh) * 2015-01-30 2018-05-29 深圳市华傲数据技术有限公司 网络账号识别匹配方法
CN104573098B (zh) * 2015-01-30 2018-05-29 深圳市华傲数据技术有限公司 基于Spark系统的大规模对象识别方法
CN106294530B (zh) * 2015-06-29 2019-09-13 阿里巴巴集团控股有限公司 规则匹配的方法和系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7035842B2 (en) * 2002-01-17 2006-04-25 International Business Machines Corporation Method, system, and program for defining asset queries in a digital library
CN102122280A (zh) * 2009-12-17 2011-07-13 北大方正集团有限公司 一种智能提取内容对象的方法及系统
CN103020782A (zh) * 2012-12-25 2013-04-03 远光软件股份有限公司 内部关联交易业务的自动识别和提取方法及系统
CN104239501A (zh) * 2014-09-10 2014-12-24 中国电子科技集团公司第二十八研究所 一种基于Spark的海量视频语义标注方法
CN104573098A (zh) * 2015-01-30 2015-04-29 深圳市华傲数据技术有限公司 基于Spark系统的大规模对象识别方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103118043B (zh) * 2011-11-16 2015-12-02 阿里巴巴集团控股有限公司 一种用户账号的识别方法及设备
US9639676B2 (en) * 2012-05-31 2017-05-02 Microsoft Technology Licensing, Llc Login interface selection for computing environment user login

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7035842B2 (en) * 2002-01-17 2006-04-25 International Business Machines Corporation Method, system, and program for defining asset queries in a digital library
CN102122280A (zh) * 2009-12-17 2011-07-13 北大方正集团有限公司 一种智能提取内容对象的方法及系统
CN103020782A (zh) * 2012-12-25 2013-04-03 远光软件股份有限公司 内部关联交易业务的自动识别和提取方法及系统
CN104239501A (zh) * 2014-09-10 2014-12-24 中国电子科技集团公司第二十八研究所 一种基于Spark的海量视频语义标注方法
CN104573098A (zh) * 2015-01-30 2015-04-29 深圳市华傲数据技术有限公司 基于Spark系统的大规模对象识别方法

Also Published As

Publication number Publication date
CN104573098A (zh) 2015-04-29
CN104573098B (zh) 2018-05-29

Similar Documents

Publication Publication Date Title
CN106649455B (zh) 一种大数据开发的标准化系统归类、命令集系统
CN104298771B (zh) 一种海量web日志数据查询与分析方法
CN103620601B (zh) 在映射缩减过程中汇合表
CN111339427B (zh) 一种图书信息推荐方法、装置、系统及存储介质
CN106970929B (zh) 数据导入方法及装置
Agarwal et al. Approximate incremental big-data harmonization
WO2017096892A1 (zh) 索引构建方法、查询方法及对应装置、设备、计算机存储介质
US20140046899A1 (en) Method and Apparatus of Implementing Navigation of Product Properties
WO2016119508A1 (zh) 基于Spark系统的大规模对象识别方法
JP2019520627A (ja) データベース中にグラフ情報を記憶するためのb木使用
CN107729330B (zh) 获取数据集的方法和装置
WO2016119276A1 (zh) 基于Hadoop框架的大规模对象识别方法
Benny et al. Hadoop framework for entity resolution within high velocity streams
Kim et al. Customer preference analysis based on SNS data
CN110929509B (zh) 一种基于louvain社区发现算法的领域事件触发词聚类方法
JP6438295B2 (ja) ハイパーグラフソルバーのためのグラフ入力の自動編集
US11868362B1 (en) Metadata extraction from big data sources
CN110704635A (zh) 一种知识图谱中三元组数据的转换方法及装置
Lee et al. Hands-On Big Data Modeling: Effective database design techniques for data architects and business intelligence professionals
JP6457290B2 (ja) グラフを剪定する方法、前記グラフを剪定する方法をコンピュータに行なわせる命令を記録している非一時的なコンピュータ可読記憶媒体、及びグラフの剪定を行うためのコンピュータシステム
Turner Hadoop: What it is, how it works, and what it can do
CN112115271A (zh) 知识图谱构建方法及装置
Cao E-Commerce Big Data Mining and Analytics
US11500933B2 (en) Techniques to generate and store graph models from structured and unstructured data in a cloud-based graph database system
CN117033346A (zh) 一种基于企业数据的数仓建模方法、系统、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15879706

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15879706

Country of ref document: EP

Kind code of ref document: A1