CN103336771A - Data similarity detection method based on sliding window - Google Patents

Data similarity detection method based on sliding window Download PDF

Info

Publication number
CN103336771A
CN103336771A CN2013101142448A CN201310114244A CN103336771A CN 103336771 A CN103336771 A CN 103336771A CN 2013101142448 A CN2013101142448 A CN 2013101142448A CN 201310114244 A CN201310114244 A CN 201310114244A CN 103336771 A CN103336771 A CN 103336771A
Authority
CN
China
Prior art keywords
queue
variable
vector
attribute
record
Prior art date
Application number
CN2013101142448A
Other languages
Chinese (zh)
Other versions
CN103336771B (en
Inventor
周莲英
周典瑞
Original Assignee
江苏大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 江苏大学 filed Critical 江苏大学
Priority to CN201310114244.8A priority Critical patent/CN103336771B/en
Publication of CN103336771A publication Critical patent/CN103336771A/en
Application granted granted Critical
Publication of CN103336771B publication Critical patent/CN103336771B/en

Links

Abstract

The invention discloses a data similarity detection method based on a sliding window, which comprises the steps that S1, an empirical vector G of an attribute is computed by a ranking method; S2,a statistic vector C of the attribute is computed by a mathematical statistic method; S3, the empirical vector G and the statistic vector C are integrated, and a final weight vector W is computed; S4, a window upper bound of a queue of sizes of variable windows is computed; S5, a plurality of threads are created according to the number of the attributes; S6, a record set is scanned sequentially in each thread, and a similarity degree of a current record and a record in the variable queue is computed; and S7, duplicated record sets detected from the threads are merged. Multi-round detection is replaced by the detection algorithm based on multithread concurrence, so that the detection efficiency is improved, and the detection time is saved.

Description

基于滑动窗口的数据相似检测方法 Similar detection methods based on the data of the sliding window

技术领域 FIELD

[0001] 本发明涉及数据清洗技术领域,尤其涉及一种海量数据下基于滑动窗口的数据相似检测方法。 [0001] The present invention relates to data cleaning technology, and particularly relates to a method for detecting a sliding window based on data similar to the one kind of mass data.

背景技术 Background technique

[0002] 数据相似检测就是检测数据库中的相似重复记录,以剔除冗余数据。 [0002] Data detection is similar similar detection of duplicate records in a database to eliminate redundant data. 相似重复记录为同一个现实实体在数据集合中不同的表现形式,由于它们在格式、拼写等方面的差异,导致数据库管理系统不能正确识别,进而影响对数据的正确处理。 Similar to a real duplicate records for the same entity in different forms of performance data set, due to their differences in format, such as spelling, resulting in the database management system does not recognize correctly, and thus affect the correct processing of data. 相似重复记录检测的衡量指标包括查全率、查准率及时间效率等,三者之间往往是相互制约的。 Similar measure to detect duplicate records include recall, precision and time efficiency, often between three mutual restraint. 海量数据下的数据检测在查全率和时间效率上尤为突出。 Detecting the data on the mass data recall and time efficiency is particularly prominent. 需从多方面对检测算法进行优化算法,以提高检测效果和检测效率。 The need for optimization of detection algorithms in many ways, in order to improve the detection effectiveness and detection efficiency.

[0003]目前已有的检测算法主要包括字段匹配算法、编辑距离算法、聚类算法以及基于滑动窗口的检测算法。 [0003] field matching algorithm, the algorithm for editing, clustering algorithm and the existing sliding window detection algorithm is based on distance detection algorithm include. 其中尤以基于滑动窗口的算法较为有效。 Especially in the sliding window algorithm based on more effective. 该算法在对记录集进行排序,依据相似记录邻近原理,将检测记录的比较记录数限制在有限的窗口数目内,从而可大大提高检测效率。 The algorithm for sorting the records set, based on a similar principle recorded adjacent, to limit the number of test recording Comparative recorded in a limited number of windows, which can greatly improve the detection efficiency. 经典的基于滑动窗口的相似检测的优点是算法简单,有限的比较量。 The classic sliding window similar advantages based detection algorithm is simple, a limited amount of comparison. 但也存在问题:①没有考虑记录的各不同属性对检测效果的影响差异,平等对待各个属性;②窗口的大小设置没有统一的标准。 But there are problems: ① does not consider the impact of differences in various property records for the detection effect, equal treatment of each attribute; ② window size settings there is no uniform standard. 由此国内外研究学者提出了一些优化方法,如《数据仓库中的相似重复记录检测方法[J]》提出可根据记录属性对于记录的影响不同,采用等级法对属性进行加权并采用简单的多轮次检测技术提高查全率。 Whereby the number of domestic and foreign researchers proposed optimization methods, such as "data warehouse similar duplicate records detection [J]" may be made for different recorded impact recording attributes, using the attribute level weighting method simple and more round detection technology to improve recall. ((An adaptive and efficientalgorithm for detecting approximately duplicate database records〉〉用一个动态形成的相似记录集队列(实际为可变的队列)来代替固定大小的滑动窗口,可变队列中的每一个记录代表的是不同相似程度的记录集,而不是一条记录,提高了查全率。 ((An adaptive and efficientalgorithm for detecting approximately duplicate database records >> similar record set with a dynamic queue formed (actually variable queue) instead of a sliding window of fixed size, a record of each variable is representative of the queue the degree of similarity of the different set of records rather than one record, improving the recall.

[0004] 针对以上方法,本文对海量数据下的数据相似检测做了进一步的优化。 [0004] For the above methods, the detection of the data in this paper is similar to the mass data were further optimized. ①科学地计算属性的权重向量,权重计算的科学性保证了按照不同属性对记录集进行排序时,相似重复记录邻近。 Right ① scientifically calculated attribute weight vector, the weight calculation in accordance with the scientific guarantees different attribute record set to be sorted adjacent similar duplicate records. ②确定存放相似重复记录集的可变窗口大小的队列的窗口上界,使得记录的比对次数大大降低,减少比对时间。 ② window boundaries to determine the queue storage variable window size similar set of duplicate records, the number of times that the recording ratio is greatly reduced, reducing the ratio of time. ③采用多线程代替多轮次的相似重复检测,减少检测时间,提闻检测效率。 ③ instead of multithreading similar iterative detection of multiple rounds, to reduce the detection time, improve the detection efficiency smell.

发明内容 SUMMARY

[0005] 本发明提出了一种海量数据下基于滑动窗口的数据相似检测方法,其采用综合用户经验和数理统计方法的综合赋权法,计算出能表征记录特征的属性权重向量;计算存放相似重复记录集的可变窗口大小的队列的窗口上界;依据各属性进行多线程并发执行检测算法,检测过程中,使用可变窗口大小的队列依次检测记录集,最后合并各线程的检测结果O [0005] The present invention proposes a similar method for detecting data based on sliding window, which is an integrated user experience several integrated weighting method under one kind of statistical methods of processing massive data, the attribute weight vector calculated recording characteristic can be characterized; calculating similarity storage bound on the queue variable window size of the window set is repeatedly recorded; multiple concurrently executing threads detection algorithm based on the respective attributes, the detection process, using a variable window size of the queue sequentially detected record set, the last detection result of the merger of each thread O

[0006] 为了实现上述目的,本发明实施案例提供的技术方案如下: [0006] To achieve the above object, the present invention provides embodiments case technical solutions are as follows:

一种基于滑动窗口的数据相似检测方法,所述方法包括:51、采用等级法计算属性的经验向量G ; Various similar detection method based on sliding window of data, the method comprising: 51, using the method of calculating an attribute level experience vector G;

52、采用数理统计法计算属性的统计向量C ; 52, calculated using mathematical statistics attribute statistics vector C;

53、综合经验向量G和统计向量C,计算出最终的权重向量W ; 53, G and combined experience vector statistics vector C, calculate the final weight vector W;

54、计算可变窗口大小的队列的窗口上界; 54, the calculated upper bound queue window variable window size;

55、根据属性的个数创建多个线程; 55, create multiple threads based on the number of properties;

56、在每个线程中,顺序扫描记录集,计算当前记录与可变队列中记录的相似度; 56, in each thread sequentially scanning the record set, and calculates the similarity of the current record variable records in the queue;

57、合并各线程检测出的重复记录集。 57, each of the threads combined duplicate record set detected.

[0007] 作为本发明的进一步改进,所述步骤SI具体为: [0007] As a further improvement of the present invention, the step SI specifically:

根据用户经验,采用等级法为每一个属性赋予相应的等级,然后通过均值法计算出代表记录属性特征的经验向量G。 The user experience, using the method given level corresponding to a level for each attribute, and then calculates the properties experience feature vector representing a recording method by the average G.

[0008] 作为本发明的进一步改进,所述步骤S2具体为: [0008] As a further improvement of the present invention, the step S2 is specifically:

多次随即抽取指定数目的记录,计算每一属性取值的变化种数,作为客观描述属性对记录的重要性,使用均值法计算出每个属性的取值种类数,生成属性统计向量C。 Then extracted repeatedly a specified number of records, calculates the number of kinds of variation values ​​of each attribute, as an objective description of the importance of property records, each attribute value is calculated using the average number of kinds of methods to generate a statistical property vector C.

[0009] 作为本发明的进一步改进,所述经验向量G的计算公式为: [0009] As a further improvement of the invention, the vector G empirical formula is:

Figure CN103336771AD00051

其中,Gi表示第i个属性的最终统一等级,m表示用户的个数,s表示第s个操作用户。 Wherein, Gi represents the i-th final uniform level attributes, m represents the number of users, s represents the s-th user operation.

[0011] 作为本发明的进一步改进,所述统计向量C的计算公式为: [0011] As a further improvement of the invention, the vector C is calculated statistic is:

Figure CN103336771AD00052

其中,Cij表示第i次第j个属性的取值种类数目,Cj表示第j个属性的最终种类数,m表示选取的次数。 Wherein, Cij j denotes the i th sequence number of kinds of attribute values, Cj j-th attribute of the final number of types, m represents a number selected.

[0012] 作为本发明的进一步改进,所述权重向量W的计算公式为: [0012] As a further improvement of the present invention, the calculated weight vector W is:

Figure CN103336771AD00053

其中,Wi表示第i个属性的权重向量,Gi表示第i个属性的最终统一等级,Ci表示第i个属性的最终种类数。 Which, Wi represents the i-th attribute weights vector, Gi represents the i-th level attribute of eventual reunification, Ci denotes the i th attribute of the final number of species.

[0013] 作为本发明的进一步改进,所述步骤S6中在每个线程中顺序扫描记录集之前还包括: [0013] As a further improvement of the present invention, the step S6 in each thread before recording set sequential scanning further comprises:

在每个线程中根据属性值对数据集进行排序。 Sort the data set in each thread based on the attribute value.

[0014] 作为本发明的进一步改进,所述步骤S6中“计算当前记录与可变队列中记录的相似度”具体为: [0014] As a further improvement of the present invention, the step S6 "to calculate the similarity with the current record variable records in the queue" specifically:

当前记录与可变队列中的第一个记录进行相似检测; A first recording current of the variable queue similarity detection;

若当前记录与可变队列中的第一个记录相似,把当前记录添加到相思重复记录集中,然后,把当前记录添加到可变队列的第一个记录中,查看可变队列是否已满,如果可变队列已满,先剔除可变队列中最后一条记录,然后再把当前记录添加到可变队列;如果优先队列不满,则直接添加记录到可变队列中;若当前记录与可变队列中的第一个记录不相似,继续与可变队列的其他记录进行比对。 If the current record and the first variable in the record is similar to the queue, the current record is added to the repetitive recording Acacia concentrated, then the current record is added to the first record variable queue, the queue is full view of the variable, If the variable queue is full, the first variable excluding the last record in the queue, and then the current record is added to the variable queue; if the priority queue is not full, add records directly to the variable queue; if the current record with variable queue the first record is not similar, other than to continue recording variable queue.

[0015] 本发明综合主观用户经验和客观的数理统计两方面计算属性的权重代替主观的等级加权计算;相似检测的过程中采用可变窗口(确定的窗口上限)大小的队列代替固定窗口大小的滑动窗口依次扫描数据集,检测每一条记录。 [0015] The present invention with subjective user experience and objective STATISTICS weight both calculated attribute weight calculation instead of subjectively rating weights; similar procedure to detect the use of a variable window (upper window determined) queue instead of fixed-size window size sliding window sequentially scan data set, each record is detected. 同时采用基于多线程并发的检测算法代替多轮检测,提高了检测效率,节省了检测时间。 At the same time instead of using multiple rounds of detection based on detection algorithm concurrent multithreading, improve the detection efficiency and save detection time.

附图说明 BRIEF DESCRIPTION

[0016] 图1为本发明基于滑动窗口的数据相似检测方法的流程示意图。 [0016] Fig 1 a schematic flow chart similar detection method based on the data of the sliding window of the present invention.

[0017] 图2为本发明一具体实施方式中基于滑动窗口的数据相似检测方法的具体流程图。 [0017] FIG. 2 is similar to a flowchart of specific detection method based on the data of a sliding window embodiment of the present invention.

具体实施方式 Detailed ways

[0018] 以下将结合附图所示的各实施方式对本发明进行详细描述。 [0018] The present invention will hereinafter be described in detail in conjunction with the embodiments shown in the drawings. 但这些实施方式并不限制本发明,本领域的普通技术人员根据这些实施方式所做出的结构、方法、或功能上的变换均包含在本发明的保护范围内。 However, these embodiments do not limit the present invention, the structure of those of ordinary skill in the art according to these embodiments made, methods, or conversion functions are included in the scope of the present invention.

[0019] 参图1所示,本发明的一种基于滑动窗口的数据相似检测方法,该方法包括: [0019] reference to FIG. 1, similar to one inventive method for detecting data based on sliding window, the method comprising:

51、采用等级法计算属性的经验向量G ; 51, using the method of calculating an attribute level of experience of the vector G;

52、采用数理统计法计算属性的统计向量C ; 52, calculated using mathematical statistics attribute statistics vector C;

53、综合经验向量G和统计向量C,计算出最终的权重向量W ; 53, G and combined experience vector statistics vector C, calculate the final weight vector W;

54、计算可变窗口大小的队列的窗口上界。 54, the calculated upper bound queue window of variable window size.

55、根据属性的个数创建多个线程; 55, create multiple threads based on the number of properties;

56、在每个线程中,顺序扫描记录集,计算当前记录与可变队列中记录的相似度 56, in each thread sequentially scanning the record set, and calculates the similarity of the current record variable records in the queue

57、合并各线程检测出的重复记录集。 57, each of the threads combined duplicate record set detected.

[0020] 其中步骤SI〜S3中的经验向量G、统计向量C和最终的权重向量W具体算法为: 综合用户经验和数理统计方法的综合赋权法:数据库中记录的属性描述了实体的特 [0020] wherein step SI~S3 experience in vector G, and the final vector C statistical weight vector W algorithm specifically: an integrated user experience several statistical methods of processing synthetic weighting method: the database records described in the attributes of the entity Laid

征,各个属性决定实体特征的重要程度各不相同,因此有必要计算每个属性的权重。 Zheng, wherein each attribute decision entity with varying degrees of importance, it is necessary to re-calculate the weight of each attribute.

[0021] ①主观方面:Gst(l≤s≤m,I≤t≤k)是第s个操作用户根据个人的经验为属性At所指定的等级(从I开始,使用连续正整数表示等级,I表示最高等级,数值越大,等级越低),Gt表示第t个属性的最终统一等级,采用均值法计算出每个属性的最终统一等级。 [0021] ① subjective: Gst (l≤s≤m, I≤t≤k) is the s-th user attribute At operation specified based on personal experience grade (from I to start using a continuous positive integer level, I represents the highest level, the greater the value, the lower the rating), Gt represents the t-th level uniform final properties, uniform final level calculation using the average method for each attribute.

Figure CN103336771AD00061

[0022] 其中,Gi表示第i ≤I ≤k)个属性的最终统一等级,m表示用户的个数。 [0022] wherein, Gi represents the i ≤I ≤k) final uniform level attributes, m represents the number of users.

②客观方面:属性的取值若是各不相同,就会很容易区别记录的相似性。 ② objective: the value of the property if not the same, it would be easy to distinguish the similarity of records. 客观上采用随 With the use of objective

机统计法(多次随机选取一定数目的记录)计算每一属性取值的变化种数,作为客观描述属性对记录的重要性。 Machine statistics (a certain number of times randomly selected records) calculates the number of kinds of variation values ​​of each attribute, the attribute as the importance of the recording objective description. Cu (I≤ j ≤ k)表示第i次第j个属性的取值种类数目,表示第j个属性的最终种类数,使用均值法计算出每个属性的最终取值种类数。 Cu (I≤ j ≤ k) represents the value of the number of the i-type sequence of attribute j, j represents the final number of kinds of attribute, each attribute is calculated using the mean value of the final number of kinds of methods.

Figure CN103336771AD00062

[0023] 其中,Cj表示第j个属性的最终种类数,m表示选取的次数。 [0023] wherein, Cj j-th attribute of the final number of types, m represents a number selected.

综合上述两式,计算属性权重向量W= (W1, W2,-,Wk): The above two formulas, the attribute is calculated weight vector W = (W1, W2, -, Wk):

Figure CN103336771AD00071

可变窗口大小的队列窗口上界计算:多次随即选取的一定数目记录,根据各属性启动多线程,每个线程中按照属性关键字排序,然后检测出每个线程中的相似重复记录,计算两条相似重复记录之间的距离(两条记录之间存在几条记录): Variable window size of the queue window boundary calculated: recording a number of times randomly chosen, starting multithreading according to each attribute, each attribute keywords sorted by thread and detected similar duplicate records each thread is calculated (a few records exist between the two recording) a similar distance between two duplicate records:

Figure CN103336771AD00072

其中,size表示可变队列的窗口上限,Cli表示第i次选取的记录中两条相似重复记录的距离,m表示记录的选取次数。 Wherein, size variable window represents an upper limit of the queue, Cli represents the i-th record selected in duplicate records from two similar, m represents the number of selected records.

[0024] 参图2所示,本发明的一具体实施方式中海量数据下基于滑动窗口的数据相似检测方法具体包括: [0024] reference to FIG. 2, a particular embodiment of the present invention is similar to a mass data detection method based on sliding window comprises:

51、根据用户经验,采用等级法为每一个属性赋予相应的等级,然后通过均值法计算出代表记录属性特征的经验向量G ; 51, the user experience level using a method attribute for each corresponding grade given, then calculates the properties experience vector G denotes a recording method characterized by a mean value;

52、多次随即抽取指定数目的记录,计算每一属性取值的变化种数,作为客观描述属性对记录的重要性。 52, then extracted repeatedly a specified number of records, calculates the number of kinds of variation values ​​of each attribute, the attribute as the importance of the recording objective description. 使用均值法计算出每个属性的取值种类数,生成属性统计向量C ; The value of each attribute is calculated using the average number of kinds of methods to generate statistical attribute vector C;

53、综合以上两个向量,计算出最终的权重向量W ; 53, the above two vectors, the calculated final weight vector W is;

54、计算可变窗口大小的队列的窗口上界; 54, the calculated upper bound queue window variable window size;

55、根据属性的个数创建多个线程; 55, create multiple threads based on the number of properties;

56、在每个线程中,顺序扫描记录集,计算当前记录与可变队列中记录的相似度。 56, in each thread sequentially scanning the record set, and calculates the similarity of the current record variable records in the queue. 具体为:当前记录与可变队列中的第一个记录进行相似检测; Specifically: a first recording current of the variable queue similarity detection;

若当前记录与可变队列中的第一个记录相似,把当前记录添加到相思重复记录集中,然后,把当前记录添加到可变队列的第一个记录中,查看可变队列是否已满,如果可变队列已满,先剔除可变队列中最后一条记录,然后再把当前记录添加到可变队列;如果优先队列不满,则直接添加记录到可变队列中; If the current record and the first variable in the record is similar to the queue, the current record is added to the repetitive recording Acacia concentrated, then the current record is added to the first record variable queue, the queue is full view of the variable, If the variable queue is full, the first variable excluding the last record in the queue, and then the current record is added to the variable queue; if the priority queue is not full, add records directly to the variable queue;

若当前记录与可变队列中的第一个记录不相似,继续与可变队列的其他记录进行比 If the first record with the current record is not similar to the variable queue, proceed to other records than the variable queue

对。 Correct. ;

57、合并各线程检测出的重复记录集。 57, each of the threads combined duplicate record set detected.

[0025] 优选地,在步骤S6中“在每个线程中,顺序扫描记录集”之前还包括: [0025] Preferably, in step S6 'in each thread sequentially scanning the record set in "Before further comprising:

在每个线程中根据属性值对数据集进行排序。 Sort the data set in each thread based on the attribute value.

本发明首先科学地计算属性的权重向量即:采用等级法计算属性的经验向量G ;采用数理统计法计算属性的统计向量C ;结合经验向量和统计向量计算权重向量W ;然后确定存放相似重复记录集的可变窗口大小的队列的窗口上界,在多次选取的记录中,计算两条相似重复记录的距离(相似重复记录间的记录条数);最后使用多线程代替简单的多轮次相似重复检测,根据属性的个数创建多线程,在每个线程中检测相似重复记录,最后合并各线程中检测出的重复记录集。 The present invention set the first scientific computing attribute weight vector that is: calculated using attribute rank method experiences vector G; calculated using properties of mathematical statistics statistics vector C; binding experience vector and statistics vector calculating weight vector W is; and then determining store similar duplicate records set queue window bounds a variable window size, the recording times in the selected calculated approximately duplicate records from two (the number of records is similar between the duplicate records); Finally, instead of a simple multi-threaded multi rounds similar duplicate detection, create multi-threaded according to the number of attributes, detected a similar repeatedly recorded in each thread, and finally merge the duplicate set of records each detected thread. 其具有以下特点: It has the following features:

采用综合赋权法计算属性的向量空间;[0026] 采用可变窗口大小的队列技术。 Calculated using the integrated vector space attributes weighting method; [0026] queuing techniques using variable window size.

与现有技术相比,本发明综合主观用户经验和客观的数理统计两方面计算属性的权重代替主观的等级加权计算;相似检测的过程中采用可变窗口(确定的窗口上限)大小的优先队列代替固定窗口大小的滑动窗口依次扫描数据集,检测每一条记录。 Compared with the prior art, the present invention is an integrated user experience subjective and objective mathematical statistics calculated attribute weights instead of two weight levels subjectively weighted calculation; similar procedure used variable detection window (window upper limit) of the size of the priority queue Instead of a fixed window size of the sliding window sequentially scan data set, each record is detected. 同时采用基于多线程并发的检测算法代替多轮检测,提高了检测效率,节省了检测时间。 At the same time instead of using multiple rounds of detection based on detection algorithm concurrent multithreading, improve the detection efficiency and save detection time.

[0027] 应当理解,虽然本说明书按照实施方式加以描述,但并非每个实施方式仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施方式中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。 [0027] It should be understood that while the present specification be described in terms of embodiments, but not every embodiment contains only a separate aspect, this narrative description only for the sake of clarity, those skilled in the art should be used as a specification overall, the embodiments of the technical solutions may be suitably combined to form other embodiments of the present art can be appreciated in the art.

[0028] 上文所列出的一系列的详细说明仅仅是针对本发明的可行性实施方式的具体说明,它们并非用以限制本发明的保护范围,凡未脱离本发明技艺精神所作的等效实施方式或变更均应包含在本发明的保护范围之内。 [0028] A series of the detailed description set forth above is merely for the feasibility of specifically described embodiments of the present invention, they are not intended to limit the scope of the present invention, who have not departing from the spirit of the present invention the equivalent skills made embodiment or changes be included within the scope of the present invention.

Claims (8)

1.一种基于滑动窗口的数据相似检测方法,其特征在于,所述方法包括: 51、采用等级法计算属性的经验向量G ; 52、采用数理统计法计算属性的统计向量C ; 53、综合经验向量G和统计向量C,计算出最终的权重向量W ; 54、计算可变窗口大小的队列的窗口上界; 55、根据属性的个数创建多个线程; 56、在每个线程中,顺序扫描记录集,计算当前记录与可变队列中记录的相似度; 57、合并各线程检测出的重复记录集。 A similar method for detecting data based on sliding window, wherein said method comprises: 51, using the method of calculating an attribute level experience vector G; 52, vector C is calculated using the statistical methods of mathematical statistics attribute; 53, integrated experience statistics vector G and C vector, calculate the final weight vector W; 54, calculated on the queue bound variable window size of the window; 55, creating a plurality of threads based on the number of attributes; 56, in each thread, sequentially scanning the record set, and calculates the similarity of the current record variable records in the queue; 57, each of the threads combined duplicate record set detected.
2.根据权利要求1所述的方法,其特征在于,所述步骤SI具体为: 根据用户经验,采用等级法为每一个属性赋予相应的等级,然后通过均值法计算出代表记录属性特征的等级向量G。 2. The method according to claim 1, wherein said step SI specifically is: the user experience, using a rating method for each property given respective level, and then calculates the properties of the recording level representative of the mean method characterized by vector G.
3.根据权利要求1所述的方法,其特征在于,所述步骤S2具体为: 多次随即抽取指定数目的记录,计算每一属性取值的变化种数,作为客观描述属性对记录的重要性,使用均值法计算出每个属性的取值种类数,生成属性统计向量C。 3. The method according to claim 1, wherein said Step S2 is specifically: a specified number of times randomly selected records, calculates the number of kinds of variation values ​​of each attribute as an attribute of the recording important objective description resistance, calculated using the average method for each attribute value of the number of types, generate statistical attribute vector C.
4.根据权利要求1所述的方法,其特征在于,所述经验向量G的计算公式为: 4. The method according to claim 1, characterized in that the empirical formula of vector G:
Figure CN103336771AC00021
其中,Gi表示第i个属性的最终统一等级,m表示用户的个数,s表示第s个操作用户。 Wherein, Gi represents the i-th final uniform level attributes, m represents the number of users, s represents the s-th user operation.
5.根据权利要求4所述的方法,其特征在于,所述统计向量C的计算公式为: 5. The method as claimed in claim 4, wherein the statistical vector C is calculated as follows:
Figure CN103336771AC00022
其中,Cij表示第i次第j个属性的取值种类数目,Cj表示第j个属性的最终种类数,m表示选取的次数。 Wherein, Cij j denotes the i th sequence number of kinds of attribute values, Cj j-th attribute of the final number of types, m represents a number selected.
6.根据权利要求4所述的方法,其特征在于,所述权重向量W的计算公式为: 6. The method according to claim 4, wherein said weight vector W is calculated as:
Figure CN103336771AC00023
其中,Wi表示第i个属性的权重向量,Gi表示第i个属性的最终统一等级,Ci表示第i个属性的最终种类数。 Which, Wi represents the i-th attribute weights vector, Gi represents the i-th level attribute of eventual reunification, Ci denotes the i th attribute of the final number of species.
7.根据权利要求1所述的方法,其特征在于,所述步骤S6中“在每个线程中,顺序扫描记录集”之前还包括: 在每个线程中根据属性值对数据集进行排序。 7. The method according to claim 1, characterized in that, "in each thread, sequential scanning record set" prior to the step S6, further comprising: sorting the data set according to the attributes in each thread.
8.根据权利要求1所述的方法,其特征在于,所述步骤S6中“计算当前记录与可变队列中记录的相似度”具体为: 当前记录与可变队列中的第一个记录进行相似检测; 若当前记录与可变队列中的第一个记录相似,把当前记录添加到相思重复记录集中,然后,把当前记录添加到可变队列的第一个记录中,查看可变队列是否已满,如果可变队列已满,先剔除可变队列中最后一条记录,然后再把当前记录添加到可变队列;如果优先队列不满,则直接添加记录到可变队列中;若当前记录与可变队列中的第一个记录不相似,继续与可变队列的其他记录进行比对。 8. The method according to claim 1, wherein the step S6 "to calculate the similarity with the current record variable records in the queue" specifically: a first recording current of the variable in the queue is similarity detection; if a record similar to the first variable and the current record in the queue, the current record is added to the repetitive recording Acacia concentrated, then the current record is added to the first variable in the record queue, the queue to see if the variable full variable if the queue is full, the first variable excluding the last record in the queue, and then the current record is added to the variable queue; if the priority queue is not full, add records directly to the variable queue; if the current record the first record is not similar to variable queue, proceed to other records than the variable queue.
CN201310114244.8A 2013-04-02 2013-04-02 Data similarity detection method based on sliding window CN103336771B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310114244.8A CN103336771B (en) 2013-04-02 2013-04-02 Data similarity detection method based on sliding window

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310114244.8A CN103336771B (en) 2013-04-02 2013-04-02 Data similarity detection method based on sliding window

Publications (2)

Publication Number Publication Date
CN103336771A true CN103336771A (en) 2013-10-02
CN103336771B CN103336771B (en) 2016-12-28

Family

ID=49244938

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310114244.8A CN103336771B (en) 2013-04-02 2013-04-02 Data similarity detection method based on sliding window

Country Status (1)

Country Link
CN (1) CN103336771B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750682A (en) * 2013-12-25 2015-07-01 任子行网络技术股份有限公司 Buffering capacity allocation method for massive logs
CN104794318A (en) * 2014-01-17 2015-07-22 无锡华润上华半导体有限公司 Data processing method for establishing semiconductor device statistical model
CN105989061A (en) * 2015-02-09 2016-10-05 中国科学院信息工程研究所 Rapid indexing method for repeated detection of multi-dimensional data under sliding window
CN106484915A (en) * 2016-11-03 2017-03-08 国家电网公司信息通信分公司 A kind of cleaning method of mass data and system
CN106528705A (en) * 2016-10-26 2017-03-22 桂林电子科技大学 Repeated record detection method and system based on RBF neural network
CN108268876A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of detection method and device of the approximately duplicate record based on cluster

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1794719A (en) * 2005-12-31 2006-06-28 西安交大捷普网络科技有限公司 Web filtering method based on weight keyword
CN101071419A (en) * 2007-05-31 2007-11-14 腾讯科技(深圳)有限公司 Method and system for judging article importance in network, and sliding window
CN101667197A (en) * 2009-09-18 2010-03-10 浙江大学 Mining method of data stream association rules based on sliding window

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1794719A (en) * 2005-12-31 2006-06-28 西安交大捷普网络科技有限公司 Web filtering method based on weight keyword
CN101071419A (en) * 2007-05-31 2007-11-14 腾讯科技(深圳)有限公司 Method and system for judging article importance in network, and sliding window
CN101667197A (en) * 2009-09-18 2010-03-10 浙江大学 Mining method of data stream association rules based on sliding window

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750682A (en) * 2013-12-25 2015-07-01 任子行网络技术股份有限公司 Buffering capacity allocation method for massive logs
CN104750682B (en) * 2013-12-25 2018-04-06 任子行网络技术股份有限公司 A kind of buffering capacity distribution method of massive logs
CN104794318A (en) * 2014-01-17 2015-07-22 无锡华润上华半导体有限公司 Data processing method for establishing semiconductor device statistical model
CN104794318B (en) * 2014-01-17 2017-11-21 无锡华润上华科技有限公司 For establishing the data processing method of semiconductor devices statistical model
CN105989061A (en) * 2015-02-09 2016-10-05 中国科学院信息工程研究所 Rapid indexing method for repeated detection of multi-dimensional data under sliding window
CN105989061B (en) * 2015-02-09 2019-11-26 中国科学院信息工程研究所 Multidimensional data repeats detection fast indexing method under a kind of sliding window
CN106528705A (en) * 2016-10-26 2017-03-22 桂林电子科技大学 Repeated record detection method and system based on RBF neural network
CN106484915A (en) * 2016-11-03 2017-03-08 国家电网公司信息通信分公司 A kind of cleaning method of mass data and system
CN108268876A (en) * 2016-12-30 2018-07-10 广东精点数据科技股份有限公司 A kind of detection method and device of the approximately duplicate record based on cluster

Also Published As

Publication number Publication date
CN103336771B (en) 2016-12-28

Similar Documents

Publication Publication Date Title
Zhou et al. Query performance prediction in web search environments
CN101322125B (en) Improving ranking results using multiple nested ranking
Chapelle et al. Expected reciprocal rank for graded relevance
Bilenko et al. On evaluation and training-set construction for duplicate detection
Matveeva et al. High accuracy retrieval with multiple nested ranker
JP3870043B2 (en) Systems, computer programs, and servers for searching, detecting, and identifying major and outlier clusters in large databases
CN102346829B (en) Virus detection method based on ensemble classification
Hofmann et al. Reusing historical interaction data for faster online learning to rank for IR
Cambazoglu et al. Early exit optimizations for additive machine learned ranking systems
Hu et al. Characterizing search intent diversity into click models
Thomas et al. SUSHI: scoring scaled samples for server selection
US20090034851A1 (en) Multimodal classification of adult content
Dai et al. Online topic detection and tracking of financial news based on hierarchical clustering
JP2015515686A (en) Network virtual user risk control method and system
Aslam et al. A unified model for metasearch, pooling, and system evaluation
Liu et al. Scoring the data using association rules
US7296020B2 (en) Automatic evaluation of categorization system quality
CN101488150A (en) Real-time multi-view network focus event analysis apparatus and analysis method
CN101556553A (en) Defect prediction method and system based on requirement change
CN101231634B (en) Autoabstract method for multi-document
CN101944099B (en) Method for automatically classifying text documents by utilizing body
CN101819411B (en) GPU-based equipment fault early-warning and diagnosis method for improving weighted association rules
Azzopardi Query side evaluation: an empirical analysis of effectiveness and effort
CN101820592A (en) Method and device for mobile search
US8374400B1 (en) Scoring items

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model
CF01