CN104809256A - Data deduplication method and data deduplication method - Google Patents

Data deduplication method and data deduplication method Download PDF

Info

Publication number
CN104809256A
CN104809256A CN 201510266694 CN201510266694A CN104809256A CN 104809256 A CN104809256 A CN 104809256A CN 201510266694 CN201510266694 CN 201510266694 CN 201510266694 A CN201510266694 A CN 201510266694A CN 104809256 A CN104809256 A CN 104809256A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
data
deduplication
information
method
similarity
Prior art date
Application number
CN 201510266694
Other languages
Chinese (zh)
Inventor
王大亮
杨琪
Original Assignee
数据堂(北京)科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30286Information retrieval; Database structures therefor ; File system structures therefor in structured data stores
    • G06F17/30289Database design, administration or maintenance
    • G06F17/30303Improving data quality; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30286Information retrieval; Database structures therefor ; File system structures therefor in structured data stores
    • G06F17/30386Retrieval requests
    • G06F17/30424Query processing
    • G06F17/30522Query processing with adaptation to user needs
    • G06F17/30525Query processing with adaptation to user needs using data annotations (user-defined metadata)
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30286Information retrieval; Database structures therefor ; File system structures therefor in structured data stores
    • G06F17/30386Retrieval requests
    • G06F17/30424Query processing
    • G06F17/30533Other types of queries
    • G06F17/30536Approximate and statistical query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30943Information retrieval; Database structures therefor ; File system structures therefor details of database functions independent of the retrieved data type
    • G06F17/30997Retrieval based on associated metadata

Abstract

The invention discloses a data deduplication method and a data deduplication method. The method includes the following steps: the metadata information of data to be processed is compared with the metadata information of stored data of a data platform, so that metadata information similarity is obtained; first data description information is compared with second data description information, so that data description similarity is obtained; weighted average is carried out on the metadata information similarity and the data description similarity, so that total similarity is obtained; according to the total similarity, the stored data are sequenced; the previous n data among the sequenced stored data are marked as suspected duplicated data. When the method and the system disclosed by the invention are adopted, the data deduplication range can be narrowed, consequently, the workload of manual data deduplication can be reduced effectively, and thereby the workload of manual data deduplication is controlled within an acceptable range.

Description

一种数据去重方法及系统 A data de-duplication method and system

技术领域 FIELD

[0001] 本发明涉及数据分析领域,特别是涉及一种数据去重方法及系统。 [0001] The present invention relates to data analysis, and more particularly, to a method and system for data deduplication.

背景技术 Background technique

[0002] 本发明主要针对数据平台中的数据进行数据去重。 [0002] The present invention is primarily in the data platform for data deduplication. 数据平台,是指承载了海量数据的系统,比如数据共享和交易平台。 Data platform refers to a system carrying huge amounts of data, such as data sharing and trading platform. 数据去重,是指识别出因不同名称、作者、来源、格式而存在的同一份数据的多份拷贝,避免同一份数据被以不同形式保存在数据平台中。 Data deduplication, refers to identify multiple copies by different names, author, the source format of the same data exist to prevent the same data is stored in the data platform in a different form.

[0003] 由于数据平台中的数据可以被共享和交易,因此当数据平台中存在重复数据时,将会对数据使用者造成困扰,也会对数据提供者造成损失。 [0003] Since the data platform can be shared and transactions, so when there is duplicate data platform, will cause distress to the data users, can also cause damage to the data provider. 例如,当一份数据被数据提供者A上传至数据平台后,又被数据提供者B上传至该数据平台。 For example, when a data is uploaded to the data provided by data platform A, B and the data provided by the data uploaded to the internet. 如果未进行数据去重,则对于数据使用者来说,可能会因为下载两份内容相同的数据,而导致金钱、时间和精力的浪费;对于数据提供者来说,假设数据提供者A为数据版权的和合法所有者,则数据提供者A会由于数据使用者采用了数据提供者B提供的相同数据,而损失掉将该数据提供给该数据使用者时可获得的收益。 If the data is not to be heavy, then for data users, may download the same data as the two content, resulting in a waste of money, time and effort; the data provider, if the data provider for the data A copyright and rightful owners, the data provider a data user due to the use of the same data provided by the data provider B, and lost earnings available at the time the data provided to the data user. 可见,数据去重对于数据平台来说是十分重要的。 Visible, data deduplication for data platform is very important.

[0004] 现有技术中的数据去重方法,主要是对待存储的数据建立摘要或指纹。 Data [0004] The prior art method of de-emphasis, the data to be stored primarily to establish a fingerprint or digest. 通常采用计算数据的哈希值(包括md5,crc32,sha256等算法)的方式建立摘要或指纹。 Establishing a fingerprint or digest usually hash value calculation data (including md5, crc32, sha256 algorithms) manner. 然后将待存储数据的哈希值与已存储数据的哈希值进行比对,如果相同,即判定待存储数据与某个已存储数据相同。 The hash value is then a hash value data to be stored with the stored data for comparison, if the same, i.e., the stored data is determined to be identical to the one stored data. 之后,再采取进一步措施删除重复数据。 After that, and then take further steps to delete duplicate data.

[0005] 但是,上述方法不适用于于数据平台。 [0005] However, the above method does not work in the data platform. 一方面由于数据平台中存储的数据很多,对于大量数据进行哈希值计算的代价过高,并且对于哈希值的存储也会占用较大存储空间。 On the one hand it costs a lot of hash value calculation for large amounts of data because the data stored in the data platform is too high, and to store the hash value will also take up more storage space. 通常PB级别的数据会生成TB级别的哈希表,不仅占用大量存储空间,还会导致对于哈希值的检索效率降低,从而降低数据去重效率。 PB-level data is usually generated TB-level hash table, not only takes up a lot of storage space, but also result in reduced efficiency for retrieving a hash value, thereby reducing data deduplication efficiency. 另一方面,由于数据平台存储的数据量很大,哈希值发生计算碰撞的可能性也较高,这又会导致把原本不同的数据误判为重复数据。 On the other hand, due to the large amount of data storage platform, the possibility to calculate the hash value collision occurred is also higher, which in turn leads to a different data originally mistaken for duplicate data.

[0006] 基于上述原因,导致现有技术中,对于数据平台中的数据去重工作,只能交由人工完成。 [0006] For these reasons, the prior art leads, for data deduplication work platform, referred only be done manually. 但是,由于数据平台中的数据量过多,导致人工进行数据去重的效率十分低下。 However, due to the excessive amount of data platform, resulting in the efficiency of manual data de-duplication is very low.

[0007] 因此,亟需一种可以有效缩小数据去重范围的数据去重方法,以便将人工进行数据去重的工作量控制在可接受的范围内。 [0007] Accordingly, a need for an effective reduction of the range of data deduplication data deduplication method, in order to manually control data deduplication workload within an acceptable range.

发明内容 SUMMARY

[0008] 本发明的目的是提供一种数据去重方法及系统,可以有效缩小数据去重范围的数据去重方法,以便将人工进行数据去重的工作量控制在可接受的范围内。 [0008] The object of the present invention to provide a data deduplication method and system that can effectively reduce the range of data deduplication data deduplication method, in order to manually control data deduplication workload within an acceptable range.

[0009] 为实现上述目的,本发明提供了如下方案: [0009] To achieve the above object, the present invention provides the following solutions:

[0010] 一种数据去重方法,包括: [0010] A data de-duplication method, comprising:

[0011] 获取上传至数据平台的待处理数据; [0011] The data to be processed acquired data uploaded to the platform;

[0012] 确定所述待处理数据的元数据信息; [0012] determined to be the metadata information processing data;

[0013] 将所述待处理数据的元数据信息与所述数据平台的已存储数据的元数据信息进行比对,得到元数据信息相似度; Metadata information stored data of the metadata information in the data platform [0013] the data to be processed for comparison, metadata information obtained similarity;

[0014] 获取所述待处理数据的第一数据描述信息; [0014] The first data acquiring description information of data to be processed;

[0015] 获取所述已存储数据的第二数据描述信息; [0015] The acquired data is stored in the second data description information;

[0016] 比对所述第一数据描述信息与所述第二数据描述信息,得到数据描述相似度; [0016] description information and the second data description information than the first data, obtain similarity data is described;

[0017] 对所述元数据信息相似度和所述数据描述相似度进行加权平均,得到总相似度; [0017] The similarity of the metadata information and data describing the similarity weighted average to give a total degree of similarity;

[0018] 按照所述总相似度对所述已存储数据进行排序; [0018] The overall degree of similarity sorted according to the stored data;

[0019] 将排序后的所述已存储数据中的前η个数据标记为疑似重复数据。 [0019] The front of the sorted data stored in the data is marked as suspected η duplicate data.

[0020] 可选的,所述将所述待处理数据标记为疑似重复数据之后,还包括: After [0020] Alternatively, the data to be processed to the repetitive data marked as suspected, further comprising:

[0021] 将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端,以便对所述疑似重复数据与所述已存储数据进行人工审核;所述数据列表由排序后的所述已存储数据的信息构成。 [0021] list containing information data with said pseudo duplicated data is transmitted to a manual review of the client, for manual review of the data and the pseudo repeating stored data; said data of said sorted list of information stored data has been constituted.

[0022] 可选的,所述将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端之后,还包括: [0022] Alternatively, the data including the pseudo-list with information after transmitting the duplicated data to the client manual review, further comprising:

[0023] 当所述疑似重复数据与所述待处理数据不同时,将所述待处理数据保存至所述数据平台。 [0023] When the pseudo data and the repeated data to be processed is not the same, the data to be saved to the data processing platform.

[0024] 可选的,所述比对所述第一数据描述信息与所述第二数据描述信息,得到数据描述相似度,具体包括: [0024] Alternatively, the specific description information of the first data and the second data description information to obtain similarity data is described, comprises:

[0025] 采用SimHash算法计算所述第一数据描述信息与所述第二数据描述信息之间的海明距离,根据所述海明距离确定所述第一数据描述信息与所述第二数据描述信息之间的的数据描述相似度。 [0025] The calculation algorithm SimHash description information of the first data and the second data information describing the Hamming distance between, the first data is determined according to the description information of the Hamming distance and the second description data data similarity between the information is described.

[0026] 一种数据去重系统,包括: [0026] A data deduplication system, comprising:

[0027] 待处理数据获取单元,用于获取上传至数据平台的待处理数据; [0027] Pending data acquisition unit for acquiring data to be processed the data uploaded to the platform;

[0028] 元数据信息确定单元,用于确定所述待处理数据的元数据信息; [0028] The metadata determination unit configured to determine the metadata information processing data;

[0029] 元数据信息比对单元,用于将所述待处理数据的元数据信息与所述数据平台的已存储数据的元数据信息进行比对,得到元数据信息相似度; Metadata information stored data of the metadata information in the data platform [0029] metadata comparison means for the data to be processed for comparison, metadata information obtained similarity;

[0030] 第一数据描述信息获取单元,用于获取所述待处理数据的第一数据描述信息; [0030] The first description data information acquiring unit for acquiring first data to be processed the data description information;

[0031] 第二数据描述信息获取单元,用于获取所述已存储数据的第二数据描述信息; [0031] The second data description information acquisition unit for acquiring said second data has been stored in the data description information;

[0032] 数据描述信息比对单元,用于比对所述第一数据描述信息与所述第二数据描述信息,得到数据描述相似度; [0032] The data alignment unit description information for describing data description information and the second information than the first data, obtain similarity data is described;

[0033] 总相似度计算单元,用于对所述元数据信息相似度和所述数据描述相似度进行加权平均,得到总相似度; [0033] The overall similarity calculating unit, and the information for the similarity data the similarity of the metadata described weighted averaging, to give a total degree of similarity;

[0034] 排序单元,用于按照所述总相似度对所述已存储数据进行排序; [0034] The sorting means for sorting according to said total degree of similarity of said stored data;

[0035] 疑似重复数据标记单元,用于将排序后的所述已存储数据中的前η个数据标记为疑似重复数据。 [0035] The repeating pseudo data marking unit, for the front of the sorted data stored in the data is marked as suspected η duplicate data.

[0036] 可选的,还包括: [0036] Optionally, further comprising:

[0037] 疑似重复数据发送单元,用于将所述待处理数据标记为疑似重复数据之后,将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端,以便对所述疑似重复数据与所述已存储数据进行人工审核;所述数据列表由排序后的所述已存储数据的信息构成。 [0037] The repeating pseudo data transmission unit for processing the data to be marked as suspected duplicate data after the data has a list containing information of the pseudo duplicated data is sent to a manual review of the client, so as to repeat the suspected data with the stored data manual review; the data list consists of the information already stored in the sorted data.

[0038] 可选的,还包括: [0038] Optionally, further comprising:

[0039] 待处理数据保存单元,用于将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端之后,当所述疑似重复数据与所述待处理数据不同时,将所述待处理数据保存至所述数据平台。 [0039] The data storage unit to be processed, including for data, after which the list information pseudo data to be transmitted to the repeated manual review the client when the data of the pseudo repeat data to be processed is not the same, the saving data to the data to be processed internet.

[0040] 可选的,所述数据描述信息比对单元,具体包括: [0040] Optionally, the data description information comparing unit comprises:

[0041] 海明距离计算子单元,用于采用SimHash算法计算所述第一数据描述信息与所述第二数据描述信息之间的海明距离,根据所述海明距离确定所述第一数据描述信息与所述第二数据描述信息之间的的数据描述相似度。 [0041] The Hamming distance calculating sub-unit, for calculating SimHash algorithm using the first data and the second description information Hamming distance between the data description information, the Hamming distance is determined based on the first data description information of the second data between the information description data describe the similarity.

[0042] 根据本发明提供的具体实施例,本发明公开了以下技术效果: [0042] According to a particular embodiment of the present invention provides, the present invention discloses the following technical effects:

[0043] 本发明实施例中的数据去重方法及系统,通过将所述待处理数据的元数据信息与所述数据平台的已存储数据的元数据信息进行比对,得到元数据信息相似度;比对所述第一数据描述信息与所述第二数据描述信息,得到数据描述相似度;对所述元数据信息相似度和所述数据描述相似度进行加权平均,得到总相似度;按照所述总相似度对所述已存储数据进行排序;将排序后的所述已存储数据中的前η个数据标记为疑似重复数据;可以缩小数据去重范围,从而有效降低人工进行数据去重的工作量,使人工进行数据去重的工作量被控制在可接受的范围内。 [0043] Data in the embodiment of the present invention is a method and system for de-duplication, metadata information data to be processed by the metadata information and the stored data of the internet data for comparison, a similarity obtain metadata information ; specific description information of the first data and the second data description information to obtain similarity data is described; similarity to the metadata information and data describing the similarity weighted average to give a total degree of similarity; according the overall degree of similarity of said stored data sorting; and η sorted before the stored data is marked as suspected deduplication data; weight range may be reduced to the data, thereby effectively reducing the data de-duplication artificial workload of the manual workload data de-duplication is controlled within an acceptable range.

附图说明 BRIEF DESCRIPTION

[0044] 为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。 [0044] In order to more clearly illustrate the technical solutions in the embodiments or the prior art embodiment of the present invention, the drawings are briefly introduced as required for use in the embodiments describing the embodiments. Apparently, the accompanying drawings described below are merely Some embodiments of the invention, those of ordinary skill in the art is concerned, without any creative effort, and can obtain other drawings based on these drawings.

[0045] 图1为本发明的数据去重方法实施例的流程图; The flowchart of the present invention, the data [0045] FIG deduplication method of embodiment 1;

[0046]图2为本发明的数据去重系统实施例的结构图。 [0046] The data of FIG. 2 of the present invention to a configuration diagram of an embodiment of the weight system.

具体实施方式 detailed description

[0047] 下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。 [0047] below in conjunction with the present invention in the accompanying drawings, technical solutions of embodiments of the present invention are clearly and completely described, obviously, the described embodiments are merely part of embodiments of the present invention, but not all embodiments example. 基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。 Based on the embodiments of the present invention, all other embodiments of ordinary skill in the art without any creative effort shall fall within the scope of the present invention.

[0048] 为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。 [0048] For the above-described objects, features and advantages of the invention more apparent, the accompanying drawings and the following specific embodiments of the present invention will be further described in detail.

[0049] 图1为本发明的数据去重方法实施例的流程图。 Data flowchart of the present invention [0049] The embodiment of FIG. 1 deduplication method. 如图1所示,该方法可以包括: As shown in FIG 1, the method may comprise:

[0050] 步骤101:获取上传至数据平台的待处理数据; [0050] Step 101: acquiring data to be processed the data uploaded to the platform;

[0051 ] 所述待处理数据可以是各种类型的数据。 [0051] The processed data may be various types of data. 例如,可以是文本类型的数据、图片类型的数据等等。 For example, the types of data can be text, pictures, data types, and so on.

[0052] 步骤102:确定所述待处理数据的元数据信息; [0052] Step 102: determining whether the data to be processed metadata information; and

[0053] 所述元数据信息可以是对于所述待处理数据的具有摘要性质的关键词。 [0053] The metadata digest information may be a keyword having properties with respect to the data to be processed.

[0054] 例如,所述元数据信息可以包括数据ID、标题、分类、格式、关键词和来源等信息。 [0054] For example, the metadata information may include data ID, title, category, format, keywords and other information sources.

[0055] 步骤103:将所述待处理数据的元数据信息与所述数据平台的已存储数据的元数据信息进行比对,得到元数据信息相似度; [0055] Step 103: metadata information stored data of the metadata information and data to the internet data to be processed for comparison, metadata information obtained similarity;

[0056] 所述数据平台的已存储数据也具有对应的元数据信息。 [0056] The stored data also has a platform corresponding to the metadata information. 可以将待处理数据的元数据信息与已存储数据的元数据信息进行比对。 Metadata information metadata information and the stored data may be data to be processed for comparison.

[0057] 所述元数据信息相似度,可以根据待处理数据与已存储数据之间相同的元数据信息的个数进行确定。 [0057] The metadata similarity, it may be determined according to the number of the same metadata to be processed between the stored data and the data information. 例如,可以采用相同的元数据信息的个数除以待处理数据所具有的元数据信息的总个数,得到相同的元数据信息在总体元数据信息中所占比例,将该比例作为元数据信息相似度。 The total number of metadata information, for example, may be the same metadata information by dividing the number of data to be processed has to give the same proportion of metadata information overall metadata information, metadata as the ratio information similarity. 假设所述元数据信息包括数据ID、标题、分类、格式、关键词和来源共6项,其中待处理数据的元数据信息中有4项信息与已存储数据的元数据信息相同,则相似度可以确定为66.7%。 Assuming that the metadata information comprises a data ID, title, category, format, and a source of common keywords 6, wherein the metadata information of data to be processed in the same four information metadata stored data information, similarity It may be determined to be 66.7%.

[0058] 步骤104:获取所述待处理数据的第一数据描述信息; [0058] Step 104: acquiring first data to be processed the data description information;

[0059] 所述数据描述信息,是指用于对数据内容进行描述的信息。 The [0059] description information data, information data refers to a content description. 所述数据描述信息通常可以由人工编辑生成。 The data description information may typically generated by manual editing.

[0060] 假设有一份待存储数据为随机采样的40个亚洲人的人脸图像信息数据。 [0060] Suppose there are data to be stored as a random sampling of 40 Asian face image information data. 则相应的第一数据描述信息就可以为“亚洲随机40人人脸图像信息”。 The corresponding first data description information can think that the "Asian face image for all 40 random information."

[0061] 步骤105:获取所述已存储数据的第二数据描述信息; [0061] Step 105: acquiring the second data already stored data description information;

[0062] 步骤106:比对所述第一数据描述信息与所述第二数据描述信息,得到数据描述相似度; [0062] Step 106: the ratio of the description of the first data and second data description information to obtain similarity data is described;

[0063] 具体的,可以采用SimHash算法计算所述第一数据描述信息与所述第二数据描述信息之间的海明距离,根据所述海明距离确定所述第一数据描述信息与所述第二数据描述信息之间的的数据描述相似度。 [0063] Specifically, the algorithm can be calculated using SimHash description information of the first data and the second data information describing the Hamming distance between, the first data is determined according to the description information of the Hamming distance the second description data describe the similarity between the data of the information.

[0064] 步骤107:对所述元数据信息相似度和所述数据描述相似度进行加权平均,得到总相似度; [0064] Step 107: the similarity of the metadata information and data describing the similarity weighted average to give a total degree of similarity;

[0065] 具体的,对于所述元数据信息相似度和所述数据描述相似度,可以分配不同的权重。 [0065] Specifically, the metadata information for the similarity and the similarity data described may be assigned different weights. 例如,可以为数据描述相似度分配第一权重,为元数据信息相似度分配第二权重,所述第一权重大于所述第二权重,这样可以使数据描述相似度在总相似度中所占的比重增大,从而使得对于相似度的计算更加精准。 For example, the data can be described as a first degree of similarity assigned weights, the similarity of metadata information assigned a second weight, the first weight is greater than the second weight, so that the data can share a similarity in the total degree of similarity is described in the proportion increased so that more accurate for the calculation of similarity.

[0066] 步骤108:按照所述总相似度对所述已存储数据进行排序; [0066] Step 108: the total degree of similarity sorted in the stored data;

[0067] 具体的,可以按照总相似度由高到低的顺序对所述已存储数据进行排序。 [0067] Specifically, it can sort the stored data in the order from highest to lowest overall degree of similarity. 总相似度最高的已存储数据,将位于首位。 The highest total degree of similarity of the stored data, will be in the first place.

[0068] 步骤109:将排序后的所述已存储数据中的前η个数据标记为疑似重复数据; [0068] Step 109: The front of the sorted data stored in the data is marked as suspected η duplicate transactions;

[0069] 其中,η为自然数,η的取值可以根据实际需求进行设定。 [0069] wherein, η is a natural number, η values ​​can be set according to actual demand. 例如,η可以取8、9或10等等。 For example, η 8, 9 or 10 may take the like.

[0070] 将所述待处理数据标记为疑似重复数据后,可以将所述待处理数据交由人工审核。 After [0070] the data to be processed is marked as suspected duplicate data, processed data may be handed over to the manual review. 对于疑似重复数据,可以不将疑似重复数据保存至所述数据平台。 For duplicate data suspected, may not be suspected of saving data to the duplicate data internet.

[0071] 综上所述,本实施例中,通过将所述待处理数据的元数据信息与所述数据平台的已存储数据的元数据信息进行比对,得到元数据信息相似度;比对所述第一数据描述信息与所述第二数据描述信息,得到数据描述相似度;对所述元数据信息相似度和所述数据描述相似度进行加权平均,得到总相似度;按照所述总相似度对所述已存储数据进行排序;将排序后的所述已存储数据中的前η个数据标记为疑似重复数据;可以缩小数据去重范围,从而有效降低人工进行数据去重的工作量,使人工进行数据去重的工作量被控制在可接受的范围内。 [0071] As described above, in the present embodiment, to compare the information through metadata stored data of the metadata information in the data platform data to be processed, the metadata information obtained similarity; match description information of the first data and the second data description information to obtain similarity data is described; similarity to the metadata information and data describing the similarity weighted average to give a total degree of similarity; according to the total similarity of the stored data sorting; and η sorted before the stored data is marked as suspected deduplication data; weight range may be reduced to the data, thus effectively reducing the manual workload data deduplication the artificial heavy workload data to be controlled within an acceptable range.

[0072] 实际应用中,所述将所述待处理数据标记为疑似重复数据之后,还可以包括以下步骤: [0072] In practice, the data to be processed is marked as duplicate data after a suspected, may further comprise the step of:

[0073] 将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端,以便对所述疑似重复数据与所述已存储数据进行人工审核;所述数据列表由排序后的所述已存储数据的信息构成。 [0073] list containing information data with said pseudo duplicated data is transmitted to a manual review of the client, for manual review of the data and the pseudo repeating stored data; said data of said sorted list of information stored data has been constituted.

[0074] 由于数据内容本身的多样性和复杂性,导致对于数据的比对过程无法由计算机完全胜任。 [0074] Due to the diversity and complexity of the data itself, content, resulting in a ratio of process data can not be fully qualified by the computer. 具体的,对于一份较庞大的数据,对于其中的某些数据进行删除或修改等编辑之后,新生成的数据是否能被认为与原数据是同一份数据,这只能够由人工审核才能确定。 After Specifically, for a relatively large data, for some of the data will be deleted or modified such as editing, whether newly generated data can be considered the original data is the same data, this can be determined by manual review before. 因此,上述步骤可以使对于数据的比对过程更加精确。 Thus, the above steps can be made than to process data more accurate.

[0075] 实际应用中,所述将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端之后,还可以包括以下步骤: [0075] In practice, the list information including the data on which the data is transmitted after suspected multiple clients to a manual review, may further comprise the step of:

[0076] 当所述疑似重复数据与所述待处理数据不同时,将所述待处理数据保存至所述数据平台。 [0076] When the pseudo data and the repeated data to be processed is not the same, the data to be saved to the data processing platform.

[0077] 因为当经过人工审核确认待存储数据与已存储数据不同时,待存储数据与已存储数据相同的概率基本为零,因此,此时可以确定待存储数据与已存储数据不相同,从而将所述待处理数据保存至所述数据平台。 [0077] Because when subjected to the same data to be stored manually reviewed to confirm the stored data is not the same, the data to be stored with the stored data the probability is substantially zero, and therefore, the data stored at this time may be determined to be not identical to the stored data, whereby saving the data to the data to be processed internet.

[0078] 本发明还公开了一种数据去重系统。 [0078] The present invention also discloses a data deduplication system. 图2为本发明的数据去重系统实施例的结构图。 Data of FIG. 2 of the present invention to a configuration diagram of an embodiment of the weight system. 如图2所示,该系统可以包括: As shown, the system 2 may include:

[0079] 待处理数据获取单元201,用于获取上传至数据平台的待处理数据; [0079] The data acquisition unit 201 to be processed, the processing for acquiring the data to be uploaded to the platform;

[0080] 元数据信息确定单元202,用于确定所述待处理数据的元数据信息; [0080] The metadata determining unit 202, configured to determine the metadata information processing data;

[0081] 元数据信息比对单元203,用于将所述待处理数据的元数据信息与所述数据平台的已存储数据的元数据信息进行比对,得到元数据信息相似度; [0081] The meta data alignment unit 203, metadata information for the stored data of the metadata information in the data platform data to be processed for comparison, metadata information obtained similarity;

[0082] 第一数据描述信息获取单元204,用于获取所述待处理数据的第一数据描述信息; [0082] The first data description information obtaining unit 204, the data acquisition for a first description information of data to be processed;

[0083] 第二数据描述信息获取单元205,用于获取所述已存储数据的第二数据描述信息; [0083] The second data description information obtaining unit 205, configured to obtain the second data has been stored in the data description information;

[0084] 数据描述信息比对单元206,用于比对所述第一数据描述信息与所述第二数据描述信息,得到数据描述相似度; [0084] The data alignment unit 206 description information, the description information for specific description information and the second data to the first data to obtain similarity data is described;

[0085] 总相似度计算单元207,用于对所述元数据信息相似度和所述数据描述相似度进行加权平均,得到总相似度; [0085] The overall similarity calculating unit 207, metadata information for the similarity and the similarity data is described in the weighted average, to give a total degree of similarity;

[0086] 排序单元208,用于按照所述总相似度对所述已存储数据进行排序; [0086] sorting unit 208 for sorting according to said total degree of similarity of the stored data;

[0087] 疑似重复数据标记单元209,用于将排序后的所述已存储数据中的前η个数据标记为疑似重复数据。 [0087] The repeating pseudo data marking unit 209, for the front of the sorted data stored in the data is marked as suspected η duplicate data.

[0088] 本实施例中,通过将所述待处理数据的元数据信息与所述数据平台的已存储数据的元数据信息进行比对,得到元数据信息相似度;比对所述第一数据描述信息与所述第二数据描述信息,得到数据描述相似度;对所述元数据信息相似度和所述数据描述相似度进行加权平均,得到总相似度;按照所述总相似度对所述已存储数据进行排序;将排序后的所述已存储数据中的前η个数据标记为疑似重复数据;可以缩小数据去重范围,从而有效降低人工进行数据去重的工作量,使人工进行数据去重的工作量被控制在可接受的范围内。 [0088] In this embodiment, a ratio of data to be processed by the metadata information metadata with the data stored in the data platform, obtain similarity metadata information; than the first data description information of the second data description information to obtain similarity data is described; similarity to the metadata information and data describing the similarity weighted average to give a total degree of similarity; according to the similarity of the total stored data sorting; and η sorted before the stored data is marked as suspected deduplication data; weight range may be reduced to the data, thus effectively reducing the manual workload data deduplication the data manually heavy workload to be controlled within an acceptable range.

[0089] 实际应用中,该系统还可以包括: [0089] In practice, the system may further comprise:

[0090] 疑似重复数据发送单元,用于将所述待处理数据标记为疑似重复数据之后,将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端,以便对所述疑似重复数据与所述已存储数据进行人工审核;所述数据列表由排序后的所述已存储数据的信息构成。 [0090] The repeating pseudo data transmission unit for processing the data to be marked as suspected duplicate data after the data has a list containing information of the pseudo duplicated data is sent to a manual review of the client, so as to repeat the suspected data with the stored data manual review; the data list consists of the information already stored in the sorted data.

[0091] 实际应用中,所述系统还可以包括: [0091] In practice, the system may further comprise:

[0092] 待处理数据保存单元,用于将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端之后,当所述疑似重复数据与所述待处理数据不同时,将所述待处理数据保存至所述数据平台。 [0092] The data storage unit to be processed, including for data, after which the list information pseudo data to be transmitted to the repeated manual review the client when the data of the pseudo repeat data to be processed is not the same, the saving data to the data to be processed internet.

[0093] 实际应用中,所述数据描述信息比对单元206,具体可以包括: [0093] In practice, the data description information comparing unit 206, specifically comprising:

[0094] 海明距离计算子单元,用于采用SimHash算法计算所述第一数据描述信息与所述第二数据描述信息之间的海明距离,根据所述海明距离确定所述第一数据描述信息与所述第二数据描述信息之间的的数据描述相似度。 [0094] The Hamming distance calculating sub-unit, for calculating SimHash algorithm using the first data and the second description information Hamming distance between the data description information, the Hamming distance is determined based on the first data description information of the second data between the information description data describe the similarity.

[0095] 本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。 [0095] In the present specification, the various embodiments described in a progressive manner, differences from the embodiment and the other embodiments each of which emphasizes embodiment, the same or similar portions between the various embodiments refer to each other. 对于实施例公开的系统而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。 For the disclosed embodiment of the system, since it corresponds to the method disclosed embodiments, the description is relatively simple, see Methods of the correlation can be described.

[0096] 本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处。 [0096] As used herein through specific examples of the principles and embodiments of the invention are set forth in the above described embodiments are only used to help understand the method and core idea of ​​the present invention; the same time, for those of ordinary skill in the art, according to the ideas of the present invention, there are modifications to the specific embodiments and application scope. 综上所述,本说明书内容不应理解为对本发明的限制。 Therefore, the specification shall not be construed as limiting the present invention.

Claims (8)

  1. 1.一种数据去重方法,其特征在于,包括: 获取上传至数据平台的待处理数据; 确定所述待处理数据的元数据信息; 将所述待处理数据的元数据信息与所述数据平台的已存储数据的元数据信息进行比对,得到元数据信息相似度; 获取所述待处理数据的第一数据描述信息; 获取所述已存储数据的第二数据描述信息; 比对所述第一数据描述信息与所述第二数据描述信息,得到数据描述相似度; 对所述元数据信息相似度和所述数据描述相似度进行加权平均,得到总相似度; 按照所述总相似度对所述已存储数据进行排序; 将排序后的所述已存储数据中的前η个数据标记为疑似重复数据。 A data de-duplication method, comprising: acquiring data to be processed the data uploaded to the platform; metadata information used to determine a processing data; be the metadata information to the data processing data metadata information stored data platform for comparison, a similarity obtain metadata information; acquiring a first data to be processed the data description information; obtaining the second data already stored data description information; than the description information of the first data and the second data description information to obtain similarity data is described; similarity to the metadata information and data describing the similarity weighted average to give a total degree of similarity; in accordance with said total degree of similarity the stored data sorting; front of the sorted data stored in the data is marked as suspected η duplicate data.
  2. 2.根据权利要求1所述的方法,其特征在于,所述将所述待处理数据标记为疑似重复数据之后,还包括: 将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端,以便对所述疑似重复数据与所述已存储数据进行人工审核;所述数据列表由排序后的所述已存储数据的信息构成。 After 2. The method according to claim 1, wherein said processing the data to be marked as suspected duplicate data, further comprising: a list containing information pseudo data with said duplicate data is sent to a manual review the client, for manual review of the data and the pseudo repeating stored data; said data information from said sorted list stored configuration data.
  3. 3.根据权利要求1所述的方法,其特征在于,所述将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端之后,还包括: 当所述疑似重复数据与所述待处理数据不同时,将所述待处理数据保存至所述数据平台。 After 3. The method according to claim 1, characterized in that, the information will contain duplicate data of the pseudo data to the list of client manual review, further comprising: when said data and said pseudo repeat data to be processed is not the same, the data to be processed to the data stored internet.
  4. 4.根据权利要求1所述的方法,其特征在于,所述比对所述第一数据描述信息与所述第二数据描述信息,得到数据描述相似度,具体包括: 采用SimHash算法计算所述第一数据描述信息与所述第二数据描述信息之间的海明距离,根据所述海明距离确定所述第一数据描述信息与所述第二数据描述信息之间的的数据描述相似度。 4. The method according to claim 1, wherein the ratio of said second data description information and description information of the first data, obtain similarity data is described, comprises: using the algorithm SimHash description information of the first data and the second description data information between the Hamming distance, determining the first data based on the Hamming distance of the second data description information description data describe the similarity between the information .
  5. 5.一种数据去重系统,其特征在于,包括: 待处理数据获取单元,用于获取上传至数据平台的待处理数据; 元数据信息确定单元,用于确定所述待处理数据的元数据信息; 元数据信息比对单元,用于将所述待处理数据的元数据信息与所述数据平台的已存储数据的元数据信息进行比对,得到元数据信息相似度; 第一数据描述信息获取单元,用于获取所述待处理数据的第一数据描述信息; 第二数据描述信息获取单元,用于获取所述已存储数据的第二数据描述信息; 数据描述信息比对单元,用于比对所述第一数据描述信息与所述第二数据描述信息,得到数据描述相似度; 总相似度计算单元,用于对所述元数据信息相似度和所述数据描述相似度进行加权平均,得到总相似度; 排序单元,用于按照所述总相似度对所述已存储数据进行排序;疑似重复数据 A data de-duplication system comprising: a data acquisition unit to be processed, the processing for acquiring the data to be uploaded to the internet; metadata information determining unit, determining metadata for the data to be processed information; comparison unit metadata information, metadata for metadata information with the data processing platform, the data to be stored data information for comparison, a similarity obtain metadata information; description information of the first data an acquisition unit for acquiring first data to be processed the data description information; a second data description information acquisition unit for acquiring said second data has been stored in the data description information; data description information comparison means for ratio information and the second description data describing information of the first data, obtain similarity data is described; total similarity calculating unit, metadata information for the similarity and the similarity data is described in the weighted average to give a total degree of similarity; sorting means for sorting the stored data in accordance with said total degree of similarity; pseudo deduplication 记单元,用于将排序后的所述已存储数据中的前η个数据标记为疑似重复数据。 Hutchison unit for the front of the sorted data stored in the data is marked as suspected η duplicate data.
  6. 6.根据权利要求5所述的系统,其特征在于,还包括: 疑似重复数据发送单元,用于将所述待处理数据标记为疑似重复数据之后,将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端,以便对所述疑似重复数据与所述已存储数据进行人工审核;所述数据列表由排序后的所述已存储数据的信息构成。 6. A system as claimed in claim 5, characterized in that, further comprising: after a suspected repeating data transmission unit for processing the data to be repeated data marked as suspected, suspected of containing said information to duplicate data manual review data list to the client, so as to repeat the stored data with data manually review the suspected; information in the data list from said sorted data stored configuration.
  7. 7.根据权利要求5所述的系统,其特征在于,还包括: 待处理数据保存单元,用于将包含有所述疑似重复数据的信息的数据列表发送至人工审核客户端之后,当所述疑似重复数据与所述待处理数据不同时,将所述待处理数据保存至所述数据平台。 7. The system according to claim 5, characterized in that, further comprising: a storage unit after the data to be processed, for which the data comprising a list of suspected duplicated data information transmitted to the client manual review, when the repeating pseudo data with the data to be processed is not the same, the data to be saved to the data processing platform.
  8. 8.根据权利要求5所述的系统,其特征在于,所述数据描述信息比对单元,具体包括: 海明距离计算子单元,用于采用SimHash算法计算所述第一数据描述信息与所述第二数据描述信息之间的海明距离,根据所述海明距离确定所述第一数据描述信息与所述第二数据描述信息之间的的数据描述相似度。 8. The system according to claim 5, characterized in that the data description information comparing unit comprises: Hamming distance calculating sub-unit, for calculating SimHash algorithm using the first information and the data description the second description data information between the Hamming distance, determining the first data based on the Hamming distance of the second data description information description data describe the similarity between the information.
CN 201510266694 2015-05-22 2015-05-22 Data deduplication method and data deduplication method CN104809256A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201510266694 CN104809256A (en) 2015-05-22 2015-05-22 Data deduplication method and data deduplication method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201510266694 CN104809256A (en) 2015-05-22 2015-05-22 Data deduplication method and data deduplication method

Publications (1)

Publication Number Publication Date
CN104809256A true true CN104809256A (en) 2015-07-29

Family

ID=53694078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201510266694 CN104809256A (en) 2015-05-22 2015-05-22 Data deduplication method and data deduplication method

Country Status (1)

Country Link
CN (1) CN104809256A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622365A (en) * 2011-01-28 2012-08-01 北京百度网讯科技有限公司 Judging system and judging method for web page repeating
US8266115B1 (en) * 2011-01-14 2012-09-11 Google Inc. Identifying duplicate electronic content based on metadata
CN104160712A (en) * 2011-10-30 2014-11-19 谷歌公司 Computing similarity between media programs
CN104216925A (en) * 2013-06-05 2014-12-17 中国科学院声学研究所 Repetition deleting processing method for video content

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8266115B1 (en) * 2011-01-14 2012-09-11 Google Inc. Identifying duplicate electronic content based on metadata
CN102622365A (en) * 2011-01-28 2012-08-01 北京百度网讯科技有限公司 Judging system and judging method for web page repeating
CN104160712A (en) * 2011-10-30 2014-11-19 谷歌公司 Computing similarity between media programs
CN104216925A (en) * 2013-06-05 2014-12-17 中国科学院声学研究所 Repetition deleting processing method for video content

Similar Documents

Publication Publication Date Title
US20140095439A1 (en) Optimizing data block size for deduplication
US20050216433A1 (en) Identification of input files using reference files associated with nodes of a sparse binary tree
US7707157B1 (en) Document near-duplicate detection
US20060004808A1 (en) System and method for performing compression/encryption on data such that the number of duplicate blocks in the transformed data is increased
US7587401B2 (en) Methods and apparatus to compress datasets using proxies
US7478113B1 (en) Boundaries
US20110099200A1 (en) Data sharing and recovery within a network of untrusted storage devices using data object fingerprinting
US6915344B1 (en) Server stress-testing response verification
US20040210575A1 (en) Systems and methods for eliminating duplicate documents
US20120185434A1 (en) Data synchronization
Garfinkel Digital media triage with bulk data analysis and bulk_extractor
US6904430B1 (en) Method and system for efficiently identifying differences between large files
US20110208744A1 (en) Methods for detecting and removing duplicates in video search results
US20090327505A1 (en) Content Identification for Peer-to-Peer Content Retrieval
CN102156727A (en) Method for deleting repeated data by using double-fingerprint hash check
US20130162902A1 (en) System and method for verification of media content synchronization
US20070097420A1 (en) Method and mechanism for retrieving images
US20080256093A1 (en) Method and System for Detection of Authors
Roussev Hashing and data fingerprinting in digital forensics
Roussev et al. Multi-resolution similarity hashing
US8078642B1 (en) Concurrent traversal of multiple binary trees
US20070098259A1 (en) Method and mechanism for analyzing the texture of a digital image
CN101595459A (en) Methods and systems for quick and efficient data management and/or processing
US20110283085A1 (en) System and method for end-to-end data integrity in a network file system
CN102571709A (en) Method for uploading file, client, server and system

Legal Events

Date Code Title Description
C06 Publication
EXSB Decision made by sipo to initiate substantive examination