CN106649646A - Method and device for deleting duplicated data - Google Patents

Method and device for deleting duplicated data Download PDF

Info

Publication number
CN106649646A
CN106649646A CN201611129751.9A CN201611129751A CN106649646A CN 106649646 A CN106649646 A CN 106649646A CN 201611129751 A CN201611129751 A CN 201611129751A CN 106649646 A CN106649646 A CN 106649646A
Authority
CN
China
Prior art keywords
data
values
key
pending data
value pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611129751.9A
Other languages
Chinese (zh)
Inventor
孙健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201611129751.9A priority Critical patent/CN106649646A/en
Publication of CN106649646A publication Critical patent/CN106649646A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a method and device for deleting duplicated data. The method comprises the following steps: acquiring an MD5 value of to-be-processed data and a corresponding data mark; forming a key value pair of the to-be-processed data by the MD5 value and the data mark; comparing the MD5 value in the key value pair of the to-be-processed data with the MD5 value in the key value pair of the existing data; and if the MD5 value in the key value pair of the to-be-processed data is as same as the MD5 value in the key value pair of the existing data, deleting the to-be-processed data and confirming the data mark of the existing data which is as same as the to-be-processed data. According to the embodiment of the invention, the key value pair of the to-be-processed data is formed by the MD5 value and the data mark, the MD5 value of the to-be-processed data is compared with the MD5 value of the existing data, and the to-be-processed data, of which the MD5 value is as same as the MD5 value of the existing data, is deleted, so that the problem of the duplicated data existing in mass data can be solved, the effect of deleting the duplicated data before storage can be achieved, the usage rate of the hard disk is reduced and the cost is lowered.

Description

A kind of method and device of data deduplication
Technical field
The present embodiments relate to data processing technique, more particularly to a kind of method and device of data deduplication.
Background technology
In the current big data epoch, with informationalized development, it is the theory of many enterprise operators to be spoken with data. Enterprise's data volume to be processed is increased sharply, and while big data offers convenience, also some burdens is increased to technical staff, in magnanimity Data in, there is substantial amounts of duplicate data, cause the load of system increasing, data loading and query performance are with How drop, realize, to a large amount of deletions for repeating junk data, reducing the utilization rate of hard disk, becomes the big data epoch urgently to be resolved hurrily A difficult problem.
The content of the invention
The present invention provides a kind of method and device of data deduplication, to realize the duplicate removal to large-scale data, reduces hard disk Utilization rate.
In a first aspect, embodiments providing a kind of method of data deduplication, the method includes:
Obtain the MD5 values and corresponding Data Identification of pending data;
The MD5 values and the Data Identification are constituted into the key-value pair of the pending data;
Compare the MD5 values in the key-value pair of MD5 values in the key-value pair of the pending data and data with existing;
If the MD5 values in the key-value pair of the pending data are identical with the MD5 values in the key-value pair of data with existing, The pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.
Further, after the pending data is deleted, also include:
The key-value pair of the data with existing that the key-value pair of pending data is repeated with the pending data is stored in repetition In statistics storehouse.
Further, also include:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing, The pending data and corresponding Data Identification are stored in into database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
Further, obtaining the MD5 values and corresponding Data Identification of pending data includes:
Pending data is read by row;
Calculate the MD5 values of the pending data;
According to thread number when read access time and/or reading pending data, the data mark of the pending data is generated Know.
Further, calculating the MD5 values of the pending data includes:
If ignoring data comprising default in the pending data, the pending data is removed into described presetting and is neglected Omit data;
Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 of the pending data Value.
Second aspect, the embodiment of the present invention additionally provides a kind of device of data deduplication, and the device includes:
Data Identification acquisition module, for obtaining the MD5 values and corresponding Data Identification of pending data;
Key-value pair comprising modules, for the MD5 values and the Data Identification to be constituted the key assignments of the pending data It is right;
MD5 value comparing modules, for the MD5 values in the key-value pair for comparing the pending data and the key assignments of data with existing The MD5 values of centering;
Duplicate data determining module, if for the MD5 values in the key-value pair of the pending data and data with existing MD5 values in key-value pair are identical, then delete the pending data, and determine the data with existing repeated with the pending data Data Identification.
Further, also including key-value pair preserving module, for after the pending data is deleted, by pending number According to the key-value pair of data with existing that repeats of key-value pair and the pending data be stored in repetition statistics storehouse.
Further, also including data memory module, specifically for:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing, The pending data and corresponding Data Identification are stored in into database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
Further, Data Identification acquisition module includes:
Data-reading unit, for reading pending data by row;
MD5 value computing units, for calculating the MD5 values of the pending data;
Data Identification signal generating unit, for according to thread number when read access time and/or reading pending data, generating institute State the Data Identification of pending data.
Further, MD5 values computing unit specifically for:
If ignoring data comprising default in the pending data, the pending data is removed into described presetting and is neglected Omit data;
Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 of the pending data Value.
The technical scheme of the present embodiment, by the way that the MD5 values of pending data and Data Identification are constituted into key-value pair, and compares MD5 values in the key-value pair of pending data and the MD5 values in the key-value pair of data with existing, by the MD5 value phases with data with existing Same pending data is deleted, and solves the problems, such as there is duplicate data in mass data, has been reached before warehouse-in to data The effect of duplicate removal is carried out, the utilization rate of hard disk, reduces cost is reduced.
Description of the drawings
Fig. 1 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention one is provided;
Fig. 2 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention two is provided;
Fig. 3 is the general frame figure of the data handling system in a kind of data duplicate removal method provided in an embodiment of the present invention;
Fig. 4 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention three is provided;
Fig. 5 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention four is provided;
Fig. 6 is a kind of schematic device of data deduplication that the embodiment of the present invention five is provided.
Specific embodiment
With reference to the accompanying drawings and examples the present invention is described in further detail.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that, in order to just Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.
Embodiment one
Fig. 1 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention one is provided, and the present embodiment is applicable to The situation of effective duplicate removal is carried out to mass data, the method can be performed by the device of data deduplication, and the method is specifically included Following steps:
S110, the MD5 values and corresponding Data Identification that obtain pending data.
Wherein, MD5 (Message-Digest Algorithm 5, Message-Digest Algorithm 5), for guaranteeing information transfer Completely unanimously, it is one of widely used hash algorithm of computer, resists with compressibility, easy calculating, anti-modification property and by force The features such as collision.The type of pending data can be text type, can row wise or column wise wait mode to read data and calculate The corresponding MD5 values of data, Data Identification can be as the mark of every data, for distinguishing every data.
S120, the key-value pair that the MD5 values and the Data Identification are constituted the pending data.
Wherein, MD5 values can be stored on redis clusters, using the benefit of redis clusters be deposited on redis databases The data of storage are typically all the mode of key-value pair, it is possible to achieve efficient comparison operation.Can be by MD5 values and Data Identification group Into key-value pair, in being stored in redis Cluster Databases.It is all using a kind of distributed system foundation frame to be typically repeated processing data The distributed file system (Hadoop Distributed File System, HDFS) of structure Hadoop, so can be effectively Mass data storage is realized, while effectively preventing Single Point of Faliure, it is to avoid unnecessary loss.But, in the enterprising row data of HDFS During duplicate removal, data will be stored in advance in hard disk, cause data to put in storage, waste hard disk resources, increase hardware cost, consumed The substantial amounts of time, duplicate removal is carried out in redis databases can be realized before data loading just effectively a large amount of repetition rubbish Rubbish data are deleted, and reduce the utilization rate of hard disk, reduces cost.
MD5 values in S130, the key-value pair of the comparison pending data and the MD5 values in the key-value pair of data with existing.
Wherein, data with existing can be the data for having been stored.
MD5 values in the key-value pair of the MD5 values in the key-value pair of pending data and data with existing are contrasted, is judged Whether the MD5 values in the key-value pair of pending data and the MD5 values in the key-value pair of data with existing are identical.Grasped by constantly comparing Make, delete rubbish duplicate data.
If the MD5 values in S140, the key-value pair of the pending data and the MD5 value phases in the key-value pair of data with existing Together, then the pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.
If the MD5 values in the key-value pair of pending data are identical with the MD5 values in the key-value pair of data with existing, can be with Think that two datas repeat mutually, therefore pending data is deleted, while determining and having that pending data repeats The Data Identification of data, to determine the data with existing repeated with pending data.
Above-mentioned steps are that S110, S120, S130 and S140 can be performed by hardware device, it is also possible to by different hard Part equipment is performed respectively, and the concrete equipment for performing is not limited here.
The technical scheme of the present embodiment, by the way that the MD5 values of pending data and Data Identification are constituted into key-value pair, and compares MD5 values in the key-value pair of pending data and the MD5 values in the key-value pair of data with existing, by the MD5 value phases with data with existing Same pending data is deleted, and solves the problems, such as there is duplicate data in mass data, has been reached before warehouse-in to data The effect of duplicate removal is carried out, the utilization rate of hard disk, reduces cost is reduced.
On the basis of above-mentioned technical proposal, after the pending data is deleted, further preferably include:
The key-value pair of the data with existing that the key-value pair of pending data is repeated with the pending data is stored in repetition In statistics storehouse.
Wherein, the key-value pair of the data with existing for the key-value pair of pending data being repeated with the pending data is stored in In repeating statistics storehouse, it is possible to use the information of preservation calculates the Data duplication amount of the data of identical MD5 values, and Data duplication amount can As reference factor during consideration business demand.
On the basis of above-described embodiment, further preferably include:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing, The pending data and corresponding Data Identification are stored in into database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
Wherein, if the MD5 values in the key-value pair of pending data are different with the MD5 values in the key-value pair of data with existing, Then can confirm that and do not exist between pending data and data with existing repetition, then by the pending data and corresponding data mark Knowledge is stored in database.The key-value pair of pending data is preserved simultaneously, as the later reference frame for comparing.To not repeat Data stored, it is ensured that the integrality of data.
Embodiment two
Fig. 2 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention two is provided, and the present embodiment is in above-mentioned reality Apply and carried out on the basis of example further optimization, " obtaining the MD5 values and corresponding Data Identification of pending data " is entered into one Step is refined as " reading pending data by row;Calculate the MD5 values of the pending data;Treated according to read access time and/or reading Thread number during processing data, generates the Data Identification of the pending data." the method specifically includes following steps:
S210, by row read pending data.
Wherein, data can be before pre-processing data access link, can be transported to data by handbarrow In the middle of the server of pretreatment, the process of pending data is waited.In preprocessing server, program reads data by row.Generally use Carrying program be all by network transmission, typically using transmission control protocol (Transmission Control Protocol, TCP) communicate or FTP (File Transfer Protocol, FTP) transmission.Using pretreatment Program realizes the real-time access of data, improves the treatment effeciency of data.Fig. 3 is a kind of data deduplication provided in an embodiment of the present invention The general frame figure of the data handling system in method.As shown in figure 3, data carry server cluster by server 1- servers N is constituted, and data are carried server cluster 310 data are carried in preprocessing server cluster 320 using carrying program, pre- place Reason server cluster is made up of server 1'- server N', and wherein preprocessor can pacify in one or multiple servers Dress, so as to be mounted with that one or more server group of preprocessor, into preprocessing server cluster 320, calculates data MD5 values are simultaneously compared on redis server clusters 330 with the MD5 values of data with existing.Redis in the embodiment of the present invention Server cluster 330 is by server 1 "-server N " etc. get up what is constituted by high-speed traffic link connection, there will be no repetition Data be stored in database 340, wherein database 340 can be Hbase or oracle database.
S220, the MD5 values for calculating the pending data.
Wherein, the often capable pending data for reading is calculated into successively MD5 values.
S230, the thread number according to read access time and/or when reading pending data, generate the number of the pending data According to mark.
Wherein, it is that pending data generates unique Data Identification, i.e. data ID.ID can be by the reading of pending data At least one of thread number when time or reading pending data composition, by pretreatment cluster pending data is being read When, ID can also pre-process the preprocessing server device number in cluster.Because preprocessing server is probably multiple stage, so In order to Differentiated Services device is that each server arranges unique number, i.e. preprocessing server device number.
S240, the key-value pair that the MD5 values and the Data Identification are constituted the pending data;
MD5 values in S250, the key-value pair of the comparison pending data and the MD5 values in the key-value pair of data with existing;
If the MD5 values in S260, the key-value pair of the pending data and the MD5 value phases in the key-value pair of data with existing Together, then the pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.
The technical scheme of the present embodiment, by pressing to go pending data is read, and calculates the MD5 values of each row of data, is generated Data Identification, the MD5 values of row data are compared with the MD5 values of data with existing, and pending data is deleted when identical, are solved There is the problem of duplicate data, to have reached carried out duplicate removal before warehouse-in to the duplicate data in mass data in mass data Effect, reduce hard disk utilization rate, reduces cost.Row data are read as by row using by pending data, it is each using judging Row data there will be the row data deletion of repetition situation with the presence or absence of repeating, and reach the effect of deduplication operation more specificization, and And cause data processing quicker by reading by row.
Embodiment three
Fig. 4 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention three is provided, and the present embodiment is in above-mentioned reality Apply and carried out on the basis of example further optimization, " calculating the MD5 values of the pending data " is further refined as " if Ignore data comprising default in the pending data, then the pending data is removed into described presetting and ignore data;Calculate The default MD5 values for ignoring the pending data after data are removed, as the MD5 values of the pending data." the method tool Body is comprised the following steps:
S410, by row read pending data.
If ignoring data comprising default in S420, the pending data, the pending data is removed described It is default to ignore data.
Wherein, before pending data is read, according to actual demand some data contents can be set to preset Ignore data, for example, can be port numbers or some unnecessary temporal informations of data etc., these be all operating system with What machine was produced, data itself are not worth can be sayed, therefore can be preset and be ignored that data are identical and remainder data is different treats Processing data thinks that same is processed, and by a data therein warehouse-in, other carry out delete processing, you can realize going for data Weight, before the MD5 values of data are calculated, removes the default purpose ignored data, can play raising data deduplication effect, saves A part of workload, saves data deduplication time.
S430, calculating remove the default MD5 values for ignoring the pending data after data, used as the pending data MD5 values.
Wherein, default ignoring after data remove, then the MD5 values for calculating pending data.
S440, the thread number according to read access time and/or when reading pending data, generate the number of the pending data According to mark.
S450, the key-value pair that the MD5 values and the Data Identification are constituted the pending data;
MD5 values in S460, the key-value pair of the comparison pending data and the MD5 values in the key-value pair of data with existing;
If the MD5 values in S470, the key-value pair of the pending data and the MD5 value phases in the key-value pair of data with existing Together, then the pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.
The technical scheme of the present embodiment, by increased before the MD5 values of data are calculated data anticipation abscission ring section is ignored, The ineffective data part that removal system is randomly generated, reaches the effect for saving workload, saves the data deduplication time.
Example IV
Fig. 5 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention four is provided, and the present embodiment is above-mentioned A preferred embodiment on the basis of embodiment, this method specifically includes following steps:
S510, data are carried server cluster and pending data are transported to into preprocessing server.
The MD5 values of S520, preprocessing server PC cluster pending data, and obtain the data mark of pending data Know, constitute key-value pair, key-value pair is sent to redis clusters.
MD5 in the key-value pair that S530, redis cluster has deposited the MD5 values and redis databases in the key-value pair Value compares, and judges whether identical.If identical, S540 is performed, otherwise perform S550.
S540, by data de-duplication.
Wherein, if the MD5 values of pending data are identical with the MD5 values that redis databases are deposited, pending data is illustrated Repeat with data with existing, then pending data is confirmed as into duplicate data, be deleted.
S550, determine pending data be unique data.
Wherein, if the MD5 values of pending data are different from the MD5 values that redis databases are deposited, it is considered that for work as For front pending data, not there is a problem of Data duplication, then pending data is defined as into unique data.
S560, by pending data warehouse-in to the data storage device such as HBase or Oracle, and by the key of pending data During value is to being stored in redis databases.
The technical scheme of the embodiment of the present invention, by the process of multiple clusters, by the MD5 values and redis of pending data The MD5 values of database storage are compared, and judge whether data repeat, and data de-duplication have been reached to mass data Effective duplicate removal, improve deduplicated efficiency.
Embodiment five
Fig. 6 is a kind of schematic device of data deduplication that the embodiment of the present invention five is provided, and the device includes:
Data Identification acquisition module 610, for obtaining the MD5 values and corresponding Data Identification of pending data;
Key-value pair comprising modules 620, for the MD5 values and the Data Identification to be constituted the key of the pending data Value is right;
MD5 values comparing module 630, for the MD5 values in the key-value pair for comparing the pending data and data with existing MD5 values in key-value pair;
Duplicate data determining module 640, if for MD5 values and data with existing in the key-value pair of the pending data Key-value pair in MD5 values it is identical, then delete the pending data, and determine several with what the pending data repeated According to Data Identification.
Further, also include;
Key-value pair preserving module 620, for after the pending data is deleted, by the key-value pair of pending data with The key-value pair of the data with existing that the pending data repeats is stored in repetition statistics storehouse.
Further, also including data memory module, specifically for:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing, The pending data and corresponding Data Identification are stored in into database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
Further, Data Identification acquisition module 610 includes:
Data-reading unit, for reading pending data by row;
MD5 value computing units, for calculating the MD5 values of the pending data;
Data Identification signal generating unit, for according to thread number when read access time and/or reading pending data, generating institute State the Data Identification of pending data.
Further, MD5 values computing unit specifically for:
If ignoring data comprising default in the pending data, the pending data is removed into described presetting and is neglected Omit data;
Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 of the pending data Value.
The method that the device of above-mentioned data deduplication can perform the data deduplication that any embodiment of the present invention is provided, possesses and holds The corresponding functional module of method of row data deduplication and beneficial effect.
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims (10)

1. a kind of method of data deduplication, it is characterised in that include:
Obtain the md5-challenge MD5 values and corresponding Data Identification of pending data;
The MD5 values and the Data Identification are constituted into the key-value pair of the pending data;
Compare the MD5 values in the key-value pair of MD5 values in the key-value pair of the pending data and data with existing;
If the MD5 values in the key-value pair of the pending data are identical with the MD5 values in the key-value pair of data with existing, delete The pending data, and determine the Data Identification of the data with existing repeated with the pending data.
2. method according to claim 1, it is characterised in that after the pending data is deleted, also include:
The key-value pair of the data with existing that the key-value pair of pending data is repeated with the pending data is stored in repetition statistics In storehouse.
3. method according to claim 1, it is characterised in that also include:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing, by institute State pending data and corresponding Data Identification is stored in database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
4. according to the arbitrary described method of claim 1-3, it is characterised in that obtain MD5 values of pending data and corresponding Data Identification includes:
Pending data is read by row;
Calculate the MD5 values of the pending data;
According to thread number when read access time and/or reading pending data, the Data Identification of the pending data is generated.
5. method according to claim 4, it is characterised in that calculating the MD5 values of the pending data includes:
If ignoring data comprising default in the pending data, the pending data is removed into described presetting and ignores number According to;
Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 values of the pending data.
6. a kind of device of data deduplication, it is characterised in that include:
Data Identification acquisition module, for obtaining the MD5 values and corresponding Data Identification of pending data;
Key-value pair comprising modules, for the MD5 values and the Data Identification to be constituted the key-value pair of the pending data;
MD5 value comparing modules, in the key-value pair for the MD5 values in the key-value pair for comparing the pending data and data with existing MD5 values;
Duplicate data determining module, if for the MD5 values in the key-value pair of the pending data and the key assignments of data with existing The MD5 values of centering are identical, then delete the pending data, and determine the number of the data with existing repeated with the pending data According to mark.
7. device according to claim 6, it is characterised in that also include:
Key-value pair preserving module, for after the pending data is deleted, the key-value pair of pending data being treated with described The key-value pair of the data with existing that processing data repeats is stored in repetition statistics storehouse.
8. device according to claim 6, it is characterised in that also including data memory module, specifically for:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing, by institute State pending data and corresponding Data Identification is stored in database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
9. according to the arbitrary described device of claim 6-8, it is characterised in that Data Identification acquisition module includes:
Data-reading unit, for reading pending data by row;
MD5 value computing units, for calculating the MD5 values of the pending data;
Data Identification signal generating unit, for according to read access time and/or read pending data when thread number, generate described in treat The Data Identification of processing data.
10. device according to claim 9, it is characterised in that MD5 values computing unit specifically for:
If ignoring data comprising default in the pending data, the pending data is removed into described presetting and ignores number According to;
Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 values of the pending data.
CN201611129751.9A 2016-12-09 2016-12-09 Method and device for deleting duplicated data Pending CN106649646A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611129751.9A CN106649646A (en) 2016-12-09 2016-12-09 Method and device for deleting duplicated data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611129751.9A CN106649646A (en) 2016-12-09 2016-12-09 Method and device for deleting duplicated data

Publications (1)

Publication Number Publication Date
CN106649646A true CN106649646A (en) 2017-05-10

Family

ID=58824828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611129751.9A Pending CN106649646A (en) 2016-12-09 2016-12-09 Method and device for deleting duplicated data

Country Status (1)

Country Link
CN (1) CN106649646A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562555A (en) * 2017-08-02 2018-01-09 网宿科技股份有限公司 The cleaning method and server of duplicate data
CN108376171A (en) * 2018-02-27 2018-08-07 平安科技(深圳)有限公司 Method, apparatus, terminal device and the storage medium that big data quickly introduces
CN109039803A (en) * 2018-07-10 2018-12-18 武汉斗鱼网络科技有限公司 A kind of method, system and the computer equipment of processing readjustment notification message
CN109522305A (en) * 2018-12-06 2019-03-26 北京千方科技股份有限公司 A kind of big data De-weight method and device
CN110765756A (en) * 2019-10-29 2020-02-07 北京齐尔布莱特科技有限公司 Text processing method and device, computing equipment and medium
CN111258966A (en) * 2020-01-14 2020-06-09 软通动力信息技术有限公司 Data deduplication method, device, equipment and storage medium
CN112422707A (en) * 2020-10-22 2021-02-26 北京安博通科技股份有限公司 Domain name data mining method and device and Redis server
CN112800004A (en) * 2019-10-28 2021-05-14 浙江宇视科技有限公司 Control method, device, equipment and medium for license plate algorithm library
CN113449505A (en) * 2021-07-01 2021-09-28 浪潮天元通信信息系统有限公司 File comparison method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882141A (en) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 Method and system for implementing repeated data deletion
CN104636477A (en) * 2015-02-15 2015-05-20 山东卓创资讯集团有限公司 Push list duplicate removal method before information push
CN106130966A (en) * 2016-06-20 2016-11-16 北京奇虎科技有限公司 A kind of bug excavation detection method, server, device and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882141A (en) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 Method and system for implementing repeated data deletion
CN104636477A (en) * 2015-02-15 2015-05-20 山东卓创资讯集团有限公司 Push list duplicate removal method before information push
CN106130966A (en) * 2016-06-20 2016-11-16 北京奇虎科技有限公司 A kind of bug excavation detection method, server, device and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李凌轩: "《Excel/Access在数据与资料管理中的应用 第1版》", 31 January 2006, 北京:中国青年出版社 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107562555A (en) * 2017-08-02 2018-01-09 网宿科技股份有限公司 The cleaning method and server of duplicate data
CN108376171A (en) * 2018-02-27 2018-08-07 平安科技(深圳)有限公司 Method, apparatus, terminal device and the storage medium that big data quickly introduces
CN108376171B (en) * 2018-02-27 2020-04-03 平安科技(深圳)有限公司 Method and device for quickly importing big data, terminal equipment and storage medium
CN109039803A (en) * 2018-07-10 2018-12-18 武汉斗鱼网络科技有限公司 A kind of method, system and the computer equipment of processing readjustment notification message
CN109522305A (en) * 2018-12-06 2019-03-26 北京千方科技股份有限公司 A kind of big data De-weight method and device
CN112800004A (en) * 2019-10-28 2021-05-14 浙江宇视科技有限公司 Control method, device, equipment and medium for license plate algorithm library
CN112800004B (en) * 2019-10-28 2023-06-16 浙江宇视科技有限公司 License plate algorithm library control method, device, equipment and medium
CN110765756A (en) * 2019-10-29 2020-02-07 北京齐尔布莱特科技有限公司 Text processing method and device, computing equipment and medium
CN110765756B (en) * 2019-10-29 2023-12-01 北京齐尔布莱特科技有限公司 Text processing method, device, computing equipment and medium
CN111258966A (en) * 2020-01-14 2020-06-09 软通动力信息技术有限公司 Data deduplication method, device, equipment and storage medium
CN112422707A (en) * 2020-10-22 2021-02-26 北京安博通科技股份有限公司 Domain name data mining method and device and Redis server
CN113449505A (en) * 2021-07-01 2021-09-28 浪潮天元通信信息系统有限公司 File comparison method

Similar Documents

Publication Publication Date Title
CN106649646A (en) Method and device for deleting duplicated data
CN107294801B (en) Streaming processing method and system based on massive real-time internet DPI data
US8078610B2 (en) Optimization technique for dealing with data skew on foreign key joins
CN104331435B (en) A kind of efficient mass data abstracting method of low influence based on Hadoop big data platforms
CN105701094B (en) A kind of ETL collecting method and device
CN104618304B (en) Data processing method and data handling system
CN111858760B (en) Data processing method and device for heterogeneous database
CN105224690B (en) Generate and select the method and system of the executive plan of the corresponding sentence containing ginseng
CN104361031A (en) Big government data preprocessing system and method
US9405801B2 (en) Processing a data stream
CN103744880B (en) A kind of DNA data managing methods and system based on cloud computing
CN117370314A (en) Distributed database system collaborative optimization and data processing system and method
KR101955376B1 (en) Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method
CN204906437U (en) Big data storage application network framework
Hurst et al. Social streams blog crawler
de Souza Ramos et al. Watershed: A high performance distributed stream processing system
CN103778220A (en) Decision support method and device based on cloud computing
WO2016197858A1 (en) Method and device for message notification
KR20170130178A (en) In-Memory DB Connection Support Type Scheduling Method and System for Real-Time Big Data Analysis in Distributed Computing Environment
Aslam et al. Pre‐filtering based summarization for data partitioning in distributed stream processing
CN107092529B (en) OLAP service method, device and system
CN103106366B (en) A kind of sample database dynamic maintaining method based on cloud
CA2425048C (en) Method and system for resource access
CN104182522B (en) Secondary indexing method and device on basis of circulation bitmap model
CN108133018B (en) Data evidence obtaining recommendation method based on association aggregation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170510