CN106649646A - Method and device for deleting duplicated data - Google Patents
Method and device for deleting duplicated data Download PDFInfo
- Publication number
- CN106649646A CN106649646A CN201611129751.9A CN201611129751A CN106649646A CN 106649646 A CN106649646 A CN 106649646A CN 201611129751 A CN201611129751 A CN 201611129751A CN 106649646 A CN106649646 A CN 106649646A
- Authority
- CN
- China
- Prior art keywords
- data
- values
- key
- pending data
- value pair
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24554—Unary operations; Data partitioning operations
- G06F16/24556—Aggregation; Duplicate elimination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a method and device for deleting duplicated data. The method comprises the following steps: acquiring an MD5 value of to-be-processed data and a corresponding data mark; forming a key value pair of the to-be-processed data by the MD5 value and the data mark; comparing the MD5 value in the key value pair of the to-be-processed data with the MD5 value in the key value pair of the existing data; and if the MD5 value in the key value pair of the to-be-processed data is as same as the MD5 value in the key value pair of the existing data, deleting the to-be-processed data and confirming the data mark of the existing data which is as same as the to-be-processed data. According to the embodiment of the invention, the key value pair of the to-be-processed data is formed by the MD5 value and the data mark, the MD5 value of the to-be-processed data is compared with the MD5 value of the existing data, and the to-be-processed data, of which the MD5 value is as same as the MD5 value of the existing data, is deleted, so that the problem of the duplicated data existing in mass data can be solved, the effect of deleting the duplicated data before storage can be achieved, the usage rate of the hard disk is reduced and the cost is lowered.
Description
Technical field
The present embodiments relate to data processing technique, more particularly to a kind of method and device of data deduplication.
Background technology
In the current big data epoch, with informationalized development, it is the theory of many enterprise operators to be spoken with data.
Enterprise's data volume to be processed is increased sharply, and while big data offers convenience, also some burdens is increased to technical staff, in magnanimity
Data in, there is substantial amounts of duplicate data, cause the load of system increasing, data loading and query performance are with
How drop, realize, to a large amount of deletions for repeating junk data, reducing the utilization rate of hard disk, becomes the big data epoch urgently to be resolved hurrily
A difficult problem.
The content of the invention
The present invention provides a kind of method and device of data deduplication, to realize the duplicate removal to large-scale data, reduces hard disk
Utilization rate.
In a first aspect, embodiments providing a kind of method of data deduplication, the method includes:
Obtain the MD5 values and corresponding Data Identification of pending data;
The MD5 values and the Data Identification are constituted into the key-value pair of the pending data;
Compare the MD5 values in the key-value pair of MD5 values in the key-value pair of the pending data and data with existing;
If the MD5 values in the key-value pair of the pending data are identical with the MD5 values in the key-value pair of data with existing,
The pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.
Further, after the pending data is deleted, also include:
The key-value pair of the data with existing that the key-value pair of pending data is repeated with the pending data is stored in repetition
In statistics storehouse.
Further, also include:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing,
The pending data and corresponding Data Identification are stored in into database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
Further, obtaining the MD5 values and corresponding Data Identification of pending data includes:
Pending data is read by row;
Calculate the MD5 values of the pending data;
According to thread number when read access time and/or reading pending data, the data mark of the pending data is generated
Know.
Further, calculating the MD5 values of the pending data includes:
If ignoring data comprising default in the pending data, the pending data is removed into described presetting and is neglected
Omit data;
Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 of the pending data
Value.
Second aspect, the embodiment of the present invention additionally provides a kind of device of data deduplication, and the device includes:
Data Identification acquisition module, for obtaining the MD5 values and corresponding Data Identification of pending data;
Key-value pair comprising modules, for the MD5 values and the Data Identification to be constituted the key assignments of the pending data
It is right;
MD5 value comparing modules, for the MD5 values in the key-value pair for comparing the pending data and the key assignments of data with existing
The MD5 values of centering;
Duplicate data determining module, if for the MD5 values in the key-value pair of the pending data and data with existing
MD5 values in key-value pair are identical, then delete the pending data, and determine the data with existing repeated with the pending data
Data Identification.
Further, also including key-value pair preserving module, for after the pending data is deleted, by pending number
According to the key-value pair of data with existing that repeats of key-value pair and the pending data be stored in repetition statistics storehouse.
Further, also including data memory module, specifically for:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing,
The pending data and corresponding Data Identification are stored in into database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
Further, Data Identification acquisition module includes:
Data-reading unit, for reading pending data by row;
MD5 value computing units, for calculating the MD5 values of the pending data;
Data Identification signal generating unit, for according to thread number when read access time and/or reading pending data, generating institute
State the Data Identification of pending data.
Further, MD5 values computing unit specifically for:
If ignoring data comprising default in the pending data, the pending data is removed into described presetting and is neglected
Omit data;
Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 of the pending data
Value.
The technical scheme of the present embodiment, by the way that the MD5 values of pending data and Data Identification are constituted into key-value pair, and compares
MD5 values in the key-value pair of pending data and the MD5 values in the key-value pair of data with existing, by the MD5 value phases with data with existing
Same pending data is deleted, and solves the problems, such as there is duplicate data in mass data, has been reached before warehouse-in to data
The effect of duplicate removal is carried out, the utilization rate of hard disk, reduces cost is reduced.
Description of the drawings
Fig. 1 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention one is provided;
Fig. 2 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention two is provided;
Fig. 3 is the general frame figure of the data handling system in a kind of data duplicate removal method provided in an embodiment of the present invention;
Fig. 4 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention three is provided;
Fig. 5 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention four is provided;
Fig. 6 is a kind of schematic device of data deduplication that the embodiment of the present invention five is provided.
Specific embodiment
With reference to the accompanying drawings and examples the present invention is described in further detail.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that, in order to just
Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.
Embodiment one
Fig. 1 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention one is provided, and the present embodiment is applicable to
The situation of effective duplicate removal is carried out to mass data, the method can be performed by the device of data deduplication, and the method is specifically included
Following steps:
S110, the MD5 values and corresponding Data Identification that obtain pending data.
Wherein, MD5 (Message-Digest Algorithm 5, Message-Digest Algorithm 5), for guaranteeing information transfer
Completely unanimously, it is one of widely used hash algorithm of computer, resists with compressibility, easy calculating, anti-modification property and by force
The features such as collision.The type of pending data can be text type, can row wise or column wise wait mode to read data and calculate
The corresponding MD5 values of data, Data Identification can be as the mark of every data, for distinguishing every data.
S120, the key-value pair that the MD5 values and the Data Identification are constituted the pending data.
Wherein, MD5 values can be stored on redis clusters, using the benefit of redis clusters be deposited on redis databases
The data of storage are typically all the mode of key-value pair, it is possible to achieve efficient comparison operation.Can be by MD5 values and Data Identification group
Into key-value pair, in being stored in redis Cluster Databases.It is all using a kind of distributed system foundation frame to be typically repeated processing data
The distributed file system (Hadoop Distributed File System, HDFS) of structure Hadoop, so can be effectively
Mass data storage is realized, while effectively preventing Single Point of Faliure, it is to avoid unnecessary loss.But, in the enterprising row data of HDFS
During duplicate removal, data will be stored in advance in hard disk, cause data to put in storage, waste hard disk resources, increase hardware cost, consumed
The substantial amounts of time, duplicate removal is carried out in redis databases can be realized before data loading just effectively a large amount of repetition rubbish
Rubbish data are deleted, and reduce the utilization rate of hard disk, reduces cost.
MD5 values in S130, the key-value pair of the comparison pending data and the MD5 values in the key-value pair of data with existing.
Wherein, data with existing can be the data for having been stored.
MD5 values in the key-value pair of the MD5 values in the key-value pair of pending data and data with existing are contrasted, is judged
Whether the MD5 values in the key-value pair of pending data and the MD5 values in the key-value pair of data with existing are identical.Grasped by constantly comparing
Make, delete rubbish duplicate data.
If the MD5 values in S140, the key-value pair of the pending data and the MD5 value phases in the key-value pair of data with existing
Together, then the pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.
If the MD5 values in the key-value pair of pending data are identical with the MD5 values in the key-value pair of data with existing, can be with
Think that two datas repeat mutually, therefore pending data is deleted, while determining and having that pending data repeats
The Data Identification of data, to determine the data with existing repeated with pending data.
Above-mentioned steps are that S110, S120, S130 and S140 can be performed by hardware device, it is also possible to by different hard
Part equipment is performed respectively, and the concrete equipment for performing is not limited here.
The technical scheme of the present embodiment, by the way that the MD5 values of pending data and Data Identification are constituted into key-value pair, and compares
MD5 values in the key-value pair of pending data and the MD5 values in the key-value pair of data with existing, by the MD5 value phases with data with existing
Same pending data is deleted, and solves the problems, such as there is duplicate data in mass data, has been reached before warehouse-in to data
The effect of duplicate removal is carried out, the utilization rate of hard disk, reduces cost is reduced.
On the basis of above-mentioned technical proposal, after the pending data is deleted, further preferably include:
The key-value pair of the data with existing that the key-value pair of pending data is repeated with the pending data is stored in repetition
In statistics storehouse.
Wherein, the key-value pair of the data with existing for the key-value pair of pending data being repeated with the pending data is stored in
In repeating statistics storehouse, it is possible to use the information of preservation calculates the Data duplication amount of the data of identical MD5 values, and Data duplication amount can
As reference factor during consideration business demand.
On the basis of above-described embodiment, further preferably include:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing,
The pending data and corresponding Data Identification are stored in into database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
Wherein, if the MD5 values in the key-value pair of pending data are different with the MD5 values in the key-value pair of data with existing,
Then can confirm that and do not exist between pending data and data with existing repetition, then by the pending data and corresponding data mark
Knowledge is stored in database.The key-value pair of pending data is preserved simultaneously, as the later reference frame for comparing.To not repeat
Data stored, it is ensured that the integrality of data.
Embodiment two
Fig. 2 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention two is provided, and the present embodiment is in above-mentioned reality
Apply and carried out on the basis of example further optimization, " obtaining the MD5 values and corresponding Data Identification of pending data " is entered into one
Step is refined as " reading pending data by row;Calculate the MD5 values of the pending data;Treated according to read access time and/or reading
Thread number during processing data, generates the Data Identification of the pending data." the method specifically includes following steps:
S210, by row read pending data.
Wherein, data can be before pre-processing data access link, can be transported to data by handbarrow
In the middle of the server of pretreatment, the process of pending data is waited.In preprocessing server, program reads data by row.Generally use
Carrying program be all by network transmission, typically using transmission control protocol (Transmission Control
Protocol, TCP) communicate or FTP (File Transfer Protocol, FTP) transmission.Using pretreatment
Program realizes the real-time access of data, improves the treatment effeciency of data.Fig. 3 is a kind of data deduplication provided in an embodiment of the present invention
The general frame figure of the data handling system in method.As shown in figure 3, data carry server cluster by server 1- servers
N is constituted, and data are carried server cluster 310 data are carried in preprocessing server cluster 320 using carrying program, pre- place
Reason server cluster is made up of server 1'- server N', and wherein preprocessor can pacify in one or multiple servers
Dress, so as to be mounted with that one or more server group of preprocessor, into preprocessing server cluster 320, calculates data
MD5 values are simultaneously compared on redis server clusters 330 with the MD5 values of data with existing.Redis in the embodiment of the present invention
Server cluster 330 is by server 1 "-server N " etc. get up what is constituted by high-speed traffic link connection, there will be no repetition
Data be stored in database 340, wherein database 340 can be Hbase or oracle database.
S220, the MD5 values for calculating the pending data.
Wherein, the often capable pending data for reading is calculated into successively MD5 values.
S230, the thread number according to read access time and/or when reading pending data, generate the number of the pending data
According to mark.
Wherein, it is that pending data generates unique Data Identification, i.e. data ID.ID can be by the reading of pending data
At least one of thread number when time or reading pending data composition, by pretreatment cluster pending data is being read
When, ID can also pre-process the preprocessing server device number in cluster.Because preprocessing server is probably multiple stage, so
In order to Differentiated Services device is that each server arranges unique number, i.e. preprocessing server device number.
S240, the key-value pair that the MD5 values and the Data Identification are constituted the pending data;
MD5 values in S250, the key-value pair of the comparison pending data and the MD5 values in the key-value pair of data with existing;
If the MD5 values in S260, the key-value pair of the pending data and the MD5 value phases in the key-value pair of data with existing
Together, then the pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.
The technical scheme of the present embodiment, by pressing to go pending data is read, and calculates the MD5 values of each row of data, is generated
Data Identification, the MD5 values of row data are compared with the MD5 values of data with existing, and pending data is deleted when identical, are solved
There is the problem of duplicate data, to have reached carried out duplicate removal before warehouse-in to the duplicate data in mass data in mass data
Effect, reduce hard disk utilization rate, reduces cost.Row data are read as by row using by pending data, it is each using judging
Row data there will be the row data deletion of repetition situation with the presence or absence of repeating, and reach the effect of deduplication operation more specificization, and
And cause data processing quicker by reading by row.
Embodiment three
Fig. 4 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention three is provided, and the present embodiment is in above-mentioned reality
Apply and carried out on the basis of example further optimization, " calculating the MD5 values of the pending data " is further refined as " if
Ignore data comprising default in the pending data, then the pending data is removed into described presetting and ignore data;Calculate
The default MD5 values for ignoring the pending data after data are removed, as the MD5 values of the pending data." the method tool
Body is comprised the following steps:
S410, by row read pending data.
If ignoring data comprising default in S420, the pending data, the pending data is removed described
It is default to ignore data.
Wherein, before pending data is read, according to actual demand some data contents can be set to preset
Ignore data, for example, can be port numbers or some unnecessary temporal informations of data etc., these be all operating system with
What machine was produced, data itself are not worth can be sayed, therefore can be preset and be ignored that data are identical and remainder data is different treats
Processing data thinks that same is processed, and by a data therein warehouse-in, other carry out delete processing, you can realize going for data
Weight, before the MD5 values of data are calculated, removes the default purpose ignored data, can play raising data deduplication effect, saves
A part of workload, saves data deduplication time.
S430, calculating remove the default MD5 values for ignoring the pending data after data, used as the pending data
MD5 values.
Wherein, default ignoring after data remove, then the MD5 values for calculating pending data.
S440, the thread number according to read access time and/or when reading pending data, generate the number of the pending data
According to mark.
S450, the key-value pair that the MD5 values and the Data Identification are constituted the pending data;
MD5 values in S460, the key-value pair of the comparison pending data and the MD5 values in the key-value pair of data with existing;
If the MD5 values in S470, the key-value pair of the pending data and the MD5 value phases in the key-value pair of data with existing
Together, then the pending data is deleted, and determines the Data Identification of the data with existing repeated with the pending data.
The technical scheme of the present embodiment, by increased before the MD5 values of data are calculated data anticipation abscission ring section is ignored,
The ineffective data part that removal system is randomly generated, reaches the effect for saving workload, saves the data deduplication time.
Example IV
Fig. 5 is a kind of flow chart of the method for data deduplication that the embodiment of the present invention four is provided, and the present embodiment is above-mentioned
A preferred embodiment on the basis of embodiment, this method specifically includes following steps:
S510, data are carried server cluster and pending data are transported to into preprocessing server.
The MD5 values of S520, preprocessing server PC cluster pending data, and obtain the data mark of pending data
Know, constitute key-value pair, key-value pair is sent to redis clusters.
MD5 in the key-value pair that S530, redis cluster has deposited the MD5 values and redis databases in the key-value pair
Value compares, and judges whether identical.If identical, S540 is performed, otherwise perform S550.
S540, by data de-duplication.
Wherein, if the MD5 values of pending data are identical with the MD5 values that redis databases are deposited, pending data is illustrated
Repeat with data with existing, then pending data is confirmed as into duplicate data, be deleted.
S550, determine pending data be unique data.
Wherein, if the MD5 values of pending data are different from the MD5 values that redis databases are deposited, it is considered that for work as
For front pending data, not there is a problem of Data duplication, then pending data is defined as into unique data.
S560, by pending data warehouse-in to the data storage device such as HBase or Oracle, and by the key of pending data
During value is to being stored in redis databases.
The technical scheme of the embodiment of the present invention, by the process of multiple clusters, by the MD5 values and redis of pending data
The MD5 values of database storage are compared, and judge whether data repeat, and data de-duplication have been reached to mass data
Effective duplicate removal, improve deduplicated efficiency.
Embodiment five
Fig. 6 is a kind of schematic device of data deduplication that the embodiment of the present invention five is provided, and the device includes:
Data Identification acquisition module 610, for obtaining the MD5 values and corresponding Data Identification of pending data;
Key-value pair comprising modules 620, for the MD5 values and the Data Identification to be constituted the key of the pending data
Value is right;
MD5 values comparing module 630, for the MD5 values in the key-value pair for comparing the pending data and data with existing
MD5 values in key-value pair;
Duplicate data determining module 640, if for MD5 values and data with existing in the key-value pair of the pending data
Key-value pair in MD5 values it is identical, then delete the pending data, and determine several with what the pending data repeated
According to Data Identification.
Further, also include;
Key-value pair preserving module 620, for after the pending data is deleted, by the key-value pair of pending data with
The key-value pair of the data with existing that the pending data repeats is stored in repetition statistics storehouse.
Further, also including data memory module, specifically for:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing,
The pending data and corresponding Data Identification are stored in into database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
Further, Data Identification acquisition module 610 includes:
Data-reading unit, for reading pending data by row;
MD5 value computing units, for calculating the MD5 values of the pending data;
Data Identification signal generating unit, for according to thread number when read access time and/or reading pending data, generating institute
State the Data Identification of pending data.
Further, MD5 values computing unit specifically for:
If ignoring data comprising default in the pending data, the pending data is removed into described presetting and is neglected
Omit data;
Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 of the pending data
Value.
The method that the device of above-mentioned data deduplication can perform the data deduplication that any embodiment of the present invention is provided, possesses and holds
The corresponding functional module of method of row data deduplication and beneficial effect.
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that
The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes,
Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example
It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also
More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.
Claims (10)
1. a kind of method of data deduplication, it is characterised in that include:
Obtain the md5-challenge MD5 values and corresponding Data Identification of pending data;
The MD5 values and the Data Identification are constituted into the key-value pair of the pending data;
Compare the MD5 values in the key-value pair of MD5 values in the key-value pair of the pending data and data with existing;
If the MD5 values in the key-value pair of the pending data are identical with the MD5 values in the key-value pair of data with existing, delete
The pending data, and determine the Data Identification of the data with existing repeated with the pending data.
2. method according to claim 1, it is characterised in that after the pending data is deleted, also include:
The key-value pair of the data with existing that the key-value pair of pending data is repeated with the pending data is stored in repetition statistics
In storehouse.
3. method according to claim 1, it is characterised in that also include:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing, by institute
State pending data and corresponding Data Identification is stored in database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
4. according to the arbitrary described method of claim 1-3, it is characterised in that obtain MD5 values of pending data and corresponding
Data Identification includes:
Pending data is read by row;
Calculate the MD5 values of the pending data;
According to thread number when read access time and/or reading pending data, the Data Identification of the pending data is generated.
5. method according to claim 4, it is characterised in that calculating the MD5 values of the pending data includes:
If ignoring data comprising default in the pending data, the pending data is removed into described presetting and ignores number
According to;
Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 values of the pending data.
6. a kind of device of data deduplication, it is characterised in that include:
Data Identification acquisition module, for obtaining the MD5 values and corresponding Data Identification of pending data;
Key-value pair comprising modules, for the MD5 values and the Data Identification to be constituted the key-value pair of the pending data;
MD5 value comparing modules, in the key-value pair for the MD5 values in the key-value pair for comparing the pending data and data with existing
MD5 values;
Duplicate data determining module, if for the MD5 values in the key-value pair of the pending data and the key assignments of data with existing
The MD5 values of centering are identical, then delete the pending data, and determine the number of the data with existing repeated with the pending data
According to mark.
7. device according to claim 6, it is characterised in that also include:
Key-value pair preserving module, for after the pending data is deleted, the key-value pair of pending data being treated with described
The key-value pair of the data with existing that processing data repeats is stored in repetition statistics storehouse.
8. device according to claim 6, it is characterised in that also including data memory module, specifically for:
If the MD5 values in the key-value pair of the pending data are different with the MD5 values in the key-value pair of data with existing, by institute
State pending data and corresponding Data Identification is stored in database;
The key-value pair of the pending data is preserved in the key-value pair of data with existing.
9. according to the arbitrary described device of claim 6-8, it is characterised in that Data Identification acquisition module includes:
Data-reading unit, for reading pending data by row;
MD5 value computing units, for calculating the MD5 values of the pending data;
Data Identification signal generating unit, for according to read access time and/or read pending data when thread number, generate described in treat
The Data Identification of processing data.
10. device according to claim 9, it is characterised in that MD5 values computing unit specifically for:
If ignoring data comprising default in the pending data, the pending data is removed into described presetting and ignores number
According to;
Calculate and remove the default MD5 values for ignoring the pending data after data, as the MD5 values of the pending data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611129751.9A CN106649646A (en) | 2016-12-09 | 2016-12-09 | Method and device for deleting duplicated data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611129751.9A CN106649646A (en) | 2016-12-09 | 2016-12-09 | Method and device for deleting duplicated data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106649646A true CN106649646A (en) | 2017-05-10 |
Family
ID=58824828
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611129751.9A Pending CN106649646A (en) | 2016-12-09 | 2016-12-09 | Method and device for deleting duplicated data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649646A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107562555A (en) * | 2017-08-02 | 2018-01-09 | 网宿科技股份有限公司 | The cleaning method and server of duplicate data |
CN108376171A (en) * | 2018-02-27 | 2018-08-07 | 平安科技(深圳)有限公司 | Method, apparatus, terminal device and the storage medium that big data quickly introduces |
CN109039803A (en) * | 2018-07-10 | 2018-12-18 | 武汉斗鱼网络科技有限公司 | A kind of method, system and the computer equipment of processing readjustment notification message |
CN109522305A (en) * | 2018-12-06 | 2019-03-26 | 北京千方科技股份有限公司 | A kind of big data De-weight method and device |
CN110765756A (en) * | 2019-10-29 | 2020-02-07 | 北京齐尔布莱特科技有限公司 | Text processing method and device, computing equipment and medium |
CN111258966A (en) * | 2020-01-14 | 2020-06-09 | 软通动力信息技术有限公司 | Data deduplication method, device, equipment and storage medium |
CN112422707A (en) * | 2020-10-22 | 2021-02-26 | 北京安博通科技股份有限公司 | Domain name data mining method and device and Redis server |
CN112800004A (en) * | 2019-10-28 | 2021-05-14 | 浙江宇视科技有限公司 | Control method, device, equipment and medium for license plate algorithm library |
CN113449505A (en) * | 2021-07-01 | 2021-09-28 | 浪潮天元通信信息系统有限公司 | File comparison method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101882141A (en) * | 2009-05-08 | 2010-11-10 | 北京众志和达信息技术有限公司 | Method and system for implementing repeated data deletion |
CN104636477A (en) * | 2015-02-15 | 2015-05-20 | 山东卓创资讯集团有限公司 | Push list duplicate removal method before information push |
CN106130966A (en) * | 2016-06-20 | 2016-11-16 | 北京奇虎科技有限公司 | A kind of bug excavation detection method, server, device and system |
-
2016
- 2016-12-09 CN CN201611129751.9A patent/CN106649646A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101882141A (en) * | 2009-05-08 | 2010-11-10 | 北京众志和达信息技术有限公司 | Method and system for implementing repeated data deletion |
CN104636477A (en) * | 2015-02-15 | 2015-05-20 | 山东卓创资讯集团有限公司 | Push list duplicate removal method before information push |
CN106130966A (en) * | 2016-06-20 | 2016-11-16 | 北京奇虎科技有限公司 | A kind of bug excavation detection method, server, device and system |
Non-Patent Citations (1)
Title |
---|
李凌轩: "《Excel/Access在数据与资料管理中的应用 第1版》", 31 January 2006, 北京:中国青年出版社 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107562555A (en) * | 2017-08-02 | 2018-01-09 | 网宿科技股份有限公司 | The cleaning method and server of duplicate data |
CN108376171A (en) * | 2018-02-27 | 2018-08-07 | 平安科技(深圳)有限公司 | Method, apparatus, terminal device and the storage medium that big data quickly introduces |
CN108376171B (en) * | 2018-02-27 | 2020-04-03 | 平安科技(深圳)有限公司 | Method and device for quickly importing big data, terminal equipment and storage medium |
CN109039803A (en) * | 2018-07-10 | 2018-12-18 | 武汉斗鱼网络科技有限公司 | A kind of method, system and the computer equipment of processing readjustment notification message |
CN109522305A (en) * | 2018-12-06 | 2019-03-26 | 北京千方科技股份有限公司 | A kind of big data De-weight method and device |
CN112800004A (en) * | 2019-10-28 | 2021-05-14 | 浙江宇视科技有限公司 | Control method, device, equipment and medium for license plate algorithm library |
CN112800004B (en) * | 2019-10-28 | 2023-06-16 | 浙江宇视科技有限公司 | License plate algorithm library control method, device, equipment and medium |
CN110765756A (en) * | 2019-10-29 | 2020-02-07 | 北京齐尔布莱特科技有限公司 | Text processing method and device, computing equipment and medium |
CN110765756B (en) * | 2019-10-29 | 2023-12-01 | 北京齐尔布莱特科技有限公司 | Text processing method, device, computing equipment and medium |
CN111258966A (en) * | 2020-01-14 | 2020-06-09 | 软通动力信息技术有限公司 | Data deduplication method, device, equipment and storage medium |
CN112422707A (en) * | 2020-10-22 | 2021-02-26 | 北京安博通科技股份有限公司 | Domain name data mining method and device and Redis server |
CN113449505A (en) * | 2021-07-01 | 2021-09-28 | 浪潮天元通信信息系统有限公司 | File comparison method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649646A (en) | Method and device for deleting duplicated data | |
CN107294801B (en) | Streaming processing method and system based on massive real-time internet DPI data | |
US8078610B2 (en) | Optimization technique for dealing with data skew on foreign key joins | |
CN104331435B (en) | A kind of efficient mass data abstracting method of low influence based on Hadoop big data platforms | |
CN105701094B (en) | A kind of ETL collecting method and device | |
CN104618304B (en) | Data processing method and data handling system | |
CN111858760B (en) | Data processing method and device for heterogeneous database | |
CN105224690B (en) | Generate and select the method and system of the executive plan of the corresponding sentence containing ginseng | |
CN104361031A (en) | Big government data preprocessing system and method | |
US9405801B2 (en) | Processing a data stream | |
CN103744880B (en) | A kind of DNA data managing methods and system based on cloud computing | |
CN117370314A (en) | Distributed database system collaborative optimization and data processing system and method | |
KR101955376B1 (en) | Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method | |
CN204906437U (en) | Big data storage application network framework | |
Hurst et al. | Social streams blog crawler | |
de Souza Ramos et al. | Watershed: A high performance distributed stream processing system | |
CN103778220A (en) | Decision support method and device based on cloud computing | |
WO2016197858A1 (en) | Method and device for message notification | |
KR20170130178A (en) | In-Memory DB Connection Support Type Scheduling Method and System for Real-Time Big Data Analysis in Distributed Computing Environment | |
Aslam et al. | Pre‐filtering based summarization for data partitioning in distributed stream processing | |
CN107092529B (en) | OLAP service method, device and system | |
CN103106366B (en) | A kind of sample database dynamic maintaining method based on cloud | |
CA2425048C (en) | Method and system for resource access | |
CN104182522B (en) | Secondary indexing method and device on basis of circulation bitmap model | |
CN108133018B (en) | Data evidence obtaining recommendation method based on association aggregation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170510 |