CN104715059A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN104715059A
CN104715059A CN201510146263.8A CN201510146263A CN104715059A CN 104715059 A CN104715059 A CN 104715059A CN 201510146263 A CN201510146263 A CN 201510146263A CN 104715059 A CN104715059 A CN 104715059A
Authority
CN
China
Prior art keywords
data
parts
identical data
many numbers
reservation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510146263.8A
Other languages
Chinese (zh)
Inventor
张新亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TVMining Beijing Media Technology Co Ltd
Original Assignee
TVMining Beijing Media Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TVMining Beijing Media Technology Co Ltd filed Critical TVMining Beijing Media Technology Co Ltd
Priority to CN201510146263.8A priority Critical patent/CN104715059A/en
Publication of CN104715059A publication Critical patent/CN104715059A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a data processing method and device. The method comprises the following steps: acquiring a plurality of data; checking the plurality of acquired data to obtain at least two same data in the plurality of data; and retaining one of the at least two same data. By checking the plurality of acquired data and retaining one of the at least two same data after checking, the data processing method with data duplicate removal is provided, only one of the same data can be retained, and the redundant data is removed, thereby facilitating the data use of a user.

Description

A kind of data processing method and device
Technical field
The present invention relates to technical field of data processing, more specifically, relates to a kind of data processing method and device.
Background technology
Along with the universal of Internet technology and intelligent terminal and development, for people's reference and data also get more and more, data such as comprise picture, word, video, music etc.
Sometimes, multiple news media can report same news, multiple website can present same pictures.How making user get unduplicated data, is a problem demanding prompt solution.
Summary of the invention
In view of this, the object of the embodiment of the present invention proposes a kind of data processing method and device, can carry out data deduplication process.
In order to achieve the above object, the embodiment of the present invention proposes a kind of data processing method, comprising:
Obtain many numbers certificate;
To the many numbers obtained according to verifying, obtain many numbers according at least two parts of identical data;
A data at least two parts of identical data described in reservation.
In an embodiment of the present invention, described many numbers to obtaining according to verifying, obtain many numbers according at least two parts of identical data, comprising:
Calculate the MD5 value of each the number certificate in many numbers certificate;
At least two identical for MD5 value numbers certificates being confirmed as is identical data.
In an embodiment of the present invention, a data at least two parts of identical data described in described reservation, comprising:
The source of at least two parts of identical data described in identification;
Whether there are the data from described source at least two parts of identical data described in searching;
When there are the data from described source in described at least two parts of identical data, a data from described source at least two parts of identical data described in reservation.
In an embodiment of the present invention, a data at least two parts of identical data described in described reservation, comprising:
Relatively, whether the issuing time of at least two parts of identical data is identical;
When the issuing time of described at least two parts of identical data is different, the issuing time a data comparatively early at least two parts of identical data described in reservation.
The embodiment of the present invention also proposes a kind of data processing equipment, comprising:
Acquisition module, for obtaining many numbers certificate;
Correction verification module, for the many numbers obtained according to verifying, obtain many numbers according at least two parts of identical data;
Processing module, for a data at least two parts of identical data described in retaining.
In an embodiment of the present invention, described correction verification module, comprising:
Calculating sub module, for calculating the MD5 value of each the number certificate in many numbers certificate;
Confirming submodule, is identical data at least two identical for MD5 value numbers certificates being confirmed as.
In an embodiment of the present invention, described processing module, comprising:
Recognin module, for the source of at least two parts of identical data described in identifying;
Search submodule, for whether there are the data from described source at least two parts of identical data described in searching;
First process submodule, for when there are the data from described source in described at least two parts of identical data, a data from described source at least two parts of identical data described in reservation.
In an embodiment of the present invention, described processing module, comprising:
Comparison sub-module, for the issuing time of at least two parts of identical data described in relatively;
Second process submodule, for when the issuing time of described at least two parts of identical data is different, a data of the issuing time at least two parts of identical data described in reservation comparatively morning.
The technical scheme that the embodiment of the present invention provides can comprise following beneficial effect:
The present invention is owing to providing the many numbers obtained according to verifying, and a data retained at least two parts of identical data after verification, therefore, provide the data processing method of data deduplication, identical data only can be retained portion, remove unnecessary data, thus facilitate user to the use of data.
The further feature of the embodiment of the present invention and advantage will be set forth in the following description, and, partly become apparent from instructions, or understand by implementing the present invention.Object of the present invention and other advantages realize by structure specifically noted in write instructions, claims and accompanying drawing and obtain.
Below by drawings and Examples, the technical scheme of the embodiment of the present invention is described in further detail.
Accompanying drawing explanation
Accompanying drawing is used to provide the further understanding to the embodiment of the present invention, and forms a part for instructions, together with embodiments of the present invention for explaining the present invention, does not form the restriction to the embodiment of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of the data processing method in one embodiment of the invention.
Fig. 2 is the process flow diagram of the data processing method in one embodiment of the invention.
Fig. 3 is the process flow diagram of the data processing method in one embodiment of the invention.
Fig. 4 is the structural representation of the data processing equipment in one embodiment of the invention.
Embodiment
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein is only for instruction and explanation of the embodiment of the present invention, is not intended to limit the present invention embodiment.
When getting many numbers according to (such as news) from such as network, identical data may being there are in this many number certificate, therefore needing to the many numbers obtained according to processing, i.e. duplicate removal process.Be illustrated in figure 1 the process flow diagram of the data processing method in the embodiment of the present invention, the method comprises:
Step S11: obtain many numbers certificate.
Data can be such as word, audio frequency, video, image etc.
Step S12: to the many numbers obtained according to verifying, obtain many numbers according at least two parts of identical data.
To the many numbers obtained according to verifying, the one in following many algorithms can be adopted: CRC8 (Cyclic Redundancy Check, cyclic redundancy check (CRC)), CRC16, CRC32; MD5 (Message-Digest Algorithm 5, Message Digest 5 version 5), it is developed by MD2, MD3, MD4; SHA (Secure Hash Algorithm, Secure Hash Algorithm) be by the standards body of U.S.'s tailor cryptographic algorithm, USA National Institute of Standard and Technology (NIST) formulates, and SHA serial algorithm comprises SHA, SHA256, SHA384, SHA512; RIPEMD be 3 people such as Hans Dobbertin on MD4, MD5 defect analysis basis, put forward in 1996, have 4 standards 128,160,256 and 320.TIGER is proposed in nineteen ninety-five by Ross, and it is the fastest hash algorithm that Tiger is known as, and is that 64 machines are optimized specially.
When using MD5 algorithm to verify the many numbers certificates obtained, first calculate the MD5 value of each the number certificate in many numbers certificate; Then at least two identical for MD5 value numbers certificates being confirmed as is identical data.
Step S13: a data at least two parts of identical data described in reservation.
In the embodiment of the present invention, owing to providing the many numbers obtained according to verifying, and a data retained at least two parts of identical data after verification, therefore, provide the data processing method of data deduplication, identical data only can be retained portion, remove unnecessary data, thus facilitate user to the use of data.
Be illustrated in figure 2 the process flow diagram of the data processing method that another embodiment of the present invention provides, in this embodiment, when retaining a data at least two parts of identical data, can also process according to the source of data further, to improve the reliability of the data of reservation.The method comprises the following steps:
Step S21: obtain many numbers certificate.
Step S22: the MD5 value calculating each the number certificate in many numbers certificate.
Step S23: judge whether identical MD5 value; If so, step S24 is performed; If not, then terminate.
Step S24: at least two identical for MD5 value numbers certificates being confirmed as is identical data.
Step S25: the source of at least two parts of identical data described in identification.
Step S26: whether there are the data from described source at least two parts of identical data described in searching; If so, step S27 is performed; If not, step S28 is performed.
Step S27: a data from described source at least two parts of identical data described in reservation.
Step S28: a data at least two parts of identical data described in reservation.
Such as, the data of acquisition are the news about same event of multiple website report.First the present invention calculates news content with MD5 algorithm, compares the MD5 value of each news content, if there is identical MD5 value, then thinks to there is identical news.It should be noted that and the pure content that news content just be it is reported do not comprise the beginning part such as title and date.The news of website orientation has " source ", one, source in identification news of the present invention, such as, judge that the news 2 that the news 1 issued website 1 and website 2 are issued is after identical news, the source identifying this news is News Network of Beijing, at this moment in news 1 and news 2, the news whether existed from News Network of Beijing is searched, namely Beijing News Network whether is had in website 1 and website 2, if had, then retain that part of news that Beijing News Network issues, and other news are the news of reprinting, therefore can abandon.The present embodiment can retain data starting in identical data.
Be illustrated in figure 3 the process flow diagram of the data processing method that another embodiment of the present invention provides, in this embodiment, when having two numbers according to time identical, the title of data and time and content-length can also be compared further, retain to select better data.The method comprises the following steps:
Step S31: obtain many numbers certificate.
Step S32: the MD5 value calculating each the number certificate in many numbers certificate.
Step S33: judge whether identical MD5 value; If so, step S34 is performed; If not, then terminate.
Step S34: at least two identical for MD5 value numbers certificates being confirmed as is identical data.
Step S35: relatively, whether the issuing time of at least two parts of identical data is identical; If so, step S36 is performed; If not, then step S37 is performed.
Step S36: a data at least two parts of identical data described in reservation.
Step S37: a data comparatively early of the issuing time at least two parts of identical data described in reservation.
In the present embodiment, when the MD5 value of two number certificates is identical, the issuing time of data can be compared further, to retain the data issued the earliest.
As shown in Figure 4, the embodiment of the present invention also proposes a kind of data processing equipment, comprising:
Acquisition module 401, for obtaining many numbers certificate;
Correction verification module 402, for the many numbers obtained according to verifying, obtain many numbers according at least two parts of identical data;
Processing module 403, for a data at least two parts of identical data described in retaining.
Described correction verification module 402, comprising:
Calculating sub module, for calculating the MD5 value of each the number certificate in many numbers certificate;
Confirming submodule, is identical data at least two identical for MD5 value numbers certificates being confirmed as.
Described processing module 403, comprising:
Recognin module, for the source of at least two parts of identical data described in identifying;
Search submodule, for whether there are the data from described source at least two parts of identical data described in searching;
First process submodule, for when there are the data from described source in described at least two parts of identical data, a data from described source at least two parts of identical data described in reservation.
Described processing module 403, comprising:
Comparison sub-module, for the issuing time of at least two parts of identical data described in relatively;
Second process submodule, for when the issuing time of described at least two parts of identical data is different, a data of the issuing time at least two parts of identical data described in reservation comparatively morning.
Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory and optical memory etc.) of computer usable program code.
The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims (8)

1. a data processing method, is characterized in that, comprises the following steps:
Obtain many numbers certificate;
To the many numbers obtained according to verifying, obtain many numbers according at least two parts of identical data;
A data at least two parts of identical data described in reservation.
2. method according to claim 1, is characterized in that, described many numbers to obtaining according to verifying, obtain many numbers according at least two parts of identical data, comprising:
Calculate the MD5 value of each the number certificate in many numbers certificate;
At least two identical for MD5 value numbers certificates being confirmed as is identical data.
3. method according to claim 1, is characterized in that, a data at least two parts of identical data described in described reservation, comprising:
The source of at least two parts of identical data described in identification;
Whether there are the data from described source at least two parts of identical data described in searching;
When there are the data from described source in described at least two parts of identical data, a data from described source at least two parts of identical data described in reservation.
4. method according to claim 1, is characterized in that, a data at least two parts of identical data described in described reservation, comprising:
Relatively, whether the issuing time of at least two parts of identical data is identical;
When the issuing time of described at least two parts of identical data is different, the issuing time a data comparatively early at least two parts of identical data described in reservation.
5. a data processing equipment, is characterized in that, comprising:
Acquisition module, for obtaining many numbers certificate;
Correction verification module, for the many numbers obtained according to verifying, obtain many numbers according at least two parts of identical data;
Processing module, for a data at least two parts of identical data described in retaining.
6. device according to claim 5, is characterized in that, described correction verification module, comprising:
Calculating sub module, for calculating the MD5 value of each the number certificate in many numbers certificate;
Confirming submodule, is identical data at least two identical for MD5 value numbers certificates being confirmed as.
7. device according to claim 5, is characterized in that, described processing module, comprising:
Recognin module, for the source of at least two parts of identical data described in identifying;
Search submodule, for whether there are the data from described source at least two parts of identical data described in searching;
First process submodule, for when there are the data from described source in described at least two parts of identical data, a data from described source at least two parts of identical data described in reservation.
8. device according to claim 5, is characterized in that, described processing module, comprising:
Comparison sub-module, for the issuing time of at least two parts of identical data described in relatively;
Second process submodule, for when the issuing time of described at least two parts of identical data is different, a data of the issuing time at least two parts of identical data described in reservation comparatively morning.
CN201510146263.8A 2015-03-30 2015-03-30 Data processing method and device Pending CN104715059A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510146263.8A CN104715059A (en) 2015-03-30 2015-03-30 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510146263.8A CN104715059A (en) 2015-03-30 2015-03-30 Data processing method and device

Publications (1)

Publication Number Publication Date
CN104715059A true CN104715059A (en) 2015-06-17

Family

ID=53414385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510146263.8A Pending CN104715059A (en) 2015-03-30 2015-03-30 Data processing method and device

Country Status (1)

Country Link
CN (1) CN104715059A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105007504A (en) * 2015-07-13 2015-10-28 无锡天脉聚源传媒科技有限公司 Browsing history processing method and device
CN105843963A (en) * 2016-04-19 2016-08-10 北京金山安全软件有限公司 Website selection method and server
CN109344007A (en) * 2018-09-29 2019-02-15 安徽江淮汽车集团股份有限公司 A kind of dual-clutch transmission NVM data method of calibration and module

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070136741A1 (en) * 2005-12-09 2007-06-14 Keith Stattenfield Methods and systems for processing content
CN103942125A (en) * 2014-05-06 2014-07-23 南宁博大全讯科技有限公司 Automatic backup method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070136741A1 (en) * 2005-12-09 2007-06-14 Keith Stattenfield Methods and systems for processing content
CN103942125A (en) * 2014-05-06 2014-07-23 南宁博大全讯科技有限公司 Automatic backup method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴泽欣: "《SEO教程:搜索引擎优化入门与进阶》", 31 December 2008 *
董守斌等: "《网络信息检索》", 1 April 2010, 西安电子科技大学出版社 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105007504A (en) * 2015-07-13 2015-10-28 无锡天脉聚源传媒科技有限公司 Browsing history processing method and device
CN105843963A (en) * 2016-04-19 2016-08-10 北京金山安全软件有限公司 Website selection method and server
CN109344007A (en) * 2018-09-29 2019-02-15 安徽江淮汽车集团股份有限公司 A kind of dual-clutch transmission NVM data method of calibration and module

Similar Documents

Publication Publication Date Title
WO2021008113A1 (en) Data evidence storage method, data verification method and related apparatuses based on blockchain
US20180365264A1 (en) Telemetry system for a cloud synchronization system
WO2018177190A1 (en) Method and device for synchronizing blockchain data
WO2021003822A1 (en) Data storage and recovery method and apparatus, and computer device
CN107679863B (en) Block chain system and method for quickly verifying block
US11563560B2 (en) Blockchain-based data evidence storage method and apparatus
WO2014067240A1 (en) Method and apparatus for recovering sqlite file deleted from mobile terminal
WO2017036228A1 (en) Method and apparatus for implementing system upgrading
EP2857990A1 (en) File information previewing method and system
US20130067237A1 (en) Providing random access to archives with block maps
CN108667917B (en) Method and device for realizing data storage, computer storage medium and terminal
CN104715059A (en) Data processing method and device
WO2019237574A1 (en) Method and device for identifying database differences
CN111966631A (en) Mirror image file generation method, system, equipment and medium capable of being rapidly distributed
CN114116313A (en) Backup data processing method and device
CN106326222A (en) Data processing method and device
JP2012164130A (en) Data division program
CN104778252A (en) Index storage method and index storage device
WO2016177075A1 (en) Method of checking associative relationship of service data, device and readable storage medium utilizing same
CN114461599A (en) Segmental data storage method, device, storage medium and electronic device
EP3258672A1 (en) Cloud file transmission method, terminal and cloud server
KR101588375B1 (en) Method and system for managing database
CN117076204A (en) Remote replication task recovery method, device, equipment and medium
CN111294613A (en) Video processing method, client and server
CN110941658A (en) Data export method, device, server and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150617