CN104410692A - Method and system for uploading duplicated files - Google Patents

Method and system for uploading duplicated files Download PDF

Info

Publication number
CN104410692A
CN104410692A CN201410712783.6A CN201410712783A CN104410692A CN 104410692 A CN104410692 A CN 104410692A CN 201410712783 A CN201410712783 A CN 201410712783A CN 104410692 A CN104410692 A CN 104410692A
Authority
CN
China
Prior art keywords
file
uploaded
check value
service end
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410712783.6A
Other languages
Chinese (zh)
Other versions
CN104410692B (en
Inventor
蓝夏军
张玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Eisoo Software Co Ltd
Original Assignee
Shanghai Eisoo Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Eisoo Software Co Ltd filed Critical Shanghai Eisoo Software Co Ltd
Priority to CN201410712783.6A priority Critical patent/CN104410692B/en
Publication of CN104410692A publication Critical patent/CN104410692A/en
Application granted granted Critical
Publication of CN104410692B publication Critical patent/CN104410692B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a system for uploading duplicated files. The method comprises the steps: a client end calculates a calibration section calibration value of a file with the file size which is greater than or equal to a first threshold value, and a server end quickly matches the calibration section calibration value with the file size; if quick matching is successful, the client end calculates a calibration value of the whole file, and the server end precisely matches file calibration information; if precise matching is successful, the server end establishes a file to be uploaded to a map recorded in an existing grandfather, otherwise, stores the file calibration information and generates a grandfather record after the file is uploaded. If no matching item exists during quick matching, calculation of the calibration information of the whole file to be uploaded can be avoided; meanwhile, consistency of the file to be uploaded and a file matched by the server end is guaranteed; by pre-storage of the file calibration information during precise matching, repeated calculation of the file calibration information by the server end is avoided.

Description

A kind of method and system uploaded for duplicate file
Technical field
The present invention relates to technical field of data processing, especially relate to a kind of method and system uploaded for duplicate file.
Background technology
Universal along with Internet technology and mobile device, people create to be increased day by day with the amount of information shared.The network storage, as net dish, cloud dish, only once can need upload for user provides, can carry out the efficient information processing mode of data access on multiple terminal at any time.But in the mass data of the network storage, certainly exist a large amount of repeating data, major embodiment is the public resource that different user is uploaded separately.The existence of these repeating datas, the process that passes thereon on the one hand occupies a large amount of network bandwidth, causes unnecessary pressure on the other hand to network storage server.
In order to reduce the waste that repeating data causes, the technology generally adopted at present has the files passe and data de-duplication that verify based on repeating data.Wherein, the current files passe scheme based on repeating data verification, its feature and some limitation comprise:
(1) by file block, in units of data block, carry out the contrast of duplicate file, therefore can process repeating data block in similar documents.As Chinese patent CN102571952A discloses a kind of system and method for transfer files, treat upload file and carry out piecemeal contrast.But, when the network bandwidth is less, in units of data block, carry out the mode contrasted, need to produce a large amount of request and response between clients and servers, and then affect efficiency of transmission.Meanwhile, the check value of each data block need be preserved and contrast to service end, sets up the mapping relations between mass data block, very large to the pressure of server, often needs independently server to process, be unfavorable for realizing in middle-size and small-size server device.
(2) if not by file block, then existing scheme is before whether server side searches exists same file, needs the check value directly calculating file total data content to be uploaded.As Chinese patent CN103929453A discloses a kind of processing method, Apparatus and system of uploading data, wherein according to the index information of described data to be uploaded, determine whether described data to be uploaded store in storage server.When file is larger, computational process is obviously consuming time.Indivedual improvement project proposes partial content calculation check value in extracted file, to shorten computing time, but can not ensure the global consistency of file to be uploaded and service end institute locating file.
Summary of the invention
Object of the present invention be exactly in order to overcome above-mentioned prior art exist defect and a kind of method and system uploaded for duplicate file is provided, the situation that the network bandwidth is limited can be tackled, reduce the communication of client and server, reduce server internal processing pressure, and realize the quick precise alignment of client file to be uploaded and server existing file.
Object of the present invention can be achieved through the following technical solutions:
For the method that duplicate file is uploaded, realize the transmission of file from client to service end, comprise the following steps:
(1) client judges whether file to be uploaded is not less than first threshold, if so, then performs step (2), and if not, then client upload file, performs step (8);
(2) client extracts the verification section of file to be uploaded, and calculation check section check value;
(3) service end carries out Rapid matching according to verification section check value and file size, judges whether service end exists occurrence, if so, then performs step (4), and if not, then client upload file, performs step (8);
(4) client calculates the overall check value of whole file to be uploaded;
(5) service end carries out exact matching according to verification section check value, overall check value and file description information, judges whether service end exists occurrence, if, then perform step (6), if not, then client upload file, performs step (7);
(6) service end adds the map record of file to be uploaded to existing file record;
(7) service end record verification section check value, overall check value and file description information, form file record corresponding to file to be uploaded and preserve;
(8) service end receives file to be uploaded, and calculates its file verification information, forms file record corresponding to file to be uploaded and preserves.
In described step (2), the extraction mode of verification section comprises:
A, be that the data content of verification segment length is as verification section by extracting size from file header to be uploaded; Or
B, be parameter by the size of file to be uploaded, obtain by predefined processing mode the original position verifying section, extracting size is that the data content of verification segment length is as verification section.
In described step (2), the computational methods of verification section check value comprise MD5 algorithm.
In described step (4), when calculating the overall check value of whole file to be uploaded, the computational methods of employing comprise MD5 algorithm and/or SHA-1 hashing algorithm and/or CRC32 checking algorithm.
Described file record at least comprises recording mechanism, file verification information and file description information, described file verification information comprises file verification section check value and overall check value, and described file description information comprises file name, client modification time and file size.
Described step (7) replaces with:
Service end record verification section check value, overall check value and file description information, and send recording mechanism and it fails to match instruction to client, client detects formally uploads period from calculation document check information to file, whether file data changes, if, then directly upload file, perform step (8), if not, then the recording mechanism received is carried out files passe as uploading one of mark, the information and the information of having preserved that receive file are formed complete file record according to recording mechanism by service end.
In described step (8), comprise when service end calculates the file verification information receiving file:
Judge whether file size is not less than Second Threshold, if then calculation document check information, file verification information and the information receiving file forms file record, if not, then directly preserve the information formation file record of reception file.
A kind of system uploaded for duplicate file, realize the transmission of file from client to service end, comprise setting check value computing module, document management module and first communication module in the client and check value matching module, archive information administration module and the second communication module be arranged in service end, wherein
Described check value computing module comprises:
For when file size is not less than first threshold, the unit of calculation check section check value; With
When Rapid matching exists occurrence, calculate the unit of the overall check value of whole file to be uploaded;
Document management module comprises:
For judging the unit of file size and first threshold relation;
Be not less than the file to be uploaded of first threshold from file size and extract verification section and the unit passing to check value computing module; With
For by file transfers to be uploaded to the unit of first communication module;
First communication module comprises:
For uploading the unit of check value information and file description information;
When file size is less than first threshold, Rapid matching returns when there is not occurrence or exact matching returns when there is not occurrence, by the unit of files passe to be uploaded to service end; With
Receive the unit of service end response;
Check value matching module is used for realizing Rapid matching and exact matching, comprises coupling verification section check value, the overall check value of file to be uploaded and file description information, and matching result is passed to second communication module;
Archive information administration module comprises:
For preserving the unit of file record;
When the success of service end exact matching, add the unit that file to be uploaded maps to corresponding existing file record; With
When the file size that service end receives is not less than Second Threshold and there is not file verification information, calculate and preserve the unit of file verification information;
Second communication module comprises:
For the unit of the check value or file that receive client upload; With
For the result of Rapid matching, exact matching and files passe being returned the unit of client.
Described document management module also comprises:
The overall check value of monitoring from calculating whole file to be uploaded formally uploads period to file, the unit whether file data changes.
Described archive information administration module also comprises:
The file of exact matching failure exceedes certain hour and does not upload, then remove the unit of corresponding file verification value and the file description information of having preserved.
Compared with prior art, beneficial effect of the present invention is:
1, service end saves the check information of file, comprises the overall check value of verification section check value file and whole file, serves Rapid matching and exact matching simultaneously.Mating by first verification section check value being uploaded to service end, can judge whether service end may exist identical content fast; If do not mated, then directly upload, avoid to whole file carry out unnecessary check value calculate spent by time.
2, uploading period, if file data does not change through exact matching and when finally uploading, service end can preserve the overall check value calculated by client in advance, avoids the wasting of resources that double counting causes; If this file is not finally uploaded or upload procedure file changes, service end will remove more than the invalid record of certain hour, recalculate test value information.
To sum up, the present invention, by Rapid matching, avoids and to the time spent by whole file calculation check value, may accelerate the checking procedure of duplicate file; By exact matching, ensure that file to be uploaded and service end match the global consistency of file; By the overall check value that prestores, and overstepping one's bounds block check mode is carried out to file, the limited scene of the network bandwidth can be applied to better, and reduce server stress.
Accompanying drawing explanation
In order to more clearly set forth the details of the embodiment of the present invention, shown below is some flow charts, application scenarios figure and module interaction figure that embodiment is relevant.Apparently, below drawings illustrate the exemplary of embodiment and indefiniteness explanation.For those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can be obtained according to these accompanying drawings.
Fig. 1 is file uploading method first embodiment flow chart of the present invention;
Fig. 2 is file uploading method second embodiment flow chart of the present invention;
Fig. 3 is that service end of the present invention adds overall check value first embodiment flow chart;
Fig. 4 is that service end of the present invention adds overall check value second embodiment flow chart;
Fig. 5 is the application scenarios figure of file uploading system embodiment of the present invention;
Fig. 6 is each module and the intermodule interaction figure of file uploading system embodiment of the present invention.
Embodiment
Accompanying drawing in conjunction with the embodiments, will elaborate to object of the present invention, design, technical scheme and advantage below.Should be appreciated that described embodiment is in order to the present invention will be described, and limit content of the present invention never in any form.By embodiment, those of ordinary skill in the art can understand summary of the invention better.Under the prerequisite not paying creative work, its related embodiment all that those of ordinary skill in the art obtain, all should belong to protection scope of the present invention.
The current files passe mode based on duplicate file verification, needs the check value of the directly whole file of calculating, consuming time longer and calculating for non-duplicate file causes unnecessary waste.In indivedual improvement project extracted file, partial content verifies, and can not ensure the global consistency of file to be uploaded and service end institute matching files.
One of feature of the present invention, proposes based on Rapid matching and exact matching twice matching process, accelerates the checking procedure of duplicate file.
Fig. 1 gives the first embodiment flow process of file uploading method of the present invention, and it comprises:
Step 101: judge whether file size to be uploaded is not less than first threshold by client.If file size to be uploaded is not less than first threshold, then go to step 102; Otherwise, then 111 are gone to step.
First threshold is the predefined file size needing to carry out duplication check of client, first threshold can be MB rank, as 10M, can carry out estimating and adjust acquisition according to the network condition in practical application, also can be the threshold size being suitable for promoting transfer efficiency on duplicate file obtained by alternate manner.
Step 102: client extracts verification section from file to be uploaded, and calculation check section check value, uploads to service end by verification section check value and file size.
Extracting verification section, is to carry out Rapid matching.Verification section is unsuitable long, otherwise loses that client calculates fast, the meaning of service end Rapid matching.Verification section is also unsuitable too short, otherwise does not have representativeness to file, may there is a lot of identical verification section in a large amount of different file.The size of upload file is the accuracy rate in order to improve Rapid matching further.
For the determination of verification segment length, can be KB rank, as 200K, also can be other size determined according to client computing capability.The impact of verification segment length: verification Duan Yuechang, the result of file coupling is more accurate, but client is longer for computing time; Verification section is shorter, and the result of file coupling is more inaccurate, but the time that client carries out check value calculating is shorter.
For the determination of verification fragment position, can intercept from file header, can intercept from other ad-hoc location of file, also using file size as parameter, the original position of verification section can be obtained by the function process that client is identical with service end.
For the calculating of verification section check value, can adopt Message Digest Algorithm 5 MD5, can adopt SHA-1 hashing algorithm, also can be other account form of client and service end agreement.Because the verification segment length intercepted is shorter, even if comparatively complicated verification mode, the time expended also can meet Rapid matching.
It should be noted that: in the present invention, verification section is not defined as a continuous print data area in file.Repeatedly extract in file zones of different the content that total length equals to verify segment length, extracted content strings is linked up, and calculation check value, be optional.
Also there is the method that partial content in extracted file carries out verifying in some scheme existing, but just it can be used as and uniquely verify matching way.Apparently, extract the partial content limitation of carrying out verifying and be, how complicated the method no matter extracting content is, as long as the content extracted does not cover whole file, all can not ensure file to be uploaded and service end match the global consistency of file.
Step 103: service end judges whether the occurrence that there is verification section check value, file size.This step carries out Rapid matching, if there is no occurrence, then service end does not exist same file, goes to step 111; If there is occurrence, then may there is same file in service end, goes to step 104.
In the present embodiment, already present for service end file is called the original file of file, the corresponding file record of each original file.File record at least comprises recording mechanism and file fingerprint, described but be not limited to file verification information and file description information, described file verification information comprises file verification section check value and overall check value, and described file description information comprises file name, client modification time and file size.
It should be noted that: if Rapid matching exists occurrence, just illustrate that service end may exist same file, this is that the consistency that can not be used for carrying out whole file due to verification section and file size judges.Concrete reason, is described in a step 102.
Step 104: client calculates the overall check value of whole file to be uploaded, and overall check value, verification section check value and file description information are uploaded to service end.
The overall check value of calculation document, can adopt Message Digest Algorithm 5 MD5 and/or SHA-1 hashing algorithm, also can be other account form of client and service end agreement.In the present embodiment, adopt MD5 checksum CRC 32 to verify simultaneously.Apparently, other checking algorithm can be adopted to carry out independent utility or Combination application carrys out calculation document check value.
Upload verification section check value and the file description information of file in embodiment, wherein file description information includes but not limited to the title of file, client modification time and file size.
Step 105: whether service end exists the occurrence of file fingerprint information to be uploaded.This step carries out exact matching, if there is no occurrence, then service end does not exist same file, goes to step 107; If there is occurrence, then there is same file in service end, goes to step 106.
The file fingerprint mated, described in integrating step 103, includes but not limited to MD5 check value and the CRC32 check value of file, file verification section check value and file size in the present embodiment.
Step 106: service end adds the mapping of file to be uploaded to existing file record, uploads end.
In the present embodiment, if service end exists the occurrence of file to be uploaded, then add by the map record of file to be uploaded to existing file record.Mapping relations are many-to-one often, namely for duplicate file, after repeatedly uploading via same user or different user, all set up mapping relations with same file record.File fingerprint is not comprised in map record.When service end carries out Rapid matching and exact matching, only mate in file record.
In universal significance, user, to the operation of uploaded duplicate file, will be divided into the operation of map record and the operation two parts to the original file of file after mapping.Illustrate below this section, be in order to related content of the present invention describe integrality, not included in the claims in the present invention content, the present invention should do not caused to invent with other conflict mutually: user is to the reading of duplicate file yet, generally first obtain file record by map record, more therefrom obtain the actual storage locations of the original file of file; When the map record pointing to file record is not less than 2, deletion action only deletes map record.
Step 107: service end log file fingerprint, and return recording number and it fails to match information.
Now, service end, through exact matching, determines to there is not the file identical with file to be uploaded.Owing to whole file may be uploaded in client subsequent operation, therefore in embodiment, service end preserves the fingerprint of file in advance.If file does not change in upload procedure, after waiting files passe to be done, upload file information is combined with fingerprint by service end, can produce a complete file record.Another feature of the present invention, proposes the wasting of resources utilizing this processing mode to cause to avoid the overall check value of service end double counting; If this file is not finally uploaded or upload procedure file content changes, service end will remove more than the invalid record of certain hour.
Step 108: the check value that client detects from calculating whole file to be uploaded formally uploads period to file, and whether file data is constant.If formally upload period from the check value calculating file to be uploaded to file, file data is constant, then go to step 109, otherwise goes to step 111.
Step 109: client upload file, recording mechanism are to service end.
The recording mechanism that service end returns by client is as one of parameter during files passe.
Step 110: service end, according to recording mechanism, by the file fingerprint preserved with upload document information and form a complete file record, uploads end.
While service end receives file, if there is recording mechanism, the fingerprint Already in service end of this file is then described, therefore directly fingerprint and file out of Memory can be combined as a complete file record, do not need at the overall check value of service end double counting.
Step 111: client upload file, service end calculates the check information of this file, check information and upload file information is formed file record, uploads end.
The file that this step is uploaded, comprises the file that file size is less than first threshold, and Rapid matching is without the file of occurrence, and exact matching formally uploads the period file that changes of file data without occurrence and from calculation document check value to file.After service end receives file, first generate the file record not comprising check information, then adopt the overall check value of the workflow management in Fig. 3 or Fig. 4, and the check information write file record that will calculate.
Fig. 2 gives the second embodiment flow process of file uploading method of the present invention.Compared with the first embodiment, client adopts different process to file during files passe.In first embodiment, whether client need monitor upload procedure file data content and change, and processes accordingly according to step 108.Whether, in the second embodiment, client locks file in files passe process, therefore do not need monitoring file data content to change.For substantial elaboration and schematic illustration such as first threshold selection, the process of file verification section, check value account forms, the second embodiment is consistent with the first embodiment.The step of the second embodiment comprises:
Step 201: judge whether file size to be uploaded is not less than first threshold by client.If file size to be uploaded is not less than first threshold, then go to step 202; Otherwise, then 208 are gone to step.
Step 202: client calculates the verification section check value of file to be uploaded.In this step, verification section check value and file size are uploaded to service end by client simultaneously.
Step 203: whether service end Rapid matching is successful.During Rapid matching, judge whether the occurrence that there is verification section check value, file size, if there is no occurrence, then there is not same file in service end, goes to step 208; If there is occurrence, then may there is same file in service end, goes to step 204.
Step 204: client calculates the check information of whole file to be uploaded.File verification value, verification section check value and file description information are uploaded to service end by client simultaneously.
Step 205: whether service end exact matching is successful.The occurrence of server side searches file fingerprint to be uploaded, if there is no occurrence, then there is not same file in service end, goes to step 207; If there is occurrence, then there is same file in service end, goes to step 206.
Step 206: service end adds the mapping of file to be uploaded to existing file record, uploads end.
Step 207: the overall check value of service end record, pending file uploads the complete file record of rear formation, uploads end.This step can be divided into 3 subprocess:
(1) service end log file fingerprint, and return recording number and it fails to match information;
(2) client upload file, recording mechanism are to service end;
(3) service end is according to recording mechanism, by the file fingerprint preserved with upload document information and form a complete file record, uploads end.
Step 208: client upload file, service end calculates the check information of this file, and check information and upload file information are formed file record, uploads end.
The file that this step is uploaded, comprises file size and is less than the file of first threshold and the Rapid matching file without occurrence.After service end receives file, first generate the file record not comprising check information, then adopt the workflow management file verification information in Fig. 3 or Fig. 4, and the check information write file record that will calculate.
Fig. 3 gives the flow process that service end adds file verification information first embodiment.The meaning that service end adds file verification information is: may not there is check information in some file record of service end, as the file uploaded in Fig. 1 step 111 and Fig. 2 step 208, need calculation check section check value and overall check value, to obtain the Rapid matching that carries out file with service end and the consistent file record of exact matching demand.In embodiment illustrated in fig. 3, process the single or multiple file record that there is not check information, this file record entry can come from the message queue being pushed to processing module after service end receives file, and its step comprises:
Step 301: obtain the file record that there is not check information.
Step 302: whether file size is not less than Second Threshold.If file size is not less than Second Threshold, then go to step 303; Otherwise flow process terminates.
Second Threshold is the predefined file size needing calculation check value of service end, without positive connection between Second Threshold and first threshold, file verification segment length, but generally there is following relation: verification segment length≤Second Threshold≤first threshold.This is based on following consideration: the first threshold set by client may be change, as little in service end memory space and disposal ability is strong time, first threshold can be set as smaller value, Second Threshold is not less than first threshold, when guarantee service end is mated, find all possible file record; First threshold and Second Threshold generally can not be less than verification segment length, because the efficiency directly transmitted small documents is very high.
In the present embodiment, being set to by Second Threshold with to verify segment length consistent, is also namely KB rank, as 200K.Obviously, Second Threshold is larger, and the calculating pressure of service end is less, can consider determine Second Threshold size according to aspects such as the demand of practical application or server computational power.
Step 303: calculation document check information, and be saved in file record, flow process terminates.
This step adopts the account form consistent with client, the check information of calculation document, as the MD5 check value of whole file, CRC32 check value and file verification section check value.Wherein, the obtain manner of file verification section also must be consistent with client.
Fig. 4 gives the flow process that service end adds file verification information second embodiment.In this flow process, there is not the file record of check information and calculate in automatic cycle acquisition, its step comprises:
Step 401: obtain the file record that there is not check information.
Step 402: whether file size is not less than Second Threshold.If file size is not less than Second Threshold, then go to step 403; Otherwise go to 404.The determination of Second Threshold, consistent with embodiment illustrated in fig. 3.
Step 403: calculation document check information, and be saved in file record.
Step 404: whether there is untreated file record.If existed, then go to step 401; If there is no, then flow process terminates.
Fig. 5 gives the application scenarios of file uploading system embodiment of the present invention.As seen from the figure, file uploading method of the present invention is applicable to general client, server architecture.
Client in Fig. 5 can be one or more, includes but not limited to PC end, Web end and mobile device end; Server can be one or more, can separate or composition Networks for Storage Services between multiple service end.The application entity of client, can be existing, research and develop or the communication equipment that can complete client functionality in Fig. 6 of in the future research and development.The application entity of service end, can be existing, research and develop or the communication equipment that service end function in Fig. 6 can be provided of in the future research and development.
Fig. 6 gives each module and the intermodule reciprocal process of file uploading system embodiment of the present invention.Module wherein and each functions of modules as follows:
Client comprises document management module 601, check value computing module 602 and first communication module 603, and the function of each module of client is specially:
Document management module 601: safeguard pending listed files; Store first threshold, verification segment length and obtain manner; Judge the relation of file size to be uploaded and first threshold; Be not less than the file to be uploaded of first threshold from file size and extract verification section and pass to check value computing module; Monitor from calculate whole file to be uploaded check value to and upload file during, whether file data changes, or locks file in files passe process; By file transfers to be sent to communication module;
Check value computing module 602: realize the check value calculation method consistent with service end, as MD5 checksum CRC 32 verifies; When file size is not less than first threshold, calculation check section check value; When Rapid matching exists occurrence, calculate the check value of whole file to be uploaded;
First communication module 603: set up with service end and communicate, uploads check value information or file and reception server response; Upload file size is less than the file of first threshold; When Rapid matching return there is not occurrence time, by files passe to service end; When exact matching return there is not occurrence time, by files passe to service end.
Server comprises archive information administration module 604, check value matching module 605 and second communication module 606, the function of each module of server:
Archive information administration module 604: the archive information of maintenance service end existing file; Store Second Threshold, verification segment length and obtain manner: realize the check value calculation method consistent with client, as MD5 checksum CRC 32 verifies; When the file success to be uploaded of service end exact matching, add the mapping of file to be uploaded to existing file record; When the file size that service end receives is not less than Second Threshold and there is not overall check value, calculate and preserve overall check value to file record; Preserve overall check value in advance when service end carries out exact matching, if respective file is not finally uploaded or upload procedure file content changing, then this check information is invalid, removes more than the invalid record of certain hour.
Check value matching module 605: the overall check value of coupling verification section check value, file to be uploaded and file description information, and matching result is passed to communication module;
Second communication module 606: set up with client and communicate, receives check value or the file of client upload, the result of Rapid matching, exact matching and files passe is returned client.
Comprising alternately between modules:
Document management module 601 and check value computing module 602: when needing Rapid matching, the verification section of extraction is passed to check value computing module by document management module; When needing exact matching, file content is passed to check value computing module by document management module;
Obviously, because module 601 and module 602 are all in client, the transmission of above-mentioned file content, only need module 601 that the information such as address, length of file or file verification section is passed to module 602, mutual between other module, if module is in client or service end simultaneously, then the transmittance process of file content is similar;
Document management module 601 and first communication module 603: document management module will need upload file delivery of content to communication module;
Check value computing module 602 and first communication module 603: check value computing module will the section of verification check value, file verification value transmit is to communication module;
Archive information administration module 604 and check value matching module 605: check value matching module requires to obtain the file record in archive information administration module; File record is returned to check value matching module by archive information administration module;
Archive information administration module 604 and second communication module 606: communication module receives the file of no parity check information, if file size is not less than Second Threshold, then notify archive information administration module calculation check information and file record.
Check value matching module 605 and second communication module 606: communication module notice check value matching module carries out Rapid matching; Communication module notice check value matching module carries out exact matching; Check value matching module returns matching result to communication module.
First communication module 603 and second communication module 606: module 603 sends request to module 606, comprise check information matching request, files passe request; Module 606 returns response to module 603, comprises Rapid matching result, exact matching result, files passe result.

Claims (10)

1., for the method that duplicate file is uploaded, realize the transmission of file from client to service end, it is characterized in that, comprise the following steps:
(1) client judges whether file to be uploaded is not less than first threshold, if so, then performs step (2), and if not, then client upload file, performs step (8);
(2) client extracts the verification section of file to be uploaded, and calculation check section check value;
(3) service end carries out Rapid matching according to verification section check value and file size, judges whether service end exists occurrence, if so, then performs step (4), and if not, then client upload file, performs step (8);
(4) client calculates the overall check value of whole file to be uploaded;
(5) service end carries out exact matching according to verification section check value, overall check value and file description information, judges whether service end exists occurrence, if, then perform step (6), if not, then client upload file, performs step (7);
(6) service end adds the map record of file to be uploaded to existing file record;
(7) service end record verification section check value, overall check value and file description information, form file record corresponding to file to be uploaded and preserve;
(8) service end receives file to be uploaded, and calculates its file verification information, forms file record corresponding to file to be uploaded and preserves.
2. the method uploaded for duplicate file according to claim 1, is characterized in that, in described step (2), the extraction mode of verification section comprises:
A, be that the data content of verification segment length is as verification section by extracting size from file header to be uploaded; Or
B, be parameter by the size of file to be uploaded, obtain by predefined processing mode the original position verifying section, extracting size is that the data content of verification segment length is as verification section.
3. the method uploaded for duplicate file according to claim 1, is characterized in that, in described step (2), the computational methods of verification section check value comprise MD5 algorithm.
4. the method uploaded for duplicate file according to claim 1, it is characterized in that, in described step (4), when calculating the overall check value of whole file to be uploaded, the computational methods of employing comprise MD5 algorithm and/or SHA-1 hashing algorithm and/or CRC32 checking algorithm.
5. the method uploaded for duplicate file according to claim 1, it is characterized in that, described file record at least comprises recording mechanism, file verification information and file description information, described file verification information comprises file verification section check value and overall check value, and described file description information comprises file name, client modification time and file size.
6. the method uploaded for duplicate file according to claim 1, is characterized in that, described step (7) replaces with:
Service end record verification section check value, overall check value and file description information, and send recording mechanism and it fails to match instruction to client, client detects formally uploads period from calculation document check information to file, whether file data changes, if, then directly upload file, perform step (8), if not, then the recording mechanism received is carried out files passe as uploading one of mark, the information and the information of having preserved that receive file are formed complete file record according to recording mechanism by service end.
7. the method uploaded for duplicate file according to claim 1, is characterized in that, in described step (8), comprises when service end calculates the file verification information receiving file:
Judge whether file size is not less than Second Threshold, if then calculation document check information, file verification information and the information receiving file forms file record, if not, then directly preserve the information formation file record of reception file.
8. the system uploaded for duplicate file, realize the transmission of file from client to service end, it is characterized in that, comprise setting check value computing module, document management module and first communication module in the client and check value matching module, archive information administration module and the second communication module be arranged in service end, wherein
Described check value computing module comprises:
For when file size is not less than first threshold, the unit of calculation check section check value; With
When Rapid matching exists occurrence, calculate the unit of the overall check value of whole file to be uploaded;
Document management module comprises:
For judging the unit of file size and first threshold relation;
Be not less than first from file size explain extraction verification section the file to be uploaded of value and pass to the unit of check value computing module; With
For by file transfers to be uploaded to the unit of first communication module;
First communication module comprises:
For uploading the unit of check value information and file description information;
When file size is less than first threshold, Rapid matching returns when there is not occurrence or exact matching returns when there is not occurrence, by the unit of files passe to be uploaded to service end; With
Receive the unit of service end response;
Check value matching module is used for realizing Rapid matching and exact matching, comprises coupling verification section check value, the overall check value of file to be uploaded and file description information, and matching result is passed to second communication module;
Archive information administration module comprises:
For preserving the unit of file record;
When the success of service end exact matching, add the unit that file to be uploaded maps to corresponding existing file record; With
When the file size that service end receives is not less than Second Threshold and there is not file verification information, calculate and preserve the unit of file verification information;
Second communication module comprises:
For the unit of the check value or file that receive client upload; With
For the result of Rapid matching, exact matching and files passe being returned the unit of client.
9. the system uploaded for duplicate file according to claim 8, is characterized in that, described document management module also comprises:
The overall check value of monitoring from calculating whole file to be uploaded formally uploads period to file, the unit whether file data changes.
10. the system uploaded for duplicate file according to claim 8, is characterized in that, described archive information administration module also comprises:
The file of exact matching failure exceedes certain hour and does not upload, then remove the unit of corresponding file verification value and the file description information of having preserved.
CN201410712783.6A 2014-11-28 2014-11-28 A kind of method and system uploaded for duplicate file Active CN104410692B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410712783.6A CN104410692B (en) 2014-11-28 2014-11-28 A kind of method and system uploaded for duplicate file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410712783.6A CN104410692B (en) 2014-11-28 2014-11-28 A kind of method and system uploaded for duplicate file

Publications (2)

Publication Number Publication Date
CN104410692A true CN104410692A (en) 2015-03-11
CN104410692B CN104410692B (en) 2019-03-22

Family

ID=52648287

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410712783.6A Active CN104410692B (en) 2014-11-28 2014-11-28 A kind of method and system uploaded for duplicate file

Country Status (1)

Country Link
CN (1) CN104410692B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787041A (en) * 2016-02-26 2016-07-20 中国银联股份有限公司 Large file comparison method and comparison system based on data characteristic codes
CN106101274A (en) * 2016-08-10 2016-11-09 玉环看知信息科技有限公司 A kind of document transmission method, Apparatus and system
CN106649676A (en) * 2016-12-15 2017-05-10 北京锐安科技有限公司 Duplication eliminating method and device based on HDFS storage file
CN106845278A (en) * 2016-12-26 2017-06-13 武汉斗鱼网络科技有限公司 A kind of file verification method and system
CN108205632A (en) * 2016-12-20 2018-06-26 北京小米移动软件有限公司 System area method of calibration and device
CN108229162A (en) * 2016-12-15 2018-06-29 中标软件有限公司 A kind of implementation method of cloud platform virtual machine completeness check
CN109309651A (en) * 2017-07-28 2019-02-05 阿里巴巴集团控股有限公司 A kind of document transmission method, device, equipment and storage medium
CN110457278A (en) * 2018-05-07 2019-11-15 百度在线网络技术(北京)有限公司 A kind of document copying method, device, equipment and storage medium
CN110704439A (en) * 2019-09-27 2020-01-17 北京智道合创科技有限公司 Data storage method and device
CN110995679A (en) * 2019-11-22 2020-04-10 杭州迪普科技股份有限公司 File data flow control method, device, equipment and storage medium
CN111314314A (en) * 2020-01-20 2020-06-19 苏州浪潮智能科技有限公司 Method and system for verifying integrity of website download file
CN112631514A (en) * 2020-12-17 2021-04-09 龙存科技(北京)股份有限公司 File duplicate removal method and system applied to cloud disk system
CN114168537A (en) * 2021-11-27 2022-03-11 深圳市连用科技有限公司 Method for uploading file and terminal equipment
CN114401147A (en) * 2022-01-20 2022-04-26 山西晟视汇智科技有限公司 New energy power station communication message comparison method and system based on abstract algorithm
CN114422503A (en) * 2022-01-24 2022-04-29 深圳市云语科技有限公司 Method for intelligently selecting file transmission mode of multi-node file transmission system
CN114615258A (en) * 2022-03-28 2022-06-10 重庆长安汽车股份有限公司 Method and device for uploading large files to file server in fragmented manner
CN116527539A (en) * 2023-05-15 2023-08-01 合芯科技(苏州)有限公司 Data consistency verification method and device and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103248711A (en) * 2013-05-23 2013-08-14 华为技术有限公司 File uploading method and server
CN103581230A (en) * 2012-07-26 2014-02-12 深圳市腾讯计算机系统有限公司 File transmission system and method, receiving end and sending end
CN103714123A (en) * 2013-12-06 2014-04-09 西安工程大学 Methods for deleting duplicated data and controlling reassembly versions of cloud storage segmented objects of enterprise

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103581230A (en) * 2012-07-26 2014-02-12 深圳市腾讯计算机系统有限公司 File transmission system and method, receiving end and sending end
CN103248711A (en) * 2013-05-23 2013-08-14 华为技术有限公司 File uploading method and server
CN103714123A (en) * 2013-12-06 2014-04-09 西安工程大学 Methods for deleting duplicated data and controlling reassembly versions of cloud storage segmented objects of enterprise

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105787041B (en) * 2016-02-26 2019-08-13 中国银联股份有限公司 Big file comparison method and Compare System based on data characteristics code
CN105787041A (en) * 2016-02-26 2016-07-20 中国银联股份有限公司 Large file comparison method and comparison system based on data characteristic codes
CN106101274A (en) * 2016-08-10 2016-11-09 玉环看知信息科技有限公司 A kind of document transmission method, Apparatus and system
CN106649676A (en) * 2016-12-15 2017-05-10 北京锐安科技有限公司 Duplication eliminating method and device based on HDFS storage file
CN108229162B (en) * 2016-12-15 2021-10-08 中标软件有限公司 Method for realizing integrity check of cloud platform virtual machine
CN108229162A (en) * 2016-12-15 2018-06-29 中标软件有限公司 A kind of implementation method of cloud platform virtual machine completeness check
CN108205632A (en) * 2016-12-20 2018-06-26 北京小米移动软件有限公司 System area method of calibration and device
CN106845278A (en) * 2016-12-26 2017-06-13 武汉斗鱼网络科技有限公司 A kind of file verification method and system
CN109309651A (en) * 2017-07-28 2019-02-05 阿里巴巴集团控股有限公司 A kind of document transmission method, device, equipment and storage medium
CN109309651B (en) * 2017-07-28 2021-12-28 斑马智行网络(香港)有限公司 File transmission method, device, equipment and storage medium
CN110457278A (en) * 2018-05-07 2019-11-15 百度在线网络技术(北京)有限公司 A kind of document copying method, device, equipment and storage medium
CN110704439A (en) * 2019-09-27 2020-01-17 北京智道合创科技有限公司 Data storage method and device
CN110704439B (en) * 2019-09-27 2022-07-29 北京智道合创科技有限公司 Data storage method and device
CN110995679A (en) * 2019-11-22 2020-04-10 杭州迪普科技股份有限公司 File data flow control method, device, equipment and storage medium
CN110995679B (en) * 2019-11-22 2022-03-01 杭州迪普科技股份有限公司 File data flow control method, device, equipment and storage medium
CN111314314A (en) * 2020-01-20 2020-06-19 苏州浪潮智能科技有限公司 Method and system for verifying integrity of website download file
CN112631514A (en) * 2020-12-17 2021-04-09 龙存科技(北京)股份有限公司 File duplicate removal method and system applied to cloud disk system
CN114168537A (en) * 2021-11-27 2022-03-11 深圳市连用科技有限公司 Method for uploading file and terminal equipment
CN114401147A (en) * 2022-01-20 2022-04-26 山西晟视汇智科技有限公司 New energy power station communication message comparison method and system based on abstract algorithm
CN114401147B (en) * 2022-01-20 2024-02-20 山西晟视汇智科技有限公司 New energy power station communication message comparison method and system based on abstract algorithm
CN114422503A (en) * 2022-01-24 2022-04-29 深圳市云语科技有限公司 Method for intelligently selecting file transmission mode of multi-node file transmission system
CN114422503B (en) * 2022-01-24 2024-01-30 深圳市云语科技有限公司 Method for intelligently selecting file transmission mode by multi-node file transmission system
CN114615258A (en) * 2022-03-28 2022-06-10 重庆长安汽车股份有限公司 Method and device for uploading large files to file server in fragmented manner
CN116527539A (en) * 2023-05-15 2023-08-01 合芯科技(苏州)有限公司 Data consistency verification method and device and computer equipment
CN116527539B (en) * 2023-05-15 2023-11-28 合芯科技(苏州)有限公司 Data consistency verification method and device and computer equipment

Also Published As

Publication number Publication date
CN104410692B (en) 2019-03-22

Similar Documents

Publication Publication Date Title
CN104410692A (en) Method and system for uploading duplicated files
US20160057201A1 (en) File Uploading Method, Client, and Application Server in Cloud Storage, and Cloud Storage System
US9514209B2 (en) Data processing method and data processing device
CN102355426B (en) Method for transmitting off-line file and system
CN110929880A (en) Method and device for federated learning and computer readable storage medium
CN105025106B (en) A kind of method of the breakpoint transmission based on piecemeal and metamessage
US20210227007A1 (en) Data storage method, encoding device, and decoding device
WO2017215646A1 (en) Data transmission method and apparatus
CN102325167A (en) Verifying method for network file transmission
CN103023796B (en) network data compression method and system
CN103916483A (en) Self-adaptation data storage and reconstruction method for coding redundancy storage system
CN103795765A (en) File uploading verification method and system
CN105656981A (en) Data transmission method and system
CN103580945A (en) Method and device for generating testing data for complex service system
CN103731499B (en) Terminal and document transmission method
CN105302676A (en) Method and apparatus for transmitting host and backup mechanism data of distributed file system
CN106790334A (en) A kind of page data transmission method and system
CN114201421A (en) Data stream processing method, storage control node and readable storage medium
WO2021068891A1 (en) Method, system, electronic device, and storage medium for storing and collecting temperature data
CN104462562A (en) Data migration system and method based on data warehouse automation
CN106203179B (en) A kind of completeness check system and method for pair of file
CN104317716A (en) Method for transmitting data among distributed nodes and distributed node equipment
EP3579526B1 (en) Resource file feedback method and apparatus
CN104023070A (en) File compression method based on cloud storage
CN110912904B (en) Malicious device identification method and device, storage medium and computer device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 201112 Shanghai, Minhang District, United Airlines route 1188, building second layer A-1 unit 8

Applicant after: SHANGHAI EISOO INFORMATION TECHNOLOGY CO., LTD.

Address before: 201112 Shanghai, Minhang District, United Airlines route 1188, building second layer A-1 unit 8

Applicant before: Shanghai Eisoo Software Co.,Ltd.

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant