CN110096483A

CN110096483A - A kind of duplicate file detection method, terminal and server

Info

Publication number: CN110096483A
Application number: CN201910380465.7A
Authority: CN
Inventors: 李春平; 杨鹏飞
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2019-08-06
Anticipated expiration: 2039-05-08
Also published as: CN110096483B

Abstract

The embodiment of the invention provides a kind of duplicate file detection methods, terminal and server, method includes: when needing to be uploaded to the file to be processed of server to server transmission user, terminal obtains the size of file to be processed, detect target value section belonging to the size of file to be processed, according to the corresponding file hash value calculation in target value section, calculate the hash value of file to be processed, and the transmission information of the hash value comprising file to be processed is sent to server, server is according to transmission information, determine whether file to be located is duplicate file, and response results are sent to terminal, response results include the information that file to be processed is duplicate file or the information that file to be processed is non-duplicate file.Based on above-mentioned processing, server withouts waiting for file to be processed, and all transmission terminates, so that it may obtain the hash value of file to be processed, in turn, server can determine whether file to be processed is duplicate file earlier.

Description

A kind of duplicate file detection method, terminal and server

Technical field

The present invention relates to technical field of the computer network, more particularly to a kind of duplicate file detection method, terminal kimonos Business device.

Background technique

With the fast development of computer networking technology, user not only can very easily be watched by video terminal online The video oneself liked can also upload to video server by oneself shooting or by the video that other approach are got, with Just the video sharing oneself uploaded is watched to other users.As server receives the video etc. that more and more users upload File, these files can inevitably repeat.In order to avoid storing duplicate file, server needs to carry out the file that user uploads It verifies one by one, to determine whether for duplicate file.

Therefore, to avoid storage duplicate file, the prior art has gone up transmitting file after the completion of file uploads, by calculating Hash (Hash) value, and the hash value is compared with the hash value of storage file, come judge on this transmitting file whether be Duplicate file.

However, inventor has found in the implementation of the present invention, at least there are the following problems for the prior art: the prior art By calculate go up the hash value of transmitting file judge its whether be duplicate file process, cannot detect what user uploaded in time Whether file is duplicate file.

Summary of the invention

The embodiment of the present invention is designed to provide a kind of duplicate file detection method, terminal and server, can examine in time Survey whether the file that user uploads is duplicate file.Specific technical solution is as follows:

In a first aspect, in order to achieve the above object, it is described the embodiment of the invention discloses a kind of duplicate file detection method Method includes:

Terminal obtains the file to be processed that user needs to be uploaded to server；

The terminal obtains the size of the file to be processed when sending the file to be processed to the server；

The terminal detects target value section belonging to the size of the file to be processed, wherein different numerical value areas Between respectively correspond different file Hash hash value calculations；

The terminal calculates the text to be processed according to the corresponding file hash value calculation in the target value section The hash value of part；

The terminal sends the transmission information of the hash value comprising the file to be processed to the server；

The terminal receives the server for the response results for sending information, wherein the response results packet Containing the information that the file to be processed is duplicate file or information that the file to be processed is non-duplicate file.

Optionally, the terminal is according to the corresponding file hash value calculation in the target value section, described in calculating The hash value of file to be processed, comprising:

The terminal is according to the corresponding file hash value calculation in the target value section, to the file to be processed The data for including are handled, and hash value to be selected is obtained；

The hash value for calculating the data of the size comprising the hash value to be selected and the file to be processed, will be calculated Hash value of the hash value as the file to be processed.

Optionally, the target value section is (0, A)；The terminal is according to the corresponding file in the target value section Hash value calculation, the data for including to the file to be processed are handled, and obtain hash value to be selected, comprising:

The terminal calculates the full dose hash value of the file to be processed, and using the full dose hash value as hash to be selected Value.

Optionally, the target value section be [A, B), wherein B > A；The terminal is according to the target value section Corresponding file hash value calculation, the data for including to the file to be processed are handled, and obtain hash value to be selected, packet It includes:

The terminal calculates the hash value that the data of head and default tail portion are preset comprising the file to be processed, and will meter Obtained hash value is as the hash value to be selected.

Optionally, the target value section be [B ,+∞)；The terminal is according to the corresponding text in the target value section Part hash value calculation, the data for including to the file to be processed are handled, and obtain hash value to be selected, comprising:

The terminal calculates the data hash that head, default tail portion and default middle part are preset comprising the file to be processed Value, and using the hash value being calculated as the hash value to be selected.

Optionally, described to send the size that information further includes the file to be processed.

Second aspect, it is in order to achieve the above object, described the embodiment of the invention discloses a kind of duplicate file detection method Method includes:

The transmission information for the hash value comprising file to be processed that server receiving terminal is sent, wherein the transmission letter Breath is the terminal when sending the file to be processed to the server, what Xiang Suoshu server was sent；

The server determines whether the file to be located is duplicate file according to the transmission information；

The server sends response results to the terminal, wherein the response results include the file to be processed It is the information of non-duplicate file for the information of duplicate file or the file to be processed.

Optionally, the server determines whether the file to be located is duplicate file, packet according to the transmission information It includes:

In the hash value of the local each storage file of the server detection, if exist and the file to be processed The identical hash value of hash value；

If there is the identical hash of hash value with the file to be processed in the hash value of local each storage file Value determines that the file to be located is duplicate file；

If there is no identical with the hash value of the file to be processed in the hash value of local each storage file Hash value determines that the file to be processed is non-duplicate file.

Optionally, described to send the size that information further includes the file to be processed；

The server determines whether the file to be located is duplicate file according to the transmission information, comprising:

The server according to the size of the file to be processed determine the file to be processed belonging to target value area Between；

The server detects in the hash value of the corresponding each storage file in the target value section, if exist with The identical hash value of the hash value of the file to be processed；

If existed and the file to be processed in the hash value of the corresponding each storage file in the target value section The identical hash value of hash value, determine that the file to be located is duplicate file；

If in the hash value of the corresponding each storage file in the target value section, be not present and the text to be processed The identical hash value of the hash value of part determines that the file to be processed is non-duplicate file.

The third aspect, in order to achieve the above object, the embodiment of the invention discloses a kind of terminal, the terminal includes: to receive Send out device and processor；

The transceiver needs to be uploaded to the file to be processed of server for obtaining user；It is sent out to the server When sending the file to be processed, the size of the file to be processed is obtained；

The processor, for detecting target value section belonging to the size of the file to be processed, wherein different Numerical intervals respectively correspond different file Hash hash value calculations；According to the corresponding file in the target value section Hash value calculation calculates the hash value of the file to be processed；

The transceiver is also used to send the transmission letter of the hash value comprising the file to be processed to the server Breath；The server is received for the response results for sending information, wherein the response results include the text to be processed Part be duplicate file information or the file to be processed be non-duplicate file information.

Optionally, the processor is specifically used for according to the corresponding file hash value calculating side in the target value section Formula, the data for including to the file to be processed are handled, and obtain hash value to be selected；Calculate comprising the hash value to be selected and The hash value of the data of the size of the file to be processed, using the hash value being calculated as the hash of the file to be processed Value.

Optionally, the target value section is (0, A)；

The processor, specifically for calculating the full dose hash value of the file to be processed, and by the full dose hash value As hash value to be selected.

Optionally, the target value section be [A, B), wherein B > A；

The processor, specifically for calculating the data comprising the default head of the file to be processed and default tail portion Hash value, and using the hash value being calculated as the hash value to be selected.

Optionally, the target value section be [B ,+∞)；

The processor is specifically used for calculating and presets head, default tail portion and default middle part comprising the file to be processed Data hash value, and using the hash value being calculated as the hash value to be selected.

Fourth aspect, in order to achieve the above object, the embodiment of the invention discloses a kind of server, the server packet It includes: transceiver and processor；

The transceiver, the transmission information of the hash value comprising file to be processed for receiving terminal transmission, wherein institute Stating and sending information is the terminal when sending the file to be processed to the server, what Xiang Suoshu server was sent；

The processor, for determining whether the file to be located is duplicate file according to the transmission information；

The transceiver is also used to send response results to the terminal, wherein the response results include described wait locate Manage the information that file is duplicate file or the information that the file to be processed is non-duplicate file.

Optionally, the processor, specifically in the hash value of the local each storage file of detection, if exist and institute State the identical hash value of hash value of file to be processed；If in the hash value of local each storage file, exist with it is described to The identical hash value of hash value of file is handled, determines that the file to be located is duplicate file；If local each storage file Hash value in, there is no hash value identical with the hash value of the file to be processed, determine that the file to be processed is non- Duplicate file.

The processor, specifically for according to the size of the file to be processed determine the file to be processed belonging to mesh Mark numerical intervals；In the hash value for detecting the corresponding each storage file in the target value section, if exist with it is described to Handle the identical hash value of hash value of file；If the hash value of the corresponding each storage file in the target value section In, there is the identical hash value with the hash value of the file to be processed, determines that the file to be located is duplicate file；If institute In the hash value for stating the corresponding each storage file in target value section, there is no identical as the hash value of the file to be processed Hash value, determine the file to be processed be non-duplicate file.

At the another aspect that the present invention is implemented, a kind of duplicate file detection system is additionally provided, the system comprises terminals And server；

The terminal needs to be uploaded to the file to be processed of server for obtaining user；It is sent to the server When the file to be processed, the size of the file to be processed is obtained；Detect target belonging to the size of the file to be processed Numerical intervals, wherein different numerical intervals respectively correspond different file Hash hash value calculations；According to the target The corresponding file hash value calculation of numerical intervals calculates the hash value of the file to be processed；It is sent to the server The transmission information of hash value comprising the file to be processed；

The server, for receiving the transmission information for the hash value comprising file to be processed that the terminal is sent；Root According to the transmission information, determine whether the file to be located is duplicate file；Response results are sent to the terminal, wherein institute It is the information of duplicate file or the file to be processed is non-duplicate file that state response results, which include the file to be processed, Information.

The terminal is also used to receive the server for the response results for sending information.

At the another aspect that the present invention is implemented, a kind of computer readable storage medium is additionally provided, it is described computer-readable It is stored with instruction in storage medium, when run on a computer, appoints described in above-mentioned first aspect so that computer executes One duplicate file detection method.

At the another aspect that the present invention is implemented, a kind of computer readable storage medium is additionally provided, it is described computer-readable It is stored with instruction in storage medium, when run on a computer, appoints described in above-mentioned second aspect so that computer executes One duplicate file detection method.

At the another aspect that the present invention is implemented, the embodiment of the invention also provides a kind of, and the computer program comprising instruction is produced Product, when run on a computer, so that computer executes any duplicate file detection method described in above-mentioned first aspect.

At the another aspect that the present invention is implemented, the embodiment of the invention also provides a kind of, and the computer program comprising instruction is produced Product, when run on a computer, so that computer executes any duplicate file detection method described in above-mentioned second aspect.

The embodiment of the invention provides a kind of duplicate file detection methods, need to be uploaded to clothes sending user to server When the file to be processed of business device, the size of the available file to be processed of terminal detects mesh belonging to the size of file to be processed It marks numerical intervals and calculates the hash value of file to be processed according to the corresponding file hash value calculation in target value section, and The transmission information of the hash value comprising file to be processed is sent to server, server determines file to be located according to information is sent Whether be duplicate file, and to terminal send response results, response results include file to be processed be duplicate file information or Person's file to be processed is the information of non-duplicate file.Based on above-mentioned processing, server withouts waiting for file to be processed and all passes Send end, so that it may obtain the hash value of file to be processed, in turn, server can determine earlier file to be processed whether be Duplicate file.

Certainly, implement any of the products of the present invention or method it is not absolutely required at the same reach all the above excellent Point.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described.

Fig. 1 is a kind of flow chart of duplicate file detection method provided in an embodiment of the present invention；

Fig. 2 is a kind of flow chart of duplicate file detection method provided in an embodiment of the present invention；

Fig. 3 is a kind of structure chart of terminal provided in an embodiment of the present invention；

Fig. 4 is a kind of structure chart of server provided in an embodiment of the present invention；

Fig. 5 is a kind of structure chart of duplicate file detection system provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention is described.

The prior art by calculate gone up the hash value of transmitting file judge its whether be duplicate file process, Bu Nengji When detection user upload file whether be duplicate file.

To solve the above-mentioned problems, the present invention provides a kind of duplicate file detection method, and this method can be respectively applied to Terminal and server, terminal and server network intercommunication, terminal can be browser or other terminals.

The available user of terminal needs to be uploaded to the file to be processed of server, and sends text to be processed to server Part.When terminal to server sends file to be processed, terminal can also obtain the size of file to be processed, and detect to be processed Target value section belonging to the size of file, then, terminal can be according to the corresponding file hash value meters in target value section Calculation mode calculates the hash value of file to be processed, and the transmission information of the hash value comprising file to be processed is sent to server.

Server then can receive the transmission information of the hash value comprising file to be processed of terminal transmission, and according to transmission Information determines whether file to be located is duplicate file, and then, server can send response results to terminal, wherein response knot Fruit includes the information that file to be processed is duplicate file or the information that file to be processed is non-duplicate file.

Based on above-mentioned processing, when sending file to be processed to server, terminal can also send to be processed to server The hash value of file, in turn, server can determine whether file to be processed is duplicate file earlier.

It is described in detail below with specific embodiment to the present invention.

Referring to Fig. 1, Fig. 1 is a kind of flow chart of duplicate file detection method provided in an embodiment of the present invention, and this method can To be applied to terminal, this method be may comprise steps of:

S101: terminal obtains the file to be processed that user needs to be uploaded to server.

Wherein, file to be processed can be the Internet resources of arbitrary format, for example, file to be processed can be video text Part, or audio file can also be the files such as the installation kit of application program.File to be processed can be one, can also To be multiple.If file to be processed be it is multiple, terminal can be with duplicate file detection method according to the present invention, successively to every One file to be processed is handled.

The available user of terminal needs to be uploaded to the file (file i.e. to be processed) of server, with to file to be processed into Row uploads.

In a kind of implementation, if terminal is browser, " upload " button can be set in the display interface of terminal, When being somebody's turn to do " upload " button when the user clicks, terminal can show the list of file to be uploaded, and the file in the list is terminal sheet The file on ground, user can select file to be processed from the file of terminal local, correspondingly, terminal is available, this is to be processed File.

S102: terminal obtains the size of file to be processed when sending file to be processed to server.

Wherein, the size of file to be processed is the size of memory space shared by file to be processed, for example, file to be processed Size can be 556MB, alternatively, the size of file to be processed can also be with 1000MB.

When terminal to server sends file to be processed, terminal can also obtain the size of file to be processed, with basis The different numerical value of the size of file to be processed, perform corresponding processing.

S103: terminal detects target value section belonging to the size of file to be processed.

Wherein, different numerical intervals respectively correspond different file hash (Hash) value calculations.

The division mode of different numerical intervals can be rule of thumb configured by technical staff.For example, can will be big In 0 and be less than first threshold file size be divided into a numerical intervals；Will be greater than or equal to first threshold, and less than the second threshold The file size of value is divided into another numerical intervals, and second threshold is greater than first threshold；The file of the third that will be greater than or equal to threshold value Size is divided into a numerical intervals, and third threshold value is greater than second threshold.First threshold, second threshold and third threshold value are positive Number.

After the size that terminal determines file to be processed, terminal can determine numerical value area belonging to the size of file to be processed Between (i.e. target value section) in turn can be according to target value section respective file hash value calculation, to text to be processed Part is handled.

S104: terminal calculates file to be processed according to the corresponding file hash value calculation in target value section Hash value.

Wherein, terminal can calculate the hash value of file to be processed according to preset algorithm, and preset algorithm can be sha1 (Secure Hash Algorithm, Secure Hash Algorithm) or other algorithms.

In a kind of implementation, terminal can be treated according to the corresponding file hash value calculation in target value section The data that processing file includes are handled, using processing result as the hash value of file to be processed.

In another way, in order to enable the hash value of calculated file to be processed more effectively to embody text to be processed The uniqueness of part, the method that terminal calculates the hash value of file to be processed may comprise steps of:

Step 1, according to the corresponding file hash value calculation in target value section, the number for including to file to be processed According to being handled, hash value to be selected is obtained.

According to the size of file to be processed, the method that terminal calculates hash value to be selected may include following situations:

Situation one, when target value section is (0, A), terminal calculates the full dose hash value of file to be processed, and will be complete Hash value is measured as hash value to be selected.

Wherein, the numerical value of A can be rule of thumb configured by technical staff, for example, A can be 40M.

In a kind of implementation, terminal determine file to be processed size belong to (0 40M) in the case where, due to wait locate It is smaller to manage file, therefore, terminal can carry out Hash operation to all data that file to be processed includes, that is, terminal can be counted The full dose hash value of file to be processed is calculated, and using full dose hash value as hash value to be selected.

Situation two, when target value section be [A, B), wherein when B > A, terminal calculate comprising file to be processed preset head The hash value of the data in portion and default tail portion, and using the hash value being calculated as hash value to be selected.

Wherein, the size of the numerical value of B, default head and default tail portion can be rule of thumb configured by technical staff, For example, B can be 128M, presets head and default tail portion can be 20M.

In a kind of implementation, terminal determine file to be processed size belong to [40M 128M) in the case where, if Terminal calculates the full dose hash value of file to be processed, then can consume more computing resource, and waste more calculating duration.

Therefore, terminal can carry out sampling processing to file to be processed, that is, the available file to be processed of terminal presets head The data of the data in portion and default tail portion, then, terminal can splice the data on default head and default tail portion, and right Spliced data carry out Hash operation, using the result of operation as hash value to be selected.

Situation three, when target value section is [B ,+∞) when, terminal, which is calculated, presets head, default comprising file to be processed The data hash value of tail portion and default middle part, and using the hash value being calculated as hash value to be selected.

In a kind of implementation, terminal determine file to be processed size belong to [128M+∞) in the case where, due to File to be processed is larger, if terminal calculates the full dose hash value of file to be processed, can consume more computing resource, and wave Take more calculating duration.

In addition, being obtained if the data that terminal only presets head and default tail portion to file to be processed are handled The validity of hash value to be selected is lower.

Therefore, the available file to be processed of terminal presets the data on head, the data of default tail portion, and default middle part Data, then, terminal can splice the data on default head, default tail portion and default middle part, and to spliced Data carry out Hash operation, using the result of operation as hash value to be selected.

As it can be seen that above situation one, into situation three, for different size of file to be processed, terminal can be to text to be processed Part carries out the processing of different modes, obtains different sampled data blocks, in turn, obtains file to be processed according to sampled data block Hash value.

In a kind of implementation, the corresponding relationship of file size and number of samples can store in terminal, terminal can be with According to the corresponding relationship, the corresponding destination sample number of the size of file to be processed is determined, then, terminal can be from text to be processed Destination sample number data block is obtained in part, and Hash operation is carried out to destination sample number data block, obtains hash to be selected Value.

The corresponding relationship of file size and number of samples can be with reference table (1).

Table (1)

File size (D)	Number of samples (S)
		D < 40M	1
40M≤D < 128M	2
		128M≤D < 512M	3
512M≤D < 1G	4
		1G≤D < 4G	5
4G≤D	6

By table (1) as it can be seen that when file to be processed is less than 40M, number of samples 1, at this point, terminal can not be to be processed File is sampled, sampled data block, that is, file to be processed itself, all numbers that terminal can directly include to file to be processed According to Hash operation is carried out, using operation result as hash value to be selected.

When file to be processed is more than or equal to 40M, and is less than 128M, number of samples 2, that is, terminal is available wait locate The data block that 2 default sizes are obtained in the data that reason file includes, as sampled data block.

When file to be processed is more than or equal to 128M, and is less than 512M, number of samples 3, that is, terminal can be to be processed The data block that 3 default sizes are obtained in the data that file includes, as sampled data block.

When file to be processed is more than or equal to 512M, and is less than 1G, number of samples 4, that is, terminal can be from text to be processed The data block that 4 default sizes are obtained in the data that part includes, as sampled data block.

When file to be processed is more than or equal to 1G, and is less than 4G, number of samples 5, that is, terminal can be from file to be processed The data block that 5 default sizes are obtained in the data for including, as sampled data block.

When file to be processed is more than or equal to 4G, number of samples 6, that is, the number that terminal can include from file to be processed According to the middle data block for obtaining 6 default sizes, as sampled data block.

Above-mentioned default size can be 20M, and when the number of sampled data block is more than or equal to 2, sampled data block be can wrap Include the default head of file to be processed and the data of default tail portion.

For two files of different-format, the difference of the header data of two files is larger, the tail of two files The difference of portion's data is larger, and therefore, terminal is when sampling file to be processed, if destination sample number is more than or equal to 2, The sampled data block that then terminal obtains may include the data that file to be processed presets head and default tail portion, so that sampling Data block can more accurately embody the uniqueness of file to be processed.

In addition, remove the default head of file to be processed and the data of default tail portion, terminal can also according to preset rules, Determine other sampled data blocks (data block at i.e. default middle part, be properly termed as middle part sampled data block).

In a kind of implementation, if destination sample number is odd number, and destination sample number is greater than 2, then terminal can be with The data block (being properly termed as midpoint sample data block) that the data midpoint that file to be processed includes presets size is obtained, remaining Middle part sampled data block is then evenly distributed on the two sides for the data midpoint that file to be processed includes according to preset interval.

If destination sample number is even number, and destination sample number is greater than 2, then the data midpoint that file to be processed includes Place is evenly distributed on the data midpoint that file to be processed includes according to preset interval without sampling, middle part sampled data root tuber Two sides.

For example, if file to be processed be more than or equal to 128M, and be less than 512M, middle part sampled data block be 1, then to The data midpoint that processing file includes obtains midpoint sample data block, as middle part sampled data block.

If file to be processed is more than or equal to 512M, and is less than 1G, preset interval can be 128M, middle part sampled data block It is 2, the distance between the data midpoint that this two middle part sampled data blocks and file to be processed include is 128M.

If file to be processed is more than or equal to 1G, and is less than 4G, preset interval can be 256M, and middle part sampled data block is 3, one of middle part sampled data block is the midpoint sample data block for the data midpoint that file to be processed includes, remaining two The two sides for the data midpoint that a middle part sampled data block includes in file to be processed, and in the data for including with file to be processed Distance at point is 256M.

If file to be processed is more than or equal to 4G, preset interval can be 512M, and middle part sampled data block is 4, respectively The two sides for the data midpoint for including positioned at file to be processed, and distinguish at a distance from the data midpoint for including with file to be processed For 512M and 1024M.

Step 2 calculates the hash value of the data of the size comprising hash value to be selected and file to be processed, will be calculated Hash value of the hash value as file to be processed.

After obtaining hash value to be selected, terminal can splice the size of hash value to be selected and file to be processed, so Afterwards, terminal can carry out Hash operation to spliced data, using the result of operation as the hash value of file to be processed.

As it can be seen that the hash value for the file to be processed that method through this embodiment obtains, can not only embody text to be processed The data that part includes can also embody the size of file to be processed, can effectively embody the uniqueness of file to be processed.

S105: terminal to server sends the transmission information of the hash value comprising file to be processed.

After terminal obtains the hash value of file to be processed, terminal can be sent to server comprising file to be processed The transmission information of hash value.

Correspondingly, server can determine text to be processed according to the transmission information after server receives the transmission information Whether part is duplicate file, and the response results for being directed to the transmission information are returned to terminal, and response results include file to be processed It is the information of non-duplicate file for the information of duplicate file or file to be processed.The processing step of server will be in subsequent implementation It is discussed in detail in example.

In turn, the response results that the available server of terminal is sent.

In a kind of implementation, if terminal is browser, the host process in terminal can be used for uploading file to be processed, Host process passes through file (file) object and XMLHttpRequest (Extensible Marku p Language Hyper Text Transfer Protocol Request, extensible markup language hypertext transfer protocol requests), to file to be processed It is sliced, and by way of asynchronous upload, the file to be processed after slice is sent to server.

Meanwhile it includes text to be processed that terminal, which can also be sent by web worker (labourer) separate threads to server, The transmission information of the hash value of part.

As it can be seen that the duplicate file detection method based on the embodiment of the present invention, when sending file to be processed to server, eventually End can also send the hash value of file to be processed to server, and server withouts waiting for file to be processed all transmission knots Beam, so that it may obtain the hash value of file to be processed, in turn, server can determine whether file to be processed is repetition earlier File.In addition, the duplicate file detection method based on the embodiment of the present invention, can be responsible for calculating file to be processed by terminal Hash value, and then the calculating pressure of server can be mitigated.

Optionally, the size that information can also include file to be processed is sent, that is, send to server wait locate in terminal When managing the hash value of file, terminal can also send the size of file to be processed to server.

Correspondingly, server can determine to be processed in conjunction with the hash value of file to be processed and the size of file to be processed Whether file is duplicate file, can be improved the efficiency of duplicate file detection method.

Referring to fig. 2, Fig. 2 is a kind of flow chart of duplicate file detection method provided in an embodiment of the present invention, and this method can To be applied to server, this method be may comprise steps of:

S201: the transmission information for the hash value comprising file to be processed that server receiving terminal is sent.

Wherein, send what information can be sent when sending file to be processed to server to server for terminal.Wait locate Reason file can be the Internet resources of arbitrary format, for example, file to be processed can be video file, or audio text Part can also be the files such as the installation kit of application program.

The available user of terminal needs to be uploaded to the file (file i.e. to be processed) of server, and then, terminal can be to Server sends file to be processed.

When sending file to be processed to server, terminal can also obtain the size of file to be processed, detect to be processed Target value section belonging to the size of file, according to the corresponding file hash value calculation in target value section, calculate to The hash value of file is handled, and sends the transmission information of the hash value comprising file to be processed to server.The processing side of terminal Method may refer to being discussed in detail for above-described embodiment.

Correspondingly, server then can receive the transmission information of the hash value comprising file to be processed.

S202: server determines whether file to be located is duplicate file according to information is sent.

After server obtains and sends information, server can extract the hash value of file to be processed, correspondingly, S202 can With the following steps are included:

In the hash value of the local each storage file of server detection, if exist identical as the hash value of file to be processed Hash value, if it does, determining that file to be located is duplicate file, if it does not, determining that file to be located is non-duplicate file.

In a kind of implementation, after server obtains the hash value of file to be processed, server can be locally stored All Files hash value in inquired, judge whether there is the identical hash value with the hash value of file to be processed, such as Fruit exists, and server, which can be determined that, has stored file identical with file to be processed, that is, file to be processed is duplicate file, such as Fruit is not present, and server can be determined that not stored file identical with file to be processed, that is, file to be processed is non-duplicate text Part.

As it can be seen that the duplicate file detection method based on the embodiment of the present invention, when sending file to be processed to server, eventually End can also send the hash value of file to be processed to server, and server withouts waiting for file to be processed all transmission knots Beam, so that it may obtain the hash value of file to be processed, in turn, server can determine whether file to be processed is repetition earlier File.

S203: server sends response results to terminal.

Wherein, response results are the information of duplicate file comprising file to be processed or file to be processed is non-duplicate file Information.

After determining whether file to be processed is duplicate file, server then can send response results to terminal, with Whether informing terminals file to be processed is duplicate file.

In addition, sending information can also include the big of file to be processed in order to improve the efficiency of duplicate file detection method It is small, correspondingly, S202 may comprise steps of:

Server according to the size of file to be processed determine file to be processed belonging to target value section, detect number of targets It is worth in the hash value of the corresponding each storage file in section, if there is the identical hash value with the hash value of file to be processed, If it does, determining that file to be located is duplicate file, if it does not, determining that file to be located is non-duplicate file.

In a kind of implementation, after server extracts the size and hash value of file to be processed, server can be true Numerical intervals (i.e. target value section) belonging to the size of fixed file to be processed.About numerical intervals, above-mentioned reality can be referred to Apply being discussed in detail in example.

Then, server can be inquired in the hash value of the corresponding each storage file in target value section, sentence It is disconnected with the presence or absence of the identical hash value with the hash value of file to be processed, if it does, server can be determined that stored and to Handle the identical file of file, that is, file to be processed is duplicate file, if it does not, server can be determined that it is not stored with The identical file of file to be processed, that is, file to be processed is non-duplicate file.

Based on above-mentioned processing, server need to only carry out in the hash value of the corresponding each storage file in target value section Inquiry in turn, can save query time, improve weight without inquiring in the hash value for the All Files being locally stored The efficiency of multiple file test method.

In addition, server can also store file to be processed when server determines that file to be processed is non-duplicate file, And the corresponding relationship of the hash value of file to be processed and file to be processed is recorded, and in turn, when terminal uploads same file again, Server can determine that the file that terminal uploads is duplicate file.

Corresponding with the embodiment of the method for Fig. 1, referring to Fig. 3, Fig. 3 is a kind of structure of terminal provided in an embodiment of the present invention Figure, which may include: transceiver 301 and processor 302；

The transceiver 301 needs to be uploaded to the file to be processed of server for obtaining user；To the server When sending the file to be processed, the size of the file to be processed is obtained；

The processor 302, for detecting target value section belonging to the size of the file to be processed, wherein no Same numerical intervals respectively correspond different file Hash hash value calculations；According to the corresponding text in the target value section Part hash value calculation calculates the hash value of the file to be processed；

The transceiver 301 is also used to send the transmission of the hash value comprising the file to be processed to the server Information；The server is received for the response results for sending information, wherein the response results include described to be processed File be duplicate file information or the file to be processed be non-duplicate file information.

Optionally, the processor 302 is specifically used for calculating according to the corresponding file hash value in the target value section Mode, the data for including to the file to be processed are handled, and obtain hash value to be selected；Calculating includes the hash value to be selected With the hash value of the data of the size of the file to be processed, using the hash value being calculated as the file to be processed Hash value.

Optionally, the target value section is (0, A)；

The processor 302, specifically for calculating the full dose hash value of the file to be processed, and by the full dose hash Value is used as hash value to be selected.

Optionally, the target value section be [A, B), wherein B > A；

The processor 302 presets the data on head and default tail portion specifically for calculating comprising the file to be processed Hash value, and using the hash value being calculated as the hash value to be selected.

Optionally, the target value section be [B ,+∞)；

The processor 302, being specifically used for calculating includes during the file to be processed presets head, presets tail portion and is default The data hash value in portion, and using the hash value being calculated as the hash value to be selected.

Corresponding with the embodiment of the method for Fig. 2, referring to fig. 4, Fig. 4 is a kind of knot of server provided in an embodiment of the present invention Composition, the server may include: transceiver 401 and processor 402；

The transceiver 401, the transmission information of the hash value comprising file to be processed for receiving terminal transmission, In, the transmission information is the terminal when sending the file to be processed to the server, and Xiang Suoshu server is sent 's；

The processor 402, for determining whether the file to be located is duplicate file according to the transmission information；

The transceiver 401 is also used to send response results to the terminal, wherein the response results include described File to be processed be duplicate file information or the file to be processed be non-duplicate file information.

Optionally, the processor 402, specifically in the hash value of the local each storage file of detection, if exist The identical hash value with the hash value of the file to be processed；If in the hash value of local each storage file, existed and institute The identical hash value of hash value of file to be processed is stated, determines that the file to be located is duplicate file；If local each storage In the hash value of file, there is no hash values identical with the hash value of the file to be processed, determine the file to be processed For non-duplicate file.

The processor 402, belonging to determining the file to be processed according to the size of the file to be processed Target value section；In the hash value for detecting the corresponding each storage file in the target value section, if exist and institute State the identical hash value of hash value of file to be processed；If the hash of the corresponding each storage file in the target value section In value, there is the identical hash value with the hash value of the file to be processed, determines that the file to be located is duplicate file；If In the hash value of the corresponding each storage file in the target value section, there is no the hash value phases with the file to be processed Same hash value determines that the file to be processed is non-duplicate file.

Referring to Fig. 5, Fig. 5 is a kind of structure chart of duplicate file detection system provided in an embodiment of the present invention, which can To include terminal 501 and server 502；

The terminal 501 needs to be uploaded to the file to be processed of server 502 for obtaining user；To the service When device 502 sends the file to be processed, the size of the file to be processed is obtained；Detect the size institute of the file to be processed The target value section of category, wherein different numerical intervals respectively correspond different file hash value calculations；According to described The corresponding file hash value calculation in target value section calculates the hash value of the file to be processed；To the server 502 send the transmission information of the hash value comprising the file to be processed；

The server 502, for receiving the transmission letter for the hash value comprising file to be processed that the terminal 501 is sent Breath；According to the transmission information, determine whether the file to be located is duplicate file；Response results are sent to the terminal 501, Wherein, the response results are the information of duplicate file comprising the file to be processed or the file to be processed is non-duplicate The information of file.

The terminal 501 is also used to receive the server 502 for the response results for sending information.

The embodiment of the invention also provides a kind of computer readable storage medium, stored in the computer readable storage medium There is instruction, when run on a computer, so that computer executes duplicate file detection method provided in an embodiment of the present invention.

Specifically, above-mentioned duplicate file detection method, comprising:

Obtain the file to be processed that user needs to be uploaded to server；

When sending the file to be processed to the server, the size of the file to be processed is obtained；

Detect target value section belonging to the size of the file to be processed, wherein different numerical intervals are right respectively Answer different file Hash hash value calculations；

According to the corresponding file hash value calculation in the target value section, the hash of the file to be processed is calculated Value；

The transmission information of the hash value comprising the file to be processed is sent to the server；

The server is received for the response results for sending information, wherein the response results include it is described to Handle the information that file is duplicate file or the information that the file to be processed is non-duplicate file.

It should be noted that other implementations of above-mentioned duplicate file detection method and preceding method embodiment part phase Together, which is not described herein again.

By running the instruction stored in computer readable storage medium provided in an embodiment of the present invention, sent out to server When sending file to be processed, the hash value of file to be processed can also be sent to server, server withouts waiting for text to be processed All transmission terminates part, so that it may obtain the hash value of file to be processed, in turn, server can determine text to be processed earlier Whether part is duplicate file.

Specifically, above-mentioned duplicate file detection method, comprising:

Receive the transmission information for the hash value comprising file to be processed that terminal is sent, wherein the transmission information is institute Terminal is stated when sending the file to be processed to the server, what Xiang Suoshu server was sent；

According to the transmission information, determine whether the file to be located is duplicate file；

Response results are sent to the terminal, wherein the response results include that the file to be processed is duplicate file Information or the file to be processed be non-duplicate file information.

By running the instruction stored in computer readable storage medium provided in an embodiment of the present invention, without waiting for All transmission terminates processing file, so that it may obtain the hash value of file to be processed, in turn, can determine text to be processed earlier Whether part is duplicate file.

The embodiment of the invention also provides a kind of computer program products comprising instruction, when it runs on computers When, so that computer executes duplicate file detection method provided in an embodiment of the present invention.

Specifically, above-mentioned duplicate file detection method, comprising:

Obtain the file to be processed that user needs to be uploaded to server；

By running computer program product provided in an embodiment of the present invention, when sending file to be processed to server, The hash value of file to be processed can also be sent to server, server withouts waiting for file to be processed, and all transmission terminates, It can be obtained by the hash value of file to be processed, in turn, server can determine whether file to be processed is to repeat text earlier Part.

Specifically, above-mentioned duplicate file detection method, comprising:

By running computer program product provided in an embodiment of the present invention, withouts waiting for file to be processed and all transmit Terminate, so that it may obtain the hash value of file to be processed, in turn, can determine whether file to be processed is to repeat text earlier Part.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..

It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.

Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for terminal, For server, system, computer readable storage medium, computer program product embodiments, since it is substantially similar to method Embodiment, so being described relatively simple, the relevent part can refer to the partial explaination of embodiments of method.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims

1. a kind of duplicate file detection method, which is characterized in that the described method includes:

The terminal detects target value section belonging to the size of the file to be processed, wherein different numerical intervals point Different file Hash hash value calculations is not corresponded to；

The terminal calculates the file to be processed according to the corresponding file hash value calculation in the target value section Hash value；

The terminal receives the server for the response results for sending information, wherein the response results include institute State the information that file to be processed is duplicate file or the information that the file to be processed is non-duplicate file.

2. the method according to claim 1, wherein the terminal is according to the corresponding text in the target value section Part hash value calculation calculates the hash value of the file to be processed, comprising:

The terminal includes to the file to be processed according to the corresponding file hash value calculation in the target value section Data handled, obtain hash value to be selected；

The hash value for calculating the data of the size comprising the hash value to be selected and the file to be processed, by what is be calculated Hash value of the hash value as the file to be processed.

3. according to the method described in claim 2, it is characterized in that, the target value section is (0, A)；The terminal according to The corresponding file hash value calculation in the target value section, the data for including to the file to be processed are handled, Obtain hash value to be selected, comprising:

The terminal calculates the full dose hash value of the file to be processed, and using the full dose hash value as hash value to be selected.

4. according to the method in claim 2 or 3, which is characterized in that the target value section be [A, B), wherein B > A； The terminal is according to the corresponding file hash value calculation in the target value section, the number for including to the file to be processed According to being handled, hash value to be selected is obtained, comprising:

The terminal calculates the hash value that the data of head and default tail portion are preset comprising the file to be processed, and will calculate The hash value arrived is as the hash value to be selected.

5. according to the method described in claim 4, it is characterized in that, the target value section be [B ,+∞)；The terminal is pressed According to the corresponding file hash value calculation in the target value section, at the data for including to the file to be processed Reason, obtains hash value to be selected, comprising:

The terminal calculates the data hash value that head, default tail portion and default middle part are preset comprising the file to be processed, and Using the hash value being calculated as the hash value to be selected.

6. the method according to claim 1, wherein the transmission information further includes the big of the file to be processed It is small.

7. a kind of duplicate file inspection method, which is characterized in that the described method includes:

The transmission information for the hash value comprising file to be processed that server receiving terminal is sent, wherein the transmission information is The terminal to the server send the file to be processed when, Xiang Suoshu server send；

The server sends response results to the terminal, wherein the response results include that the file to be processed is attached most importance to The information of multiple file or the file to be processed are the information of non-duplicate file.

8. the method according to the description of claim 7 is characterized in that described in the server according to the transmission information, determines Whether file to be located is duplicate file, comprising:

In the hash value of the local each storage file of the server detection, if there is the hash value with the file to be processed Identical hash value；

If there is the identical hash value with the hash value of the file to be processed in the hash value of local each storage file, Determine that the file to be located is duplicate file；

If there is no the identical hash of hash value with the file to be processed in the hash value of local each storage file Value determines that the file to be processed is non-duplicate file.

9. the method according to the description of claim 7 is characterized in that the transmission information further includes the big of the file to be processed It is small；

The server according to the size of the file to be processed determine the file to be processed belonging to target value section；

The server detects in the hash value of the corresponding each storage file in the target value section, if exist with it is described The identical hash value of the hash value of file to be processed；

If existed and the file to be processed in the hash value of the corresponding each storage file in the target value section The identical hash value of hash value determines that the file to be located is duplicate file；

If in the hash value of the corresponding each storage file in the target value section, there is no with the file to be processed The identical hash value of hash value determines that the file to be processed is non-duplicate file.

10. a kind of terminal, which is characterized in that the terminal includes: transceiver and processor；

The transceiver needs to be uploaded to the file to be processed of server for obtaining user；Institute is being sent to the server When stating file to be processed, the size of the file to be processed is obtained；

The processor, for detecting target value section belonging to the size of the file to be processed, wherein different numerical value Section respectively corresponds different file Hash hash value calculations；According to the corresponding file hash value in the target value section Calculation calculates the hash value of the file to be processed；

The transceiver is also used to send the transmission information of the hash value comprising the file to be processed to the server；It connects The server is received for the response results for sending information, wherein the response results include that the file to be processed is The information of duplicate file or the file to be processed are the information of non-duplicate file.

11. terminal according to claim 10, which is characterized in that the processor is specifically used for according to the number of targets It is worth the corresponding file hash value calculation in section, the data for including to the file to be processed are handled, obtained to be selected Hash value；The hash value for calculating the data of the size comprising the hash value to be selected and the file to be processed, will be calculated Hash value of the hash value as the file to be processed.

12. terminal according to claim 11, which is characterized in that the target value section is (0, A)；

The processor, specifically for calculating the full dose hash value of the file to be processed, and using the full dose hash value as Hash value to be selected.

13. terminal according to claim 11 or 12, which is characterized in that the target value section be [A, B), wherein B >A；

The processor presets the hash of the data of head and default tail portion specifically for calculating comprising the file to be processed Value, and using the hash value being calculated as the hash value to be selected.

14. terminal according to claim 13, which is characterized in that the target value section be [B ,+∞)；

The processor presets the number on head, default tail portion and default middle part specifically for calculating comprising the file to be processed According to hash value, and using the hash value being calculated as the hash value to be selected.

15. terminal according to claim 10, which is characterized in that the transmission information further includes the file to be processed Size.

16. a kind of server, which is characterized in that the server includes: transceiver and processor；

The transceiver, the transmission information of the hash value comprising file to be processed for receiving terminal transmission, wherein the hair Breath of delivering letters is the terminal when sending the file to be processed to the server, what Xiang Suoshu server was sent；

The transceiver is also used to send response results to the terminal, wherein the response results include the text to be processed Part be duplicate file information or the file to be processed be non-duplicate file information.

17. server according to claim 16, which is characterized in that the processor, specifically for each of detection local In the hash value of storage file, if there is the identical hash value with the hash value of the file to be processed；If local is each In the hash value of storage file, there is the identical hash value with the hash value of the file to be processed, determines the file to be located For duplicate file；If there is no identical as the hash value of the file to be processed in the hash value of local each storage file Hash value, determine the file to be processed be non-duplicate file.

18. server according to claim 16, which is characterized in that the transmission information further includes the file to be processed Size；

The processor, specifically for according to the size of the file to be processed determine the file to be processed belonging to number of targets It is worth section；In the hash value for detecting the corresponding each storage file in the target value section, if exist with it is described to be processed The identical hash value of the hash value of file；If deposited in the hash value of the corresponding each storage file in the target value section In hash value identical with the hash value of the file to be processed, determine that the file to be located is duplicate file；If the mesh In the hash value for marking the corresponding each storage file of numerical intervals, there is no identical with the hash value of the file to be processed Hash value determines that the file to be processed is non-duplicate file.