CN110096483B

CN110096483B - Duplicate file detection method, terminal and server

Info

Publication number: CN110096483B
Application number: CN201910380465.7A
Authority: CN
Inventors: 李春平; 杨鹏飞
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-05-08
Filing date: 2019-05-08
Publication date: 2021-04-30
Anticipated expiration: 2039-05-08
Also published as: CN110096483A

Abstract

The embodiment of the invention provides a duplicate file detection method, a terminal and a server, wherein the method comprises the following steps: when a file to be processed, which is required to be uploaded to a server by a user, is sent to the server, a terminal acquires the size of the file to be processed, detects a target value interval to which the size of the file to be processed belongs, calculates a hash value of the file to be processed according to a file hash value calculation mode corresponding to the target value interval, sends sending information containing the hash value of the file to be processed to the server, the server determines whether the file to be processed is a duplicate file according to the sending information, and sends a response result to the terminal, wherein the response result contains information that the file to be processed is the duplicate file or information that the file to be processed is a non-duplicate file. Based on the processing, the server can obtain the hash value of the file to be processed without waiting for the completion of the transmission of the file to be processed, and further, the server can determine whether the file to be processed is a duplicate file earlier.

Description

Duplicate file detection method, terminal and server

Technical Field

The invention relates to the technical field of computer networks, in particular to a duplicate file detection method, a terminal and a server.

Background

With the rapid development of computer network technology, users can not only watch favorite videos online through a video terminal conveniently, but also upload videos shot by themselves or obtained through other ways to a video server, so that the videos uploaded by themselves can be shared by other users for watching. As more and more users upload files, such as videos, to the server, the files are inevitably duplicated. In order to avoid storing duplicate files, the server needs to check the files uploaded by the user one by one to determine whether the files are duplicate files.

Therefore, in order to avoid storing duplicate files, in the prior art, after a file is uploaded, whether the uploaded file is a duplicate file is determined by calculating a hash value of the uploaded file and comparing the hash value with a hash value of the stored file.

However, the inventor finds that the prior art has at least the following problems in the process of implementing the invention: in the prior art, the process of judging whether the uploaded file is a duplicate file by calculating the hash value of the uploaded file cannot detect whether the file uploaded by a user is the duplicate file in time.

Disclosure of Invention

The embodiment of the invention aims to provide a duplicate file detection method, a terminal and a server, which can detect whether a file uploaded by a user is a duplicate file in time. The specific technical scheme is as follows:

in a first aspect, to achieve the above object, an embodiment of the present invention discloses a duplicate file detection method, where the method includes:

a terminal acquires a file to be processed, which is required to be uploaded to a server by a user;

the terminal acquires the size of the file to be processed when sending the file to be processed to the server;

the terminal detects a target numerical interval to which the size of the file to be processed belongs, wherein different numerical intervals respectively correspond to different file hash value calculation modes;

the terminal calculates the hash value of the file to be processed according to a file hash value calculation mode corresponding to the target numerical value interval;

the terminal sends sending information containing the hash value of the file to be processed to the server;

and the terminal receives a response result of the server for the sent information, wherein the response result comprises information that the file to be processed is a repeated file or information that the file to be processed is a non-repeated file.

Optionally, the calculating, by the terminal, the hash value of the to-be-processed file according to the file hash value calculation manner corresponding to the target value interval includes:

the terminal processes data contained in the file to be processed according to a file hash value calculation mode corresponding to the target numerical value interval to obtain a hash value to be selected;

and calculating a hash value of the data containing the hash value to be selected and the size of the file to be processed, and taking the calculated hash value as the hash value of the file to be processed.

Optionally, the target value interval is (0, a); the terminal processes the data contained in the file to be processed according to the file hash value calculation mode corresponding to the target numerical value interval to obtain a hash value to be selected, and the method comprises the following steps:

and the terminal calculates the full hash value of the file to be processed and takes the full hash value as the hash value to be selected.

Optionally, the target value interval is [ a, B), where B > a; the terminal processes the data contained in the file to be processed according to the file hash value calculation mode corresponding to the target numerical value interval to obtain a hash value to be selected, and the method comprises the following steps:

and the terminal calculates the hash value of the data comprising the preset head and the preset tail of the file to be processed, and takes the calculated hash value as the hash value to be selected.

Optionally, the target value interval is [ B, + ∞ "); the terminal processes the data contained in the file to be processed according to the file hash value calculation mode corresponding to the target numerical value interval to obtain a hash value to be selected, and the method comprises the following steps:

and the terminal calculates the hash value of the data comprising the preset head, the preset tail and the preset middle of the file to be processed, and takes the calculated hash value as the hash value to be selected.

Optionally, the sending information further includes a size of the file to be processed.

In a second aspect, in order to achieve the above object, an embodiment of the present invention discloses a duplicate file detection method, where the method includes:

the method comprises the steps that a server receives sending information which is sent by a terminal and contains a hash value of a file to be processed, wherein the sending information is sent to the server when the terminal sends the file to be processed to the server;

the server determines whether the file to be processed is a repeated file or not according to the sending information;

and the server sends a response result to the terminal, wherein the response result comprises the information that the file to be processed is a repeated file or the information that the file to be processed is a non-repeated file.

Optionally, the determining, by the server, whether the file to be processed is a duplicate file according to the sending information includes:

the server detects whether a hash value identical to that of the file to be processed exists in the hash values of the local storage files;

if the hash value of each local storage file is the same as that of the file to be processed, determining that the file to be processed is a duplicate file;

and if the hash value of each local storage file does not have the same hash value as that of the file to be processed, determining that the file to be processed is a non-repeated file.

Optionally, the sending information further includes a size of the file to be processed;

the server determines whether the file to be processed is a duplicate file according to the sending information, and the determining includes:

the server determines a target numerical value interval to which the file to be processed belongs according to the size of the file to be processed;

the server detects whether a hash value identical to that of the file to be processed exists in the hash values of the storage files corresponding to the target value interval;

if the hash value of each storage file corresponding to the target numerical value interval has the same hash value as that of the file to be processed, determining that the file to be processed is a duplicate file;

and if the hash value of each storage file corresponding to the target numerical value interval does not have the same hash value as that of the file to be processed, determining that the file to be processed is a non-duplicate file.

In order to achieve the above object, an embodiment of the present invention discloses a terminal, where the terminal includes: a transceiver and a processor;

the transceiver is used for acquiring a file to be processed, which is required to be uploaded to a server by a user; when the file to be processed is sent to the server, the size of the file to be processed is obtained;

the processor is used for detecting a target numerical interval to which the size of the file to be processed belongs, wherein different numerical intervals respectively correspond to different file hash value calculation modes; calculating the hash value of the file to be processed according to a file hash value calculation mode corresponding to the target numerical value interval;

the transceiver is further configured to send, to the server, sending information including a hash value of the file to be processed; and receiving a response result of the server for the sent information, wherein the response result comprises information that the file to be processed is a duplicate file or information that the file to be processed is a non-duplicate file.

Optionally, the processor is specifically configured to process data included in the file to be processed according to a file hash value calculation manner corresponding to the target value interval, so as to obtain a hash value to be selected; and calculating a hash value of the data containing the hash value to be selected and the size of the file to be processed, and taking the calculated hash value as the hash value of the file to be processed.

Optionally, the target value interval is (0, a);

the processor is specifically configured to calculate a full hash value of the file to be processed, and use the full hash value as a hash value to be selected.

Optionally, the target value interval is [ a, B), where B > a;

the processor is specifically configured to calculate a hash value of data including a preset head and a preset tail of the file to be processed, and use the calculated hash value as the hash value to be selected.

Optionally, the target value interval is [ B, + ∞ ");

the processor is specifically configured to calculate a hash value of data including a preset head, a preset tail and a preset middle of the file to be processed, and use the calculated hash value as the hash value to be selected.

In a fourth aspect, to achieve the above object, an embodiment of the present invention discloses a server, where the server includes: a transceiver and a processor;

the transceiver is configured to receive sending information including a hash value of a to-be-processed file, where the sending information is sent to the server by the terminal when the to-be-processed file is sent to the server;

the processor is used for determining whether the file to be processed is a repeated file or not according to the sending information;

the transceiver is further configured to send a response result to the terminal, where the response result includes information that the file to be processed is a duplicate file or information that the file to be processed is a non-duplicate file.

Optionally, the processor is specifically configured to detect whether a hash value that is the same as the hash value of the file to be processed exists in hash values of local storage files; if the hash value of each local storage file is the same as that of the file to be processed, determining that the file to be processed is a duplicate file; and if the hash value of each local storage file does not have the same hash value as that of the file to be processed, determining that the file to be processed is a non-repeated file.

the processor is specifically configured to determine a target numerical value interval to which the to-be-processed file belongs according to the size of the to-be-processed file; detecting whether a hash value identical to that of the file to be processed exists in the hash values of the storage files corresponding to the target value interval; if the hash value of each storage file corresponding to the target numerical value interval has the same hash value as that of the file to be processed, determining that the file to be processed is a duplicate file; and if the hash value of each storage file corresponding to the target numerical value interval does not have the same hash value as that of the file to be processed, determining that the file to be processed is a non-duplicate file.

In another aspect of the present invention, there is also provided a duplicate file detection system, including a terminal and a server;

the terminal is used for acquiring a file to be processed, which is required to be uploaded to the server by a user; when the file to be processed is sent to the server, the size of the file to be processed is obtained; detecting a target value interval to which the size of the file to be processed belongs, wherein different value intervals respectively correspond to different file hash value calculation modes; calculating the hash value of the file to be processed according to a file hash value calculation mode corresponding to the target numerical value interval; sending information containing the hash value of the file to be processed to the server;

the server is used for receiving sending information which is sent by the terminal and contains a hash value of the file to be processed; determining whether the file to be processed is a repeated file or not according to the sending information; and sending a response result to the terminal, wherein the response result comprises information that the file to be processed is a duplicate file or information that the file to be processed is a non-duplicate file.

The terminal is further used for receiving a response result of the server for the sending information.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute any of the duplicate file detection methods described in the above first aspect.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which when run on a computer, cause the computer to execute any of the duplicate file detection methods described in the second aspect above.

In another aspect of the present invention, there is also provided a computer program product including instructions, which when run on a computer, causes the computer to perform any of the duplicate file detection methods described in the first aspect above.

In another aspect of the present invention, there is also provided a computer program product including instructions, which when run on a computer, causes the computer to execute any of the duplicate file detection methods described in the second aspect above.

The embodiment of the invention provides a duplicate file detection method, wherein when a to-be-processed file which needs to be uploaded to a server by a user is sent to the server, a terminal can obtain the size of the to-be-processed file, a target numerical value interval to which the size of the to-be-processed file belongs is detected, the hash value of the to-be-processed file is calculated according to a file hash value calculation mode corresponding to the target numerical value interval, sending information containing the hash value of the to-be-processed file to the server, the server determines whether the to-be-processed file is a duplicate file according to the sending information, and sends a response result to the terminal, wherein the response result contains information that the to-be-processed file is the duplicate file or information that the to-be-processed file is a non-. Based on the processing, the server can obtain the hash value of the file to be processed without waiting for the completion of the transmission of the file to be processed, and further, the server can determine whether the file to be processed is a duplicate file earlier.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart of a duplicate file detection method according to an embodiment of the present invention;

fig. 2 is a flowchart of a duplicate file detection method according to an embodiment of the present invention;

fig. 3 is a structural diagram of a terminal according to an embodiment of the present invention;

fig. 4 is a block diagram of a server according to an embodiment of the present invention;

fig. 5 is a structural diagram of a duplicate file detection system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

In the prior art, the process of judging whether the uploaded file is a duplicate file by calculating the hash value of the uploaded file cannot detect whether the file uploaded by a user is the duplicate file in time.

In order to solve the above problems, the present invention provides a duplicate file detection method, which can be applied to a terminal and a server, respectively, where the terminal and the server are in network communication, and the terminal can be a browser or other terminals.

The terminal can acquire the file to be processed which is required to be uploaded to the server by the user, and sends the file to be processed to the server. When the terminal sends the file to be processed to the server, the terminal can also obtain the size of the file to be processed and detect a target numerical value interval to which the size of the file to be processed belongs, then the terminal can calculate the hash value of the file to be processed according to a file hash value calculation mode corresponding to the target numerical value interval and send sending information containing the hash value of the file to be processed to the server.

The server may receive transmission information including a hash value of the file to be processed, which is transmitted by the terminal, determine whether the file to be processed is a duplicate file according to the transmission information, and then transmit a response result to the terminal, where the response result includes information that the file to be processed is a duplicate file or information that the file to be processed is a non-duplicate file.

Based on the processing, when the to-be-processed file is sent to the server, the terminal can also send the hash value of the to-be-processed file to the server, and then the server can determine whether the to-be-processed file is a duplicate file or not earlier.

The present invention will be described in detail with reference to specific examples.

Referring to fig. 1, fig. 1 is a flowchart of a duplicate file detection method provided in an embodiment of the present invention, where the method may be applied to a terminal, and the method may include the following steps:

s101: the terminal acquires a file to be processed, which is required to be uploaded to the server by a user.

The file to be processed may be a network resource in any format, for example, the file to be processed may be a video file, an audio file, an installation package of an application program, or the like. The number of the files to be processed can be one or more. If the number of the files to be processed is multiple, the terminal can process each file to be processed in turn according to the duplicate file detection method of the invention.

The terminal can acquire the file (i.e. the file to be processed) which the user needs to upload to the server, so as to upload the file to be processed.

In one implementation manner, if the terminal is a browser, an "upload" button may be set in a display interface of the terminal, and when a user clicks the "upload" button, the terminal may display a list of files to be uploaded, where the files in the list are local files of the terminal, the user may select a file to be processed from the local files of the terminal, and accordingly, the terminal may obtain the file to be processed.

S102: and when the terminal sends the file to be processed to the server, the size of the file to be processed is obtained.

The size of the file to be processed is the size of the storage space occupied by the file to be processed, for example, the size of the file to be processed may be 556MB, or the size of the file to be processed may also be 1000 MB.

When the terminal sends the file to be processed to the server, the terminal can also obtain the size of the file to be processed so as to perform corresponding processing according to different numerical values of the size of the file to be processed.

S103: and the terminal detects a target numerical value interval to which the size of the file to be processed belongs.

Wherein, different value intervals respectively correspond to different file hash value calculation modes.

The manner of dividing the different value intervals can be set by the skilled person based on experience. For example, a file size greater than 0 and less than a first threshold may be divided into a numerical range; dividing the file size which is larger than or equal to a first threshold and smaller than a second threshold into another numerical value interval, wherein the second threshold is larger than the first threshold; and dividing the file size which is larger than or equal to a third threshold into a numerical value interval, wherein the third threshold is larger than the second threshold. The first threshold, the second threshold and the third threshold are all positive numbers.

After the terminal determines the size of the file to be processed, the terminal can determine a value interval (namely a target value interval) to which the size of the file to be processed belongs, and further, the file to be processed can be processed according to a file hash value calculation mode corresponding to the target value interval.

S104: and the terminal calculates the hash value of the file to be processed according to the file hash value calculation mode corresponding to the target numerical value interval.

The terminal may calculate a Hash value of the file to be processed according to a preset Algorithm, where the preset Algorithm may be sha1(Secure Hash Algorithm ) or another Algorithm.

In an implementation manner, the terminal may process data included in the file to be processed according to a file hash value calculation manner corresponding to the target value interval, and use a processing result as a hash value of the file to be processed.

In another mode, in order to enable the calculated hash value of the file to be processed to more effectively reflect the uniqueness of the file to be processed, the method for the terminal to calculate the hash value of the file to be processed may include the following steps:

step one, processing data contained in a file to be processed according to a file hash value calculation mode corresponding to a target numerical value interval to obtain a hash value to be selected.

According to the size of the file to be processed, the method for the terminal to calculate the hash value to be selected can comprise the following conditions:

in the first case, when the target value interval is (0, A), the terminal calculates the full hash value of the file to be processed, and takes the full hash value as the hash value to be selected.

Wherein the value of a may be set empirically by the skilled person, for example a may be 40M.

In one implementation, in a case that the terminal determines that the size of the file to be processed belongs to (0, 40M), since the file to be processed is small, the terminal may perform hash operation on all data included in the file to be processed, that is, the terminal may calculate a full hash value of the file to be processed, and use the full hash value as the hash value to be selected.

And in the second situation, when the target value interval is [ A, B ], wherein B > A, the terminal calculates the hash value of the data comprising the preset head and the preset tail of the file to be processed, and the calculated hash value is used as the hash value to be selected.

The value of B, the size of the preset head and the preset tail may be set by a technician according to experience, for example, B may be 128M, and the size of the preset head and the size of the preset tail may be 20M.

In one implementation, in the case that the terminal determines that the size of the file to be processed belongs to [40M, 128M), if the terminal calculates the full hash value of the file to be processed, more calculation resources are consumed, and more calculation time is wasted.

Therefore, the terminal can sample the file to be processed, that is, the terminal can acquire the data of the preset head and the data of the preset tail of the file to be processed, then, the terminal can splice the data of the preset head and the preset tail, perform hash operation on the spliced data, and take the operation result as a hash value to be selected.

And thirdly, when the target numerical value interval is [ B, + ∞ ], the terminal calculates the hash value of the data including the preset head, the preset tail and the preset middle of the file to be processed, and the hash value obtained by calculation is used as the hash value to be selected.

In one implementation, in the case that the terminal determines that the size of the file to be processed belongs to [128M, + ∞), since the file to be processed is large, if the terminal calculates the full hash value of the file to be processed, more computing resources are consumed and more computing time is wasted.

In addition, if the terminal only processes the data of the preset head and the preset tail of the file to be processed, the obtained hash value to be selected is low in effectiveness.

Therefore, the terminal can acquire data of a preset head, data of a preset tail and data of a preset middle of the file to be processed, then the terminal can splice the data of the preset head, the preset tail and the data of the preset middle, hash operation is carried out on the spliced data, and the operation result is used as a hash value to be selected.

As can be seen, in the first to third cases, for the files to be processed with different sizes, the terminal may perform different processing on the files to be processed to obtain different sample data blocks, and then obtain the hash value of the files to be processed according to the sample data blocks.

In an implementation manner, a corresponding relationship between the file size and the sampling number may be stored in the terminal, and the terminal may determine a target sampling number corresponding to the size of the file to be processed according to the corresponding relationship, and then may obtain a target sampling number of data blocks from the file to be processed, and perform hash operation on the target sampling number of data blocks to obtain a hash value to be selected.

The correspondence between the file size and the number of samples can be referred to table (1).

Watch (1)

File size (D)	Number of samples (S)
		D＜40M	1
40M≤D＜128M	2
		128M≤D＜512M	3
512M≤D＜1G	4
		1G≤D＜4G	5
4G≤D	6

As can be seen from table (1), when the to-be-processed file is smaller than 40M, the sampling number is 1, at this time, the terminal may not sample the to-be-processed file, the sampling data block is the to-be-processed file itself, and the terminal may directly perform hash operation on all data included in the to-be-processed file, and use the operation result as the to-be-selected hash value.

When the file to be processed is greater than or equal to 40M and less than 128M, the sampling number is 2, that is, the terminal may acquire 2 data blocks with a preset size from the data included in the file to be processed as the sampling data blocks.

When the file to be processed is greater than or equal to 128M and less than 512M, the sampling number is 3, that is, the terminal may obtain 3 data blocks with preset sizes from the data contained in the file to be processed as the sampling data blocks.

When the file to be processed is greater than or equal to 512M and less than 1G, the sampling number is 4, that is, the terminal may obtain 4 data blocks with a preset size from the data included in the file to be processed as the sampling data blocks.

When the file to be processed is greater than or equal to 1G and less than 4G, the sampling number is 5, that is, the terminal may obtain 5 data blocks with a preset size from the data included in the file to be processed as the sampling data blocks.

When the file to be processed is greater than or equal to 4G, the sampling number is 6, that is, the terminal may obtain 6 data blocks with a preset size from data included in the file to be processed as sampling data blocks.

The preset size may be 20M, and when the number of the sample data blocks is greater than or equal to 2, the sample data blocks may include data of a preset head and a preset tail of the file to be processed.

For two files with different formats, the difference between the head data of the two files is larger, and the difference between the tail data of the two files is larger, so that when the terminal samples the file to be processed, if the target sampling number is more than or equal to 2, the sampled data block acquired by the terminal can include the data of the preset head and the preset tail of the file to be processed, and the sampled data block can more accurately represent the uniqueness of the file to be processed.

In addition, the terminal may determine other sampling data blocks (i.e., a data block in the preset middle part, which may be referred to as a middle sampling data block) according to a preset rule by removing data of a preset header and a preset trailer of the file to be processed.

In one implementation, if the target sampling number is an odd number and the target sampling number is greater than 2, the terminal may obtain a data block (which may be referred to as a midpoint sampling data block) with a preset size at a midpoint of data included in the file to be processed, and the remaining middle sampling data blocks are uniformly distributed on two sides of the midpoint of the data included in the file to be processed according to a preset interval.

If the target sampling number is an even number and is more than 2, sampling is not performed at the midpoint of the data contained in the file to be processed, and the middle sampling data blocks are uniformly distributed on two sides of the midpoint of the data contained in the file to be processed according to the preset interval.

For example, if the file to be processed is greater than or equal to 128M and less than 512M, and the number of middle sample data blocks is 1, a midpoint sample data block is obtained at a midpoint of data included in the file to be processed, and the midpoint sample data block is used as the middle sample data block.

If the file to be processed is greater than or equal to 512M and smaller than 1G, the preset interval may be 128M, the number of middle sample data blocks is 2, and the distances between the two middle sample data blocks and the data midpoint included in the file to be processed are both 128M.

If the file to be processed is greater than or equal to 1G and smaller than 4G, the preset interval can be 256M, the number of the middle sampling data blocks is 3, one middle sampling data block is a midpoint sampling data block at a data midpoint contained in the file to be processed, the other two middle sampling data blocks are arranged on two sides of the data midpoint contained in the file to be processed, and the distance between each middle sampling data block and the data midpoint contained in the file to be processed is 256M.

If the file to be processed is greater than or equal to 4G, the preset interval can be 512M, the number of middle sampling data blocks is 4, the middle sampling data blocks are respectively located on two sides of the midpoint of the data contained in the file to be processed, and the distances between the middle sampling data blocks and the midpoint of the data contained in the file to be processed are 512M and 1024M respectively.

And step two, calculating a hash value of data comprising the hash value to be selected and the size of the file to be processed, and taking the calculated hash value as the hash value of the file to be processed.

After the hash value to be selected is obtained, the terminal can splice the hash value to be selected and the size of the file to be processed, then the terminal can perform hash operation on the spliced data, and the operation result is used as the hash value of the file to be processed.

Therefore, the hash value of the file to be processed obtained by the method of the embodiment can not only reflect the data contained in the file to be processed, but also reflect the size of the file to be processed, and can effectively reflect the uniqueness of the file to be processed.

S105: and the terminal sends sending information containing the hash value of the file to be processed to the server.

After the terminal obtains the hash value of the file to be processed, the terminal can send sending information containing the hash value of the file to be processed to the server.

Correspondingly, after the server receives the sending information, the server can determine whether the file to be processed is the duplicate file according to the sending information, and return a response result aiming at the sending information to the terminal, wherein the response result comprises information that the file to be processed is the duplicate file or information that the file to be processed is the non-duplicate file. The processing steps of the server will be described in detail in the following embodiments.

Further, the terminal can acquire a response result transmitted by the server.

In an implementation manner, if the terminal is a browser, a host process in the terminal may be used to upload a file to be processed, the host process slices the file to be processed through a file object and an XMLHttpRequest (Extensible markup Language hypertext Transfer Protocol Request), and sends the sliced file to be processed to a server in an asynchronous uploading manner.

Meanwhile, the terminal can also send sending information containing the hash value of the file to be processed to the server through a web worker (worker) independent thread.

Therefore, based on the duplicate file detection method provided by the embodiment of the invention, when the to-be-processed file is sent to the server, the terminal can also send the hash value of the to-be-processed file to the server, and the server can obtain the hash value of the to-be-processed file without waiting for the completion of the transmission of all the to-be-processed files, so that the server can determine whether the to-be-processed file is the duplicate file or not earlier. In addition, the duplicate file detection method based on the embodiment of the invention can be used for calculating the hash value of the file to be processed by the terminal, thereby reducing the calculation pressure of the server.

Optionally, the sending information may further include the size of the file to be processed, that is, when the terminal sends the hash value of the file to be processed to the server, the terminal may also send the size of the file to be processed to the server.

Correspondingly, the server can determine whether the file to be processed is a duplicate file or not by combining the hash value of the file to be processed and the size of the file to be processed, so that the efficiency of the duplicate file detection method can be improved.

Referring to fig. 2, fig. 2 is a flowchart of a duplicate file detection method according to an embodiment of the present invention, where the method may be applied to a server, and the method may include the following steps:

s201: and the server receives the sending information which contains the hash value of the file to be processed and is sent by the terminal.

The sending information may be sent to the server by the terminal when the terminal sends the file to be processed to the server. The file to be processed may be a network resource in any format, for example, the file to be processed may be a video file, an audio file, an installation package of an application program, or the like.

The terminal can acquire a file (i.e., a file to be processed) that the user needs to upload to the server, and then the terminal can send the file to be processed to the server.

When the file to be processed is sent to the server, the terminal can also obtain the size of the file to be processed, detect a target numerical value interval to which the size of the file to be processed belongs, calculate the hash value of the file to be processed according to a file hash value calculation mode corresponding to the target numerical value interval, and send sending information containing the hash value of the file to be processed to the server. The processing method of the terminal can refer to the detailed description of the above embodiments.

Correspondingly, the server may receive the sending information containing the hash value of the file to be processed.

S202: and the server determines whether the file to be processed is a repeated file or not according to the sending information.

After the server obtains the sending information, the server may extract a hash value of the file to be processed, and accordingly, S202 may include the following steps:

the server detects whether a hash value identical to that of the file to be processed exists in the hash values of the local storage files, if yes, the file to be processed is determined to be a repeated file, and if not, the file to be processed is determined to be a non-repeated file.

In one implementation, after the server obtains the hash value of the file to be processed, the server may query the hash values of all files stored locally, determine whether a hash value identical to the hash value of the file to be processed exists, if so, the server may determine that the file identical to the file to be processed is stored, that is, the file to be processed is a duplicate file, and if not, the server may determine that the file identical to the file to be processed is not stored, that is, the file to be processed is a non-duplicate file.

Therefore, based on the duplicate file detection method provided by the embodiment of the invention, when the to-be-processed file is sent to the server, the terminal can also send the hash value of the to-be-processed file to the server, and the server can obtain the hash value of the to-be-processed file without waiting for the completion of the transmission of all the to-be-processed files, so that the server can determine whether the to-be-processed file is the duplicate file or not earlier.

S203: and the server sends a response result to the terminal.

And the response result comprises the information that the file to be processed is a repeated file or the information that the file to be processed is a non-repeated file.

After determining whether the file to be processed is a duplicate file, the server may send a response result to the terminal to notify the terminal whether the file to be processed is a duplicate file.

In addition, in order to improve the efficiency of the duplicate file detection method, the sending information may further include the size of the file to be processed, and accordingly, S202 may include the following steps:

the server determines a target value interval to which the file to be processed belongs according to the size of the file to be processed, detects whether a hash value identical to the hash value of the file to be processed exists in the hash values of the storage files corresponding to the target value interval, determines the file to be processed to be a repeated file if the hash value exists, and determines the file to be processed to be a non-repeated file if the hash value does not exist.

In one implementation, after the server extracts the size and the hash value of the file to be processed, the server may determine a value interval (i.e., a target value interval) to which the size of the file to be processed belongs. With regard to the numerical intervals, reference may be made to the detailed description in the above-mentioned embodiments.

Then, the server can query the hash value of each storage file corresponding to the target value interval, and determine whether a hash value identical to the hash value of the file to be processed exists, if so, the server can determine that the file identical to the file to be processed is stored, that is, the file to be processed is a duplicate file, and if not, the server can determine that the file identical to the file to be processed is not stored, that is, the file to be processed is a non-duplicate file.

Based on the processing, the server only needs to query the hash value of each storage file corresponding to the target value interval, and does not need to query the hash values of all files stored locally, so that the query time can be saved, and the efficiency of the duplicate file detection method can be improved.

In addition, when the server judges that the file to be processed is a non-duplicate file, the server can also store the file to be processed and record the corresponding relation between the file to be processed and the hash value of the file to be processed, and further, when the terminal uploads the same file again, the server can determine that the file uploaded by the terminal is a duplicate file.

Corresponding to the method embodiment of fig. 1, referring to fig. 3, fig. 3 is a structural diagram of a terminal according to an embodiment of the present invention, where the terminal may include: a transceiver 301 and a processor 302;

the transceiver 301 is configured to acquire a file to be processed, which needs to be uploaded to a server by a user; when the file to be processed is sent to the server, the size of the file to be processed is obtained;

the processor 302 is configured to detect a target value interval to which the size of the file to be processed belongs, where different value intervals respectively correspond to different file hash value calculation manners; calculating the hash value of the file to be processed according to a file hash value calculation mode corresponding to the target numerical value interval;

the transceiver 301 is further configured to send, to the server, sending information including a hash value of the file to be processed; and receiving a response result of the server for the sent information, wherein the response result comprises information that the file to be processed is a duplicate file or information that the file to be processed is a non-duplicate file.

Optionally, the processor 302 is specifically configured to process data included in the file to be processed according to a file hash value calculation manner corresponding to the target value interval, so as to obtain a hash value to be selected; and calculating a hash value of the data containing the hash value to be selected and the size of the file to be processed, and taking the calculated hash value as the hash value of the file to be processed.

Optionally, the target value interval is (0, a);

the processor 302 is specifically configured to calculate a full hash value of the file to be processed, and use the full hash value as a hash value to be selected.

Optionally, the target value interval is [ a, B), where B > a;

the processor 302 is specifically configured to calculate a hash value of data including a preset head and a preset tail of the file to be processed, and use the calculated hash value as the hash value to be selected.

Optionally, the target value interval is [ B, + ∞ ");

the processor 302 is specifically configured to calculate a hash value of data including a preset head, a preset tail, and a preset middle of the file to be processed, and use the calculated hash value as the hash value to be selected.

Corresponding to the embodiment of the method in fig. 2, referring to fig. 4, fig. 4 is a structural diagram of a server according to an embodiment of the present invention, where the server may include: a transceiver 401 and a processor 402;

the transceiver 401 is configured to receive sending information that includes a hash value of a to-be-processed file and is sent by a terminal, where the sending information is sent to the server when the terminal sends the to-be-processed file to the server;

the processor 402 is configured to determine whether the file to be processed is a duplicate file according to the sending information;

the transceiver 401 is further configured to send a response result to the terminal, where the response result includes information that the file to be processed is a duplicate file or information that the file to be processed is a non-duplicate file.

Optionally, the processor 402 is specifically configured to detect whether a hash value that is the same as the hash value of the file to be processed exists in hash values of local storage files; if the hash value of each local storage file is the same as that of the file to be processed, determining that the file to be processed is a duplicate file; and if the hash value of each local storage file does not have the same hash value as that of the file to be processed, determining that the file to be processed is a non-repeated file.

the processor 402 is specifically configured to determine a target value interval to which the to-be-processed file belongs according to the size of the to-be-processed file; detecting whether a hash value identical to that of the file to be processed exists in the hash values of the storage files corresponding to the target value interval; if the hash value of each storage file corresponding to the target numerical value interval has the same hash value as that of the file to be processed, determining that the file to be processed is a duplicate file; and if the hash value of each storage file corresponding to the target numerical value interval does not have the same hash value as that of the file to be processed, determining that the file to be processed is a non-duplicate file.

Referring to fig. 5, fig. 5 is a structural diagram of a duplicate file detection system according to an embodiment of the present invention, where the system may include a terminal 501 and a server 502;

the terminal 501 is configured to acquire a file to be processed, which needs to be uploaded to the server 502 by a user; when the file to be processed is sent to the server 502, the size of the file to be processed is obtained; detecting a target value interval to which the size of the file to be processed belongs, wherein different value intervals respectively correspond to different file hash value calculation modes; calculating the hash value of the file to be processed according to a file hash value calculation mode corresponding to the target numerical value interval; sending information containing the hash value of the file to be processed to the server 502;

the server 502 is configured to receive sending information that includes a hash value of a file to be processed and is sent by the terminal 501; determining whether the file to be processed is a repeated file or not according to the sending information; and sending a response result to the terminal 501, wherein the response result includes information that the file to be processed is a duplicate file or information that the file to be processed is a non-duplicate file.

The terminal 501 is further configured to receive a response result of the server 502 for the sending information.

The embodiment of the invention also provides a computer-readable storage medium, wherein the computer-readable storage medium is stored with instructions, and when the computer-readable storage medium runs on a computer, the computer is enabled to execute the duplicate file detection method provided by the embodiment of the invention.

Specifically, the duplicate file detection method includes:

acquiring a file to be processed, which is required to be uploaded to a server by a user;

when the file to be processed is sent to the server, the size of the file to be processed is obtained;

detecting a target value interval to which the size of the file to be processed belongs, wherein different value intervals respectively correspond to different file hash value calculation modes;

calculating the hash value of the file to be processed according to a file hash value calculation mode corresponding to the target numerical value interval;

sending information containing the hash value of the file to be processed to the server;

and receiving a response result of the server for the sent information, wherein the response result comprises information that the file to be processed is a duplicate file or information that the file to be processed is a non-duplicate file.

It should be noted that other implementation manners of the above duplicate file detection method are the same as those of the foregoing method embodiment, and are not described herein again.

By operating the instruction stored in the computer-readable storage medium provided by the embodiment of the invention, when the to-be-processed file is sent to the server, the hash value of the to-be-processed file can be sent to the server, the server can obtain the hash value of the to-be-processed file without waiting for the completion of the transmission of all the to-be-processed files, and further, the server can determine whether the to-be-processed file is a duplicate file or not earlier.

Specifically, the duplicate file detection method includes:

receiving sending information which is sent by a terminal and contains a hash value of a file to be processed, wherein the sending information is sent to a server when the terminal sends the file to be processed to the server;

determining whether the file to be processed is a repeated file or not according to the sending information;

and sending a response result to the terminal, wherein the response result comprises information that the file to be processed is a duplicate file or information that the file to be processed is a non-duplicate file.

By operating the instruction stored in the computer-readable storage medium provided by the embodiment of the invention, the hash value of the file to be processed can be obtained without waiting for the completion of the transmission of the file to be processed, and further, whether the file to be processed is a duplicate file can be determined earlier.

Embodiments of the present invention further provide a computer program product including instructions, which when run on a computer, causes the computer to execute the duplicate file detection method provided by the embodiments of the present invention.

Specifically, the duplicate file detection method includes:

By operating the computer program product provided by the embodiment of the invention, when the to-be-processed file is sent to the server, the hash value of the to-be-processed file can be sent to the server, the server can obtain the hash value of the to-be-processed file without waiting for the completion of the transmission of the to-be-processed file, and further, the server can determine whether the to-be-processed file is a duplicate file or not earlier.

Specifically, the duplicate file detection method includes:

By operating the computer program product provided by the embodiment of the invention, the hash value of the file to be processed can be obtained without waiting for the transmission of the file to be processed to be completed, and further, whether the file to be processed is a duplicate file can be determined earlier.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the terminal, the server, the system, the computer-readable storage medium, and the computer program product, since they are substantially similar to the embodiments of the method, the description is simple, and the relevant points can be referred to the partial description of the embodiments of the method.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A duplicate file detection method, the method comprising:

the terminal sends sending information containing the hash value of the file to be processed and the size of the file to be processed to the server;

2. The method according to claim 1, wherein the calculating, by the terminal, the hash value of the to-be-processed file according to the file hash value calculation manner corresponding to the target value interval includes:

3. The method of claim 2, wherein the target value interval is (0, a); the terminal processes the data contained in the file to be processed according to the file hash value calculation mode corresponding to the target numerical value interval to obtain a hash value to be selected, and the method comprises the following steps:

4. A method according to claim 2 or 3, wherein the target value interval is [ a, B), wherein B > a; the terminal processes the data contained in the file to be processed according to the file hash value calculation mode corresponding to the target numerical value interval to obtain a hash value to be selected, and the method comprises the following steps:

5. The method according to claim 4, characterized in that the target interval of values is [ B, + ∞); the terminal processes the data contained in the file to be processed according to the file hash value calculation mode corresponding to the target numerical value interval to obtain a hash value to be selected, and the method comprises the following steps:

6. The method of claim 1, wherein the sending information further comprises a size of the pending file.

7. A duplicate file checking method, the method comprising:

the method comprises the steps that a server receives sending information which is sent by a terminal and contains a hash value of a file to be processed and the size of the file to be processed, wherein the sending information is sent to the server when the terminal sends the file to be processed to the server; the hash value is obtained according to the following steps: a terminal acquires a file to be processed, which is required to be uploaded to a server by a user; the terminal acquires the size of the file to be processed when sending the file to be processed to the server; the terminal detects a target numerical interval to which the size of the file to be processed belongs, wherein different numerical intervals respectively correspond to different file hash value calculation modes; the terminal calculates the hash value of the file to be processed according to a file hash value calculation mode corresponding to the target numerical value interval;

8. The method of claim 7, wherein the server determines whether the pending file is a duplicate file according to the sending information, comprising:

9. The method of claim 7, wherein the sending information further comprises a size of the pending file;

10. A terminal, characterized in that the terminal comprises: a transceiver and a processor;

the transceiver is further configured to send, to the server, sending information including a hash value of the to-be-processed file and a size of the to-be-processed file; and receiving a response result of the server for the sent information, wherein the response result comprises information that the file to be processed is a duplicate file or information that the file to be processed is a non-duplicate file.

11. The terminal according to claim 10, wherein the processor is specifically configured to process data included in the file to be processed according to a file hash value calculation manner corresponding to the target value interval, so as to obtain a hash value to be selected; and calculating a hash value of the data containing the hash value to be selected and the size of the file to be processed, and taking the calculated hash value as the hash value of the file to be processed.

12. The terminal of claim 11, wherein the target value interval is (0, a);

13. A terminal as claimed in claim 11 or 12, wherein the target value interval is [ a, B), where B > a;

14. The terminal according to claim 13, characterized in that said target value interval is [ B, + ∞);

15. The terminal of claim 10, wherein the sending information further comprises a size of the pending file.

16. A server, characterized in that the server comprises: a transceiver and a processor;

the transceiver is configured to receive sending information, which is sent by a terminal and includes a hash value of a to-be-processed file and a size of the to-be-processed file, where the sending information is sent to the server by the terminal when the to-be-processed file is sent to the server; the hash value is obtained according to the following steps: a terminal acquires a file to be processed, which is required to be uploaded to a server by a user; the terminal acquires the size of the file to be processed when sending the file to be processed to the server; the terminal detects a target numerical interval to which the size of the file to be processed belongs, wherein different numerical intervals respectively correspond to different file hash value calculation modes; the terminal calculates the hash value of the file to be processed according to a file hash value calculation mode corresponding to the target numerical value interval;

17. The server according to claim 16, wherein the processor is specifically configured to detect whether a hash value identical to the hash value of the file to be processed exists in hash values of local storage files; if the hash value of each local storage file is the same as that of the file to be processed, determining that the file to be processed is a duplicate file; and if the hash value of each local storage file does not have the same hash value as that of the file to be processed, determining that the file to be processed is a non-repeated file.

18. The server according to claim 16, wherein the sending information further includes a size of the file to be processed;