WO2021109850A1

WO2021109850A1 - Method and system for deduplicating and storing pdf files

Info

Publication number: WO2021109850A1
Application number: PCT/CN2020/129125
Authority: WO
Inventors: 鲍建涛
Original assignee: 世强先进(深圳)科技股份有限公司
Priority date: 2019-12-03
Filing date: 2020-11-16
Publication date: 2021-06-10
Also published as: CN111177082B; CN111177082A

Abstract

A method and system for deduplicating and storing PDF files. The method comprises: reading a feature value to be stored of a PDF file to be stored (S1); determining, stage by stage, whether there is a stored feature value matching the feature value (S2); and if not, storing the PDF file and updating a record containing the stored feature value (S3). In the method, a feature value to be stored of a PDF file to be stored is read, and then a comparison operation is performed to determine whether the feature value matches a stored feature value so as to determine whether the PDF file is the same as a stored PDF file. If the PDF file is not the same as a stored PDF file, the PDF file is stored. The invention ensures that only non-duplicate PDF files are stored, thereby saving file storage resources, preventing users from browsing duplicate files, and improving user experience.

Description

A method and system for deduplication and storage of PDF files

Technical field

The invention relates to the field of data processing, and more specifically, to a method and system for deduplication and storage of PDF files.

Background technique

With the continuous development of the information age, people gradually choose to use electronic files when learning knowledge and exchanging information. Among the many types of electronic files, the PDF format electronic files are not easy to modify and scale. Features such as high fidelity without deformation have been chosen by more and more users.

With the continuous increase in the number of PDF files, there are also situations in which multiple PDF files are stored. The file names of the two files are different but the content is the same, or the file names of the two files are the same but the content is different. It brings troubles and inconvenience to people's knowledge learning and information exchange, and also causes a waste of storage resources.

technical problem

The technical problem to be solved by the present invention is to provide a method and system for deduplication and storage of PDF files in view of the above-mentioned defect in the prior art that it is difficult to distinguish whether the stored PDF files are the same.

Technical solutions

The technical solution adopted by the present invention to solve its technical problems is: constructing a method for deduplication and storage of PDF files, including:

S1: Read the to-be-saved feature value of the PDF file to be saved;

S2: Judge step by step whether there is a stored feature value matching the feature value to be stored, if not, execute step S3;

S3: Store the to-be-saved PDF file and update the record of the stored feature value.

Preferably, the feature value to be saved includes the MD5 code of the PDF file stream to be saved;

The step-by-step judgment in step S2 includes:

S21: Determine whether the stored feature value that is the same as the MD5 code of the PDF file stream to be saved is recorded, and if so, execute step S29;

S29: Delete the to-be-saved PDF file.

Preferably, the feature value to be saved further includes the MD5 code of the text content in the PDF file to be saved;

In the step S21, when the record of the stored feature value that is the same as the MD5 code of the PDF file stream to be saved is not found, the stepwise judgment in the step S2 further includes:

S22: Determine whether the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is recorded, and if so, execute step S23;

S23: Determine whether other content in the file corresponding to the stored feature value is the same as other content in the to-be-saved PDF file, and if they are the same, execute the step S29.

Preferably, the feature value to be saved further includes the SIMHASH code of the text content in the PDF file to be saved and the number of pages of the PDF file to be saved;

In the step S22, when the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is not found, or in the step S23, when it is determined that other content in the file corresponding to the stored feature value When it is different from other content in the to-be-saved PDF file, the step-by-step judgment in step S2 further includes:

S24: Determine whether there is a stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range is recorded, and if so, execute step S25;

S25: Determine whether the number of pages of the file corresponding to the stored feature value is the same as the number of pages of the PDF file to be saved, if they are the same, perform step S26, and further determine;

S26: Store the corresponding stored feature value in the suspected repeated area; wherein, the corresponding stored feature value is all within a preset range from the Hamming distance of the SIMHASH code of the text content in the PDF file to be saved. Describe the existing feature values.

Preferably, in the step S23, when it is determined that other content in the file corresponding to the stored feature value is different from the other content in the to-be-saved PDF file, the method further includes:

Perform the step S26, and make a further judgment;

Wherein, the corresponding stored feature value is the same as the stored feature value of the MD5 code of the text content in the PDF file to be saved.

Preferably, the further judgment specifically includes:

S27: Determine whether there are stored feature values in the suspected duplicate temporary area, if so, execute step S28;

S28: Manually compare whether the file corresponding to the stored feature value is the same as the PDF file to be saved, if they are the same, execute the step S29, otherwise, execute the step S3.

Preferably, the preset range is 3.

Preferably, the step S3 further includes:

Generate and record the file number and file storage path of the PDF file to be saved.

The present invention also constructs a PDF file deduplication storage system, including:

Information reading module, used to read the to-be-saved feature value of the PDF file to be saved;

The content comparison module is used to determine step by step whether there is a stored feature value matching the feature value to be stored;

A storage module for storing the to-be-saved PDF file when there is no stored feature value matching the to-be-saved feature value;

The database is used to update the record of the stored feature value when the storage module stores the to-be-saved PDF file.

Preferably, the feature value to be stored includes:

The MD5 code of the PDF file stream to be saved, the MD5 code and SIMHASH code of the text content in the PDF file to be saved, and the number of pages of the PDF file to be saved.

Beneficial effect

Implementation of the method and system for deduplication and storage of PDF files of the present invention has the following beneficial effects:

By reading the pending feature value of the pending PDF file and comparing whether the pending feature value matches the stored feature value, it is determined whether the pending PDF file is the same as the stored PDF file, and the current When the to-be-saved PDF file is different from the saved PDF file, the to-be-saved PDF file is stored. It realizes that only non-duplicated PDF files are stored, saving file storage resources, and avoiding users from browsing duplicate files, improving user experience.

Description of the drawings

The present invention will be further described below in conjunction with the accompanying drawings and embodiments. In the accompanying drawings:

FIG. 1 is a flowchart of a first embodiment of a method for deduplicating and storing PDF files of the present invention;

2 is a flowchart of a second embodiment of the method for deduplicating and storing PDF files of the present invention;

FIG. 3 is a flowchart of a third embodiment of a method for deduplicating and storing PDF files of the present invention;

4 is a flowchart of a fourth embodiment of a method for deduplication and storage of PDF files of the present invention;

FIG. 5 is a flowchart of a fifth embodiment of a method for deduplicating and storing PDF files of the present invention;

Figure 6 is a schematic diagram of the structure of the PDF file deduplication storage system of the present invention.

Embodiments of the present invention

The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of the first embodiment of the method for deduplication and storage of PDF files of the present invention. The method for deduplication and storage of PDF files in this embodiment can be applied to data processing equipment, such as mobile phones, computers, servers, etc. with data processing In a capable electronic device, as shown in FIG. 1, in this embodiment, the method for deduplication and storage of a PDF file mainly includes the following steps:

Step S1: Read the to-be-saved feature value of the PDF file to-be-saved.

In practice, when a user stores a newly-added PDF file through a data processing device, the data processing device uses the newly-added PDF file as a PDF file to be saved, reads the pending feature value of the PDF file to be saved, and passes judgment Whether the feature value to be stored matches the stored feature value of the stored PDF file is determined to determine whether the to-be-stored PDF file is the same as the stored PDF file, so as to determine whether to store the to-be-stored PDF file.

Understandably, before receiving the PDF file to be saved, the data processing device may have stored the PDF file as a saved PDF file, and when the PDF file has been stored, it also records the saved PDF file corresponding to the saved PDF file. The feature value, where the saved feature value includes one or more of the MD5 code of the saved PDF file stream, the MD5 code and the SIMHASH code of the text content in the saved PDF file, and the number of pages of the saved PDF file. Of course, The file number and file storage path of the saved PDF file are also stored.

Correspondingly, when receiving the to-be-saved PDF file, the data processing device reads one or more to-be-saved feature values corresponding to the stored feature values of the to-be-saved PDF file. That is, when the saved feature value includes the MD5 code of the saved PDF file stream, when the pending PDF file is received, the MD5 code of the pending PDF file stream of the pending PDF file is read; when the saved feature value includes the saved PDF file stream When saving the MD5 code of the PDF file stream and the MD5 code of the text content in the saved PDF file, when the to-be-saved PDF file is received, read the MD5 code of the to-be-saved PDF file stream of the to-be-saved PDF file and the to-be-saved PDF The MD5 code of the text content in the file; and so on, so that the read feature value to be stored corresponds to the stored feature value, so as to facilitate the subsequent determination of whether the data processing device records the stored feature value that matches the pending feature value Eigenvalues.

Understandably, when the data processing device stores a stored PDF file and simultaneously records multiple stored feature values correspondingly, when the data processing device receives the to-be-stored PDF file, it can read the multiple PDF files at once. The multiple pending feature values corresponding to the stored feature values are cached. In the subsequent judgment process, the corresponding pending feature values are read from the cache; understandably, it is also possible to read only the current required performance at a time Judgment of the feature value to be saved, for example, when it is necessary to determine whether the processing device records a stored feature value that matches the MD5 code of the PDF file stream to be saved, just read the PDF file stream of the PDF file to be saved The MD5 code.

Step S2: It is judged step by step whether there is a stored feature value matching the above-mentioned feature value to be stored. If not, step S3 is executed.

When the data processing device stores a stored PDF file and simultaneously records multiple stored feature values correspondingly, when the data processing device receives the to-be-saved PDF file, it reads the to-be-saved PDF file and the stored features A plurality of to-be-stored characteristic values corresponding to the values are then judged step by step whether the data processing device has recorded a stored characteristic value that matches the to-be-saved characteristic value.

Understandably, when it is determined that the data processing device has recorded a stored feature value that matches the feature value to be stored, it is determined that the data processing device has stored the same stored PDF file as the to-be-stored PDF file, then Delete the to-be-saved PDF file to avoid repeated storage; otherwise, determine that the to-be-saved PDF file is different from the saved PDF file, and store the to-be-saved PDF file.

Specifically, the stepwise judgment includes two or more levels of judgment, and each level of judgment includes comparing one or more pending feature values with the corresponding one or more existing feature values. In addition, in the judgment of each level in the stepwise judgment, according to the characteristics of the characteristic value, the overall content of the to-be-saved PDF file to the partial content can be compared with the existing PDF file, and the partial content includes text content, image content, and table Content etc.

In this embodiment, by adopting a step-by-step judgment method, when it is judged that the overall content of the PDF file to be saved is the same as the overall content of the saved PDF file, there is no need to compare the partial content of the two, thereby speeding up the judgment; and , In each level of judgment, the characteristic values used for comparison are different, that is, the two are compared through a variety of judgment methods, which improves the reliability of the judgment result and prevents repeated storage.

Step S3: Store the to-be-saved PDF file and update the record of the stored feature value.

Specifically, when it is judged that the data processing device records a stored feature value that matches the feature value to be saved, it is deemed that the to-be-saved PDF file is the same as the saved PDF file, and the to-be-saved PDF file is deleted; otherwise, the to-be-saved PDF file is considered If the file is not the same as the saved PDF file, store the to-be-saved PDF file and record the read-out feature value of the to-be-saved PDF file corresponding to the to-be-saved PDF file, and then, the to-be-saved PDF file and its corresponding The feature value is used as the saved PDF file and its corresponding saved feature value to update the data stored and recorded in the data processing device.

Understandably, when it is determined that the PDF file to be saved is not the same as the saved PDF file, the file number and file storage path of the PDF file to be saved are generated at the same time, and the file number and file storage path of the PDF file to be saved are recorded. Provide convenience for the retrospective search of subsequent files.

Fig. 2 is a flowchart of a second embodiment of a method for deduplication and storage of PDF files according to the present invention. As shown in Fig. 2, in this embodiment, the method for deduplication and storage of PDF files mainly includes the following steps:

Step 11: Read the MD5 code of the pending PDF file stream of the pending PDF file.

Specifically, when a PDF file to be saved is received, the PDF file stream of the PDF file to be saved is read, and the read PDF file stream is converted into an MD5 code to obtain the MD5 code of the PDF file stream to be saved.

Step 12: Determine whether there is a stored feature value that is the same as the MD5 code of the above-mentioned PDF file stream to be saved.

Understandably, before receiving the to-be-saved PDF file, the data processing device may have stored one or more saved PDF files and recorded the corresponding MD5 code of the saved PDF file stream. After the MD5 code of the PDF file stream, query whether the MD5 code of the saved PDF file stream recorded by the data processing device is the same as the MD5 code of the to-be-saved PDF file stream. If so, judge the to-be-saved PDF file The saved PDF file corresponding to the MD5 code of the saved PDF file stream is the same, and step 13 is executed; otherwise, it is determined that the to-be-saved PDF file is different from the saved PDF file in the data processing device, and step 14 is executed.

Step 13: Delete the above pending PDF files.

Step 14. Store the to-be-saved PDF file and update the record of the stored feature value.

Specifically, when it is determined that the to-be-saved PDF file is not the same as the saved PDF file, the to-be-saved PDF file is stored in the designated path, and the file number of the to-be-saved PDF file is generated at the same time, and the MD5 of the to-be-saved PDF file stream is recorded Code, file storage path and file number.

In this embodiment, using the characteristics of the MD5 code of the PDF file stream, the MD5 code of the PDF file stream is used as the judgment object to realize whether the PDF file to be saved is the same as the saved PDF file from the overall content. The judgment method is simple and fast. .

FIG. 3 is a flowchart of a third embodiment of a method for deduplication and storage of a PDF file of the present invention. The difference between this embodiment and the previous embodiment is that this embodiment compares partial content. As shown in Figure 3, in this embodiment, the method for deduplication and storage of PDF files mainly includes the following steps:

Step 21: Read the MD5 code of the text content in the PDF file to be saved.

Specifically, when a PDF file to be saved is received, the text content of the PDF file to be saved is read, and the text content of the read PDF file is converted into an MD5 code to obtain the MD5 of the text content in the PDF file to be saved code. Understandably, other contents in the to-be-saved PDF file can be read at the same time, such as pictures, table contents and other objects in the PDF file.

Step 22: Determine whether there is a stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved.

Understandably, when the data processing device records the MD5 code of the text content in the saved PDF file that is the same as the MD5 code of the text content in the to-be-saved PDF file, it is judged that the text content of the to-be-saved PDF file is the same as the text content in the saved PDF file. The text content of the saved PDF file corresponding to the MD5 code of the text content in the saved PDF file is the same, and step 23 is executed for further judgment. Otherwise, the text content of the to-be-saved PDF file is judged to be the same as the saved text in the data processing device. If the text content of the PDF file is different, go to step 24.

Step 23: Determine whether other content in the file corresponding to the stored feature value is the same as other content in the to-be-saved PDF file.

Understandably, it is further judged whether other content in the saved PDF file corresponding to the MD5 code of the text content in the searched saved PDF file is the same as other content in the to-be-saved PDF file. Among them, the MD5 code of the text content in the saved PDF file can correspond to one or more saved PDF files. Correspondingly compare the to-be-saved PDF file with other content except text content of the one or more saved PDF files, where the other content includes pictures, table content, and other objects.

Understandably, the reading of the other content can be performed in step 21, or can be read before the step is performed. Understandably, when the other contents of the two are completely the same, they are judged to be the same; otherwise, they are judged to be different. For example, the zoom ratios of the pictures of the two are different, and it is still judged to be different.

If it is judged that the other contents of the two are also the same, it is judged that the overall content of the PDF file to be saved is the same as the overall content of the saved PDF file, and step 25 is executed. Understandably, when it is judged that the PDF file to be saved is the same as a saved PDF file , Only need to perform step 25 and end. Otherwise, step 26 is executed. It can be understood that step 26 is executed when the other contents of the multiple stored PDF files are different from the other contents of the PDF files to be saved.

Step 24: Store the to-be-saved PDF file and update the record of the stored feature value.

Step 25: Delete the above pending PDF files.

Step 26: Store the above-mentioned stored feature value that is the same as the MD5 code of the text content in the above-mentioned pending PDF file in the suspected duplicate temporary area, and further judge.

Understandably, when it is judged that the text content of the to-be-saved PDF file is the same as that of the saved PDF file but the other content is different, it is considered that the to-be-saved PDF file is suspected to be the same as the saved PDF file, and further judgment is required.

In this embodiment, multiple partial content comparisons are used to determine whether the overall content of the two is the same. The partial content includes text content and other content. The other content includes pictures, table content, and other objects. Corresponding comparison of content improves the accuracy of judgment.

FIG. 4 is a flowchart of a fourth embodiment of a method for deduplication and storage of a PDF file of the present invention. The difference from the previous embodiment is that this embodiment compares the overall content with the partial content. As shown in FIG. 4, in this embodiment, the method for deduplication and storage of PDF files mainly includes the following steps:

Step 31: Read the SIMHASH code of the text content of the PDF file to be saved and the number of pages of the PDF file to be saved.

Specifically, by reading the text content of the PDF file to be saved, and converting the text content of the read PDF file into a SIMHASH code, to obtain the SIMHASH code of the text content in the PDF file to be saved, and reading the text content to be saved The PDF file stream of the PDF file to get the number of pages of the PDF file to be saved.

Step 32: Determine whether there is a stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range.

Understandably, in information encoding, the number of bits that are encoded differently on the corresponding bits of the two legal codes is called the code distance, also known as the Hamming distance. It is generally considered that texts with a Hamming distance within 3 are highly similar texts. In this embodiment, the preset range is 3. Of course, the preset range can also be set as needed.

Specifically, when the data processing device records a stored feature value within 3 Hamming distance from the SIMHASH code of the text content in the PDF file to be saved, it is considered that the data processing device stores its text content and If the text content of the PDF file to be saved is highly similar to the stored PDF file, step 33 is executed for further judgment; otherwise, step 34 is executed.

Step 33: Determine whether the number of pages of the file corresponding to the above-mentioned stored feature value is the same as the number of pages of the above-mentioned PDF file to be saved.

Understandably, it is further determined whether the number of pages of the file corresponding to the SIMHASH code of the text content in the queried saved PDF file is the same as the number of pages of the PDF file to be saved. Among them, the stored PDF files that are queried may include one or more stored PDF files. When multiple saved PDF files are included, there are the following situations. One is that the number of pages of all the above-mentioned saved PDF files is different from the number of pages of the to-be-saved PDF file, and the to-be-saved PDF file is considered to be different from the saved PDF file. Go to step 34; second, if the number of pages of all the above-mentioned saved PDF files are the same as the number of pages of the to-be-saved PDF file, go to step 35; For the same saved PDF file, discard the page number record of the searched saved PDF file that has a different page number from the PDF file to be saved, and treat the saved PDF file with the same number of pages as a suspected identical file. Go to step 35.

Step 34: Store the to-be-saved PDF file and update the record of the stored feature value.

Step 35: Store the stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range to the suspected duplicate temporary area, and make a further judgment.

In this embodiment, the overall content combined with the partial content corresponding comparison method is used to determine whether the PDF file to be saved is the same as the existing PDF file, and the accuracy of the judgment is improved.

Further, in the foregoing third and fourth embodiments, the further judgment mainly includes the following steps:

Step 41: Determine whether there are stored feature values in the suspected repeated temporary area, and if so, proceed to step 42.

Step 42: Manually compare whether the file corresponding to the above-mentioned stored feature value is the same as the above-mentioned pending PDF file, if they are the same, delete the above-mentioned pending PDF file; if they are not the same, store the above-mentioned pending PDF file and update the above-mentioned pending PDF file. Save the record of the characteristic value.

Specifically, according to the stored feature value stored in the suspected repeated temporary area, the stored PDF file corresponding to the stored feature value in the storage of the data processing device is read, and the to-be-saved PDF file and the stored PDF are manually judged whether In the same way, through the manual judgment method, the defect that the content is judged to be different due to the zooming degree and definition of the picture, table and other objects in the above judgment can be eliminated, and the accuracy of the judgment can be improved.

5 is a flowchart of a fifth embodiment of a method for deduplicating and storing PDF files of the present invention. This embodiment is a step-by-step judgment scheme formed by the combination of the second, third, and fourth embodiments described above. Therefore, it is similar to the foregoing embodiment The content of the repeated steps will not be detailed again.

As shown in Figure 5, in this embodiment, the method for deduplication and storage of PDF files mainly includes the following steps:

Step S1: Read the to-be-saved feature value of the PDF file to-be-saved.

Specifically, the MD5 code of the PDF file stream to be saved, the MD5 and SIMHASH codes of the text content in the PDF file to be saved, and the number of pages of the PDF file to be saved are read.

Step S21: It is judged whether there is a stored feature value that is the same as the MD5 code of the above-mentioned PDF file stream to be saved. If there is, step S29 is executed; otherwise, step S22 is executed.

Step S29: Delete the above-mentioned pending PDF file, and end the process.

Step S22: Determine whether the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is recorded, if there is, step S23 is executed, otherwise, step S24 is executed.

Step S23: Determine whether other content in the file corresponding to the stored feature value is the same as other content in the pending PDF file, if they are the same, perform step S29, otherwise, perform step S26 and step S24.

Understandably, in this step, when the same situation exists, step S29 is directly executed, and no other steps are executed, and the process ends; when there are different situations, step S26 is executed first, and then step S24 is executed to ensure the correspondence The stored feature value of is stored in the suspected duplicate area. Understandably, when there are different situations in step S23, in order to improve the accuracy of the judgment, step S24 needs to be further executed.

Step S24: Determine whether there is a stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range is recorded, if there is, step S25 is executed, otherwise, step S27 is executed.

In this embodiment, the preset range is 3, of course, it can also be set as needed.

Step S25: Determine whether the number of pages of the file corresponding to the stored feature value is the same as the number of pages of the PDF file to be saved, if they are the same, step S26 is executed, otherwise, step S27 is executed.

Understandably, in this step, when the conditions are not the same, skip to step S27, when there are all the same conditions, skip to step S26, when there are parts that are the same but not the same, discard the different parts, Go to step S26.

Step S26: Store the corresponding stored feature value in the suspected duplicate area.

Understandably, when jumping from step S23 to step S26, it means storing the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved to the suspected duplicate temporary area; when jumping from step S25 to step S26 At the time, the stored feature values whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved are within a preset range are stored in the suspected duplicate temporary area.

Step S27: It is judged whether the stored feature value is stored in the suspected repeated temporary area, if so, step S28 is executed, otherwise, step S3 is executed.

Understandably, when the suspected repeated temporary area is not stored in the stored feature value, it is considered that the to-be-saved PDF file is different from the stored PDF file.

Step S28: Manually compare whether the file corresponding to the above-mentioned stored feature value is the same as the above-mentioned PDF file to be saved. If they are the same, perform step S29; otherwise, perform step S3.

Step S3: Store the above-mentioned pending PDF file and update the record of the above-mentioned stored characteristic value, ending the process.

The method for deduplication and storage of PDF files in this embodiment adopts a step-by-step judgment method to judge whether the to-be-saved PDF file is the same as the existing PDF file, and the judgment at all levels adopts overall content, partial content, and overall combined partial content. The judgment method is used for judgment and comparison to improve the accuracy of judgment.

6 is a schematic structural diagram of the first embodiment of the PDF file deduplication storage system of the present invention. The system can be applied to data processing equipment, such as mobile phones, computers, servers, and other electronic equipment with data processing capabilities.

As shown in FIG. 6, the PDF file deduplication storage system 100 includes: an information reading module 101, a content comparison module 102, a storage module 103, and a database 104. Understandably, each module in the PDF file deduplication storage system Corresponding to the PDF file deduplication storage method in the first to fifth embodiments described above, the specific steps are not described in detail.

The information reading module 101 is used to read the to-be-saved feature value of the PDF file to be saved.

Understandably, when the data processing device receives the to-be-saved PDF file, the information reading module 101 reads one or more to-be-saved feature values corresponding to the stored feature values of the to-be-saved PDF file. The information reading module 101 reads the PDF file stream of the PDF file to be saved and converts the read PDF file stream into MD5 code to obtain the MD5 code of the PDF file stream to be saved; The text content of the PDF file, and convert the text content of the read PDF file into MD5 code and SIMHASH code to obtain the MD5 code and SIMHASH code of the text content in the PDF file to be saved; read the number of pages of the PDF file to be saved, Read other content in the to-be-saved PDF file, where the other content includes pictures, tables, and other objects.

The content comparison module 102 is used to determine step by step whether there is a stored feature value matching the above-mentioned feature value to be stored.

Understandably, when the content comparison module 102 determines that the database 104 has recorded a stored feature value that matches the to-be-saved feature value, it is determined that the storage module 103 has stored the same stored PDF file as the to-be-saved PDF file, Then delete the pending PDF file to avoid repeated storage; otherwise, determine that the pending PDF file is different from the stored PDF file, and notify the storage module 103 to store the pending PDF file, and notify the database 104 to store the pending PDF The feature value to be saved corresponding to the file.

Understandably, the overall content judgment includes judging whether the MD5 code of the PDF file stream to be saved is the same as the MD5 code of the saved PDF file stream, and whether the page number of the PDF file to be saved is the same as the page number of the saved PDF file; partial content judgment Including whether the MD5 code of the text content in the PDF file to be saved is the same as the MD5 code of the text content in the saved PDF file, the Hamming distance between the SIMHASH code of the text content in the PDF file to be saved and the SIMHASH code of the text content in the saved PDF file Whether it is within the scope of 3, whether other content in the to-be-saved PDF file is the same as other content in the saved PDF file, where the other content includes pictures, tables, and other objects.

The storage module 103 is configured to store the PDF file to be stored when there is no stored feature value matching the feature value to be stored.

The database 104 is used to update the record of the stored feature value when the storage module 103 stores the PDF file to be stored.

Understandably, before receiving the to-be-saved PDF file, the storage module 103 stores the stored PDF file, and at the same time, the database 104 records the stored feature value corresponding to the stored PDF file. Among them, the stored feature value includes one or more of the MD5 code of the stored PDF file stream, the MD5 code and the SIMHASH code of the text content of the stored PDF file, and the number of pages of the stored PDF file. Of course, there are also stored The file number and file storage path of the saved PDF file.

Specifically, after the storage module 103 receives the notification from the content comparison module 102, it stores the to-be-saved PDF file in a designated path. After receiving the notification from the content comparison module 102, the database 104 records the to-be-saved feature value corresponding to the PDF file to be saved. .

Industrial applicability

In the present invention, by reading the pending feature value of the pending PDF file and comparing whether the pending feature value matches the stored feature value, it is determined whether the pending PDF file and the stored PDF file are Same, and when the to-be-saved PDF file is different from the saved PDF file, the to-be-saved PDF file is stored. It realizes that only non-duplicated PDF files are stored, saving file storage resources, and avoiding users from browsing duplicate files, improving user experience.

Sequence Listing Free Content

It is understandable that the above examples only express the preferred embodiments of the present invention, and the descriptions are more specific and detailed, but they should not be construed as limiting the scope of the patent of the present invention; it should be pointed out that for those of ordinary skill in the art In other words, without departing from the concept of the present invention, the above technical features can be freely combined, and several modifications and improvements can be made. These all belong to the scope of protection of the present invention; therefore, everything that follows the scope of the claims of the present invention All equivalent changes and modifications shall fall within the scope of the claims of the present invention.

Claims

A method for deduplication and storage of PDF files, which is characterized in that it comprises:

S1: Read the to-be-saved feature value of the PDF file to be saved;

S2: Judge step by step whether there is a stored feature value matching the feature value to be stored, if not, execute step S3;

S3: Store the to-be-saved PDF file and update the record of the stored feature value.
The method for deduplication and storage of PDF files according to claim 1, characterized in that:

The feature value to be saved includes the MD5 code of the PDF file stream to be saved;

The step-by-step judgment in step S2 includes:

S21: Determine whether the stored feature value that is the same as the MD5 code of the PDF file stream to be saved is recorded, and if so, execute step S29;

S29: Delete the to-be-saved PDF file.
The method for deduplication and storage of PDF files according to claim 2, characterized in that:

The feature value to be saved also includes the MD5 code of the text content in the PDF file to be saved;

In the step S21, when the record of the stored feature value that is the same as the MD5 code of the PDF file stream to be saved is not found, the stepwise judgment in the step S2 further includes:

S22: Determine whether the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is recorded, and if so, execute step S23;

S23: Determine whether other content in the file corresponding to the stored feature value is the same as other content in the to-be-saved PDF file, and if they are the same, execute the step S29.
The method for deduplication and storage of PDF files according to claim 3, wherein:

The feature value to be saved also includes the SIMHASH code of the text content in the PDF file to be saved and the number of pages of the PDF file to be saved;

In the step S22, when the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is not found, or in the step S23, when it is determined that other content in the file corresponding to the stored feature value When it is different from other content in the to-be-saved PDF file, the step-by-step judgment in step S2 further includes:

S24: Determine whether there is a stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range is recorded, and if so, execute step S25;

S25: Determine whether the number of pages of the file corresponding to the stored feature value is the same as the number of pages of the PDF file to be saved, if they are the same, perform step S26, and further determine;

S26: Store the corresponding stored feature value in the suspected repeated area; wherein, the corresponding stored feature value is all within a preset range from the Hamming distance of the SIMHASH code of the text content in the PDF file to be saved. Describe the existing feature values.
The method for deduplication and storage of PDF files according to claim 4, characterized in that:

In the step S23, when it is determined that other content in the file corresponding to the stored feature value is different from the other content in the to-be-saved PDF file, the method further includes:

Perform the step S26, and make a further judgment;

Wherein, the corresponding stored feature value is the same as the stored feature value of the MD5 code of the text content in the PDF file to be saved.
The method for deduplication and storage of PDF files according to any one of claims 4-5, characterized in that:

The further judgment specifically includes:

S27: Determine whether there are stored feature values in the suspected duplicate temporary area, if so, execute step S28;

S28: Manually compare whether the file corresponding to the stored feature value is the same as the PDF file to be saved, if they are the same, execute the step S29, otherwise, execute the step S3.
The method for deduplication and storage of PDF files according to claim 5, characterized in that:

The preset range is 3.
The method for deduplication and storage of PDF files according to claim 1, characterized in that:

The step S3 also includes:

Generate and record the file number and file storage path of the PDF file to be saved.
Information reading module, used to read the to-be-saved feature value of the PDF file to be saved;

The content comparison module is used to determine step by step whether there is a stored feature value matching the feature value to be stored;

A storage module for storing the to-be-saved PDF file when there is no stored feature value matching the to-be-saved feature value;

The database is used to update the record of the stored feature value when the storage module stores the to-be-saved PDF file.
The PDF file deduplication storage system according to claim 9, characterized in that:

The feature value to be stored includes:

The MD5 code of the PDF file stream to be saved, the MD5 code and SIMHASH code of the text content in the PDF file to be saved, and the number of pages of the PDF file to be saved.