WO2021109850A1 - Method and system for deduplicating and storing pdf files - Google Patents
Method and system for deduplicating and storing pdf files Download PDFInfo
- Publication number
- WO2021109850A1 WO2021109850A1 PCT/CN2020/129125 CN2020129125W WO2021109850A1 WO 2021109850 A1 WO2021109850 A1 WO 2021109850A1 CN 2020129125 W CN2020129125 W CN 2020129125W WO 2021109850 A1 WO2021109850 A1 WO 2021109850A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- saved
- pdf file
- feature value
- stored
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
- G06F16/162—Delete operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/11—File system administration, e.g. details of archiving or snapshots
- G06F16/113—Details of archiving
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- the invention relates to the field of data processing, and more specifically, to a method and system for deduplication and storage of PDF files.
- the technical problem to be solved by the present invention is to provide a method and system for deduplication and storage of PDF files in view of the above-mentioned defect in the prior art that it is difficult to distinguish whether the stored PDF files are the same.
- the technical solution adopted by the present invention to solve its technical problems is: constructing a method for deduplication and storage of PDF files, including:
- step S2 Judge step by step whether there is a stored feature value matching the feature value to be stored, if not, execute step S3;
- S3 Store the to-be-saved PDF file and update the record of the stored feature value.
- the feature value to be saved includes the MD5 code of the PDF file stream to be saved;
- step S2 includes:
- the feature value to be saved further includes the MD5 code of the text content in the PDF file to be saved;
- the stepwise judgment in the step S2 further includes:
- the feature value to be saved further includes the SIMHASH code of the text content in the PDF file to be saved and the number of pages of the PDF file to be saved;
- step S22 when the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is not found, or in the step S23, when it is determined that other content in the file corresponding to the stored feature value
- the step-by-step judgment in step S2 further includes:
- step S25 Determine whether the number of pages of the file corresponding to the stored feature value is the same as the number of pages of the PDF file to be saved, if they are the same, perform step S26, and further determine;
- S26 Store the corresponding stored feature value in the suspected repeated area; wherein, the corresponding stored feature value is all within a preset range from the Hamming distance of the SIMHASH code of the text content in the PDF file to be saved. Describe the existing feature values.
- the method when it is determined that other content in the file corresponding to the stored feature value is different from the other content in the to-be-saved PDF file, the method further includes:
- the corresponding stored feature value is the same as the stored feature value of the MD5 code of the text content in the PDF file to be saved.
- the further judgment specifically includes:
- the preset range is 3.
- the step S3 further includes:
- the present invention also constructs a PDF file deduplication storage system, including:
- Information reading module used to read the to-be-saved feature value of the PDF file to be saved
- the content comparison module is used to determine step by step whether there is a stored feature value matching the feature value to be stored;
- a storage module for storing the to-be-saved PDF file when there is no stored feature value matching the to-be-saved feature value
- the database is used to update the record of the stored feature value when the storage module stores the to-be-saved PDF file.
- the feature value to be stored includes:
- the MD5 code of the PDF file stream to be saved the MD5 code and SIMHASH code of the text content in the PDF file to be saved, and the number of pages of the PDF file to be saved.
- the to-be-saved PDF file is stored. It realizes that only non-duplicated PDF files are stored, saving file storage resources, and avoiding users from browsing duplicate files, improving user experience.
- FIG. 1 is a flowchart of a first embodiment of a method for deduplicating and storing PDF files of the present invention
- FIG. 2 is a flowchart of a second embodiment of the method for deduplicating and storing PDF files of the present invention
- FIG. 3 is a flowchart of a third embodiment of a method for deduplicating and storing PDF files of the present invention
- FIG. 4 is a flowchart of a fourth embodiment of a method for deduplication and storage of PDF files of the present invention.
- FIG. 5 is a flowchart of a fifth embodiment of a method for deduplicating and storing PDF files of the present invention.
- Figure 6 is a schematic diagram of the structure of the PDF file deduplication storage system of the present invention.
- Fig. 1 is a flowchart of the first embodiment of the method for deduplication and storage of PDF files of the present invention.
- the method for deduplication and storage of PDF files in this embodiment can be applied to data processing equipment, such as mobile phones, computers, servers, etc. with data processing
- the method for deduplication and storage of a PDF file mainly includes the following steps:
- Step S1 Read the to-be-saved feature value of the PDF file to-be-saved.
- the data processing device uses the newly-added PDF file as a PDF file to be saved, reads the pending feature value of the PDF file to be saved, and passes judgment Whether the feature value to be stored matches the stored feature value of the stored PDF file is determined to determine whether the to-be-stored PDF file is the same as the stored PDF file, so as to determine whether to store the to-be-stored PDF file.
- the data processing device may have stored the PDF file as a saved PDF file, and when the PDF file has been stored, it also records the saved PDF file corresponding to the saved PDF file.
- the feature value where the saved feature value includes one or more of the MD5 code of the saved PDF file stream, the MD5 code and the SIMHASH code of the text content in the saved PDF file, and the number of pages of the saved PDF file.
- the file number and file storage path of the saved PDF file are also stored.
- the data processing device when receiving the to-be-saved PDF file, the data processing device reads one or more to-be-saved feature values corresponding to the stored feature values of the to-be-saved PDF file. That is, when the saved feature value includes the MD5 code of the saved PDF file stream, when the pending PDF file is received, the MD5 code of the pending PDF file stream of the pending PDF file is read; when the saved feature value includes the saved PDF file stream When saving the MD5 code of the PDF file stream and the MD5 code of the text content in the saved PDF file, when the to-be-saved PDF file is received, read the MD5 code of the to-be-saved PDF file stream of the to-be-saved PDF file and the to-be-saved PDF The MD5 code of the text content in the file; and so on, so that the read feature value to be stored corresponds to the stored feature value, so as to facilitate the subsequent determination of whether the data processing device records the stored feature value that matches the
- the data processing device when the data processing device stores a stored PDF file and simultaneously records multiple stored feature values correspondingly, when the data processing device receives the to-be-stored PDF file, it can read the multiple PDF files at once.
- the multiple pending feature values corresponding to the stored feature values are cached.
- the corresponding pending feature values are read from the cache; understandably, it is also possible to read only the current required performance at a time Judgment of the feature value to be saved, for example, when it is necessary to determine whether the processing device records a stored feature value that matches the MD5 code of the PDF file stream to be saved, just read the PDF file stream of the PDF file to be saved The MD5 code.
- Step S2 It is judged step by step whether there is a stored feature value matching the above-mentioned feature value to be stored. If not, step S3 is executed.
- the data processing device stores a stored PDF file and simultaneously records multiple stored feature values correspondingly
- the data processing device receives the to-be-saved PDF file, it reads the to-be-saved PDF file and the stored features
- a plurality of to-be-stored characteristic values corresponding to the values are then judged step by step whether the data processing device has recorded a stored characteristic value that matches the to-be-saved characteristic value.
- the data processing device when it is determined that the data processing device has recorded a stored feature value that matches the feature value to be stored, it is determined that the data processing device has stored the same stored PDF file as the to-be-stored PDF file, then Delete the to-be-saved PDF file to avoid repeated storage; otherwise, determine that the to-be-saved PDF file is different from the saved PDF file, and store the to-be-saved PDF file.
- the stepwise judgment includes two or more levels of judgment, and each level of judgment includes comparing one or more pending feature values with the corresponding one or more existing feature values.
- the overall content of the to-be-saved PDF file to the partial content can be compared with the existing PDF file, and the partial content includes text content, image content, and table Content etc.
- Step S3 Store the to-be-saved PDF file and update the record of the stored feature value.
- the to-be-saved PDF file is the same as the saved PDF file, and the to-be-saved PDF file is deleted; otherwise, the to-be-saved PDF file is considered If the file is not the same as the saved PDF file, store the to-be-saved PDF file and record the read-out feature value of the to-be-saved PDF file corresponding to the to-be-saved PDF file, and then, the to-be-saved PDF file and its corresponding The feature value is used as the saved PDF file and its corresponding saved feature value to update the data stored and recorded in the data processing device.
- the file number and file storage path of the PDF file to be saved are generated at the same time, and the file number and file storage path of the PDF file to be saved are recorded. Provide convenience for the retrospective search of subsequent files.
- Fig. 2 is a flowchart of a second embodiment of a method for deduplication and storage of PDF files according to the present invention. As shown in Fig. 2, in this embodiment, the method for deduplication and storage of PDF files mainly includes the following steps:
- Step 11 Read the MD5 code of the pending PDF file stream of the pending PDF file.
- the PDF file stream of the PDF file to be saved is read, and the read PDF file stream is converted into an MD5 code to obtain the MD5 code of the PDF file stream to be saved.
- Step 12 Determine whether there is a stored feature value that is the same as the MD5 code of the above-mentioned PDF file stream to be saved.
- the data processing device may have stored one or more saved PDF files and recorded the corresponding MD5 code of the saved PDF file stream.
- step 13 is executed; otherwise, it is determined that the to-be-saved PDF file is different from the saved PDF file in the data processing device, and step 14 is executed.
- Step 13 Delete the above pending PDF files.
- Step 14 Store the to-be-saved PDF file and update the record of the stored feature value.
- the to-be-saved PDF file is stored in the designated path, and the file number of the to-be-saved PDF file is generated at the same time, and the MD5 of the to-be-saved PDF file stream is recorded Code, file storage path and file number.
- the MD5 code of the PDF file stream is used as the judgment object to realize whether the PDF file to be saved is the same as the saved PDF file from the overall content.
- the judgment method is simple and fast. .
- FIG. 3 is a flowchart of a third embodiment of a method for deduplication and storage of a PDF file of the present invention. The difference between this embodiment and the previous embodiment is that this embodiment compares partial content. As shown in Figure 3, in this embodiment, the method for deduplication and storage of PDF files mainly includes the following steps:
- Step 21 Read the MD5 code of the text content in the PDF file to be saved.
- the text content of the PDF file to be saved is read, and the text content of the read PDF file is converted into an MD5 code to obtain the MD5 of the text content in the PDF file to be saved code.
- other contents in the to-be-saved PDF file can be read at the same time, such as pictures, table contents and other objects in the PDF file.
- Step 22 Determine whether there is a stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved.
- the data processing device records the MD5 code of the text content in the saved PDF file that is the same as the MD5 code of the text content in the to-be-saved PDF file, it is judged that the text content of the to-be-saved PDF file is the same as the text content in the saved PDF file.
- the text content of the saved PDF file corresponding to the MD5 code of the text content in the saved PDF file is the same, and step 23 is executed for further judgment. Otherwise, the text content of the to-be-saved PDF file is judged to be the same as the saved text in the data processing device. If the text content of the PDF file is different, go to step 24.
- Step 23 Determine whether other content in the file corresponding to the stored feature value is the same as other content in the to-be-saved PDF file.
- the MD5 code of the text content in the saved PDF file can correspond to one or more saved PDF files.
- the to-be-saved PDF file with other content except text content of the one or more saved PDF files, where the other content includes pictures, table content, and other objects.
- the reading of the other content can be performed in step 21, or can be read before the step is performed. Understandably, when the other contents of the two are completely the same, they are judged to be the same; otherwise, they are judged to be different. For example, the zoom ratios of the pictures of the two are different, and it is still judged to be different.
- step 25 is executed. Understandably, when it is judged that the PDF file to be saved is the same as a saved PDF file , Only need to perform step 25 and end. Otherwise, step 26 is executed. It can be understood that step 26 is executed when the other contents of the multiple stored PDF files are different from the other contents of the PDF files to be saved.
- Step 24 Store the to-be-saved PDF file and update the record of the stored feature value.
- Step 25 Delete the above pending PDF files.
- Step 26 Store the above-mentioned stored feature value that is the same as the MD5 code of the text content in the above-mentioned pending PDF file in the suspected duplicate temporary area, and further judge.
- multiple partial content comparisons are used to determine whether the overall content of the two is the same.
- the partial content includes text content and other content.
- the other content includes pictures, table content, and other objects. Corresponding comparison of content improves the accuracy of judgment.
- FIG. 4 is a flowchart of a fourth embodiment of a method for deduplication and storage of a PDF file of the present invention. The difference from the previous embodiment is that this embodiment compares the overall content with the partial content. As shown in FIG. 4, in this embodiment, the method for deduplication and storage of PDF files mainly includes the following steps:
- Step 31 Read the SIMHASH code of the text content of the PDF file to be saved and the number of pages of the PDF file to be saved.
- Step 32 Determine whether there is a stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range.
- the code distance also known as the Hamming distance. It is generally considered that texts with a Hamming distance within 3 are highly similar texts.
- the preset range is 3. Of course, the preset range can also be set as needed.
- step 33 is executed for further judgment; otherwise, step 34 is executed.
- Step 33 Determine whether the number of pages of the file corresponding to the above-mentioned stored feature value is the same as the number of pages of the above-mentioned PDF file to be saved.
- the stored PDF files that are queried may include one or more stored PDF files.
- the number of pages of all the above-mentioned saved PDF files is different from the number of pages of the to-be-saved PDF file, and the to-be-saved PDF file is considered to be different from the saved PDF file.
- step 34 second, if the number of pages of all the above-mentioned saved PDF files are the same as the number of pages of the to-be-saved PDF file, go to step 35; For the same saved PDF file, discard the page number record of the searched saved PDF file that has a different page number from the PDF file to be saved, and treat the saved PDF file with the same number of pages as a suspected identical file. Go to step 35.
- Step 34 Store the to-be-saved PDF file and update the record of the stored feature value.
- Step 35 Store the stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range to the suspected duplicate temporary area, and make a further judgment.
- the overall content combined with the partial content corresponding comparison method is used to determine whether the PDF file to be saved is the same as the existing PDF file, and the accuracy of the judgment is improved.
- the further judgment mainly includes the following steps:
- Step 41 Determine whether there are stored feature values in the suspected repeated temporary area, and if so, proceed to step 42.
- Step 42 Manually compare whether the file corresponding to the above-mentioned stored feature value is the same as the above-mentioned pending PDF file, if they are the same, delete the above-mentioned pending PDF file; if they are not the same, store the above-mentioned pending PDF file and update the above-mentioned pending PDF file. Save the record of the characteristic value.
- the stored PDF file corresponding to the stored feature value in the storage of the data processing device is read, and the to-be-saved PDF file and the stored PDF are manually judged whether In the same way, through the manual judgment method, the defect that the content is judged to be different due to the zooming degree and definition of the picture, table and other objects in the above judgment can be eliminated, and the accuracy of the judgment can be improved.
- FIG. 5 is a flowchart of a fifth embodiment of a method for deduplicating and storing PDF files of the present invention.
- This embodiment is a step-by-step judgment scheme formed by the combination of the second, third, and fourth embodiments described above. Therefore, it is similar to the foregoing embodiment The content of the repeated steps will not be detailed again.
- the method for deduplication and storage of PDF files mainly includes the following steps:
- Step S1 Read the to-be-saved feature value of the PDF file to-be-saved.
- the MD5 code of the PDF file stream to be saved the MD5 and SIMHASH codes of the text content in the PDF file to be saved, and the number of pages of the PDF file to be saved are read.
- Step S21 It is judged whether there is a stored feature value that is the same as the MD5 code of the above-mentioned PDF file stream to be saved. If there is, step S29 is executed; otherwise, step S22 is executed.
- Step S29 Delete the above-mentioned pending PDF file, and end the process.
- Step S22 Determine whether the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is recorded, if there is, step S23 is executed, otherwise, step S24 is executed.
- Step S23 Determine whether other content in the file corresponding to the stored feature value is the same as other content in the pending PDF file, if they are the same, perform step S29, otherwise, perform step S26 and step S24.
- step S29 when the same situation exists, step S29 is directly executed, and no other steps are executed, and the process ends; when there are different situations, step S26 is executed first, and then step S24 is executed to ensure the correspondence The stored feature value of is stored in the suspected duplicate area. Understandably, when there are different situations in step S23, in order to improve the accuracy of the judgment, step S24 needs to be further executed.
- Step S24 Determine whether there is a stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range is recorded, if there is, step S25 is executed, otherwise, step S27 is executed.
- the preset range is 3, of course, it can also be set as needed.
- Step S25 Determine whether the number of pages of the file corresponding to the stored feature value is the same as the number of pages of the PDF file to be saved, if they are the same, step S26 is executed, otherwise, step S27 is executed.
- step S27 when the conditions are not the same, skip to step S27, when there are all the same conditions, skip to step S26, when there are parts that are the same but not the same, discard the different parts, Go to step S26.
- Step S26 Store the corresponding stored feature value in the suspected duplicate area.
- step S23 to step S26 it means storing the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved to the suspected duplicate temporary area; when jumping from step S25 to step S26 At the time, the stored feature values whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved are within a preset range are stored in the suspected duplicate temporary area.
- Step S27 It is judged whether the stored feature value is stored in the suspected repeated temporary area, if so, step S28 is executed, otherwise, step S3 is executed.
- the to-be-saved PDF file is different from the stored PDF file.
- Step S28 Manually compare whether the file corresponding to the above-mentioned stored feature value is the same as the above-mentioned PDF file to be saved. If they are the same, perform step S29; otherwise, perform step S3.
- Step S3 Store the above-mentioned pending PDF file and update the record of the above-mentioned stored characteristic value, ending the process.
- the method for deduplication and storage of PDF files in this embodiment adopts a step-by-step judgment method to judge whether the to-be-saved PDF file is the same as the existing PDF file, and the judgment at all levels adopts overall content, partial content, and overall combined partial content.
- the judgment method is used for judgment and comparison to improve the accuracy of judgment.
- FIG. 6 is a schematic structural diagram of the first embodiment of the PDF file deduplication storage system of the present invention.
- the system can be applied to data processing equipment, such as mobile phones, computers, servers, and other electronic equipment with data processing capabilities.
- the PDF file deduplication storage system 100 includes: an information reading module 101, a content comparison module 102, a storage module 103, and a database 104. Understandably, each module in the PDF file deduplication storage system Corresponding to the PDF file deduplication storage method in the first to fifth embodiments described above, the specific steps are not described in detail.
- the information reading module 101 is used to read the to-be-saved feature value of the PDF file to be saved.
- the information reading module 101 reads one or more to-be-saved feature values corresponding to the stored feature values of the to-be-saved PDF file.
- the information reading module 101 reads the PDF file stream of the PDF file to be saved and converts the read PDF file stream into MD5 code to obtain the MD5 code of the PDF file stream to be saved;
- the content comparison module 102 is used to determine step by step whether there is a stored feature value matching the above-mentioned feature value to be stored.
- the content comparison module 102 determines that the database 104 has recorded a stored feature value that matches the to-be-saved feature value, it is determined that the storage module 103 has stored the same stored PDF file as the to-be-saved PDF file, Then delete the pending PDF file to avoid repeated storage; otherwise, determine that the pending PDF file is different from the stored PDF file, and notify the storage module 103 to store the pending PDF file, and notify the database 104 to store the pending PDF The feature value to be saved corresponding to the file.
- the stepwise judgment includes two or more levels of judgment, and each level of judgment includes comparing one or more pending feature values with the corresponding one or more existing feature values.
- the overall content of the to-be-saved PDF file to the partial content can be compared with the existing PDF file, and the partial content includes text content, image content, and table Content etc.
- the overall content judgment includes judging whether the MD5 code of the PDF file stream to be saved is the same as the MD5 code of the saved PDF file stream, and whether the page number of the PDF file to be saved is the same as the page number of the saved PDF file; partial content judgment Including whether the MD5 code of the text content in the PDF file to be saved is the same as the MD5 code of the text content in the saved PDF file, the Hamming distance between the SIMHASH code of the text content in the PDF file to be saved and the SIMHASH code of the text content in the saved PDF file Whether it is within the scope of 3, whether other content in the to-be-saved PDF file is the same as other content in the saved PDF file, where the other content includes pictures, tables, and other objects.
- the storage module 103 is configured to store the PDF file to be stored when there is no stored feature value matching the feature value to be stored.
- the database 104 is used to update the record of the stored feature value when the storage module 103 stores the PDF file to be stored.
- the storage module 103 stores the stored PDF file, and at the same time, the database 104 records the stored feature value corresponding to the stored PDF file.
- the stored feature value includes one or more of the MD5 code of the stored PDF file stream, the MD5 code and the SIMHASH code of the text content of the stored PDF file, and the number of pages of the stored PDF file.
- the stored feature value includes one or more of the MD5 code of the stored PDF file stream, the MD5 code and the SIMHASH code of the text content of the stored PDF file, and the number of pages of the stored PDF file.
- the storage module 103 After the storage module 103 receives the notification from the content comparison module 102, it stores the to-be-saved PDF file in a designated path. After receiving the notification from the content comparison module 102, the database 104 records the to-be-saved feature value corresponding to the PDF file to be saved. .
- the present invention by reading the pending feature value of the pending PDF file and comparing whether the pending feature value matches the stored feature value, it is determined whether the pending PDF file and the stored PDF file are Same, and when the to-be-saved PDF file is different from the saved PDF file, the to-be-saved PDF file is stored. It realizes that only non-duplicated PDF files are stored, saving file storage resources, and avoiding users from browsing duplicate files, improving user experience.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method and system for deduplicating and storing PDF files. The method comprises: reading a feature value to be stored of a PDF file to be stored (S1); determining, stage by stage, whether there is a stored feature value matching the feature value (S2); and if not, storing the PDF file and updating a record containing the stored feature value (S3). In the method, a feature value to be stored of a PDF file to be stored is read, and then a comparison operation is performed to determine whether the feature value matches a stored feature value so as to determine whether the PDF file is the same as a stored PDF file. If the PDF file is not the same as a stored PDF file, the PDF file is stored. The invention ensures that only non-duplicate PDF files are stored, thereby saving file storage resources, preventing users from browsing duplicate files, and improving user experience.
Description
本发明涉及数据处理领域,更具体地说,涉及一种PDF文件去重存储方法及系统。The invention relates to the field of data processing, and more specifically, to a method and system for deduplication and storage of PDF files.
随着信息时代的不断发展,人们在学习知识、交流信息时,也逐渐选择采用电子文件的方式进行,在众多格式类型的电子文件中,由于PDF格式的电子文件具有内容不易修改性、经过缩放不会变形的高保真性等特点,被越来越多的使用者选择。With the continuous development of the information age, people gradually choose to use electronic files when learning knowledge and exchanging information. Among the many types of electronic files, the PDF format electronic files are not easy to modify and scale. Features such as high fidelity without deformation have been chosen by more and more users.
伴随着PDF格式文件数量的不断增多,同时也出现了在存储的多个PDF文件中,存在着两文件的文件名不同、但内容相同,或两文件的文件名相同、但内容不同的情况,给人们的知识学习、信息交流带来困扰以及不便,同时也造成存储资源浪费。With the continuous increase in the number of PDF files, there are also situations in which multiple PDF files are stored. The file names of the two files are different but the content is the same, or the file names of the two files are the same but the content is different. It brings troubles and inconvenience to people's knowledge learning and information exchange, and also causes a waste of storage resources.
本发明要解决的技术问题在于,针对现有技术的上述难以分辨所存储的PDF文件是否相同的缺陷,提供一种PDF文件去重存储方法及系统。The technical problem to be solved by the present invention is to provide a method and system for deduplication and storage of PDF files in view of the above-mentioned defect in the prior art that it is difficult to distinguish whether the stored PDF files are the same.
本发明解决其技术问题所采用的技术方案是:构造一种PDF文件去重存储方法,包括:The technical solution adopted by the present invention to solve its technical problems is: constructing a method for deduplication and storage of PDF files, including:
S1:读取待存PDF文件的待存特征值;S1: Read the to-be-saved feature value of the PDF file to be saved;
S2:逐级判断是否记录有与所述待存特征值相匹配的已存特征值,若否,则执行步骤S3;S2: Judge step by step whether there is a stored feature value matching the feature value to be stored, if not, execute step S3;
S3:存储所述待存PDF文件并更新所述已存特征值的记录。S3: Store the to-be-saved PDF file and update the record of the stored feature value.
优选的,所述待存特征值包括待存PDF文件流的MD5码;Preferably, the feature value to be saved includes the MD5 code of the PDF file stream to be saved;
所述步骤S2中的逐级判断包括:The step-by-step judgment in step S2 includes:
S21:判断是否记录有与所述待存PDF文件流的MD5码相同的已存特征值,若有,则执行步骤S29;S21: Determine whether the stored feature value that is the same as the MD5 code of the PDF file stream to be saved is recorded, and if so, execute step S29;
S29:删除所述待存PDF文件。S29: Delete the to-be-saved PDF file.
优选的,所述待存特征值还包括待存PDF文件中文字内容的MD5码;Preferably, the feature value to be saved further includes the MD5 code of the text content in the PDF file to be saved;
在所述步骤S21中,当未找到与所述待存PDF文件流的MD5码相同的已存特征值的记录时,所述步骤S2中的逐级判断还包括:In the step S21, when the record of the stored feature value that is the same as the MD5 code of the PDF file stream to be saved is not found, the stepwise judgment in the step S2 further includes:
S22:判断是否记录有与所述待存PDF文件中文字内容的MD5码相同的已存特征值,若有,则执行步骤S23;S22: Determine whether the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is recorded, and if so, execute step S23;
S23:判断所述已存特征值对应的文件中其它内容与所述待存PDF文件中其它内容是否相同,若相同,则执行所述步骤S29。S23: Determine whether other content in the file corresponding to the stored feature value is the same as other content in the to-be-saved PDF file, and if they are the same, execute the step S29.
优选的,所述待存特征值还包括待存PDF文件中文字内容的SIMHASH码以及待存PDF文件的页数;Preferably, the feature value to be saved further includes the SIMHASH code of the text content in the PDF file to be saved and the number of pages of the PDF file to be saved;
所述步骤S22中,当未找到与所述待存PDF文件中文字内容的MD5码相同的已存特征值,或所述步骤S23中,当判断所述已存特征值对应的文件中其它内容与所述待存PDF文件中其它内容不相同时,所述步骤S2中的逐级判断还包括:In the step S22, when the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is not found, or in the step S23, when it is determined that other content in the file corresponding to the stored feature value When it is different from other content in the to-be-saved PDF file, the step-by-step judgment in step S2 further includes:
S24:判断是否记录有与所述待存PDF文件中文字内容的SIMHASH码的海明距离在预设范围内的已存特征值,若有,则执行步骤S25;S24: Determine whether there is a stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range is recorded, and if so, execute step S25;
S25:判断所述已存特征值对应的文件的页数与所述待存PDF文件的页数是否相同,若相同,则执行步骤S26,并进一步判断;S25: Determine whether the number of pages of the file corresponding to the stored feature value is the same as the number of pages of the PDF file to be saved, if they are the same, perform step S26, and further determine;
S26:将对应的已存特征值存储至疑似重复区;其中,所述对应的已存特征值为与所述待存PDF文件中文字内容的SIMHASH码的海明距离在预设范围内的所述已存特征值。S26: Store the corresponding stored feature value in the suspected repeated area; wherein, the corresponding stored feature value is all within a preset range from the Hamming distance of the SIMHASH code of the text content in the PDF file to be saved. Describe the existing feature values.
优选的,在所述步骤S23中,当判断所述已存特征值对应的文件中其它内容与所述待存PDF文件中其它内容不相同时,还包括:Preferably, in the step S23, when it is determined that other content in the file corresponding to the stored feature value is different from the other content in the to-be-saved PDF file, the method further includes:
执行所述步骤S26,并进一步判断;Perform the step S26, and make a further judgment;
其中,所述对应的已存特征值为与所述待存PDF文件中文字内容的MD5码相同的所述已存特征值。Wherein, the corresponding stored feature value is the same as the stored feature value of the MD5 code of the text content in the PDF file to be saved.
优选的,所述进一步判断具体包括:Preferably, the further judgment specifically includes:
S27:判断疑似重复临时区是否存储有已存特征值,若有,执行步骤S28;S27: Determine whether there are stored feature values in the suspected duplicate temporary area, if so, execute step S28;
S28:人工比对所述已存特征值对应的文件与所述待存PDF文件是否相同,若相同,则执行所述步骤S29,否则,执行所述步骤S3。S28: Manually compare whether the file corresponding to the stored feature value is the same as the PDF file to be saved, if they are the same, execute the step S29, otherwise, execute the step S3.
优选的,所述预设范围为3。Preferably, the preset range is 3.
优选的,所述步骤S3中还包括:Preferably, the step S3 further includes:
生成并记录所述待存PDF文件的文件编号以及文件存储路径。Generate and record the file number and file storage path of the PDF file to be saved.
本发明还构造一种PDF文件去重存储系统,包括:The present invention also constructs a PDF file deduplication storage system, including:
信息读取模块,用于读取待存PDF文件的待存特征值;Information reading module, used to read the to-be-saved feature value of the PDF file to be saved;
内容比较模块,用于逐级判断是否记录有与所述待存特征值相匹配的已存特征值;The content comparison module is used to determine step by step whether there is a stored feature value matching the feature value to be stored;
存储模块,用于当未记录有与所述待存特征值相匹配的已存特征值时,存储所述待存PDF文件; A storage module for storing the to-be-saved PDF file when there is no stored feature value matching the to-be-saved feature value;
数据库,用于当所述存储模块存储所述待存PDF文件时,更新所述已存特征值的记录。The database is used to update the record of the stored feature value when the storage module stores the to-be-saved PDF file.
优选的,所述待存特征值包括:Preferably, the feature value to be stored includes:
待存PDF文件流的MD5码、待存PDF文件中文字内容的MD5码以及SIMHASH码、以及待存PDF文件的页数。The MD5 code of the PDF file stream to be saved, the MD5 code and SIMHASH code of the text content in the PDF file to be saved, and the number of pages of the PDF file to be saved.
实施本发明的PDF文件去重存储方法及系统,具有以下有益效果:Implementation of the method and system for deduplication and storage of PDF files of the present invention has the following beneficial effects:
通过读取待存PDF文件的待存特征值,并比对该待存特征值是否与已存特征值相匹配的方法,来判断该待存PDF文件与已存PDF文件是否相同,并在当该待存PDF文件与已存PDF文件不相同时,存储该待存PDF文件。实现了只存储不相重复的PDF文件,节约了文件存储资源,而且,避免使用者浏览重复的文件,提升用户体验。By reading the pending feature value of the pending PDF file and comparing whether the pending feature value matches the stored feature value, it is determined whether the pending PDF file is the same as the stored PDF file, and the current When the to-be-saved PDF file is different from the saved PDF file, the to-be-saved PDF file is stored. It realizes that only non-duplicated PDF files are stored, saving file storage resources, and avoiding users from browsing duplicate files, improving user experience.
下面将结合附图及实施例对本发明作进一步说明,附图中:The present invention will be further described below in conjunction with the accompanying drawings and embodiments. In the accompanying drawings:
图1是本发明的PDF文件去重存储方法第一实施例的流程图;FIG. 1 is a flowchart of a first embodiment of a method for deduplicating and storing PDF files of the present invention;
图2是本发明的PDF文件去重存储方法第二实施例的流程图;2 is a flowchart of a second embodiment of the method for deduplicating and storing PDF files of the present invention;
图3是本发明的PDF文件去重存储方法第三实施例的流程图;FIG. 3 is a flowchart of a third embodiment of a method for deduplicating and storing PDF files of the present invention;
图4是本发明的PDF文件去重存储方法第四实施例的流程图;4 is a flowchart of a fourth embodiment of a method for deduplication and storage of PDF files of the present invention;
图5是本发明的PDF文件去重存储方法第五实施例的流程图;FIG. 5 is a flowchart of a fifth embodiment of a method for deduplicating and storing PDF files of the present invention;
图6是本发明的PDF文件去重存储系统的结构示意图。Figure 6 is a schematic diagram of the structure of the PDF file deduplication storage system of the present invention.
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
图1为本发明的PDF文件去重存储方法第一实施例的流程图,本实施例的PDF文件去重存储方法可应用于数据处理设备中,例如应用于手机、计算机、服务器等具有数据处理能力的电子设备中,如图1所示,在本实施例中,PDF文件去重存储方法主要包括以下步骤:Fig. 1 is a flowchart of the first embodiment of the method for deduplication and storage of PDF files of the present invention. The method for deduplication and storage of PDF files in this embodiment can be applied to data processing equipment, such as mobile phones, computers, servers, etc. with data processing In a capable electronic device, as shown in FIG. 1, in this embodiment, the method for deduplication and storage of a PDF file mainly includes the following steps:
步骤S1:读取待存PDF文件的待存特征值。Step S1: Read the to-be-saved feature value of the PDF file to-be-saved.
实际中,当用户通过数据处理设备进行新增的PDF文件的存储时,数据处理设备以该新增的PDF文件作为待存PDF文件,读取该待存PDF文件的待存特征值,通过判断该待存特征值与已存PDF文件的已存特征值是否相匹配,来判断该待存PDF文件是否与已存PDF文件相同,从而确定是否存储该待存PDF文件。In practice, when a user stores a newly-added PDF file through a data processing device, the data processing device uses the newly-added PDF file as a PDF file to be saved, reads the pending feature value of the PDF file to be saved, and passes judgment Whether the feature value to be stored matches the stored feature value of the stored PDF file is determined to determine whether the to-be-stored PDF file is the same as the stored PDF file, so as to determine whether to store the to-be-stored PDF file.
可以理解地,数据处理设备在接收待存PDF文件前,可以已存储有PDF文件,作为已存PDF文件,并当已存储有PDF文件时,并记录有与该已存PDF文件对应的已存特征值,其中,已存特征值包括已存PDF文件流的MD5码、已存PDF文件中文字内容的MD5码以及SIMHASH码、以及已存PDF文件的页数中的一个或多个,当然,也存储有已存PDF文件的文件编号以及文件存储路径。Understandably, before receiving the PDF file to be saved, the data processing device may have stored the PDF file as a saved PDF file, and when the PDF file has been stored, it also records the saved PDF file corresponding to the saved PDF file. The feature value, where the saved feature value includes one or more of the MD5 code of the saved PDF file stream, the MD5 code and the SIMHASH code of the text content in the saved PDF file, and the number of pages of the saved PDF file. Of course, The file number and file storage path of the saved PDF file are also stored.
对应地,当接收到待存PDF文件时,数据处理设备读取待存PDF文件的与已存特征值对应的一个或多个待存特征值。即,当已存特征值包括已存PDF文件流的MD5码时,当接收到待存PDF文件时,读取待存PDF文件的待存PDF文件流的MD5码;当已存特征值包括已存PDF文件流的MD5码、以及已存PDF文件中文字内容的MD5码时,当接收到待存PDF文件时,读取待存PDF文件的待存PDF文件流的MD5码、以及待存PDF文件中文字内容的MD5码;以此类推,使得所读取的待存特征值与已存特征值相对应,以利于后续判断数据处理设备是否记录有与该待存特征值相匹配的已存特征值。Correspondingly, when receiving the to-be-saved PDF file, the data processing device reads one or more to-be-saved feature values corresponding to the stored feature values of the to-be-saved PDF file. That is, when the saved feature value includes the MD5 code of the saved PDF file stream, when the pending PDF file is received, the MD5 code of the pending PDF file stream of the pending PDF file is read; when the saved feature value includes the saved PDF file stream When saving the MD5 code of the PDF file stream and the MD5 code of the text content in the saved PDF file, when the to-be-saved PDF file is received, read the MD5 code of the to-be-saved PDF file stream of the to-be-saved PDF file and the to-be-saved PDF The MD5 code of the text content in the file; and so on, so that the read feature value to be stored corresponds to the stored feature value, so as to facilitate the subsequent determination of whether the data processing device records the stored feature value that matches the pending feature value Eigenvalues.
可以理解地,在数据处理设备存储一已存PDF文件,且同时对应记录有多个已存特征值的情况下,当数据处理设备接收到待存PDF文件时,可一次读取与该多个已存特征值对应的多个待存特征值并缓存,在后续判断过程中,再从缓存中读取出对应的待存特征值;可以理解地,也可一次只读取当次所需进行判断的待存特征值,例如,当需要判断据处理设备是否记录有与待存PDF文件流的MD5码相匹配的已存特征值时,只需读取待存PDF文件的待存PDF文件流的MD5码。Understandably, when the data processing device stores a stored PDF file and simultaneously records multiple stored feature values correspondingly, when the data processing device receives the to-be-stored PDF file, it can read the multiple PDF files at once. The multiple pending feature values corresponding to the stored feature values are cached. In the subsequent judgment process, the corresponding pending feature values are read from the cache; understandably, it is also possible to read only the current required performance at a time Judgment of the feature value to be saved, for example, when it is necessary to determine whether the processing device records a stored feature value that matches the MD5 code of the PDF file stream to be saved, just read the PDF file stream of the PDF file to be saved The MD5 code.
步骤S2:逐级判断是否记录有与上述待存特征值相匹配的已存特征值,若否,则执行步骤S3。Step S2: It is judged step by step whether there is a stored feature value matching the above-mentioned feature value to be stored. If not, step S3 is executed.
在数据处理设备存储一已存PDF文件,且同时对应记录有多个已存特征值的情况下,当数据处理设备接收到待存PDF文件时,读取待存PDF文件的与该已存特征值相对应的多个待存特征值,然后通过逐级判断该数据处理设备是否记录有与该待存特征值相匹配的已存储特征值。When the data processing device stores a stored PDF file and simultaneously records multiple stored feature values correspondingly, when the data processing device receives the to-be-saved PDF file, it reads the to-be-saved PDF file and the stored features A plurality of to-be-stored characteristic values corresponding to the values are then judged step by step whether the data processing device has recorded a stored characteristic value that matches the to-be-saved characteristic value.
可以理解地,当判断该数据处理设备已记录有与该待存特征值相匹配的已存储特征值时,判断该数据处理设备已存储有与该待存PDF文件相同的已存PDF文件,则删除该待存PDF文件,以避免重复存储;否则,判断该待存PDF文件与已存PDF文件不相同,并存储该待存PDF文件。Understandably, when it is determined that the data processing device has recorded a stored feature value that matches the feature value to be stored, it is determined that the data processing device has stored the same stored PDF file as the to-be-stored PDF file, then Delete the to-be-saved PDF file to avoid repeated storage; otherwise, determine that the to-be-saved PDF file is different from the saved PDF file, and store the to-be-saved PDF file.
具体地,逐级判断包括两级或多级判断,且每级判断中,包括进行一个或多个待存特征值与对应的一个或多个已存特征值进行比较。且,逐级判断中的各级判断中,可根据特征值的特点,从待存PDF文件的整体内容至局部内容分别与已存PDF文件进行比较,且局部内容包括文字内容、图片内容、表格内容等。Specifically, the stepwise judgment includes two or more levels of judgment, and each level of judgment includes comparing one or more pending feature values with the corresponding one or more existing feature values. In addition, in the judgment of each level in the stepwise judgment, according to the characteristics of the characteristic value, the overall content of the to-be-saved PDF file to the partial content can be compared with the existing PDF file, and the partial content includes text content, image content, and table Content etc.
本实施例中,通过采用逐级判断的方式,当判断待存PDF文件的整体内容与已存PDF文件的整体内容相同时,并不需要进行两者的局部内容比较,从而加快判断速度;且,各级判断中,所用于比较的特征值不同,即通过多种判断方式进行了两者的比较,提高了判断结果的可靠性,使得不重复存储。In this embodiment, by adopting a step-by-step judgment method, when it is judged that the overall content of the PDF file to be saved is the same as the overall content of the saved PDF file, there is no need to compare the partial content of the two, thereby speeding up the judgment; and , In each level of judgment, the characteristic values used for comparison are different, that is, the two are compared through a variety of judgment methods, which improves the reliability of the judgment result and prevents repeated storage.
步骤S3:存储上述待存PDF文件并更新上述已存特征值的记录。Step S3: Store the to-be-saved PDF file and update the record of the stored feature value.
具体地,当判断数据处理设备记录有与待存特征值相匹配的已存特征值时,认为待存PDF文件与已存PDF文件相同,则删除该待存PDF文件;否则,认为待存PDF文件与已存PDF文件不相同,则存储该待存PDF文件,并记录所读取的该待存PDF文件所对应的待存特征值,并,将该待存PDF文件及其对应的待存特征值作为已存PDF文件及其对应的已存特征值,从而更新数据处理设备中所存储以及记录的数据。Specifically, when it is judged that the data processing device records a stored feature value that matches the feature value to be saved, it is deemed that the to-be-saved PDF file is the same as the saved PDF file, and the to-be-saved PDF file is deleted; otherwise, the to-be-saved PDF file is considered If the file is not the same as the saved PDF file, store the to-be-saved PDF file and record the read-out feature value of the to-be-saved PDF file corresponding to the to-be-saved PDF file, and then, the to-be-saved PDF file and its corresponding The feature value is used as the saved PDF file and its corresponding saved feature value to update the data stored and recorded in the data processing device.
可以理解地,当判断待存PDF文件与已存PDF文件不相同时,同时生成该待存PDF文件的文件编号、以及文件存储路径,并记录该待存PDF文件的文件编号以及文件存储路径,为后续文件的追溯查找提供方便。Understandably, when it is determined that the PDF file to be saved is not the same as the saved PDF file, the file number and file storage path of the PDF file to be saved are generated at the same time, and the file number and file storage path of the PDF file to be saved are recorded. Provide convenience for the retrospective search of subsequent files.
图2为本发明的PDF文件去重存储方法第二实施例的流程图,如图2所示,在本实施例中,PDF文件去重存储方法主要包括以下步骤:Fig. 2 is a flowchart of a second embodiment of a method for deduplication and storage of PDF files according to the present invention. As shown in Fig. 2, in this embodiment, the method for deduplication and storage of PDF files mainly includes the following steps:
步骤11:读取待存PDF文件的待存PDF文件流的MD5码。Step 11: Read the MD5 code of the pending PDF file stream of the pending PDF file.
具体地,当接收到待存PDF文件时,通过读取该待存PDF文件的PDF文件流,并将读取的PDF文件流转换为MD5码,以得到待存PDF文件流的MD5码。Specifically, when a PDF file to be saved is received, the PDF file stream of the PDF file to be saved is read, and the read PDF file stream is converted into an MD5 code to obtain the MD5 code of the PDF file stream to be saved.
步骤12:判断是否记录有与上述待存PDF文件流的MD5码相同的已存特征值。Step 12: Determine whether there is a stored feature value that is the same as the MD5 code of the above-mentioned PDF file stream to be saved.
可以理解地,在接收到该待存PDF文件前,数据处理设备可已存储有一个或多个已存PDF文件及记录有其对应的已存PDF文件流的MD5码,当读取到待存PDF文件流的MD5码后,查询该数据处理设备已记录的已存PDF文件流的MD5码中是否有与该待存PDF文件流的MD5码相同的,若有,则判断该待存PDF文件与该已存PDF文件流的MD5码对应的已存PDF文件相同,并执行步骤13,否则,判断该待存PDF文件与数据处理设备中的已存PDF文件不同,则执行步骤14。Understandably, before receiving the to-be-saved PDF file, the data processing device may have stored one or more saved PDF files and recorded the corresponding MD5 code of the saved PDF file stream. After the MD5 code of the PDF file stream, query whether the MD5 code of the saved PDF file stream recorded by the data processing device is the same as the MD5 code of the to-be-saved PDF file stream. If so, judge the to-be-saved PDF file The saved PDF file corresponding to the MD5 code of the saved PDF file stream is the same, and step 13 is executed; otherwise, it is determined that the to-be-saved PDF file is different from the saved PDF file in the data processing device, and step 14 is executed.
步骤13:删除上述待存PDF文件。Step 13: Delete the above pending PDF files.
步骤14.存储上述待存PDF文件并更新上述已存特征值的记录。Step 14. Store the to-be-saved PDF file and update the record of the stored feature value.
具体地,当判断待存PDF文件与已存PDF文件不相同时,将该待存PDF文件存储至指定路径,同时生成该待存PDF文件的文件编号,并记录该待存PDF文件流的MD5码、文件存储路径以及文件编号。Specifically, when it is determined that the to-be-saved PDF file is not the same as the saved PDF file, the to-be-saved PDF file is stored in the designated path, and the file number of the to-be-saved PDF file is generated at the same time, and the MD5 of the to-be-saved PDF file stream is recorded Code, file storage path and file number.
本实施例中,利用PDF文件流的MD5码的特性,以PDF文件流的MD5码作为判断对象,实现从整体内容上判断待存PDF文件与已存PDF文件是否相同,判断方法简单、速度快。In this embodiment, using the characteristics of the MD5 code of the PDF file stream, the MD5 code of the PDF file stream is used as the judgment object to realize whether the PDF file to be saved is the same as the saved PDF file from the overall content. The judgment method is simple and fast. .
图3为本发明的PDF文件去重存储方法第三实施例的流程图,该实施例与上一实施例的区别在于,该实施例从局部内容上进行比较。如图3所示,在本实施例中,PDF文件去重存储方法主要包括以下步骤:FIG. 3 is a flowchart of a third embodiment of a method for deduplication and storage of a PDF file of the present invention. The difference between this embodiment and the previous embodiment is that this embodiment compares partial content. As shown in Figure 3, in this embodiment, the method for deduplication and storage of PDF files mainly includes the following steps:
步骤21:读取待存PDF文件的待存PDF文件中文字内容的MD5码。Step 21: Read the MD5 code of the text content in the PDF file to be saved.
具体地,当接收到待存PDF文件时,通过读取该待存PDF文件的文字内容,并将读取的PDF文件的文字内容转换为MD5码,以得到待存PDF文件中文字内容的MD5码。可以理解地,可同时读取该待存PDF文件中其他内容,例如PDF文件中图片、表格内容及其他对象。Specifically, when a PDF file to be saved is received, the text content of the PDF file to be saved is read, and the text content of the read PDF file is converted into an MD5 code to obtain the MD5 of the text content in the PDF file to be saved code. Understandably, other contents in the to-be-saved PDF file can be read at the same time, such as pictures, table contents and other objects in the PDF file.
步骤22:判断是否记录有与上述待存PDF文件中文字内容的MD5码相同的已存特征值。Step 22: Determine whether there is a stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved.
可以理解地,当查询到数据处理设备记录有与该待存PDF文件中文字内容的MD5码相同的已存PDF文件中文字内容的MD5码,则判断该待存PDF文件的文字内容与该已存PDF文件中文字内容的MD5码所对应的已存PDF文件的文字内容相同,并执行步骤23,以作进一步判断,否则,判断该待存PDF文件的文字内容与数据处理设备中的已存PDF文件的文字内容不同,执行步骤24。Understandably, when the data processing device records the MD5 code of the text content in the saved PDF file that is the same as the MD5 code of the text content in the to-be-saved PDF file, it is judged that the text content of the to-be-saved PDF file is the same as the text content in the saved PDF file. The text content of the saved PDF file corresponding to the MD5 code of the text content in the saved PDF file is the same, and step 23 is executed for further judgment. Otherwise, the text content of the to-be-saved PDF file is judged to be the same as the saved text in the data processing device. If the text content of the PDF file is different, go to step 24.
步骤23:判断上述已存特征值对应的文件中其他内容与上述待存PDF文件中其他内容是否相同。Step 23: Determine whether other content in the file corresponding to the stored feature value is the same as other content in the to-be-saved PDF file.
可以理解地,进一步判断所查询到的已存PDF文件中文字内容的MD5码所对应的已存PDF文件中其他内容与待存PDF文件中其他内容是否相同。其中,已存PDF文件中文字内容的MD5码可对应一个或多个已存PDF文件。对应比较待存PDF文件与该一个或多个已存PDF文件的除文字内容外的其他内容,其中,其他内容包括其中包括图片、表格内容及其他对象。Understandably, it is further judged whether other content in the saved PDF file corresponding to the MD5 code of the text content in the searched saved PDF file is the same as other content in the to-be-saved PDF file. Among them, the MD5 code of the text content in the saved PDF file can correspond to one or more saved PDF files. Correspondingly compare the to-be-saved PDF file with other content except text content of the one or more saved PDF files, where the other content includes pictures, table content, and other objects.
可以理解地,该其他内容的读取可在步骤21中执行,也可在该步骤执行前先读取。可以理解地,当两者的其他内容完全相同时,则判断为相同,否则,判断为不相同,例如两者的图片的缩放比例不同,仍判断为不相同。Understandably, the reading of the other content can be performed in step 21, or can be read before the step is performed. Understandably, when the other contents of the two are completely the same, they are judged to be the same; otherwise, they are judged to be different. For example, the zoom ratios of the pictures of the two are different, and it is still judged to be different.
若判断两者的其他内容也对应相同,则判断待存PDF文件整体内容与已存PDF文件整体内容相同,则执行步骤25,可以理解地,当判断待存PDF与一已存PDF文件相同时,只需要执行步骤25,并结束。否则,执行步骤26,可以理解地,当上述多个已存PDF文件的其他内容与待存PDF文件的其他内容都不相同时,才执行步骤26。If it is judged that the other contents of the two are also the same, it is judged that the overall content of the PDF file to be saved is the same as the overall content of the saved PDF file, and step 25 is executed. Understandably, when it is judged that the PDF file to be saved is the same as a saved PDF file , Only need to perform step 25 and end. Otherwise, step 26 is executed. It can be understood that step 26 is executed when the other contents of the multiple stored PDF files are different from the other contents of the PDF files to be saved.
步骤24:存储上述待存PDF文件并更新上述已存特征值的记录。Step 24: Store the to-be-saved PDF file and update the record of the stored feature value.
步骤25:删除上述待存PDF文件。Step 25: Delete the above pending PDF files.
步骤26:将与上述待存PDF文件中文字内容的MD5码相同的上述已存特征值存储至疑似重复临时区,并进一步判断。Step 26: Store the above-mentioned stored feature value that is the same as the MD5 code of the text content in the above-mentioned pending PDF file in the suspected duplicate temporary area, and further judge.
可以理解地,当判断待存PDF文件与已存PDF文件的文字内容相同、但其他内容不相同时,则认为该待存PDF文件与已存PDF文件疑似相同,需进一步判断。Understandably, when it is judged that the text content of the to-be-saved PDF file is the same as that of the saved PDF file but the other content is different, it is considered that the to-be-saved PDF file is suspected to be the same as the saved PDF file, and further judgment is required.
本实施例中,采用多次局部内容对应比较的方式去判断两者的整体内容是否相同,局部内容包括文字内容以及其他内容,其中,其他内容包括图片、表格内容及其他对象,通过对各局部内容进行对应比较,提高判断的准确率。In this embodiment, multiple partial content comparisons are used to determine whether the overall content of the two is the same. The partial content includes text content and other content. The other content includes pictures, table content, and other objects. Corresponding comparison of content improves the accuracy of judgment.
图4为本发明的PDF文件去重存储方法第四实施例的流程图,与上一实施例的区别在于,该实施例从整体内容结合局部内容的方式进行比较。如图4所示,在本实施例中,PDF文件去重存储方法主要包括以下步骤:FIG. 4 is a flowchart of a fourth embodiment of a method for deduplication and storage of a PDF file of the present invention. The difference from the previous embodiment is that this embodiment compares the overall content with the partial content. As shown in FIG. 4, in this embodiment, the method for deduplication and storage of PDF files mainly includes the following steps:
步骤31:读取待存PDF文件的待存PDF文件中文字内容的SIMHASH码以及待存PDF文件的页数。Step 31: Read the SIMHASH code of the text content of the PDF file to be saved and the number of pages of the PDF file to be saved.
具体地,通过读取该待存PDF文件的文字内容,并将读取的PDF文件的文字内容转换为SIMHASH码,以得到待存PDF文件中文字内容的SIMHASH码,以及,读取该待存PDF文件的PDF文件流,以得到待存PDF文件的页数。Specifically, by reading the text content of the PDF file to be saved, and converting the text content of the read PDF file into a SIMHASH code, to obtain the SIMHASH code of the text content in the PDF file to be saved, and reading the text content to be saved The PDF file stream of the PDF file to get the number of pages of the PDF file to be saved.
步骤32:判断是否记录有与上述待存PDF文件中文字内容的SIMHASH码的海明距离在预设范围内的已存特征值。Step 32: Determine whether there is a stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range.
可以理解地,在信息编码中,两个合法代码对应位上编码不同的位数称为码距,又称海明距离,通常认为海明距离在3内的是高度相似的文本。在本实施例中,预设范围为3,当然,该预设范围也可根据需要进行设置。Understandably, in information encoding, the number of bits that are encoded differently on the corresponding bits of the two legal codes is called the code distance, also known as the Hamming distance. It is generally considered that texts with a Hamming distance within 3 are highly similar texts. In this embodiment, the preset range is 3. Of course, the preset range can also be set as needed.
具体地,当查询到数据处理设备记录有与该待存PDF文件中文字内容的SIMHASH码的海明距离在3范围内的已存特征值时,则认为该数据处理设备存储有其文字内容与该待存PDF文件的文字内容高度相似的已存PDF文件,则执行步骤33,以作进一步判断,否则,执行步骤34。Specifically, when the data processing device records a stored feature value within 3 Hamming distance from the SIMHASH code of the text content in the PDF file to be saved, it is considered that the data processing device stores its text content and If the text content of the PDF file to be saved is highly similar to the stored PDF file, step 33 is executed for further judgment; otherwise, step 34 is executed.
步骤33:判断上述已存特征值对应的文件的页数与上述待存PDF文件的页数是否相同。Step 33: Determine whether the number of pages of the file corresponding to the above-mentioned stored feature value is the same as the number of pages of the above-mentioned PDF file to be saved.
可以理解地,进一步判断所查询到的已存PDF文件中文字内容的SIMHASH码所对应的文件的页数与待存PDF文件的页数是否相同。其中,查询到的已存PDF文件可包括一个或多个已存PDF文件。当包括多个已存PDF文件时,具有以下情况,一是所有上述已存PDF文件的的页数都与待存PDF文件的页数不同,则认为待存PDF文件与已存PDF文件不同,则执行步骤34;二是所有上述已存PDF文件的的页数都与待存PDF文件的页数相同,则执行步骤35;三是同时存有与待存PDF文件的页数相同、以及不相同的已存PDF文件,则丢弃该所查询到的与待存PDF文件的页数不相同的已存PDF文件的的页数记录,并将页数相同的已存PDF文件作为疑似相同文件,执行步骤35。Understandably, it is further determined whether the number of pages of the file corresponding to the SIMHASH code of the text content in the queried saved PDF file is the same as the number of pages of the PDF file to be saved. Among them, the stored PDF files that are queried may include one or more stored PDF files. When multiple saved PDF files are included, there are the following situations. One is that the number of pages of all the above-mentioned saved PDF files is different from the number of pages of the to-be-saved PDF file, and the to-be-saved PDF file is considered to be different from the saved PDF file. Go to step 34; second, if the number of pages of all the above-mentioned saved PDF files are the same as the number of pages of the to-be-saved PDF file, go to step 35; For the same saved PDF file, discard the page number record of the searched saved PDF file that has a different page number from the PDF file to be saved, and treat the saved PDF file with the same number of pages as a suspected identical file. Go to step 35.
步骤34:存储上述待存PDF文件并更新上述已存特征值的记录。Step 34: Store the to-be-saved PDF file and update the record of the stored feature value.
步骤35:将与上述待存PDF文件中文字内容的SIMHASH码的海明距离在预设范围内的上述已存特征值存储至疑似重复临时区,并进一步判断。Step 35: Store the stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range to the suspected duplicate temporary area, and make a further judgment.
本实施例中,采用整体内容结合局部内容对应比较的方式,判断待存PDF文件与已存PDF文件是否相同,提高判断的准确率。In this embodiment, the overall content combined with the partial content corresponding comparison method is used to determine whether the PDF file to be saved is the same as the existing PDF file, and the accuracy of the judgment is improved.
进一步的,上述第三、以及第四实施例中,进一步判断主要包括以下步骤:Further, in the foregoing third and fourth embodiments, the further judgment mainly includes the following steps:
步骤41:判断疑似重复临时区是否存储有已存特征值,若有,则执行步骤42。Step 41: Determine whether there are stored feature values in the suspected repeated temporary area, and if so, proceed to step 42.
步骤42:人工比对上述已存特征值对应的文件与上述待存PDF文件是否相同,若相同,则删除上述待存PDF文件;若不相同,则存储上述述待存PDF文件并更新上述已存特征值的记录。Step 42: Manually compare whether the file corresponding to the above-mentioned stored feature value is the same as the above-mentioned pending PDF file, if they are the same, delete the above-mentioned pending PDF file; if they are not the same, store the above-mentioned pending PDF file and update the above-mentioned pending PDF file. Save the record of the characteristic value.
具体地,根据疑似重复临时区中存储的已存特征值,读取数据处理设备存储中与该已存特征值对应的已存PDF文件,并通过人工判断待存PDF文件与该已存PDF是否相同,通过人工判断方式,可消除了上述判断中因图片、表格等对象的缩放程度、清晰度不同而判断为内容不同的缺陷,提高判断的准确率。Specifically, according to the stored feature value stored in the suspected repeated temporary area, the stored PDF file corresponding to the stored feature value in the storage of the data processing device is read, and the to-be-saved PDF file and the stored PDF are manually judged whether In the same way, through the manual judgment method, the defect that the content is judged to be different due to the zooming degree and definition of the picture, table and other objects in the above judgment can be eliminated, and the accuracy of the judgment can be improved.
图5为本发明的PDF文件去重存储方法第五实施例的流程图,该实施例由上述第二、第三以及第四实施例所组合形成的逐步判断的方案,因此,与上述实施例重复的步骤内容不再次详述。5 is a flowchart of a fifth embodiment of a method for deduplicating and storing PDF files of the present invention. This embodiment is a step-by-step judgment scheme formed by the combination of the second, third, and fourth embodiments described above. Therefore, it is similar to the foregoing embodiment The content of the repeated steps will not be detailed again.
如图5所示,在本实施例中,PDF文件去重存储方法主要包括以下步骤:As shown in Figure 5, in this embodiment, the method for deduplication and storage of PDF files mainly includes the following steps:
步骤S1:读取待存PDF文件的待存特征值。Step S1: Read the to-be-saved feature value of the PDF file to-be-saved.
具体地,读取待存PDF文件的待存PDF文件流的MD5码、待存PDF文件中文字内容的MD5以及SIMHASH码、以及待存PDF文件的页数。Specifically, the MD5 code of the PDF file stream to be saved, the MD5 and SIMHASH codes of the text content in the PDF file to be saved, and the number of pages of the PDF file to be saved are read.
步骤S21:判断是否记录有与上述待存PDF文件流的MD5码相同的已存特征值,若有,则执行步骤S29;否则,执行步骤S22。Step S21: It is judged whether there is a stored feature value that is the same as the MD5 code of the above-mentioned PDF file stream to be saved. If there is, step S29 is executed; otherwise, step S22 is executed.
步骤S29:删除上述待存PDF文件,结束进程。Step S29: Delete the above-mentioned pending PDF file, and end the process.
步骤S22:判断是否记录有与上述待存PDF文件中文字内容的MD5码相同的已存特征值,若有,则执行步骤S23,否则,执行步骤S24。Step S22: Determine whether the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is recorded, if there is, step S23 is executed, otherwise, step S24 is executed.
步骤S23:判断上述已存特征值对应的文件中其他内容与上述待存PDF文件中其他内容是否相同,若相同,则执行步骤S29,否则,执行步骤S26、以及步骤S24。Step S23: Determine whether other content in the file corresponding to the stored feature value is the same as other content in the pending PDF file, if they are the same, perform step S29, otherwise, perform step S26 and step S24.
可以理解地,该步骤中,当存在相同的情况时,直接执行步骤S29,不再执行其他步骤,流程结束;当都不相同的情况时,先执行步骤S26,再执行步骤S24,以保证对应的已存特征值存储至疑似重复区中。可以理解地,当步骤S23中,存在都不相同的情况,为提高判断的准确性,需进一步执行步骤S24。Understandably, in this step, when the same situation exists, step S29 is directly executed, and no other steps are executed, and the process ends; when there are different situations, step S26 is executed first, and then step S24 is executed to ensure the correspondence The stored feature value of is stored in the suspected duplicate area. Understandably, when there are different situations in step S23, in order to improve the accuracy of the judgment, step S24 needs to be further executed.
步骤S24:判断是否记录有与上述待存PDF文件中文字内容的SIMHASH码的海明距离在预设范围内的已存特征值,若有,则执行步骤S25,否则,执行步骤S27。Step S24: Determine whether there is a stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range is recorded, if there is, step S25 is executed, otherwise, step S27 is executed.
本实施例中,该预设范围为3,当然,也可根据需要进行设置。In this embodiment, the preset range is 3, of course, it can also be set as needed.
步骤S25:判断上述已存特征值对应的文件的页数与上述待存PDF文件的页数是否相同,若相同,则执行步骤S26,否则,执行步骤S27。Step S25: Determine whether the number of pages of the file corresponding to the stored feature value is the same as the number of pages of the PDF file to be saved, if they are the same, step S26 is executed, otherwise, step S27 is executed.
可以理解地,该步骤中,当都不相同的情况,跳转至步骤S27,当存在都相同的情况,跳至步骤S26,当存在部分相同,部分不相同的情况,丢弃不相同的部分,跳至步骤S26。Understandably, in this step, when the conditions are not the same, skip to step S27, when there are all the same conditions, skip to step S26, when there are parts that are the same but not the same, discard the different parts, Go to step S26.
步骤S26:将对应的已存特征值存储至疑似重复区。Step S26: Store the corresponding stored feature value in the suspected duplicate area.
可以理解地,当由步骤S23跳转至步骤S26时,为将与待存PDF文件中文字内容的MD5码相同的已存特征值存储至疑似重复临时区;当由步骤S25跳转至步骤S26时,为将与待存PDF文件中文字内容的SIMHASH码的海明距离在预设范围内的已存特征值存储至疑似重复临时区。Understandably, when jumping from step S23 to step S26, it means storing the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved to the suspected duplicate temporary area; when jumping from step S25 to step S26 At the time, the stored feature values whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved are within a preset range are stored in the suspected duplicate temporary area.
步骤S27:判断疑似重复临时区是否存储有已存特征值,若有,则执行步骤S28,否则,执行步骤S3。 Step S27: It is judged whether the stored feature value is stored in the suspected repeated temporary area, if so, step S28 is executed, otherwise, step S3 is executed.
可以理解地,当疑似重复临时区没有存储到已存特征值时,则认为该待存PDF文件与已存PDF文件都不相同。Understandably, when the suspected repeated temporary area is not stored in the stored feature value, it is considered that the to-be-saved PDF file is different from the stored PDF file.
步骤S28:人工比对上述已存特征值对应的文件与上述待存PDF文件是否相同,若相同,则执行步骤S29,否则,执行步骤S3。Step S28: Manually compare whether the file corresponding to the above-mentioned stored feature value is the same as the above-mentioned PDF file to be saved. If they are the same, perform step S29; otherwise, perform step S3.
步骤S3:存储上述待存PDF文件并更新上述已存特征值的记录,结束进程。Step S3: Store the above-mentioned pending PDF file and update the record of the above-mentioned stored characteristic value, ending the process.
本实施例中的PDF文件去重存储方法,采用逐级判断的方法判断待存PDF文件与已存PDF文件是否相同,且,各级判断中,采用整体内容、局部内容、以及整体结合局部内容的判断方式进行判断比较,提高判断的准确率。The method for deduplication and storage of PDF files in this embodiment adopts a step-by-step judgment method to judge whether the to-be-saved PDF file is the same as the existing PDF file, and the judgment at all levels adopts overall content, partial content, and overall combined partial content. The judgment method is used for judgment and comparison to improve the accuracy of judgment.
图6为本发明的PDF文件去重存储系统第一实施例的结构示意图,该系统可应用于数据处理设备中,例如手机、计算机、服务器等具有数据处理能力的电子设备。6 is a schematic structural diagram of the first embodiment of the PDF file deduplication storage system of the present invention. The system can be applied to data processing equipment, such as mobile phones, computers, servers, and other electronic equipment with data processing capabilities.
如图6所示,该PDF文件去重存储系统100包括:信息读取模块101、内容比较模块102、存储模块103、以及数据库104,可以理解地,该PDF文件去重存储系统中的各模块与上述第一至第五实施例中的PDF文件去重存储方法对应,具体步骤不再详述。As shown in FIG. 6, the PDF file deduplication storage system 100 includes: an information reading module 101, a content comparison module 102, a storage module 103, and a database 104. Understandably, each module in the PDF file deduplication storage system Corresponding to the PDF file deduplication storage method in the first to fifth embodiments described above, the specific steps are not described in detail.
信息读取模块101,用于读取待存PDF文件的待存特征值。The information reading module 101 is used to read the to-be-saved feature value of the PDF file to be saved.
可以理解地,当数据处理设备接收到待存PDF文件时,信息读取模块101读取待存PDF文件的与已存特征值对应的一个或多个待存特征值。其中包括,信息读取模块101通过读取该待存PDF文件的PDF文件流,并将读取的PDF文件流转换为MD5码,以得到待存PDF文件流的MD5码;读取该待存PDF文件的文字内容,并将读取的PDF文件的文字内容转换为MD5码和SIMHASH码,以得到待存PDF文件中文字内容的MD5码以及SIMHASH码;读取待存PDF文件的页数、读取待存PDF文件中其他内容,其中,其他内容包括图片、表格以及其他对象。Understandably, when the data processing device receives the to-be-saved PDF file, the information reading module 101 reads one or more to-be-saved feature values corresponding to the stored feature values of the to-be-saved PDF file. The information reading module 101 reads the PDF file stream of the PDF file to be saved and converts the read PDF file stream into MD5 code to obtain the MD5 code of the PDF file stream to be saved; The text content of the PDF file, and convert the text content of the read PDF file into MD5 code and SIMHASH code to obtain the MD5 code and SIMHASH code of the text content in the PDF file to be saved; read the number of pages of the PDF file to be saved, Read other content in the to-be-saved PDF file, where the other content includes pictures, tables, and other objects.
内容比较模块102,用于逐级判断是否记录有与上述待存特征值相匹配的已存特征值。The content comparison module 102 is used to determine step by step whether there is a stored feature value matching the above-mentioned feature value to be stored.
可以理解地,当内容比较模块102判断数据库104已记录有与该待存特征值相匹配的已存储特征值时,判断存储模块103已存储有与该待存PDF文件相同的已存PDF文件,则删除该待存PDF文件,以避免重复存储;否则,判断该待存PDF文件与已存PDF文件不相同,并通知存储模块103存储该待存PDF文件,以及通知数据库104存储该待存PDF文件所对应的待存特征值。Understandably, when the content comparison module 102 determines that the database 104 has recorded a stored feature value that matches the to-be-saved feature value, it is determined that the storage module 103 has stored the same stored PDF file as the to-be-saved PDF file, Then delete the pending PDF file to avoid repeated storage; otherwise, determine that the pending PDF file is different from the stored PDF file, and notify the storage module 103 to store the pending PDF file, and notify the database 104 to store the pending PDF The feature value to be saved corresponding to the file.
具体地,逐级判断包括两级或多级判断,且每级判断中,包括进行一个或多个待存特征值与对应的一个或多个已存特征值进行比较。且,逐级判断中的各级判断中,可根据特征值的特点,从待存PDF文件的整体内容至局部内容分别与已存PDF文件进行比较,且局部内容包括文字内容、图片内容、表格内容等。Specifically, the stepwise judgment includes two or more levels of judgment, and each level of judgment includes comparing one or more pending feature values with the corresponding one or more existing feature values. In addition, in the judgment of each level in the stepwise judgment, according to the characteristics of the characteristic value, the overall content of the to-be-saved PDF file to the partial content can be compared with the existing PDF file, and the partial content includes text content, image content, and table Content etc.
可以理解地,整体内容判断包括判断待存PDF文件流的MD5码与已存PDF文件流的MD5码是否相同、待存PDF文件的页数与已存PDF文件的页数是否相同;局部内容判断包括待存PDF文件中文字内容的MD5码与已存PDF文件中文字内容的MD5码是否相同、待存PDF文件中文字内容的SIMHASH码与已存PDF文件中文字内容的SIMHASH码的海明距离是否在3范围内、待存PDF文件中其他内容与已存PDF文件中其他内容是否相同,其中,其他内容包括图片、表格以及其他对象。Understandably, the overall content judgment includes judging whether the MD5 code of the PDF file stream to be saved is the same as the MD5 code of the saved PDF file stream, and whether the page number of the PDF file to be saved is the same as the page number of the saved PDF file; partial content judgment Including whether the MD5 code of the text content in the PDF file to be saved is the same as the MD5 code of the text content in the saved PDF file, the Hamming distance between the SIMHASH code of the text content in the PDF file to be saved and the SIMHASH code of the text content in the saved PDF file Whether it is within the scope of 3, whether other content in the to-be-saved PDF file is the same as other content in the saved PDF file, where the other content includes pictures, tables, and other objects.
存储模块103,用于当未记录有与上述待存特征值相匹配的已存特征值时,存储上述待存PDF文件。The storage module 103 is configured to store the PDF file to be stored when there is no stored feature value matching the feature value to be stored.
数据库104,用于当存储模块103存储述待存PDF文件时,更新已存特征值的记录。The database 104 is used to update the record of the stored feature value when the storage module 103 stores the PDF file to be stored.
可以理解地,在接收待存PDF文件前,存储模块103存储有已存PDF文件,同时,数据库104记录有与该已存PDF文件对应的已存特征值。其中,已存特征值包括已存PDF文件流的MD5码、已存PDF文件中文字内容的MD5码以及SIMHASH码、以及已存PDF文件的页数中的一个或多个,当然,也存储有已存PDF文件的文件编号以及文件存储路径。Understandably, before receiving the to-be-saved PDF file, the storage module 103 stores the stored PDF file, and at the same time, the database 104 records the stored feature value corresponding to the stored PDF file. Among them, the stored feature value includes one or more of the MD5 code of the stored PDF file stream, the MD5 code and the SIMHASH code of the text content of the stored PDF file, and the number of pages of the stored PDF file. Of course, there are also stored The file number and file storage path of the saved PDF file.
具体地,存储模块103接收到内容比较模块102的通知后,将待存PDF文件存储至指定路径,数据库104接收到内容比较模块102的通知后,记录待存PDF文件所对应的待存特征值。Specifically, after the storage module 103 receives the notification from the content comparison module 102, it stores the to-be-saved PDF file in a designated path. After receiving the notification from the content comparison module 102, the database 104 records the to-be-saved feature value corresponding to the PDF file to be saved. .
在本发明中,通过读取待存PDF文件的待存特征值,并比对该待存特征值是否与已存特征值相匹配的方法,来判断该待存PDF文件与已存PDF文件是否相同,并在当该待存PDF文件与已存PDF文件不相同时,存储该待存PDF文件。实现了只存储不相重复的PDF文件,节约了文件存储资源,而且,避免使用者浏览重复的文件,提升用户体验。In the present invention, by reading the pending feature value of the pending PDF file and comparing whether the pending feature value matches the stored feature value, it is determined whether the pending PDF file and the stored PDF file are Same, and when the to-be-saved PDF file is different from the saved PDF file, the to-be-saved PDF file is stored. It realizes that only non-duplicated PDF files are stored, saving file storage resources, and avoiding users from browsing duplicate files, improving user experience.
可以理解的,以上实施例仅表达了本发明的优选实施方式,其描述较为具体和详细,但并不能因此而理解为对本发明专利范围的限制;应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,可以对上述技术特点进行自由组合,还可以做出若干变形和改进,这些都属于本发明的保护范围;因此,凡跟本发明权利要求范围所做的等同变换与修饰,均应属于本发明权利要求的涵盖范围。It is understandable that the above examples only express the preferred embodiments of the present invention, and the descriptions are more specific and detailed, but they should not be construed as limiting the scope of the patent of the present invention; it should be pointed out that for those of ordinary skill in the art In other words, without departing from the concept of the present invention, the above technical features can be freely combined, and several modifications and improvements can be made. These all belong to the scope of protection of the present invention; therefore, everything that follows the scope of the claims of the present invention All equivalent changes and modifications shall fall within the scope of the claims of the present invention.
Claims (10)
- 一种PDF文件去重存储方法,其特征在于,包括:A method for deduplication and storage of PDF files, which is characterized in that it comprises:S1:读取待存PDF文件的待存特征值;S1: Read the to-be-saved feature value of the PDF file to be saved;S2:逐级判断是否记录有与所述待存特征值相匹配的已存特征值,若否,则执行步骤S3;S2: Judge step by step whether there is a stored feature value matching the feature value to be stored, if not, execute step S3;S3:存储所述待存PDF文件并更新所述已存特征值的记录。S3: Store the to-be-saved PDF file and update the record of the stored feature value.
- 根据权利要求1所述的PDF文件去重存储方法,其特征在于:The method for deduplication and storage of PDF files according to claim 1, characterized in that:所述待存特征值包括待存PDF文件流的MD5码;The feature value to be saved includes the MD5 code of the PDF file stream to be saved;所述步骤S2中的逐级判断包括:The step-by-step judgment in step S2 includes:S21:判断是否记录有与所述待存PDF文件流的MD5码相同的已存特征值,若有,则执行步骤S29;S21: Determine whether the stored feature value that is the same as the MD5 code of the PDF file stream to be saved is recorded, and if so, execute step S29;S29:删除所述待存PDF文件。S29: Delete the to-be-saved PDF file.
- 根据权利要求2所述的PDF文件去重存储方法,其特征在于:The method for deduplication and storage of PDF files according to claim 2, characterized in that:所述待存特征值还包括待存PDF文件中文字内容的MD5码;The feature value to be saved also includes the MD5 code of the text content in the PDF file to be saved;在所述步骤S21中,当未找到与所述待存PDF文件流的MD5码相同的已存特征值的记录时,所述步骤S2中的逐级判断还包括:In the step S21, when the record of the stored feature value that is the same as the MD5 code of the PDF file stream to be saved is not found, the stepwise judgment in the step S2 further includes:S22:判断是否记录有与所述待存PDF文件中文字内容的MD5码相同的已存特征值,若有,则执行步骤S23;S22: Determine whether the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is recorded, and if so, execute step S23;S23:判断所述已存特征值对应的文件中其它内容与所述待存PDF文件中其它内容是否相同,若相同,则执行所述步骤S29。S23: Determine whether other content in the file corresponding to the stored feature value is the same as other content in the to-be-saved PDF file, and if they are the same, execute the step S29.
- 根据权利要求3所述的PDF文件去重存储方法,其特征在于:The method for deduplication and storage of PDF files according to claim 3, wherein:所述待存特征值还包括待存PDF文件中文字内容的SIMHASH码以及待存PDF文件的页数;The feature value to be saved also includes the SIMHASH code of the text content in the PDF file to be saved and the number of pages of the PDF file to be saved;所述步骤S22中,当未找到与所述待存PDF文件中文字内容的MD5码相同的已存特征值,或所述步骤S23中,当判断所述已存特征值对应的文件中其它内容与所述待存PDF文件中其它内容不相同时,所述步骤S2中的逐级判断还包括:In the step S22, when the stored feature value that is the same as the MD5 code of the text content in the PDF file to be saved is not found, or in the step S23, when it is determined that other content in the file corresponding to the stored feature value When it is different from other content in the to-be-saved PDF file, the step-by-step judgment in step S2 further includes:S24:判断是否记录有与所述待存PDF文件中文字内容的SIMHASH码的海明距离在预设范围内的已存特征值,若有,则执行步骤S25;S24: Determine whether there is a stored feature value whose Hamming distance from the SIMHASH code of the text content in the PDF file to be saved is within a preset range is recorded, and if so, execute step S25;S25:判断所述已存特征值对应的文件的页数与所述待存PDF文件的页数是否相同,若相同,则执行步骤S26,并进一步判断;S25: Determine whether the number of pages of the file corresponding to the stored feature value is the same as the number of pages of the PDF file to be saved, if they are the same, perform step S26, and further determine;S26:将对应的已存特征值存储至疑似重复区;其中,所述对应的已存特征值为与所述待存PDF文件中文字内容的SIMHASH码的海明距离在预设范围内的所述已存特征值。S26: Store the corresponding stored feature value in the suspected repeated area; wherein, the corresponding stored feature value is all within a preset range from the Hamming distance of the SIMHASH code of the text content in the PDF file to be saved. Describe the existing feature values.
- 根据权利要求4所述的PDF文件去重存储方法,其特征在于:The method for deduplication and storage of PDF files according to claim 4, characterized in that:在所述步骤S23中,当判断所述已存特征值对应的文件中其它内容与所述待存PDF文件中其它内容不相同时,还包括:In the step S23, when it is determined that other content in the file corresponding to the stored feature value is different from the other content in the to-be-saved PDF file, the method further includes:执行所述步骤S26,并进一步判断;Perform the step S26, and make a further judgment;其中,所述对应的已存特征值为与所述待存PDF文件中文字内容的MD5码相同的所述已存特征值。Wherein, the corresponding stored feature value is the same as the stored feature value of the MD5 code of the text content in the PDF file to be saved.
- 根据权利要求4-5任一项所述的PDF文件去重存储方法,其特征在于:The method for deduplication and storage of PDF files according to any one of claims 4-5, characterized in that:所述进一步判断具体包括:The further judgment specifically includes:S27:判断疑似重复临时区是否存储有已存特征值,若有,执行步骤S28;S27: Determine whether there are stored feature values in the suspected duplicate temporary area, if so, execute step S28;S28:人工比对所述已存特征值对应的文件与所述待存PDF文件是否相同,若相同,则执行所述步骤S29,否则,执行所述步骤S3。S28: Manually compare whether the file corresponding to the stored feature value is the same as the PDF file to be saved, if they are the same, execute the step S29, otherwise, execute the step S3.
- 根据权利要求5所述的PDF文件去重存储方法,其特征在于:The method for deduplication and storage of PDF files according to claim 5, characterized in that:所述预设范围为3。The preset range is 3.
- 根据权利要求1所述的PDF文件去重存储方法,其特征在于:The method for deduplication and storage of PDF files according to claim 1, characterized in that:所述步骤S3中还包括:The step S3 also includes:生成并记录所述待存PDF文件的文件编号以及文件存储路径。Generate and record the file number and file storage path of the PDF file to be saved.
- 信息读取模块,用于读取待存PDF文件的待存特征值;Information reading module, used to read the to-be-saved feature value of the PDF file to be saved;内容比较模块,用于逐级判断是否记录有与所述待存特征值相匹配的已存特征值;The content comparison module is used to determine step by step whether there is a stored feature value matching the feature value to be stored;存储模块,用于当未记录有与所述待存特征值相匹配的已存特征值时,存储所述待存PDF文件; A storage module for storing the to-be-saved PDF file when there is no stored feature value matching the to-be-saved feature value;数据库,用于当所述存储模块存储所述待存PDF文件时,更新所述已存特征值的记录。The database is used to update the record of the stored feature value when the storage module stores the to-be-saved PDF file.
- 根据权利要求9所述的PDF文件去重存储系统,其特征在于:The PDF file deduplication storage system according to claim 9, characterized in that:所述待存特征值包括:The feature value to be stored includes:待存PDF文件流的MD5码、待存PDF文件中文字内容的MD5码以及SIMHASH码、以及待存PDF文件的页数。The MD5 code of the PDF file stream to be saved, the MD5 code and SIMHASH code of the text content in the PDF file to be saved, and the number of pages of the PDF file to be saved.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911221955.9 | 2019-12-03 | ||
CN201911221955.9A CN111177082B (en) | 2019-12-03 | 2019-12-03 | PDF file duplicate removal storage method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021109850A1 true WO2021109850A1 (en) | 2021-06-10 |
Family
ID=70650096
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/129125 WO2021109850A1 (en) | 2019-12-03 | 2020-11-16 | Method and system for deduplicating and storing pdf files |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111177082B (en) |
WO (1) | WO2021109850A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113961549A (en) * | 2021-09-22 | 2022-01-21 | 李凤杰 | Medical data integration method and system based on data warehouse |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111177082B (en) * | 2019-12-03 | 2023-06-09 | 世强先进(深圳)科技股份有限公司 | PDF file duplicate removal storage method and system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103136243A (en) * | 2011-11-29 | 2013-06-05 | 中国电信股份有限公司 | File system duplicate removal method and device based on cloud storage |
CN103970722A (en) * | 2014-05-07 | 2014-08-06 | 江苏金智教育信息技术有限公司 | Text content duplicate removal method |
CN106569989A (en) * | 2016-10-20 | 2017-04-19 | 北京智能管家科技有限公司 | De-weighting method and apparatus for short text |
US20180365262A1 (en) * | 2014-12-10 | 2018-12-20 | International Business Machines Corporation | Method and apparatus for data deduplication |
CN109241505A (en) * | 2018-10-09 | 2019-01-18 | 北京奔影网络科技有限公司 | Text De-weight method and device |
CN111177082A (en) * | 2019-12-03 | 2020-05-19 | 世强先进(深圳)科技股份有限公司 | PDF file duplicate removal storage method and system |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101141476A (en) * | 2007-10-09 | 2008-03-12 | 创新科存储技术(深圳)有限公司 | File storing, downloading method and device |
US9058298B2 (en) * | 2009-07-16 | 2015-06-16 | International Business Machines Corporation | Integrated approach for deduplicating data in a distributed environment that involves a source and a target |
US9239843B2 (en) * | 2009-12-15 | 2016-01-19 | Symantec Corporation | Scalable de-duplication for storage systems |
US20150006475A1 (en) * | 2013-06-26 | 2015-01-01 | Katherine H. Guo | Data deduplication in a file system |
CN108038124B (en) * | 2017-11-06 | 2020-08-28 | 广东广业开元科技有限公司 | PDF document acquisition and processing method, system and device based on big data |
CN109213738B (en) * | 2018-11-20 | 2022-01-25 | 武汉理工光科股份有限公司 | Cloud storage file-level repeated data deletion retrieval system and method |
CN110413589A (en) * | 2019-07-30 | 2019-11-05 | 中国联合网络通信集团有限公司 | Approaches to IM and platform based on interspace file system |
-
2019
- 2019-12-03 CN CN201911221955.9A patent/CN111177082B/en active Active
-
2020
- 2020-11-16 WO PCT/CN2020/129125 patent/WO2021109850A1/en active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103136243A (en) * | 2011-11-29 | 2013-06-05 | 中国电信股份有限公司 | File system duplicate removal method and device based on cloud storage |
CN103970722A (en) * | 2014-05-07 | 2014-08-06 | 江苏金智教育信息技术有限公司 | Text content duplicate removal method |
US20180365262A1 (en) * | 2014-12-10 | 2018-12-20 | International Business Machines Corporation | Method and apparatus for data deduplication |
CN106569989A (en) * | 2016-10-20 | 2017-04-19 | 北京智能管家科技有限公司 | De-weighting method and apparatus for short text |
CN109241505A (en) * | 2018-10-09 | 2019-01-18 | 北京奔影网络科技有限公司 | Text De-weight method and device |
CN111177082A (en) * | 2019-12-03 | 2020-05-19 | 世强先进(深圳)科技股份有限公司 | PDF file duplicate removal storage method and system |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113961549A (en) * | 2021-09-22 | 2022-01-21 | 李凤杰 | Medical data integration method and system based on data warehouse |
Also Published As
Publication number | Publication date |
---|---|
CN111177082B (en) | 2023-06-09 |
CN111177082A (en) | 2020-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8838551B2 (en) | Multi-level database compression | |
CA2814401C (en) | Vector transformation for indexing, similarity search and classification | |
US9768802B2 (en) | Look-ahead hash chain matching for data compression | |
US20160196277A1 (en) | Data record compression with progressive and/or selective decompression | |
US10224957B1 (en) | Hash-based data matching enhanced with backward matching for data compression | |
RU2503058C2 (en) | Search index format optimisation | |
US8634947B1 (en) | System and method for identifying digital files | |
TWI549005B (en) | Multi-layer search-engine index | |
US10649905B2 (en) | Method and apparatus for storing data | |
WO2022048284A1 (en) | Hash table lookup method, apparatus, and device for gene comparison, and storage medium | |
TWI604318B (en) | Method of data sorting | |
WO2021109850A1 (en) | Method and system for deduplicating and storing pdf files | |
CN107958079A (en) | Aggregate file delet method, system, device and readable storage medium storing program for executing | |
US11469774B2 (en) | Data compression method and apparatus, and computer device | |
CN111611250A (en) | Data storage device, data query method, data query device, server and storage medium | |
US9600578B1 (en) | Inverted index and inverted list process for storing and retrieving information | |
US20060184554A1 (en) | System and method for extensible metadata architecture for digital images using in-place editing | |
EP3343395B1 (en) | Data storage method and apparatus for mobile terminal | |
CN117194322A (en) | File classification management method, system and computing device | |
US8463759B2 (en) | Method and system for compressing data | |
JP2020525949A (en) | Media search method and device | |
US10037148B2 (en) | Facilitating reverse reading of sequentially stored, variable-length data | |
TWI607325B (en) | Method for generating search index and server utilizing the same | |
CN111008301B (en) | Method for searching video by using graph | |
TWI483131B (en) | Method, apparatus, and computer program product for detecting encoding format |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20895244 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 09/11/2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20895244 Country of ref document: EP Kind code of ref document: A1 |