CN116010362A - File storage and file reading method, device and system - Google Patents

File storage and file reading method, device and system Download PDF

Info

Publication number
CN116010362A
CN116010362A CN202310314661.0A CN202310314661A CN116010362A CN 116010362 A CN116010362 A CN 116010362A CN 202310314661 A CN202310314661 A CN 202310314661A CN 116010362 A CN116010362 A CN 116010362A
Authority
CN
China
Prior art keywords
file
hash
index
data
stored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310314661.0A
Other languages
Chinese (zh)
Inventor
纪智辉
张青辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4u Beijing Technology Co ltd
Original Assignee
4u Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4u Beijing Technology Co ltd filed Critical 4u Beijing Technology Co ltd
Priority to CN202310314661.0A priority Critical patent/CN116010362A/en
Publication of CN116010362A publication Critical patent/CN116010362A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method, a device and a system for storing and reading files, wherein the method comprises the following steps: obtaining a file to be stored, and carrying out hash operation on the file to be stored according to a preset hash rule to obtain a hash index corresponding to the content of the file to be stored; traversing a local index file, comparing the hash index with the local index file, obtaining difference data, and storing the file to be stored based on the difference data. The method and the device solve the technical problem that in the prior art, similar files are repeatedly stored, so that the storage space occupies a large space.

Description

File storage and file reading method, device and system
Technical Field
The present invention relates to the field of file storage technologies, and in particular, to a method, an apparatus, and a system for file storage and file reading.
Background
When an APP on a mobile terminal, such as a mobile phone, receives and downloads a file transmitted by another person, the downloaded file is not recorded, and then a problem of file repeated storage may occur in subsequent downloads. This may be because the user, when downloading the same file multiple times, generates a new file to be stored in the storage space of the mobile terminal each time it downloads, without performing a checksum deduplication operation on the already downloaded file. This results in multiple copies of the same file on the device, taking up unnecessary storage space.
In order to solve the above problem, in the prior art, when downloading a file, the identification is performed according to the file name. If the files corresponding to the same file name are already stored in the mobile terminal, the APP on the mobile phone can directly cover or inquire whether the user needs to cover or not, instead of generating a new file, and the situation of repeated storage is avoided. However, if two files downloaded by the user have the same file name but are actually different in content, a situation of overlaying the existing file occurs, resulting in the loss of the original file. In addition, for similar files, the prior art generally stores the files directly, which results in a large storage space.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the application provides a method, a device and a system for storing and reading files, which are used for solving the problem that in the prior art, similar files are repeatedly stored to cause large occupied storage space.
According to one aspect of the embodiments of the present application, there is provided a method for storing a file, the method including: obtaining a file to be stored, and carrying out hash operation on the file to be stored according to a preset hash rule to obtain a hash index corresponding to the content of the file to be stored; traversing a local index file, comparing the hash index with the index file, obtaining difference data, and storing the file to be stored based on the difference data.
According to another aspect of the embodiments of the present application, there is also provided a method for reading a file, including: receiving a reading instruction; searching and reading the file corresponding to the reading instruction from the file to be stored by using the file storage method.
According to another aspect of the embodiments of the present application, there is also provided an apparatus for storing a file, including: the hash module is configured to acquire a file to be stored, and perform hash operation on the file to be stored according to a preset hash rule to obtain a hash index corresponding to the content of the file to be stored; the storage module is configured to traverse a local index file, compare the hash index with the index file, acquire difference data and store the file to be stored based on the difference data.
According to another aspect of the embodiments of the present application, there is also provided a device for reading a file, including: a receiving module configured to receive a read instruction; and the reading module is configured to search and read the file corresponding to the reading instruction from the file to be stored, which is stored by using the file storage method.
According to another aspect of the embodiments of the present application, there is also provided a system for storing and reading a file, including: a file storage device as described above and a file reading device as described above.
According to the embodiment of the application, hash operation is carried out on the file to be stored according to a preset hash rule, and a hash index corresponding to the content of the file to be stored is obtained; comparing the hash index with local index files to obtain difference data, and storing files to be stored based on the difference data, so that the technical problem that storage space occupation is large due to repeated storage of similar files in the prior art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flow chart of a method of file storage according to an embodiment of the present application;
FIG. 2 is a flow chart of another method of file storage according to an embodiment of the present application;
FIG. 3 is a flow chart of a method of acquiring discrepancy data according to an embodiment of the present application;
FIG. 4 is a flow chart of a method of constructing an index that avoids hash collisions according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a system for file storage and file reading according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
The relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise. Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description. Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate. In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.
Example 1
According to an embodiment of the present application, there is provided a method for storing a file, as shown in fig. 1, including the steps of:
step S102, obtaining a file to be stored, and carrying out hash operation on the file to be stored according to a preset hash rule to obtain a hash index corresponding to the content of the file to be stored.
First, a first hash function and a second hash function are constructed using different seed values. For example, combining a fixed seed value and two randomly generated seed values together, respectively, to generate two of the different seed values; the first hash function and the second hash function are constructed based on two different original hash functions, respectively using two of the different seed values.
Specifically, two different original hash functions are mutated by utilizing two mutation parameters of each other, and the mutated two different original hash functions are combined to obtain a new hash function; adjusting the new hash function by utilizing one seed value of the two different seed values and the length of the hash table to obtain the first hash function; and simultaneously, adjusting the new hash function by utilizing the other seed value of the two different seed values and the length of the hash table to obtain the second hash function.
Next, based on the square detection method, a correction factor is determined using the load density of the hash table and the length of the hash table. Firstly, calculating the ratio of the load density of the hash table to the length of the hash table to determine the relation between the load density of the hash table and the length of the hash table; the correction factor is then determined based on a relationship of the load density of the hash table and the length of the hash table. For example, when the load density of the hash table is smaller than the length of the hash table, the correction factor is set to increase as the load density increases; when the load density of the hash table is greater than the length of the hash table, the correction factor is set to decrease as the load density increases.
And finally, carrying out hash operation on the file to be stored by using the correction factor, the first hash function and the second hash function to obtain a hash index corresponding to the content of the file to be stored. For example, the second hash function is corrected by the correction factor, the corrected second hash function is used for carrying out hash operation on the file to be stored, and meanwhile, the first hash function is used for carrying out hash operation on the file to be stored; combining a hash result of the hash operation corresponding to the second hash function and a hash result of the hash operation corresponding to the first hash function to obtain a combined hash value; and generating a hash index corresponding to the content of the file to be stored based on the combined hash value and the length of the hash table.
Step S104, traversing the local index file, comparing the hash index with the index file, obtaining difference data, and storing the file to be stored based on the difference data.
First, the local index file is traversed. Taking out the catalog to be scanned from the catalog queue to be scanned in parallel by using an idle task process or task thread, and locking the catalog to be scanned; and performing multi-process or multi-thread parallel traversal scanning based on the catalog to be scanned, recording the scanned index file into the task process or task thread, and adding the scanned file catalog to the tail of the catalog queue to be scanned.
And then, comparing the hash index with the scanned index file to determine difference data. Comparing the hash index with the index file, finding different data fragments, and determining the difference data based on the different data fragments. For example, slicing the hash index according to a preset slicing algorithm to obtain a plurality of data slices, and calculating a characteristic value of each data slice; and taking each data fragment as a current data fragment, calculating a characteristic value of the current data fragment, judging whether the same data fragment exists in the index file as the current data fragment, accumulating data fragments different from the current data fragment, and taking the accumulated different data fragments as the difference data.
And under the condition that the difference data is zero, associating the file to be stored with the index file, otherwise, comparing the percentage of the difference data in the hash index with a preset difference threshold value, and storing the file to be stored according to a comparison result. For example, if the percentage of the difference data in the hash index is greater than the preset difference threshold, directly storing the hash index; and storing the difference data and the index file based on the difference data and the hash index when the percentage of the difference data in the hash index is smaller than or equal to the preset difference threshold.
Specifically, if the percentage of the difference data in the hash index is smaller than or equal to the preset difference threshold, if the same data fragment exists as the current data fragment, the current data fragment is not written into the difference data, and the pointer of the current data fragment points to the same data fragment only in metadata of the difference data; and writing the current data fragment into the difference data in the condition that the same data fragment exists, and pointing a pointer of the current data fragment to the current data fragment in metadata of the difference data.
The embodiment of the application has the following beneficial effects:
the efficiency of file storage is improved and the storage space is saved. By carrying out hash operation on the files to be stored, a unique hash index can be obtained, so that files with the same content are prevented from being repeatedly stored. In addition, by comparing the hash index with the existing index file, the difference data can be found, only the difference part is stored, and the whole content of the file is prevented from being repeatedly stored. In this way, storage efficiency can be improved and storage space can be saved.
In addition, parallel processing improves the efficiency of file traversal. By taking out the to-be-scanned directory from the to-be-scanned directory queue in parallel by using the idle task process or task thread and locking the to-be-scanned directory, the scanned directory can be traversed in parallel without blocking, the scanned index file is recorded into the task process or task thread, and the scanned file directory is added to the to-be-scanned directory queue. This may increase the efficiency of file traversal.
And, the slicing algorithm reduces the time for data comparison. The hash index is segmented through a preset segmentation algorithm, a large file can be split into a plurality of data segments, a characteristic value is calculated for each data segment, and then only different data segments are compared. This may reduce the time for data comparison.
And finally, determining different storage schemes according to different conditions of the difference data, and optimizing storage efficiency and storage space. And under the condition that the difference data is zero, the hash index can be directly stored, otherwise, the percentage of the difference data in the hash index can be compared with a preset difference threshold value to determine a storage scheme. When the percentage of the difference data in the hash index is larger than a preset difference threshold value, the hash index can be directly stored; in the case that the percentage of the difference data in the hash index is equal to or less than a preset difference threshold, the storage may be performed based on the difference data and the index file. In this way, the optimal storage scheme can be selected according to the actual situation, so that the storage efficiency and the storage space are optimized.
Example 2
According to an embodiment of the present application, there is provided another method for storing a file, as shown in fig. 2, including the steps of:
step S202, carrying out hash operation on the file to be stored to obtain a hash index.
The file to be stored is acquired, for example, the received file is downloaded from the APP of the mobile terminal device. For a file to be stored, a hash operation needs to be performed on it to generate a hash index. The hash index is a unique identifier generated by a hash algorithm, and its length is typically a fixed value.
The file to be stored is taken as input through a hash operation and is calculated by using a specific hash algorithm to generate a hash index. The hash algorithm may be MD5, SHA-1, SHA-2 or other hash function. The result of the hash operation is a hash index, which is a fixed length number or string of characters. The same input will always generate the same hash index, so the hash index can be used as a unique identifier to represent the content of the original input.
The hash index may be used as a unique identifier for a file and thus may be used to find a particular file in a storage system. The hash index may also be used to compare whether two files are identical, since they will only be considered identical if their hash indexes are identical. The method of constructing the hash index will be described in detail below, and will not be described in detail here.
After the hash index is generated, the file to be stored can be associated with the hash index, so that the file to be stored is opened on the APP of the mobile phone and can be linked to the hash index. The file to be stored may be any type of file, such as a text file, an image file, a video file, etc. The storage system can be a local disk, cloud storage, a distributed storage system and the like.
Step S204, comparing the hash index with the index file in the storage system to obtain difference data.
Comparing the hash index with the index file, finding different data fragments between the hash index and the index file, and determining difference data based on the different data fragments. The hash index is sliced according to a preset slicing algorithm to obtain a plurality of data slices, and the characteristic value of each data slice is calculated. And then taking each data fragment as the current data fragment, calculating the characteristic value of the current data fragment, judging whether the data fragment which is the same as the current data fragment exists in the index file, accumulating the data fragments which are different from the current data fragment, and taking the accumulated different data fragments as difference data.
Specifically, as shown in fig. 3, the method for acquiring the difference data may include the following steps:
in step S2042, data slicing is performed.
The hash index is first data sliced using a slicing algorithm. For example, using a hash function based slicing algorithm, a hash index may be split into multiple data slices, each containing a portion of the hash index's data. In addition to the hash function-based slicing algorithm, other slicing algorithms may be used, such as a slicing algorithm that performs range division according to a key of data, a slicing algorithm that performs dynamic division according to the heat of data, and so on.
Step S2044, performing feature value calculation.
For each data slice, the eigenvalue of that data slice needs to be calculated in order to be able to determine whether it is the same as the data slice in the index file. The characteristic value may be a hash value of the data fragment or any other value capable of uniquely identifying the data fragment.
The feature values of the data fragments are calculated in order to be able to compare them with the data fragments in the index file to determine if they are identical. The characteristic value is a value capable of uniquely identifying a data fragment, and may be calculated using a hash function. For example, a hash value of a data fragment is calculated as a feature value using a hash function. If the data fragment is a hash table containing multiple key-value pairs, the entire hash table may be passed as input to a hash function to obtain unique hash values.
In the embodiment, when the characteristic value of the data fragment is calculated, the uniqueness of the characteristic value, the calculation efficiency and other factors are considered, and the hash function method capable of ensuring the uniqueness is selected, so that the time and the resource cost required for calculating the characteristic value are reduced.
Step S2046, index file comparison is performed.
For each data slice, find out whether there is the same data slice in the index file. The index file may be divided into a plurality of data slices using the same slicing algorithm and eigenvalue calculation method as the hash index, and the eigenvalue of each data slice may be calculated. A fast search algorithm, such as a hash table or binary search tree, may then be used to find the same data fragment as the current data fragment.
When the hash table is used for searching, the characteristic value of the data fragment can be used as a key, and the information of the data fragment can be stored in the hash table as a value. When it is required to find the same data fragment as the current data fragment, the corresponding value may be found in the hash table using the characteristic value of the current data fragment as a key. If there are data fragments of the same eigenvalue in the hash table, it is stated that the two data fragments are identical.
When searching is performed by using the binary search tree, the characteristic value of the data fragment can be used as a key, and the information of the data fragment can be stored in the binary search tree as a value. When the same data fragment as the current data fragment needs to be searched, the characteristic value of the current data fragment and the size relation of the key of each node can be compared from the root node, and if the characteristic value and the size relation of the key of each node are the same, the two data fragments are the same. In a binary search tree, for each node, the values of all nodes in its left subtree are less than the value of that node, and the values of all nodes in the right subtree are greater than the value of that node.
The embodiment can quickly search the data fragments which are the same as the current data fragments by using a quick search algorithm, thereby greatly improving the efficiency of data processing. In addition, the storage space is reduced. When the hash table or binary search tree is used for searching, only the characteristic value and corresponding information of the data fragments are needed to be stored, and each data fragment is not needed to be stored. Therefore, the occupation of the storage space can be reduced. And, query accuracy is also improved. When the hash table or the binary search tree is used for searching, accurate searching can be performed according to the characteristic value, so that the query accuracy is improved.
In step S2048, difference data is determined.
All the different pieces of data accumulated are taken as difference data.
Step S206, storing based on the differential data.
In a storage system, a file to be stored is associated with an index file if there is already exactly the same file as the file to be stored, i.e. the difference data is zero. Therefore, when clicking the file on the mobile phone APP, the file processed based on the index file can be directly opened, and repeated storage of the same file is avoided.
However, if there is a case where the difference data is not zero, that is, there is no file that is exactly the same as the file to be stored in the storage system, it is necessary to further determine whether there is a file that has a high similarity with the file to be stored. This may be achieved by comparing the percentage of the difference data to the hash index with a preset difference threshold. If the percentage of the difference data in the hash index is larger than a preset difference threshold, the fact that similar files do not exist in the storage system is indicated, and the hash index can be directly stored. If the percentage of the difference data in the hash index is smaller than or equal to a preset difference threshold, the fact that similar files exist in the storage system is indicated, and storage is needed based on the difference data and the index files.
When a situation of higher similarity occurs, the processing may be further performed at the level of the data pieces. If the same data fragment exists in the index file as the current data fragment of the hash index, the current data fragment is not written into the difference data, and the pointer of the current data fragment points to the same data fragment only in the metadata of the difference data. If there is no data fragment identical to the current data fragment, the current data fragment is written into the difference data, and pointers of the current data fragment are pointed to the current data fragment in metadata of the difference data.
According to the embodiment of the application, the storage mode based on the difference data is adopted, so that the cost of storage space and storage bandwidth can be effectively reduced. In the difference storage, the amount of change of data is stored, not the entire data itself. If the data fragment to be stored already exists in the index file, it is not necessary to store it in the difference data, but only to point it in the metadata to the data fragment of the stored index file. Thus, the repeated storage of the same data can be avoided, and the storage space and the storage bandwidth are saved. If the data fragment to be stored does not exist in the index file, it needs to be stored into the difference data and pointed to the current data fragment in the metadata. This ensures that all new data fragments are stored for subsequent lookup and retrieval operations.
Thus, when the difference data needs to be retrieved, different pieces of data can be assembled into complete data blocks according to pointer information in the metadata. If the whole original data needs to be restored, all the data blocks are assembled in sequence.
Example 3
A hash index is a unique index value that converts different files, each of which typically has its own unique hash index. However, sometimes different files may be hashed to the same hash index, which is referred to as a hash collision. Although this is rare, in order to ensure the integrity and correctness of the data, conflict handling must be performed at the time of the hash operation.
There are many methods of hash collision processing, and an open addressing method or a chained method is generally used in the prior art. In the open addressing method, when a hash collision occurs, the next available hash position is found. In the chained approach, each hash index contains a linked list to which all files hashed to the index are added. When a specific file needs to be retrieved, only the linked list of the index needs to be traversed.
However, the open addressing method and the chained method have a number of disadvantages. The open addressing method requires reserving enough unused hash slots to handle the hash collision, which occupies additional memory space. When the hash table is already full, the open addressing method becomes slow because enough empty slots must be found to insert new elements. Open addressing methods require maintaining hash table continuity when elements are deleted, which may require a large number of move operations, resulting in reduced performance. The chaining method requires a pointer to the chain header to be maintained for each hash slot, which takes up additional memory space. The chained approach may result in cache inefficiency when processing large hash tables because the chained list is not a contiguous block of memory. When deleting elements, the chain method needs to reorganize the linked list, which may cause performance degradation.
In order to solve the above-described problems, the present embodiment provides a method of constructing an index, as shown in fig. 4, including the steps of:
step S402, constructing a first hash function and a second hash function using different seed values.
The seed value is used in the hash function to determine the manner in which the hash code is generated. If the same seed value is used to generate the hash code, the same input value will always generate the same hash code. Therefore, in order to obtain better hash table performance, it is necessary to ensure the uniqueness and dispersibility of the seed value.
In this embodiment a plurality of seed values are used to generate the hash code, i.e. the new seed value. For example, a fixed seed value and a randomly generated seed value may be combined and combined to generate the hash code. This has several benefits: 1) The uniqueness is increased. The use of multiple seed values ensures that each key pair in the hash table has a unique hash code. This helps to avoid hash collisions and improves the performance of the hash table. 2) The dispersibility is increased. Different seed values may produce different hash code distributions, which may make the hash table more decentralized, reducing the probability of hash collisions. 3) It is difficult to guess. The use of multiple seed values may make the hash code more difficult to predict, which helps to improve the security of the hash table and avoid hash collision attacks.
Let the original hash function be h (key), where key is the key value to be hashed. Splitting an original hash value h (key) into two parts, multiplying the two parts by variation parameters m and n respectively, and adding the two parts to obtain a new hash function:
new_hash = hs1(key) * m + hs2(key) * n
where hs1 and hs2 are original hash functions and m and n are additional introduced variation parameters. Can be adjusted according to the actual situation so as to minimize the occurrence of hash collision. It should be noted that the values of the variation parameters m and n are mutually equal, so that it is ensured that the hash function can cover all positions in the hash table.
The new hash function in this embodiment is a variation of the original hash function so that different files can be hashed to different index positions. According to the embodiment, the original hash function is changed to hash different key values to different index positions, so that each key value is ensured to have a unique hash index, and hash collision is avoided.
Two new hash functions are defined on the basis of new_hash, namely a first hash function and a second hash function:
h1(key) = (new_hash (key) + a) % M
h2(key) = (new_hash (key) + b) % M
where a and b are two different seed values generated using a fixed seed value and a random seed value, and M is the length of the hash table.
In this embodiment, two different hash functions are used to calculate the hash index to further reduce the likelihood of hash collisions. The two hash functions may be different algorithms or the same algorithm, but using different parameters or seed values. For each key value, a first hash function is used to calculate the hash value, and then a second hash function is used to calculate the hash value again, so as to obtain a final hash index. The use of multiple hash functions may improve the performance and reliability of the hash table.
Step S404, calculating a correction factor based on a square detection method by using the load density of the hash table and the length of the hash table.
In this embodiment, the correction factor is calculated based on the square detection method. The correction factor is sized by comparing it to the square of the hash table length, thereby better avoiding hash collisions. The specific formula is as follows:
i = (1+ sqrt(k) )/ 2 - ((sqrt(k) - 1) / 2) * (lf/ M)
where sqrt () represents evolution, lf represents load density, represents a ratio of the number of elements already stored in the hash table to the hash table array length, and M represents the hash table length, which represents the hash table length. In the formula, k is a random integer value, sqrt (k) is a constant, and hash collision can be avoided.
According to this formula, when the load density of the hash table is equal to the hash table length, the correction factor is equal to 1, and the hash table is completely filled at this time, hash collision is not easy to occur any more. When the load density of the hash table is smaller than the hash table length, the correction factor increases with increasing load density to reduce the probability of hash collisions. Meanwhile, when the load density of the hash table is larger than the length of the hash table, the correction factor is reduced along with the increase of the load density, so that the performance of the hash table is not affected.
And step S406, performing hash operation on the file to be stored by using the correction factor, the first hash function and the second hash function to obtain a hash index corresponding to the content of the file to be stored.
For each key, a first hash function h1 (key) is used to calculate the hash value of the key, and then a second hash function h2 (key) is used to calculate the hash value again, so as to obtain a final hash index:
hash_index = (h1(key) + i * h2(key)) % M
by the method, hash collision can be better avoided. The relation between the load density and the hash table length is considered by the calculation formula of the correction factor, so that different load densities can be better adapted, and the occurrence of hash collision is reduced. In addition, the searching efficiency can be improved. The more elements in the hash table, the greater the probability of hash collisions, resulting in reduced lookup efficiency. And the complex correction factor formula is adopted, so that hash collision can be reduced, and the searching efficiency is improved. Finally, reasonable space utilization can be maintained and different load densities can be accommodated. Storing a large number of elements in the hash table may result in low space utilization, and the calculation formula of the correction factor may improve the space utilization of the hash table on the premise of maintaining the performance of the hash table. The calculation formula of the correction factor can be adjusted according to the load density of the hash table, so that the correction factor can adapt to hash tables with different load densities, and the flexibility and adaptability of the hash table are improved.
Example 4
According to an embodiment of the application, a method for reading a file is provided, and the method comprises the steps of receiving a reading instruction; and searching and reading the file corresponding to the reading instruction from the file to be stored, which is stored by using the method in the embodiment 1 or 2.
Example 5
According to an embodiment of the present application, a system for storing and reading a file is provided, as shown in fig. 5, the apparatus includes a means 52 for storing a file and a means 54 for reading a file.
The means 52 for storing files includes a hash module 522 and a storage module 524. The hash module 522 is configured to obtain a file to be stored, and perform a hash operation on the file to be stored according to a preset hash rule to obtain a hash index corresponding to the content of the file to be stored; the storage module 524 is configured to traverse the local index file, compare the hash index to the index file, obtain difference data, and store the file to be stored based on the difference data.
The file reading device 54 includes a receiving module 542 and a reading module 544. Wherein the receiving module 542 is configured to receive a read instruction; the reading module 544 is configured to find and read a file corresponding to the reading instruction from the files to be stored using the method described in embodiment 1 or 2.
The file storing device 52 can implement the file storing method in the above embodiment, and the file reading device 54 can implement the file reading method in the above embodiment, so that details are not repeated here.
Example 6
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 6, where the electronic device includes:
a processor 291, the electronic device further comprising a memory 292; a communication interface (Communication Interface) 293 and bus 294 may also be included. The processor 291, the memory 292, and the communication interface 293 may communicate with each other via the bus 294. Communication interface 293 may be used for information transfer. The processor 291 may call logic instructions in the memory 294 to perform the methods of the above embodiments.
Further, the logic instructions in memory 292 described above may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product.
The memory 292 is a computer readable storage medium, and may be used to store a software program, a computer executable program, and program instructions/modules corresponding to the methods in the embodiments of the present application. The processor 291 executes functional applications and data processing by running software programs, instructions and modules stored in the memory 292, i.e., implements the methods of the method embodiments described above.
Memory 292 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the terminal device, etc. Further, memory 292 may include high-speed random access memory, and may also include non-volatile memory.
Embodiments of the present application also provide a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, are configured to implement the method described in any of the embodiments.
Embodiments of the present application also provide a computer program product comprising a computer program for implementing the method described in any of the embodiments when executed by a processor.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present invention.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (10)

1. A method of file storage, comprising:
obtaining a file to be stored, and carrying out hash operation on the file to be stored according to a preset hash rule to obtain a hash index corresponding to the content of the file to be stored;
traversing a local index file, comparing the hash index with the index file, obtaining difference data, and storing the file to be stored based on the difference data.
2. The method of claim 1, wherein comparing the hash index to the index file, obtaining difference data, and storing the file to be stored based on the difference data, comprises:
comparing the hash index with the index file, finding different data fragments between the hash index and the index file, and determining the difference data based on the different data fragments;
and under the condition that the difference data is zero, associating the file to be stored with the index file, otherwise, comparing the percentage of the difference data in the hash index with a preset difference threshold value, and storing the file to be stored according to a comparison result.
3. The method of claim 2, wherein comparing the hash index to the index file, finding a different data shard between the hash index and the index file, and determining the difference data based on the different data shard, comprises:
fragmenting the hash index according to a preset fragmenting algorithm to obtain a plurality of data fragments, and calculating the characteristic value of each data fragment;
and taking each data fragment as a current data fragment, calculating a characteristic value of the current data fragment, judging whether the same data fragment exists in the index file as the current data fragment, accumulating data fragments different from the current data fragment, and taking the accumulated different data fragments as the difference data.
4. The method of claim 3, wherein comparing the percentage of the difference data to the hash index with a preset difference threshold and storing the file to be stored according to the comparison result comprises:
directly storing the hash index under the condition that the percentage of the difference data in the hash index is larger than the preset difference threshold value;
and storing the difference data and the index file based on the difference data and the hash index when the percentage of the difference data in the hash index is smaller than or equal to the preset difference threshold.
5. The method of claim 4, wherein storing based on the difference data and the index file comprises:
in the case where there is the same data fragment as the current data fragment, not writing the current data fragment into the difference data, but only pointing a pointer of the current data fragment to the same data fragment in metadata of the difference data;
and writing the current data fragment into the difference data in the condition that the same data fragment exists, and pointing a pointer of the current data fragment to the current data fragment in metadata of the difference data.
6. The method of claim 1, wherein traversing the local index file comprises:
taking out a catalog to be scanned from a catalog queue to be scanned in parallel by using an idle task process or task thread, and locking the catalog to be scanned;
and performing multi-process or multi-thread parallel traversal scanning based on the catalog to be scanned, recording the scanned index file into the task process or task thread, and adding the scanned file catalog to the tail of the catalog queue to be scanned.
7. A method of reading a document, comprising:
receiving a reading instruction;
a file corresponding to the read instruction is searched for and read from the files to be stored using the method of any one of claims 1 to 6.
8. A device for storing files, comprising:
the hash module is configured to acquire a file to be stored, and perform hash operation on the file to be stored according to a preset hash rule to obtain a hash index corresponding to the content of the file to be stored;
the storage module is configured to traverse a local index file, compare the hash index with the index file, acquire difference data and store the file to be stored based on the difference data.
9. A document reading apparatus, comprising:
a receiving module configured to receive a read instruction;
a reading module configured to find and read a file corresponding to the reading instruction from the files to be stored using the method of any one of claims 1 to 6.
10. A system for file storage and file reading, comprising: a file storage device according to claim 8 and a file reading device according to claim 9.
CN202310314661.0A 2023-03-29 2023-03-29 File storage and file reading method, device and system Pending CN116010362A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310314661.0A CN116010362A (en) 2023-03-29 2023-03-29 File storage and file reading method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310314661.0A CN116010362A (en) 2023-03-29 2023-03-29 File storage and file reading method, device and system

Publications (1)

Publication Number Publication Date
CN116010362A true CN116010362A (en) 2023-04-25

Family

ID=86023321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310314661.0A Pending CN116010362A (en) 2023-03-29 2023-03-29 File storage and file reading method, device and system

Country Status (1)

Country Link
CN (1) CN116010362A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117667517A (en) * 2023-12-11 2024-03-08 合芯科技有限公司 Distributed file processing method, device, server, equipment and storage medium
CN117891414A (en) * 2024-03-14 2024-04-16 支付宝(杭州)信息技术有限公司 Data storage method based on perfect hash and related equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160335285A1 (en) * 2004-09-15 2016-11-17 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
CN109324998A (en) * 2018-09-18 2019-02-12 郑州云海信息技术有限公司 A kind of document handling method, apparatus and system
CN111966649A (en) * 2020-10-21 2020-11-20 中国人民解放军国防科技大学 Lightweight online file storage method and device capable of efficiently removing weight
CN112559452A (en) * 2020-12-11 2021-03-26 北京云宽志业网络技术有限公司 Data deduplication processing method, device, equipment and storage medium
JP2021179717A (en) * 2020-05-12 2021-11-18 日本電気株式会社 File server, deduplication system, processing method, and program
CN114064572A (en) * 2021-11-12 2022-02-18 苏州慧工云信息科技有限公司 Object storage method and system based on Hash algorithm
WO2022048475A1 (en) * 2020-09-03 2022-03-10 中兴通讯股份有限公司 Data deduplication method, node, and computer readable storage medium
CN114564446A (en) * 2022-03-01 2022-05-31 清华大学 File storage method, device, system and storage medium
CN114860677A (en) * 2022-04-24 2022-08-05 Oppo广东移动通信有限公司 File redundancy removal method for terminal equipment, terminal equipment and storage medium
CN115576899A (en) * 2022-12-09 2023-01-06 深圳市木浪云科技有限公司 Index construction method and device and file searching method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160335285A1 (en) * 2004-09-15 2016-11-17 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
CN109324998A (en) * 2018-09-18 2019-02-12 郑州云海信息技术有限公司 A kind of document handling method, apparatus and system
JP2021179717A (en) * 2020-05-12 2021-11-18 日本電気株式会社 File server, deduplication system, processing method, and program
WO2022048475A1 (en) * 2020-09-03 2022-03-10 中兴通讯股份有限公司 Data deduplication method, node, and computer readable storage medium
CN111966649A (en) * 2020-10-21 2020-11-20 中国人民解放军国防科技大学 Lightweight online file storage method and device capable of efficiently removing weight
CN112559452A (en) * 2020-12-11 2021-03-26 北京云宽志业网络技术有限公司 Data deduplication processing method, device, equipment and storage medium
CN114064572A (en) * 2021-11-12 2022-02-18 苏州慧工云信息科技有限公司 Object storage method and system based on Hash algorithm
CN114564446A (en) * 2022-03-01 2022-05-31 清华大学 File storage method, device, system and storage medium
CN114860677A (en) * 2022-04-24 2022-08-05 Oppo广东移动通信有限公司 File redundancy removal method for terminal equipment, terminal equipment and storage medium
CN115576899A (en) * 2022-12-09 2023-01-06 深圳市木浪云科技有限公司 Index construction method and device and file searching method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王祖俪 等: "数据结构", 西安电子科技大学出版社, pages: 274 - 280 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117667517A (en) * 2023-12-11 2024-03-08 合芯科技有限公司 Distributed file processing method, device, server, equipment and storage medium
CN117891414A (en) * 2024-03-14 2024-04-16 支付宝(杭州)信息技术有限公司 Data storage method based on perfect hash and related equipment

Similar Documents

Publication Publication Date Title
US10938961B1 (en) Systems and methods for data deduplication by generating similarity metrics using sketch computation
US9792306B1 (en) Data transfer between dissimilar deduplication systems
US8543555B2 (en) Dictionary for data deduplication
CN116010362A (en) File storage and file reading method, device and system
US7725437B2 (en) Providing an index for a data store
US8463787B2 (en) Storing nodes representing respective chunks of files in a data store
CN102782643B (en) Use the indexed search of Bloom filter
US9262432B2 (en) Scalable mechanism for detection of commonality in a deduplicated data set
CN110347651B (en) Cloud storage-based data synchronization method, device, equipment and storage medium
CN110908589B (en) Data file processing method, device, system and storage medium
CN106599091B (en) RDF graph structure storage and index method based on key value storage
WO2014067063A1 (en) Duplicate data retrieval method and device
US20220156233A1 (en) Systems and methods for sketch computation
CN106980680B (en) Data storage method and storage device
CN112104725A (en) Container mirror image duplicate removal method, system, computer equipment and storage medium
EP4078340A1 (en) Systems and methods for sketch computation
Moia et al. Similarity digest search: A survey and comparative analysis of strategies to perform known file filtering using approximate matching
US10515055B2 (en) Mapping logical identifiers using multiple identifier spaces
US9449013B2 (en) Application transparent deduplication data
US9442951B2 (en) Maintaining deduplication data in native file formats
US20210191640A1 (en) Systems and methods for data segment processing
CN111045988B (en) File searching method, device and computer program product
CN116627904A (en) Method and device for constructing index and method and device for searching file
US20130218851A1 (en) Storage system, data management device, method and program
CN114416741A (en) KV data writing and reading method and device based on multi-level index and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20230425