CN112667858A

CN112667858A - Method for storing data by adopting HASH chain and data writing and reading methods

Info

Publication number: CN112667858A
Application number: CN202011566140.7A
Authority: CN
Inventors: 蔡云霞
Original assignee: Shenzhen Innovation Technology Co ltd
Current assignee: Shenzhen Innovation Technology Co ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-16

Abstract

The invention relates to a method for storing data by adopting a HASH chain and a data writing and reading method, wherein the method for storing data comprises the steps of exhausting all data blocks with fixed length, calculating HASH values of all standard data blocks by adopting a HASH algorithm, forming an index table with a plurality of HASH values, and storing a data file to be stored in a storage system in the form of the HASH index chain of the standard data blocks contained in the data file. The invention only operates the HASH index chain of the file when reading and writing the data file, thereby reducing the actual storage capacity of the data file in the system and improving the data reading and writing efficiency; meanwhile, data transmission is carried out in the system or between the systems, and only data HASH index chain records need to be transmitted, so that the bandwidth requirements of data transmission in the system and between the systems are reduced; in addition, the method is convenient to flexibly deploy in a distributed system, and is easy to construct a distributed and large-scale storage system.

Description

Method for storing data by adopting HASH chain and data writing and reading methods

Technical Field

The invention relates to the technical field of computer storage, in particular to a method for storing data by adopting a HASH chain, a data writing method and a data reading method.

Background

The development of computer memory technology requires a memory unit based on a computer hardware system, and the known computer hardware system generally comprises an arithmetic unit, a controller, a memory, an input device and an output device, and the quality of memory performance is affected by various aspects, including the memory capacity and the memory mode of the memory unit. For increasingly complex large data applications and rapidly increasing data volumes, the performance, capacity and even data transmission bandwidth of a storage device are still caught in modern computer systems, and gradually become the development bottleneck of the whole computer system.

The fact that the memory capacity directly relates to the performance of the memory system is mainly because, with the continuous enhancement of the computer CPU operation performance, the CPU performance exceeds the range that the memory system can meet, on the other hand, the memory requirement is rapidly increased due to the massive application of multimedia, object-oriented databases, Web servers and the like, the memory system that is not adapted to the high-speed development of the processor has limited many application fields, and the restriction effect is continuously increased, shows that if the performance bottleneck problem of the memory system cannot be solved, the main frequency and arithmetic logic function of the machine are simply and continuously improved, the input-output ratio of the machine may become larger and larger, and therefore, for some research projects and research departments of main hardware equipment manufacturers, the memory system research becomes a new hot spot for replacing the CPU design.

The technical means developed by designers of the technical scheme of the invention is focused on another key part mentioned above, namely a storage mode, which also meets the research requirement under the background of the rapid development of the integrated circuit technology, because the performance of a microprocessor integrating the functions of an arithmetic unit and a controller is developed rapidly at the rate of doubling the performance and doubling the price every eighteenth month, the improvement of the data read-write efficiency is limited by the influence of the mechanical seek speed of a hard disk of a storage device, and the improvement of the data read-write efficiency is far behind the improvement of the computing capability of a computer system.

With the advent of the big data era, the data which needs to be processed by a computer system becomes more and more huge, and the performance gap between a CPU and a storage device becomes more prominent. Although the industry has adopted methods such as external storage devices, dedicated storage networks, and higher-performance storage media, mass data storage requires a large amount of storage devices, and the problems of energy consumption and space occupation of the storage devices are increasingly prominent. According to a research of Google corporation, in a data center of the data center, the power consumption of a storage device reaches 30% of that of the whole data center, and if the power consumption of a memory is added, the power consumption of a storage system accounts for or even exceeds that of a CPU. Obviously, improving the data read-write efficiency and improving the utilization rate of the data storage space become a great challenge for the development of computer systems.

After summarizing the merits of the storage mode of the computer system, the designer of the technical scheme of the invention finds that the storage mode of the data file can not be orderly, simply and characteristically stored by a great amount of diversified data files, which causes great pressure to the storage system, and the exposed merits are universal when in application, including:

the method has the advantages that the existing computer storage system is often influenced by a great amount of ever-changing data files, and the data files are stored with disorder, so that data required to be stored cannot be greatly compressed, and the actual storage space occupation ratio is very high;

the actual data block needs to be written when data is written or read, so that the data storage efficiency is reduced immediately, and meanwhile, the data block is influenced by lack of reference in data block arrangement, and the data reading efficiency is very low;

the existing storage system needs to perform proper comparison and error correction when writing data blocks, and is not beneficial to constructing a large-scale storage system for a large number of data files;

in addition, the conventional data storage mode is not beneficial to data transmission in the system and data transmission between the systems, large data transmission pressure exists in data transmission in the system, meanwhile, when data are transmitted between two computer systems needing communication, the data are influenced by transmitted complex data files, and the requirement of data transmission bandwidth between the systems can obviously not be met.

The technical scheme of the invention develops a corresponding solution aiming at the problems, reasonably utilizes the HasH index, provides a method for storing data by adopting a HASH chain, it generates and stores an index table of standard data blocks and HASH values in a computer system, while the ever-changing data file is composed of HASH index chains representing the sequence of data blocks, and at the same time, based on the HASH chain storage method, corresponding data writing-in method and data reading method are respectively provided, so that the system reading and writing files only operate the HASH chain of the data file, when data files are transferred within a system or between systems using the same data storage method, only the HASH chain of the data file is also transferred, therefore, the actual data storage capacity in the system is greatly reduced, the data processing efficiency is improved, and the requirement of the data transmission bandwidth of the computer system is greatly reduced. Of course, practical applications show that the technical scheme provided by the invention can relieve, partially solve or completely solve the problems in the prior art.

Disclosure of Invention

In order to overcome the above problems or at least partially solve the above problems, the present invention provides a method for storing data using a HASH chain, a data writing method, and a data reading method, wherein an index table of standard data blocks and HASH values thereof is generated and stored in a computer system, and data files are stored in the system in the form of HASH index chains of the standard data blocks included in the data files, so that the HASH index chains of the files are only operated during the read/write processing of the data files, thereby reducing the actual storage capacity of the data files in the system and improving the data read/write efficiency.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for storing data by adopting HASH chain, which firstly generates and stores a standard data block in a computer system at the beginning of data storage, and then introduces HASH algorithm to perform corresponding calculation for the standard data block, the method comprises the following steps:

exhaustively exhausting all data blocks with fixed length, calculating HASH values of all standard data blocks by adopting a HASH algorithm to form an index table with a plurality of HASH values, and storing the data file to be stored in the storage system in the form of a HASH index chain of the standard data blocks contained in the data file;

if the calculated HASH value has conflict, the data blocks with the HASH value conflict need to be numbered, stored and recorded.

By implementing the above method, the obtained technical scheme can be further applied with different technical means, including:

numbering and storing the data blocks with the HASH value conflict, and recording the data blocks in records indexed by the HASH values;

if the calculated HASH values conflict, the orthogonal HASH algorithm can be adopted again for the data block, and the two HASH values are calculated for distinguishing;

wherein the data file only stores the HASH index chain representing the sequence of the data blocks and the necessary algorithm information in the storage system.

In the above technical solution, in a specific application, for example, data transfer in or between systems, only the data HASH index chain record needs to be transferred.

In a specific application, for example, the step of storing data in a system with high concurrency requirements includes that multiple standard data block HASH index tables can be generated in one node or distributed and deployed on multiple nodes, and the standard data block HASH index table of each node does not need to perform any verification operation.

The invention also comprises other technical schemes based on the same conception of the method for storing data by adopting the HASH chain:

a data writing method, which is suitable for a storage system adopting a HASH chain to store data, comprising the following steps:

the method includes the steps that a data file is cut into standard data blocks according to fixed length by a storage system, HASH values of all the standard data blocks are calculated by adopting an HASH algorithm, if the length of the last data block does not reach the standard fixed length, the HASH values can be calculated after 0 or 1 is supplemented, and the actual length of the last data block is recorded;

secondly, detecting records in a HASH index table of the standard data blocks corresponding to the HASH values, and comparing the records with the standard data blocks with HASH conflicts to obtain the serial number values of the records if the records have the HASH conflicting data blocks;

generating a HASH index chain record of the data file;

and fourthly, writing the HASH index chain record of the data file into the storage medium, returning the writing success to the user, and ending the flow.

Further, the content of the HASH chain record is at least one of the following information: the fixed length value of the standard data block, the HASH algorithm, the number of the data blocks, the HASH value chain of each data block and the actual length information of the last data block.

a data reading method, which is suitable for a storage system adopting a HASH chain to store data, comprising the following steps:

the method comprises the steps of enabling a storage system to access HASH index chain records corresponding to data files, directly indexing corresponding standard data blocks in a HASH index table of the standard data blocks according to HASH values of the standard data blocks in the HASH index chain records, and intercepting the standard data blocks by the last data block according to the actual length in the records;

secondly, generating data files by the standard data blocks according to the sequence recorded by the HASH index chain;

and thirdly, returning the generated data file to the application system.

Further, if there is a conflicting HASH value, the indexing continues with the second HASH value or corresponding number.

The invention creates and stores standard data block in computer system, calculates HASH value to build HASH index table, and stores data file in the system in the form of HASH index chain of standard data block, so as to operate only HASH index chain of file when processing read-write of data file, thereby reducing actual storage capacity of data file in system, and improving data read-write efficiency, and making a great deal of ever-changing data file capable of being stored orderly, simply and characteristically.

Meanwhile, the data transmission is carried out in the system or between the systems by adopting the method, and only the data HASH index chain records need to be transmitted, so that the bandwidth requirements of data transmission in the system and between the systems are reduced, and the transmission efficiency is improved.

In addition, the HASH index table of the standard data block is generated through calculation, so that verification and error correction are not needed, the HASH index table is convenient to flexibly deploy in a distributed system, and the distributed and large-scale storage system is easy to construct.

Drawings

The invention is explained in further detail below with reference to the drawing.

FIG. 1 is a schematic diagram of a fixed-length HASH index table of a standard data block, such as 4K, in accordance with an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a data writing process of a system for storing data based on a HASH chain, in which a fixed length is 4K as an example, according to the present invention;

fig. 3 is a schematic diagram illustrating a data read flow of a system based on HASH chain stored data, in which the fixed length is 4K as an example;

fig. 4 is a schematic diagram of a data storage flow of a system data writing method for storing data based on a HASH chain, which is implemented by the present invention;

fig. 5 is a schematic data writing flow diagram of a system data writing method for storing data based on a HASH chain according to the present invention;

fig. 6 is a schematic diagram of a data reading process of the system data writing method for storing data based on a HASH chain according to the present invention.

Detailed Description

The invention discloses a method for storing data by adopting a HASH chain, which is to be implemented and aims to solve the technical problems that the storage capacity of data is huge, the data processing efficiency is low and the requirement of data transmission bandwidth of a computer system cannot be met due to the fact that a large number of ever-changing data files cannot be stored orderly, compactly and characteristically in a storage system of a computer.

The designer of the technical scheme of the invention mainly adopts a method that data files are stored in a system in the form of a HASH index chain of a standard data block contained in the data files, and the read-write processing of the data files is only based on the HASH index chain operation of the files for storing data, however, the constructed technical scheme is not limited by technical characteristics such as how to create a HASH index table, and the technical scheme of the invention does not need to carry out unnecessary limitation on the aspects, if the technical staff needs to improve the related aspects by combining with actual application requirements when applying the method, the related various means can adopt conventional technical means; meanwhile, because the space of different storage systems is different from the assembly requirement of the computer system, there is no complete limitation on the configuration of the storage system, the selection of the algorithm and the calculation function of the HASH value, the establishment mode of the HASH index table, the generation mode of the data block, and the like, and a technician only needs to design a corresponding storage method according to the application requirement. Therefore, the implemented technical solution is actually a method for storing data that can be referred and implemented by those skilled in the art by combining with conventional technical means, and the skilled can construct a final method for storing data by using a HASH chain according to different application requirements and assembly requirements, and a series of advantages brought by the constructed method for storing data by using a HASH chain can be actually obtained during application, such as a substantial reduction in actual data storage capacity in a system, an improvement in data processing efficiency, a reduction in data transmission bandwidth requirements of a computer system, and the advantages will be gradually reflected in the following analysis of method steps.

As shown in fig. 1 and 4, the method for storing data using a HASH chain according to the present invention is implemented by enumerating fixed length steps using 4K as an example for the convenience of parsing the method steps:

firstly, in the system initialization stage, namely, the standard data block and the HASH index table thereof are established, and the corresponding links comprise:

firstly, exhausting all data blocks with fixed length, and setting the fixed length according to application characteristics, such as the length of a data block commonly used by a storage system, 4K, and the like, and calculating HASH values of all standard data blocks by adopting a HASH algorithm (namely, a HASH algorithm), for example, the HASH values calculated by a plurality of standard data blocks arranged in sequence are HI, H2, H3..

Aiming at the step, when calculating the HASH values of all standard data blocks, the HASH value calculation can be carried out according to the existing conventional HASH algorithm, and the HASH algorithm determines the storage address of the node according to the key code value of the node, namely the key code value K is used as an argument, and the corresponding function value is calculated through a certain functional relation h (K) (called as a HASH function), and the value is interpreted as the storage address of the node, and the node is stored in the storage unit; during retrieval, the addresses are calculated by the same method, then the nodes to be found are obtained from the corresponding units, and the nodes can be quickly retrieved by the HASH algorithm, so that the method is an important storage mode and is a common retrieval method.

If a storage structure constructed in a HASH storage manner according to the conventional HASH algorithm is called a HASH table (i.e., a HASH table), one position in the HASH table is called a slot, and the core of the HASH technique is a HASH function (i.e., a HASH function); for any given dynamic lookup table DL, if a certain "ideal" hash function h and corresponding hash table HT are selected, for each data element X in DL, the function value h (x.key) is the storage location of X in the hash table HT, where the data element X will be located when inserted (or tabulated) and also retrieved when retrieving X; the storage location determined by the hash function is called a hash address, generally, the storage space of the hash table is a one-dimensional array HT [ M ], the hash address is a subscript of the array, and the objective of designing the hash method is to design a certain hash function h, 0< ═ h (k) < M; for key value K, HT [ i ] ═ K is obtained. Therefore, the core of the hash is that a hash function determines the corresponding relationship between the key value (x.key) and the hash address h (x.key), and the organization, storage and retrieval are realized through the relationship.

And secondly, if the HASH values calculated by the standard data blocks conflict, calculating two HASH values by adopting another orthogonal HASH algorithm again for the data blocks, numbering and storing the data blocks with the conflicting HASH values, and recording the data blocks in the records of HASH value indexes.

When the hash table is established, if the key code and the hash address are in a one-to-one relationship, the storage location of the node to be checked can be obtained only by performing some operation on a given value according to the hash function during retrieval, however, the hash function may calculate the same hash address for the unequal key codes, which is called a collision (collision), and the two key codes which have the collision are called synonyms of the hash function. Therefore, the corresponding method is properly adopted to solve the conflict of the HASH values.

In view of the above two links, an index table of the standard data blocks and the HASH values thereof can be finally generated, and the whole data file is stored in the system in the form of the HASH index chain of the standard data blocks contained therein.

Because the selection principle of the hash function is that the operation is as simple as possible, the value range of the function must be within the range of the hash table, and different keys have different hash function values as much as possible, various factors need to be considered: key length, hash table size, key distribution, frequency of retrieval of records, etc. In this case, a common hash function, for example, a division-remainder method, may be adopted according to a conventional technical means, in which the key x is divided by M (often taking the length of the hash table), and a remainder is taken as a hash address, where the hash function is: h (x) x mod M; for another example, the multiply-remainder rounding method, the hash function is: hash (key) ═ _ LOW (n × (a × key% 1));

for example, the hash table length m is known to be 11, and the hash function is: h (key) = 3, H (26) ═ 4, H (60) ═ 5, and assuming that the next keyword is 69, H (69) ═ 3, conflicting with 47; if the collision is processed by linear probe re-hashing, the next hash address is H1 ═ 3+ 1% 11 ═ 4, the collision still occurs, the next hash address is H2 ═ 3+ 2% 11 ═ 5, or the collision still occurs, the next hash address is H3 ═ 3% 11 ═ 6, at this time, the collision is not occurred, and 69 is filled in cell No. 5.

The above proposed common functions and calculation examples are provided for reference, and the functions and calculation methods adopted by the common functions and calculation examples are not in the technical scheme of the present invention, and are only provided for reference by technical personnel, and the others are not described again.

As shown in fig. 2 and fig. 5, based on the method for storing data by using HASH chain implemented above, a data writing method based on HASH chain storage can be further implemented, in order to facilitate the parsing of the method steps, the method steps are still listed as fixed length, taking 4K as an example, when an application system requests to write data to a storage system:

the method includes the steps that a data file is cut into data blocks according to a fixed length by a storage system, HASH values of the data blocks are calculated, if the length of the last data block does not reach a standard fixed length, the HASH values can be calculated after 0 or 1 is supplemented, and the actual length of the last data block is recorded;

secondly, detecting a record in a standard data block HASH index table corresponding to the HASH value, and if the record has data blocks with HASH conflicts, calculating a second HASH value of the data blocks by adopting another orthogonal HASH algorithm; or comparing the HASH value with the standard data block with HASH conflict to obtain the serial number value (according to the HASH conflict processing method adopted during system initialization);

generating a HASH chain record for the data file, the content may include: the fixed length value (such as 4K) of the standard data block, the HASH algorithm (the two fields are mainly convenient for cross-system transmission), the number of the data blocks, the HASH value chain of each data block, the actual length of the last data block and the like.

And fourthly, writing the HASH chain record of the data file into the storage medium, returning the writing success to the user, and ending the flow.

It can be clearly understood from the above system write procedures that, compared to the conventional scheme of storing all data blocks of a data file in a storage medium, the technical solution implemented by the present invention only needs to write HASH chain records representing the data file, greatly compresses the required write operation and the required storage capacity, takes the HASH algorithm SHA-256 with a fixed length of 4K BYTE, and each data block is only represented by HASH values of 256 BIT in the HASH chain records of the data file, and compresses the write operation by 128 times, thus greatly improving the write efficiency and saving a large amount of storage capacity.

As shown in fig. 3 and fig. 6, based on the method for storing data using a HASH chain implemented above, a data reading method based on HASH chain storage can be further implemented, in order to facilitate the parsing of the method steps, the method steps are still listed as fixed length, taking 4K as an example, when an application system requests to read a data file from a storage system:

the method comprises the steps that a storage system accesses a corresponding HASH chain record of a data file, corresponding standard data blocks in a HASH index table of the standard data blocks are directly indexed according to HASH values of all data blocks in the HASH chain record (if conflicting HASH values exist, a second HASH value or corresponding numbers continue to be indexed), and the last data block intercepts the standard data blocks according to the actual length in the record;

secondly, generating data files by the standard data blocks according to the sequence recorded by the HASH chain;

and thirdly, returning the generated data file to the application system, and ending the process.

It can be clearly seen from the data reading flow that, by using the method of storing data in a HASH chain, although the operation of reading a data block cannot be avoided, the read data block is a standard data block, and after the HASH index table of the standard data block is generated by initialization of the system, no write operation is required, and the HASH index table can be completely deployed on a storage medium with the fastest speed, and can even be read into a memory during the operation of the system, so that the efficiency of reading data is greatly improved.

According to the design of the technical scheme and the actual operation flow of the system, the method for storing data by adopting the HASH chain is implemented by the technical scheme of the invention, and the data file only stores the HASH chain representing the sequence of the data blocks and necessary algorithm information in the storage system, so that the storage capacity is greatly compressed, and the data reading and writing efficiency is improved; in addition, when data are transmitted in the system, only data HASH chain records need to be transmitted, so that the data transmission pressure in the system is reduced, if two computer systems needing to communicate store and process data by adopting the method, the data are transmitted between the systems, and only the HASH chain records of data files can be transmitted, so that the data transmission bandwidth requirement between the systems is certainly greatly reduced; even data storage applications such as data backup and data disaster recovery are easy to implement due to the large compression of data volume.

In addition, when the method is applied to a system with high concurrency requirement, because the HASH index table of the standard data block is generated through system calculation, a plurality of HASH index tables of the standard data block can be generated in one node or distributed and deployed on a plurality of nodes, the HASH index table of the standard data block of each node does not need to perform any operation such as comparison, and the complexity and the system consumption of distributed deployment are greatly reduced.

In the description herein, the appearances of the phrases "embodiment one," "this embodiment," "specific implementation," and the like in this specification are not necessarily all referring to the same embodiment or example, but rather to the same embodiment or example. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example; furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

In the description of the present specification, the terms "connect", "mount", "fix", "set", "have", and the like are used in a broad sense, for example, the "connect" may be a fixed connection or an indirect connection through intermediate components without affecting the relationship and technical effects of the components, or may be an integral connection or a partial connection, as in this case, for a person skilled in the art, the specific meaning of the above terms in the present invention can be understood according to specific situations.

The foregoing description of the embodiments is provided to enable any person skilled in the art to make and use the embodiments, and it is to be understood that various modifications may be readily apparent to those skilled in the art, and that the generic principles defined herein may be applied to other embodiments without the use of the inventive faculty. Therefore, the present disclosure is not limited to the above embodiments, and modifications to the following cases should be included within the scope of the present disclosure: firstly, a new technical scheme implemented on the basis of the technical scheme of the invention and combining with the prior common knowledge, for example, a HASH chain is adopted to store data, and a data file only stores the HASH chain representing the sequence of data blocks and necessary algorithm information, and the formed technical scheme has expected effects which are not beyond the expected effect of the invention; secondly, equivalent replacement of part of characteristics of the technical scheme of the invention by adopting the known technology generates the same technical effect as the technical effect of the invention; expanding on the basis of the technical scheme of the invention, wherein the substantial content of the expanded technical scheme does not exceed the technical scheme of the invention; and fourthly, the technical means obtained by utilizing the equivalent transformation carried out by the text record content of the invention is applied to other related technical fields.

Claims

1. A method for storing data by adopting a HASH chain is characterized in that a standard data block is firstly generated and stored in a computer system at the beginning of data storage, and then HASH algorithm is introduced to carry out corresponding calculation on the standard data block, and the method for storing data by adopting the HASH chain comprises the following steps:

2. The method of claim 1, wherein the data is stored using a HASH chain, comprising: after the data blocks with the HASH value conflict are numbered and stored, the data blocks are recorded in the record indexed by the HASH value.

3. The method of claim 1, wherein the data is stored using a HASH chain, comprising: if the calculated HASH values conflict, the orthogonal HASH algorithm can be adopted again for the data block to calculate two HASH values for distinguishing.

4. A method of using a HASH chain to store data according to any of claims 1 to 3, wherein: the data file only holds the HASH index chain representing the order of the data blocks and the necessary algorithm information in the storage system.

5. The method of claim 4, wherein the HASH chain is used to store data, and wherein: when the method is adopted to carry out data transmission in a system or between systems, only data HASH index chain records need to be transmitted.

6. A method of storing data using a HASH chain according to any of claims 1 to 3, wherein the step of storing data in a system with high concurrency requirements comprises generating multiple standard data block HASH index tables in one node or deploying them in a distributed manner on multiple nodes, without any trial and error operations on the standard data block HASH index table of each node.

7. A data writing method, which is suitable for a storage system that stores data by adopting a HASH chain, comprising the steps of:

generating a HASH index chain record of the data file;

8. The data writing method of claim 7, wherein the HASH chain record content is at least one of the following information:

the fixed length value of the standard data block, the HASH algorithm, the number of the data blocks, the HASH value chain of each data block and the actual length information of the last data block.

9. A data reading method, which is suitable for a storage system adopting a HASH chain to store data, and is characterized by comprising the following steps:

and thirdly, returning the generated data file to the application system.

10. A data reading method according to claim 9, wherein: if there is a conflicting HASH value, continue indexing the second HASH value or corresponding number.