CN115686382A

CN115686382A - Data storage and reading method

Info

Publication number: CN115686382A
Application number: CN202211713250.0A
Authority: CN
Inventors: 王晓强; 林振仪; 古妍
Original assignee: Nanjing Whale Shark Data Technology Co ltd
Current assignee: Nanjing Whale Shark Data Technology Co ltd
Priority date: 2022-12-30
Filing date: 2022-12-30
Publication date: 2023-02-03
Anticipated expiration: 2042-12-30
Also published as: CN115686382B

Abstract

The invention provides a data storage and reading method, which comprises the following steps: receiving the stored data and obtaining the type of the stored data; setting an initial frequency identifier of stored data; if the frequency is high, dividing the storage data into a plurality of data blocks, and storing a plurality of copies in different nodes of the first storage device by each data block; if the intermediate frequency is the frequency, dividing the stored data into a plurality of data blocks, generating a first check block, and storing the data blocks and the first check block on different nodes of the second storage device; if the frequency is low, dividing the storage data into a plurality of data blocks, compressing to obtain a plurality of compressed data blocks, generating a second check block, and storing the compressed data blocks and the second check block on different nodes of a second storage device; the read frequency of the stored data is monitored. The invention selects the most appropriate storage mode aiming at the data with different reading frequency requirements, and greatly reduces the storage cost on the premise of not reducing the access efficiency.

Description

Data storage and reading method

Technical Field

The invention relates to the technical field of data access, in particular to a data storage and reading method.

Background

As information technology has entered the data age, the dramatic increase in data volume has led to ever increasing storage systems

With the quantity and capacity, large distributed storage systems have been produced and developed rapidly. The distributed storage system adopts an expandable system structure, utilizes a plurality of storage servers to share storage load, utilizes the position server to position storage information, has the characteristics of higher storage reliability, usability, expandability and the like, and becomes the most main mode for storing data information at present.

At present, data of a distributed storage system is generally randomly distributed to a flash memory hard disk area or a magnetic hard disk area of a storage server for storage. However, the types of data are various, such as audio, video, pictures, text, and the like, the reading and writing speed and the reading and writing times required for each type of data are different, and the reading frequency of each type of data is also different and dynamically changes with time. Therefore, the random storage mode is not beneficial to improving the read-write speed and the storage space utilization rate of the storage server, and reasonable matching of the storage data and the storage resources cannot be realized.

Disclosure of Invention

In view of the above problems, the present invention provides a data storage and reading method, which solves the problems that the storage server has a slow read-write speed and a low utilization rate of storage space, and cannot reasonably match the storage data with the storage resources.

In order to solve the technical problems, the invention adopts the technical scheme that: a method of data storage comprising the steps of: receiving storage data, and acquiring the type of the storage data, wherein the type comprises audio and video, pictures and texts; setting initial frequency identification of stored data according to the type, wherein the frequency identification comprises high frequency, intermediate frequency and low frequency; if the frequency identification is high frequency, dividing the stored data into a plurality of data blocks, wherein each data block stores a plurality of copies on different nodes of the first storage device, and simultaneously stores a first mapping table in a memory, wherein the first mapping table is a mapping relation between the data block and the first storage device node; if the frequency identification is the intermediate frequency, dividing the stored data into a plurality of data blocks, generating a first check block according to the data blocks, storing the plurality of data blocks and the first check block on different nodes of a second storage device, and simultaneously storing a second mapping table in a memory, wherein the second mapping table is a mapping relation between the data blocks and the nodes of the first check block and the nodes of the second storage device; if the frequency identification is low frequency, dividing the stored data into a plurality of data blocks, compressing the plurality of data blocks to obtain a plurality of compressed data blocks, generating a second check block according to the compressed data blocks, storing the plurality of compressed data blocks and the second check block on different nodes of a second storage device, and generating a third mapping table in the memory, wherein the third mapping table is a mapping relation between the compressed data blocks and the second check block as well as the nodes of the second storage device; and monitoring the reading frequency of the stored data, changing the frequency identification of the stored data according to a set frequency threshold value, and correspondingly switching the data storage mode.

As a preferred scheme, in the process of storing the data block and/or the compressed data block in the storage device, the method further includes: and caching the data stored in the storage device, and sending the data when the data accumulation amount exceeds a set length threshold value.

Preferably, the calculation formula of the reading frequency is as follows:

；

in the above-mentioned formula, the compound has the following structure,

for the purpose of the reading frequency of the data,

is the initial frequency of the data and,

for the time of the last scan of the data,

for the time of the last access of the data,

and m is the current scanning time of the data, and the number of data accesses is shown.

Preferably, the compressing the plurality of data blocks to obtain a plurality of compressed data blocks includes: counting the occurrence times of each data in each data block, and recording the data with the maximum occurrence probability and the corresponding probability value; judging whether the maximum probability value exceeds a set frequency threshold, if so, replacing the data with a marker and then outputting the data, and if not, normally outputting the data; and sequentially circulating until the compression of all the data blocks is completed.

As a preferred scheme, the generating of the first parity chunks according to the data chunks includes: dividing k data blocks into a groups, wherein each group comprises k/a data blocks, calculating a local check block in each group based on an encoding equation, and calculating r global check blocks from all the data blocks, wherein the encoding equation is an Van der Monte matrix or a Cauchy matrix.

As a preferred scheme, the generating a second parity block according to the compressed data blocks includes: b redundant blocks are calculated from all the compressed data blocks based on an encoding equation, which is either a vandermonde matrix or a cauchy matrix.

Preferably, the first storage device is a flash hard disk region, and the second storage device is a magnetic hard disk region.

The invention also provides a data reading method, which is applied to the data storage method stored in the first storage device or the second storage device, and comprises the following steps: acquiring stored data according to the reading instruction, and judging a frequency identifier of the stored data; if the frequency identification is high frequency, reading a data block or a data block copy of data stored on a first storage device node according to the first mapping table, and outputting a reading result after reading is finished; if the frequency identification is the intermediate frequency, reading a data block of data stored on a second storage device node according to the second mapping table, and outputting a reading result after the reading is finished; and if the frequency identification is low frequency, reading a compressed data block of the data stored on the second storage equipment node according to the third mapping table, decompressing after reading to obtain a data block, and splicing the data block to output a reading result.

Preferably, when the reading operation is performed, the method further includes judging to select the reconstructed file or directly read the source file according to the load degree of the node reading request, and if the load degree exceeds a set load threshold, selecting the reconstructed file, otherwise, directly reading the source file.

Preferably, the calculation formula of the load degree of the node read request is as follows:

；

；

in the above formula, the first and second carbon atoms are,

as the load level of the jth data node,

the load degree of the ith data block of the jth data node in the t time period, n is the number of the data blocks,

for the ith data block, O is a given value,

for last calculation of distanceThe interval of time between the loads is,

is the ith data block of the jth data node

The degree of load during the period of time,

is the length of the ith data block.

Compared with the prior art, the invention has the beneficial effects that: by distinguishing the types of the stored data and giving frequency identification according to the reading frequency of the data, the storage mode and the storage position of the stored data can be dynamically adjusted, the storage server adapts to the actual data access state, the effective utilization rate of the storage space of the storage server is improved, and the reasonable matching of the stored data and the storage resources is realized. Aiming at the data characteristics of different reading frequencies, different data protection modes are respectively adopted, when the data is read at high frequency, a plurality of copies are adopted for data backup, check calculation is not needed, the data reconstruction performance is good, the reading and writing speed is high, and the high-frequency data reading and writing requirements are met; when the data is read in the intermediate frequency, the local check block and the global check block are generated by adopting the coding equation, so that the quick reconstruction of the lost data can be realized, the storage cost is high, the copy mode is low, and the read-write requirement of the intermediate frequency data is met; when the data is read at low frequency, a coding equation is adopted to generate a redundant block, reconstruction of lost data is realized by using the redundant block, although the reconstruction speed is sacrificed, the storage cost is reduced, the storage cost is further reduced by a data block compression mode, and the read-write requirement of the low-frequency data is well adapted.

Drawings

The disclosure of the present invention is illustrated with reference to the accompanying drawings. It is to be understood that the drawings are designed solely for the purposes of illustration and not as a definition of the limits of the invention. In the drawings, like reference numerals are used to refer to like parts. Wherein:

FIG. 1 is a flow chart of a data storage method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a data reading method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process of generating a first parity chunk according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a process of generating a second parity chunk according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a process of reconstructing data based on a second parity chunk according to an embodiment of the present invention.

Detailed Description

It is easily understood that, according to the technical solution of the present invention, a person skilled in the art can propose various alternative structural modes and implementation modes without changing the spirit of the present invention. Therefore, the following detailed description and the accompanying drawings are merely illustrative of the technical aspects of the present invention, and should not be construed as all of the present invention or as limitations or limitations on the technical aspects of the present invention.

An embodiment according to the present invention is shown in connection with fig. 1. A method of data storage comprising the steps of:

s11: receiving the storage data and acquiring the type of the storage data. For example: the storage data types include audio and video, pictures and texts.

S12: and setting initial frequency identification of the stored data according to the type, wherein the frequency identification comprises high frequency, intermediate frequency and low frequency. Setting by default: the audio and video is low frequency, the picture is medium frequency, the text is high frequency, and manual adjustment can be carried out according to actual requirements.

S13: if the frequency identification is high frequency, the storage data is divided into a plurality of data blocks, each data block stores a plurality of copies on different nodes of the first storage device, and a first mapping table is stored in the memory and is a mapping relation between the data block and the first storage device node. The first storage device is a flash hard disk region.

For example: for a certain piece of stored data, the data is divided into data blocks with the size of 1MB by default, each data block is copied into 3 copies, and then the copies are stored on different nodes of the first storage device according to a certain distributed storage algorithm (a consistent hash algorithm, a hash remainder algorithm, a hash slot algorithm and the like). And randomly distributing the copy data in different nodes to realize automatic data balance and horizontal expansion. When the disk or the node is damaged due to a fault, the system can automatically reestablish a new data copy according to the mapping table, so that the reliability of the data is ensured.

S14: if the frequency identification is the intermediate frequency, dividing the stored data into a plurality of data blocks, generating a first check block according to the data blocks, storing the plurality of data blocks and the first check block on different nodes of the second storage device, and simultaneously storing a second mapping table in the memory, wherein the second mapping table is a mapping relation between the data blocks and the nodes of the first check block and the second storage device. The second storage device is a magnetic hard disk region.

In this embodiment of the present invention, if the first parity chunk includes a local parity chunk and a global parity chunk, and there are k data chunks, a local parity chunks, and r global parity chunks, then generating the first parity chunk according to the data chunks includes: dividing k data blocks into a groups, wherein each group comprises k/a data blocks, calculating a local check block for each group based on an encoding equation, and calculating r global check blocks from all the data blocks, wherein the encoding equation is a Van der Monte matrix or a Cauchy matrix.

For example: as shown in fig. 3, the storage data is divided into 6 data blocks, which are divided into two groups (x 0, x1, x 2) and (Y0, Y1, Y2), the first group generates one local parity block PX from three data blocks (x 0, x1, x 2), the second group generates one local parity block PY from three data blocks (Y0, Y1, Y2), and then two global parity blocks P0 and P1 are generated from the 6 data blocks as a whole. If the X1 data block in the stored data is lost, the X1 data block can be recovered only based on X0, X2 and PX, and the reconstruction speed is high.

Compared with the traditional EC erasure codes, the method for recovering the data can greatly reduce the data reconstruction cost and improve the reconstruction speed. When any data block is lost, the data block can be recovered only by reading the nearby (a + r-1) block, so that the delay of reconstructing the data block and the consumption of bandwidth, disk io and the like can be reduced when the data block is lost, and the overall reliability of the system is improved.

S15: if the frequency identification is low frequency, dividing the stored data into a plurality of data blocks, compressing the plurality of data blocks to obtain a plurality of compressed data blocks, generating a second check block according to the compressed data blocks, storing the plurality of compressed data blocks and the second check block on different nodes of the second storage device, and generating a third mapping table in the memory, wherein the third mapping table is a mapping relation between the compressed data blocks and the second check block as well as the nodes of the second storage device. The second storage device is a magnetic hard disk region.

The compressing the plurality of data blocks to obtain a plurality of compressed data blocks includes: counting the occurrence frequency of each data in each data block, and recording the data with the maximum occurrence probability and the corresponding probability value; judging whether the maximum probability value exceeds a set frequency threshold, if so, replacing the data with a marker and then outputting the data, and if not, normally outputting the data; and sequentially circulating until the compression of all the data blocks is completed. The compressed data is smaller or far smaller than the storage space occupied by the data before compression through the compression mode, when the probability of data occurrence in the data block is higher than a set frequency threshold value, a marker replacing mode is selected for coding, otherwise, a normal coding mode is selected for coding, the sizes of the data files before and after compression are the same, and therefore the fact that the storage space occupied by the compressed data of each data block is smaller or equal than that occupied by the data before compression is guaranteed, and the overall compression effect is obtained.

In this embodiment of the present invention, the second parity check block includes redundant blocks, where j compressed data blocks are provided, and b redundant blocks, and the generating the second parity check block according to the compressed data blocks includes: b redundant blocks are calculated from all the compressed data blocks based on an encoding equation, which is a vandermonde matrix or cauchy matrix. The mode allows the loss of b block data at most, and improves the reliability of data storage.

For example: referring to fig. 4, the matrix G is cauchy matrix, the matrix D includes (D1, D2, D3, D4, D5) 5 compressed data blocks, and the matrix E includes the matrix D and 2 redundant blocks (C1, C2). Referring to fig. 5, if data blocks D2, D5 are lost, they may be extracted from matrix DSelecting corresponding row vectors and solving an inverse matrix thereof

And combining the remaining matrices

Original data D is reconstructed.

S16: and monitoring the reading frequency of the stored data, changing the frequency identification of the stored data according to a set frequency threshold value, and correspondingly switching the data storage mode.

Specifically, the calculation formula of the reading frequency is as follows:

；

in the above formula, the first and second carbon atoms are,

for the purpose of the reading frequency of the data,

for the purpose of the initial frequency of the data,

for the time of the last scan of the data,

the time of the last access of the data,

and m is the current scanning time of the data and the number of data accesses.

Further, in the process of storing the data block and/or the compressed data block in the storage device, the method further includes: and caching the data stored in the storage device, and sending the data when the data accumulation amount exceeds a set length threshold value. When the size of a cache temporary file which needs to be created locally exceeds one block (usually 64 MB), the node of the storage device is contacted, and a storage position of a data storage node is allocated to complete the writing operation of the file. The pressure of the server side can be greatly reduced by caching the stored data.

The invention selects the most appropriate storage mode aiming at the data with different reading frequency requirements, greatly reduces the storage cost on the premise of not reducing the access efficiency, and provides a better solution for long-term storage and instant access of a large amount of data.

Referring to fig. 2, the present invention further provides a data reading method, which is applied to the data storage method according to any one of the above methods, and stored on the storage data of the first storage device or the second storage device, and includes the following steps:

s21: and acquiring the storage data according to the reading instruction, and judging the frequency identification of the storage data. The frequency identification includes a high frequency, a medium frequency, and a low frequency.

S22: and if the frequency identification is high frequency, reading a data block or a data block copy of the data stored on the first storage equipment node according to the first mapping table, and outputting a reading result after reading.

S23: and if the frequency identification is the intermediate frequency, reading the data block of the data stored on the second storage equipment node according to the second mapping table, and outputting a reading result after the reading is finished.

S24: and if the frequency identification is low frequency, reading a compressed data block of the data stored on the second storage equipment node according to the third mapping table, decompressing after reading to obtain a data block, splicing the data block and outputting a reading result.

When the reading operation is carried out, the method also comprises the steps of judging and selecting the reconstructed file or directly reading the source file according to the load degree of the node reading request, if the load degree exceeds a set load threshold value, selecting the reconstructed file, and if not, directly reading the source file.

The calculation formula of the load degree of the node reading request is as follows:

；

；

in the above formula, the first and second carbon atoms are,

as the load level of the jth data node,

is the ith data block, O is a given value,

for the interval time from the last calculation of the load,

is the ith data block of the jth data node

The degree of load during the period of time,

is the length of the ith data block.

In summary, the beneficial effects of the invention include: by distinguishing the types of the stored data and giving frequency identification according to the reading frequency of the data, the storage mode and the storage position of the stored data can be dynamically adjusted, the storage server adapts to the actual data access state, the effective utilization rate of the storage space of the storage server is improved, and the reasonable matching of the stored data and the storage resources is realized. Aiming at the data characteristics of different reading frequencies, different data protection modes are respectively adopted, when the data is read at high frequency, a plurality of copies are adopted for data backup, checking calculation is not needed, the data reconstruction performance is good, the reading and writing speed is high, and the high-frequency data reading and writing requirements are met; when the data is read in the intermediate frequency, the local check block and the global check block are generated by adopting the coding equation, so that the quick reconstruction of the lost data can be realized, the storage cost is high, the copy mode is low, and the read-write requirement of the intermediate frequency data is met; when the data is read at low frequency, a coding equation is adopted to generate a redundant block, reconstruction of lost data is realized by using the redundant block, although the reconstruction speed is sacrificed, the storage cost is reduced, the storage cost is further reduced by a data block compression mode, and the read-write requirement of the low-frequency data is well adapted.

It should be understood that the integrated unit, if implemented as a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The technical scope of the present invention is not limited to the above description, and those skilled in the art can make various changes and modifications to the above-described embodiments without departing from the technical spirit of the present invention, and such changes and modifications should fall within the protective scope of the present invention.

Claims

1. A method of storing data, comprising the steps of:

receiving storage data, and acquiring the type of the storage data, wherein the type comprises audio and video, pictures and texts;

setting initial frequency identification of stored data according to the type, wherein the frequency identification comprises high frequency, intermediate frequency and low frequency;

if the frequency identification is high frequency, dividing the stored data into a plurality of data blocks, wherein each data block stores a plurality of copies on different nodes of the first storage device, and simultaneously stores a first mapping table in a memory, wherein the first mapping table is a mapping relation between the data block and the first storage device node;

if the frequency identification is the intermediate frequency, dividing the stored data into a plurality of data blocks, generating a first check block according to the data blocks, storing the plurality of data blocks and the first check block on different nodes of a second storage device, and simultaneously storing a second mapping table in a memory, wherein the second mapping table is a mapping relation between the data blocks and the nodes of the first check block and the nodes of the second storage device;

if the frequency identification is low frequency, dividing the stored data into a plurality of data blocks, compressing the plurality of data blocks to obtain a plurality of compressed data blocks, generating a second check block according to the compressed data blocks, storing the plurality of compressed data blocks and the second check block on different nodes of a second storage device, and generating a third mapping table in the memory, wherein the third mapping table is a mapping relation between the compressed data blocks and the second check block as well as the nodes of the second storage device;

and monitoring the reading frequency of the stored data, changing the frequency identification of the stored data according to a set frequency threshold value, and correspondingly switching the data storage mode.

2. The data storage method according to claim 1, wherein the storing of the data block and/or the compressed data block to the storage device further comprises: and caching the data stored in the storage device, and sending the data when the data accumulation amount exceeds a set length threshold value.

3. The data storage method of claim 1, wherein the reading frequency is calculated as follows:

；

in the above-mentioned formula, the compound has the following structure,

for the purpose of the reading frequency of the data,

for the purpose of the initial frequency of the data,

the time of the last scan of the data,

for the time of the last access of the data,

and m is the current scanning time of the data and the number of data accesses.

4. The data storage method of claim 1, wherein compressing the plurality of data blocks to obtain a plurality of compressed data blocks comprises:

counting the occurrence times of each data in each data block, and recording the data with the maximum occurrence probability and the corresponding probability value;

judging whether the maximum probability value exceeds a set frequency threshold, if so, replacing the data with a marker and then outputting the data, and if not, normally outputting the data;

and sequentially circulating until the compression of all the data blocks is completed.

5. The data storage method according to claim 1, wherein the first parity chunks include local parity chunks and global parity chunks, and if there are k data chunks, a local parity chunks, and r global parity chunks, the generating of the first parity chunk according to the data chunks includes: dividing k data blocks into a groups, wherein each group comprises k/a data blocks, calculating a local check block in each group based on an encoding equation, and calculating r global check blocks from all the data blocks, wherein the encoding equation is an Van der Monte matrix or a Cauchy matrix.

6. The data storage method of claim 1, wherein the second parity block comprises redundant blocks, there are j compressed data blocks, and b redundant blocks, and the generating the second parity block from the compressed data blocks comprises: b redundant blocks are calculated from all the compressed data blocks based on an encoding equation, which is either a vandermonde matrix or a cauchy matrix.

7. The data storage method of claim 1, wherein the first storage device is a flash hard disk region and the second storage device is a magnetic hard disk region.

8. A data reading method applied to the data stored in the first storage device or the second storage device according to the data storage method of any one of claims 1 to 7, comprising the steps of:

acquiring storage data according to the reading instruction, and judging the frequency identification of the storage data;

if the frequency identification is high frequency, reading a data block or a data block copy of data stored on a first storage device node according to the first mapping table, and outputting a reading result after reading is finished;

if the frequency identification is the intermediate frequency, reading a data block of data stored on a second storage device node according to the second mapping table, and outputting a reading result after the reading is finished;

and if the frequency identification is low frequency, reading a compressed data block of the data stored on the second storage equipment node according to the third mapping table, decompressing after reading to obtain a data block, and splicing the data block to output a reading result.

9. The data reading method of claim 8, further comprising determining to select the reconstructed file or directly read the source file according to a load degree of the node read request during the reading operation, and selecting the reconstructed file if the load degree exceeds a set load threshold, otherwise directly reading the source file.

10. The data reading method according to claim 9, wherein the calculation formula of the load degree of the node read request is as follows: