CN114371810A

CN114371810A - Data storage method and device of HDFS

Info

Publication number: CN114371810A
Application number: CN202011101718.1A
Authority: CN
Inventors: 高宗宝; 陈燕雷; 李晓; 周波; 李光锴; 吴兴耀; 耿禄博
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Design Institute Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Design Institute Co Ltd
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2022-04-19
Anticipated expiration: 2040-10-15
Also published as: CN114371810B

Abstract

The invention relates to the technical field of data storage, in particular to a data storage method and device of an HDFS. The method comprises the following steps: acquiring the data record number of a current data buffer after data to be stored is stored in the current data buffer; if the data record number is not less than a preset upper limit value and not greater than the data record number upper limit of the data block, storing the data to be stored into the current data buffer; performing HDFS writing on the data cached in the current data cache; wherein the preset upper limit value is the product of the upper limit of the data record number of the data block and a preset coefficient. The data storage method and the data storage device of the HDFS, provided by the embodiment of the invention, can combine small-scale data to the greatest extent under the condition of keeping the original characteristics of the data to be stored, so that the storage of the data in the HDFS can approach the block size, and the number of small data blocks in the HDFS is reduced.

Description

Data storage method and device of HDFS

Technical Field

The invention relates to the technical field of data storage, in particular to a data storage method and device of an HDFS.

Background

For the storage of data in HDFS (Hadoop Distributed File System), the prior art mainly adopts the following scheme:

scheme 1: data is directly written into the HDFS in the client, for example, the data is uploaded by using a copyfromlLocalFile method of an HDFS shell or a Filesystem, and the data is sorted under an actual scene without considering the scale of the uploaded data block.

The scheme is a basic method for uploading the HDFS file, the scale of uploaded data is not considered, and the arrangement of the data is considered in an operation scene, so that time and labor are wasted.

Scheme 2: the file content is added through an API provided by Hadoop, an added file stream is obtained in a client through a FileSystems type appended method, and other data are written into the stream to complete the file addition.

The scheme mainly adds data to the existing HDFS file, and the size of a file block cannot be well controlled in a client, so that the size distribution is not uniform.

Therefore, how to provide a data storage method of the HDFS can fully consider the scale of data, so that the storage of the data in the HDFS can approach to the block size, which is of great significance.

Disclosure of Invention

Aiming at the defects in the prior art, the embodiment of the invention provides a data storage method of an HDFS, which comprises the following steps:

acquiring the data record number of a current data buffer after data to be stored is stored in the current data buffer;

if the data record number is not less than a preset upper limit value and not greater than the data record number upper limit of the data block, storing the data to be stored into the current data buffer;

performing HDFS writing on the data cached in the current data cache;

wherein the preset upper limit value is the product of the upper limit of the data record number of the data block and a preset coefficient.

In one embodiment, the method further comprises:

and if the data record number is greater than the data record number upper limit of the data block, counting, and acquiring the data record number of a next data buffer after the data to be stored is stored in the next data buffer.

In one embodiment, the method further comprises:

if the data record number is smaller than the preset upper limit value, continuing to store the current data buffer;

the continuing the storing operation comprises:

and storing the data to be stored into the current data buffer, and acquiring the data record number of the current data buffer after the next data to be stored is stored into the current data buffer.

In one embodiment, if the count value is greater than a preset threshold, HDFS writing is performed on the data to be stored.

In one embodiment, after the obtaining of the data to be stored in the next data buffer and before the data record number of the next data buffer, the method further includes:

and storing the data to be stored into a waiting queue buffer.

In an embodiment, if the time consumed for the continuous storage operation reaches a preset time, HDFS writing is performed on the data cached in the current data cache.

In one embodiment, the predetermined coefficient has a value in a range of 0.8 to 1.

On the other hand, an embodiment of the present invention further provides a data storage device for an HDFS, including:

the acquisition module is used for acquiring the data record number of the current data buffer after the data to be stored is stored in the current data buffer;

the judging module is used for storing the data to be stored into the current data buffer when the data record number is not less than a preset upper limit value and not more than the data record number upper limit of a data block;

the writing module is used for writing the data cached in the current data cache into the HDFS;

On the other hand, an embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of any one of the above-mentioned data storage methods for the HDFS when executing the program.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the data storage method of the HDFS described above.

According to the data storage method and device of the HDFS, provided by the embodiment of the invention, the data stored in the data cache is written in the HDFS only when the data record number of the data cache is close to the upper limit of the data record number of the data block, so that small-scale data can be combined to the maximum extent under the condition of keeping the original characteristics of the data to be stored, the storage of the data in the HDFS can approach to the block size, and the number of the small data blocks in the HDFS is reduced.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart illustrating a data storage method of an HDFS according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a data storage device of an HDFS according to an embodiment of the present invention;

fig. 3 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in further detail with reference to the drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Fig. 1 is a schematic flow chart of a data storage method of an HDFS according to an embodiment of the present invention, and referring to fig. 1, an embodiment of the present invention provides a data storage method of an HDFS, including:

s110, acquiring the data record number of the current data buffer after the data to be stored is stored in the current data buffer;

s120, if the data record number is not less than the preset upper limit value and not greater than the data record number upper limit of the data block, storing the data to be stored into a current data buffer;

s130, performing HDFS writing on the data cached in the current data cache;

wherein the predetermined upper limit is the product of the upper limit of the data record number of the data block and a predetermined coefficient.

An execution main body of the data storage method of the HDFS provided by the embodiment of the present invention may be a computer, such as a smart phone, a portable computer, a tablet computer, a personal computer, a wearable device, and the like.

Note that the upper limit of the number of data records is determined by the default size of the data block. For example, a memory block having a default size of 128M may have an upper limit of about 30 ten thousand data records.

The data buffer is a buffer device for structured data records. The name of the data buffer may be determined according to a specific service scenario. For example, the common data may be a file name, the mobile MRO data may be "provide-city-enhanced", and the like. And a data line recorder is simultaneously arranged with the data buffer and used for recording the data recording number in the current data buffer.

Specifically, when there is data to be stored, the data record number cu of the current data buffer after the data to be stored is stored in the current data buffer may be first obtained, and it may be determined, according to the data record number cu, what manner the data to be stored is stored in.

When the data record number cu of the current data buffer after the data to be stored is stored in the current data buffer is obtained, the data record number cu and a preset upper limit value l are judged₁And an upper limit of the number of data records l of the data block₀If the data record number cu is not less than the preset upper limit value l₁And is not greater than the upper limit of data recording number l of data blocks₀(i.e. |)₁≤cu≤l₀Wherein l is₁＝p×l₀And p is a preset coefficient), the data to be stored is stored in the current data buffer.

Wherein, the value range of the preset coefficient p can be 08 to 1, the upper limit value l is preset₁Is in the range of 0.8l₀To 1l₀. The specific value range of the preset coefficient p can be adjusted according to actual needs, which is not limited in the embodiment of the present invention.

When the data to be stored is stored in the current data buffer, the metadata of the data to be stored can be recorded, including the client number, the file name (depending on the data input source), the start line number, the end line number, and the file location (i.e. the file name in the HDFS, which is required to ensure the cluster uniqueness, for addressing in subsequent processing). The recording metadata can effectively check the stored structured data or extract lines and the like.

After the data to be stored is stored in the current data buffer, HDFS writing and metadata writing can be carried out on the data cached in the current data buffer.

It can be understood that, when the data stored in the data buffer does not reach the upper limit of the data record number of the data block and exceeds the data to be stored after the data to be stored is imported, if the HDFS is written, a small data block appears after the HDFS is cut, and the data to be stored is also cut.

In the data storage method of the HDFS provided in the embodiment of the present invention, since the HDFS is only written into the data stored in the data buffer when the data record number of the data buffer is close to the upper limit of the data record number of the data block, small-scale data can be combined to the maximum extent under the condition of keeping the original characteristics of the data to be stored, so that the data stored in the HDFS can approach the block size, and thus the number of small data blocks in the HDFS is reduced.

When the number of small data blocks in the HDFS is reduced, the processing efficiency of a task executed based on the data blocks in the HDFS can be obviously improved.

Further, in an embodiment, the data storage method of the HDFS provided by the embodiment of the present invention may further include:

if the data record number cu is larger than the data record number upper limit l of the data block₀Counting, and caching the next data after the data to be stored is stored in the next data bufferThe data record number of the device.

When cu is more than l₀When the data to be stored is stored in the current data buffer and the data stored in the current data buffer exceeds the storage upper limit of the data block, the data is counted, the count value is updated, and the data record number cu' of the next data buffer after the data to be stored is stored in the next data buffer is obtained.

It can be understood that, after the data record number cu 'of the next data buffer is obtained, the data record number cu' can be compared with the preset upper limit value l₁And an upper limit of the number of data records l of the data block₀Comparing if cu' is less than or equal to l₀Then the data to be stored is stored in the next data buffer.

If cu' > l₀And continuing to count, updating the count value, and acquiring the data record number cu "of the next data buffer after the data to be stored is stored in the next data buffer, and so on until a subsequent data buffer can store the data to be stored, or the count value reaches a preset threshold value.

The data to be stored can be stored in the data buffer by multiple attempts, so that the probability of storing the data to be stored in the proper data buffer can be improved, the probability of splitting and generating small data blocks after the data to be stored is written in the HDFS is reduced, and the storage of the data in the HDFS can be further ensured to approach to the block size.

When the count value reaches a preset threshold, for example, 9, it indicates that 9 times of storage of the data to be stored have been attempted, but no suitable data buffer can store the data to be stored. Then when the 10 th attempt is made (i.e., when the count value is greater than the preset threshold), the HDFS write may be made directly to the data to be stored. The specific value of the preset threshold may be adjusted according to actual needs, which is not limited in the embodiments of the present invention.

By directly writing the HDFS into the data to be stored when the count value is larger than the preset threshold value, excessive resource waste can be avoided, and the data storage efficiency of the HDFS is improved.

In an embodiment, after the data to be stored is stored in the next data buffer and before the data record number of the next data buffer is obtained, the data storage method of the HDFS provided in the embodiment of the present invention further includes:

and storing the data to be stored into the waiting queue buffer.

The structure of the wait queue register is identical to the structure of the data register.

When the further storage judgment is carried out, the data to be stored is stored into the waiting queue buffer, so that the delay of storing the next data to be stored into the current data buffer can be avoided, and the operation efficiency of the data storage method of the HDFS provided by the embodiment of the invention is improved.

In an embodiment, the data storage method of the HDFS provided in the embodiment of the present invention may further include:

if the data record number cu is less than the preset upper limit value l₁If so, continuing to store the current data buffer;

the continuing the storage operation includes:

storing the data to be stored into the current data buffer, and acquiring the data record number cu of the current data buffer after the next data to be stored is stored into the current data buffer₁。

It will be understood that when cu < l₁When the data block size is smaller than the data block size, the current data buffer can store the data to be stored, and the current data buffer can continue to store the data to be stored.

and if the consumed time for continuing the storage operation reaches the preset time length, performing HDFS writing on the data cached in the current data cache.

The preset duration may be, for example, 10ms, and the specific size may be adjusted according to actual needs, which is not limited in the embodiment of the present invention.

It can be understood that, by writing the HDFS into the data cached in the current data buffer when the time consumed for continuing the storage operation reaches the preset time, it is possible to avoid that the current data buffer delays much time due to waiting for the storage of the subsequent data to be stored, thereby ensuring the efficient operation of the data storage method of the HDFS provided by the embodiment of the present invention.

In summary, the data storage method of the HDFS provided in the embodiments of the present invention can combine small-scale data to the greatest extent while preserving the original features of the data, so as to reduce the number of small data blocks in the HDFS, thereby making the distribution of the data blocks in the HDFS more balanced.

Fig. 2 is a schematic structural diagram of a data storage device of an HDFS according to an embodiment of the present invention, and referring to fig. 2, an embodiment of the present invention further provides a data storage device of an HDFS, including:

an obtaining module 210, configured to obtain a data record number of a current data buffer after data to be stored is stored in the current data buffer;

the judging module 220 is configured to store the data to be stored in the current data buffer when the data record number is not less than the preset upper limit and is not greater than the data record number upper limit of the data block;

a write-in module 230, configured to perform HDFS write-in on data cached in the current data cache;

According to the data storage device of the HDFS provided by the embodiment of the invention, the data stored in the data buffer is written in the HDFS only when the data record number of the data buffer is close to the upper limit of the data record number of the data block, so that small-scale data can be combined to the maximum extent under the condition of keeping the original characteristics of the data to be stored, the storage of the data in the HDFS can be close to the block size, and the number of the small data blocks in the HDFS is reduced.

Fig. 3 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 3: a processor (processor)310, a communication interface (communication interface)320, a memory (memory)330 and a bus (bus)340, wherein the processor 310, the communication interface 320 and the memory 330 are communicated with each other via the bus 340. The processor 310 may call logic instructions in the memory 330 to perform the following method:

if the data record number is not less than the preset upper limit value and not greater than the data record number upper limit of the data block, storing the data to be stored into the current data buffer;

and performing HDFS writing on the data cached in the current data cache.

In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

Further, an embodiment of the present invention discloses a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, the computer is capable of performing the method provided by the above-mentioned method embodiments, for example, including:

and performing HDFS writing on the data cached in the current data cache.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method provided by the foregoing embodiments, for example, including:

and performing HDFS writing on the data cached in the current data cache.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A data storage method of an HDFS (Hadoop distributed File System), which is characterized by comprising the following steps:

performing HDFS writing on the data cached in the current data cache;

2. The HDFS data storage method according to claim 1, further comprising:

3. The HDFS data storage method according to claim 1, further comprising:

the continuing the storing operation comprises:

4. The HDFS data storage method according to claim 2, wherein if the count value is greater than a preset threshold, the HDFS writing is performed on the data to be stored.

5. The HDFS data storage method according to claim 2, wherein after the obtaining of the data to be stored in the next data buffer and before the data record number of the next data buffer, the method further comprises:

and storing the data to be stored into a waiting queue buffer.

6. The HDFS data storage method according to claim 3, wherein if the time consumed for the continuous storage operation reaches a preset time, the HDFS is written into the data cached in the current data cache.

7. The HDFS data storage method according to any one of claims 1 to 6, wherein the predetermined coefficient has a value in a range of 0.8 to 1.

8. A data storage device of an HDFS, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, implements the steps of the data storage method of the HDFS according to any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the data storage method of the HDFS according to any one of claims 1 to 7.