CN111309697A

CN111309697A - Method and device for storing data in distributed file system

Info

Publication number: CN111309697A
Application number: CN201811516697.2A
Authority: CN
Inventors: 温帮; 彭兴勃
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2020-06-19

Abstract

The invention discloses a method and a device for storing data in a distributed file system, and relates to the technical field of computers. One embodiment of the method comprises: independently configuring and differentially storing a data file written into a distributed file system by a data cluster and a corresponding file of the distributed file system, wherein the corresponding file comprises a system file and a log file; setting the number of the storage copies of the data file to be 2; and starting rack sensing to store the data files and the corresponding files across racks respectively. The embodiment can reduce about one third of the storage space, ensure the data reliability and the availability of the HBase cluster, realize the data storage with high availability and low cost, and greatly save the storage cost.

Description

Method and device for storing data in distributed file system

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for storing data in a distributed file system.

Background

Currently big data HBase clusters have become the storage solution used by many companies. In order to provide low latency services, better hardware such as large Memory, solid State drive (ssd), Non-Volatile Memory (NVME) disk, etc. is often selected, which results in higher and higher storage cost.

The HBase cluster bottom layer data is stored in an HDFS (Hadoop distributed file system), and in order to guarantee data reliability, a 3-copy storage scheme is adopted by default. Meanwhile, in order to ensure high availability of the HBase clusters across the data center, main and standby HBase clusters across the data center are also built.

In the HDFS, a 3-copy storage scheme is adopted, the reliability of data is ensured by introducing redundant data, and in the scene of a main HBase cluster and a standby HBase cluster, the data totally stores 6 copies in 2 HDFS clusters, which means that one copy of data expands by 6 times, the data redundancy is serious, and the waste of storage resources is greatly caused.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for data storage in a distributed file system, which can reduce about one third of a storage space, ensure data reliability and availability of an HBase cluster, implement data storage with high availability and low cost, and greatly save storage cost.

To achieve the above object, according to an aspect of the embodiments of the present invention, a method for data storage in a distributed file system is provided.

A method of data storage in a distributed file system, comprising: independently configuring and differentially storing a data file written into a distributed file system by a data cluster and a corresponding file of the distributed file system, wherein the corresponding file comprises a system file and a log file; setting the number of the stored copies of the data file to be 2; and starting rack sensing to store the data files and the corresponding files across racks respectively.

Optionally, the method further comprises: and combining the racks with the number of deployed servers smaller than a preset threshold value into one logical rack according to the number of the servers deployed on different racks by configuring logical rack perception.

Optionally, the method further comprises: the heartbeat interval time of the data nodes is reduced to reduce the timeout time of the distributed file system data node response.

Optionally, the method further comprises: keeping the hedge reading function of the distributed file system closed.

According to another aspect of the embodiments of the present invention, an apparatus for data storage in a distributed file system is provided.

An apparatus for data storage in a distributed file system, comprising: the file configuration module is used for independently configuring and differentially storing a data file written into the distributed file system by the data cluster and a corresponding file of the distributed file system, wherein the corresponding file comprises a system file and a log file; the copy setting module is used for setting the number of the stored copies of the data file to be 2; and the perception opening module is used for opening the rack perception so as to respectively store the data file and the corresponding file across racks.

Optionally, the system further comprises a rack merging module, configured to: and combining the racks with the number of deployed servers smaller than a preset threshold value into one logical rack according to the number of the servers deployed on different racks by configuring logical rack perception.

Optionally, the apparatus further comprises a timeout setting module, configured to: the heartbeat interval time of the data nodes is reduced to reduce the timeout time of the distributed file system data node response.

Optionally, the system further comprises a function saving module, configured to: keeping the hedge reading function of the distributed file system closed.

According to another aspect of the embodiments of the present invention, an electronic device for data storage in a distributed file system is provided.

An electronic device for data storage in a distributed file system, comprising: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the method for storing data in the distributed file system provided by the embodiment of the invention.

According to yet another aspect of embodiments of the present invention, a computer-readable medium is provided.

A computer readable medium, on which a computer program is stored, which when executed by a processor implements a method of data storage in a distributed file system as provided by an embodiment of the invention.

One embodiment of the above invention has the following advantages or benefits: the data files and the corresponding files (including system files, log files and the like) are independently configured and stored in a distinguishing mode, then the number of storage copies of the data files is set to be 2, and rack sensing is started to store the data files and the corresponding files in a cross-rack mode, so that the number of the data copies written into the HDFS by the main HBase cluster and the standby HBase cluster is reduced to 4 from the original 6, the reliability and the usability of the data of the HBase cluster are guaranteed while about one third of the storage space is reduced, the data storage with high availability and low cost is achieved, and the storage cost can be greatly saved.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main steps of a method of data storage in a distributed file system according to an embodiment of the present invention;

FIG. 2 is a schematic architecture diagram of a data storage method of an embodiment of the present invention;

FIG. 3 is a schematic diagram of the main modules of an apparatus for data storage in a distributed file system according to an embodiment of the present invention;

FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to solve the problem of storage resource waste in the prior art, the method separately stores data files and corresponding files (including system files, log files and the like) included in data written into a Hadoop Distributed File System (HDFS) by HBase in a scene of an active HBase cluster and a standby HBase cluster; the copy number of the data file is set to be 2, so that the number of the data copies written into the HDFS by the main HBase cluster and the standby HBase cluster is reduced to 4 from the original 6; meanwhile, the data files are stored across the racks by combining with a rack sensing strategy, so that the reliability and the availability of data of the HBase cluster are ensured while the storage space is reduced by about one third, the data storage with high availability and low cost is realized, and the cost can be greatly saved. The proportion of the system files and the log files in the total data storage space is extremely small, so that the number of copies of the system files and the log files can be adjusted to 3 or 2 without being adjusted, and the system files and the log files are stored across racks.

The rack is called as a rack server. Under the condition that the Hadoop (a distributed system infrastructure developed by the Apache foundation) cluster scale is large, rack sensing is configured to tell which machine belongs to which rack in the Hadoop cluster, so that the performance of the Hadoop is exerted to the maximum extent. When rack sensing is configured, factors such as reliability, availability and bandwidth consumption need to be weighed so that (1) communication between different nodes can occur in the same rack as much as possible instead of crossing racks; (2) to improve fault tolerance, the name node may place as many copies of the data block as possible on multiple racks.

Fig. 1 is a schematic diagram of main steps of a method for storing data in a distributed file system according to an embodiment of the present invention. As shown in fig. 1, the method for storing data in a distributed file system according to an embodiment of the present invention mainly includes the following steps S101 to S103.

Step S101: and independently configuring and differentially storing the data files written into the distributed file system by the data clusters and the corresponding files of the distributed file system, wherein the corresponding files comprise system files and log files.

In the prior art, when data is stored, data files, system files, log files and the like included in data are stored as one data block in one data node of the HDFS. When all servers of the rack corresponding to the data node are down, the data file, the system file, the log file and the like are unavailable. The system file comprises metadata, a name space table and the like, the metadata information of the cluster is stored, occupied storage resources are extremely small, and once the metadata is lost, the cluster is greatly influenced. Log files, such as a WAL (Write-Ahead Logging, which is a highly concurrent and persistent log saving and playback mechanism) file, can be played back and executed for operations that have not been completed before in the WAL file when the HBase server goes down, thereby ensuring the reliability of data.

In order to ensure the reliability and safety of data storage, the data file can be separated from the corresponding file (including a system file, a log file and the like) and independently configured and distinguished for storage. Thus, the availability of system files and log files can be guaranteed even if the server that holds the data files goes down.

Step S102: the number of stored copies of the data file is set to 2.

And modifying the configuration file of the HDFS, and writing a copy 2 of the configuration data file in the HDFS by default. Because in the HBase cluster, the storage of the data files accounts for most of the total storage space. Therefore, after the number of copies written in the HDFS is reduced from 3 to 2, about one third of the storage space can be saved.

In addition, because the storage resources occupied by the system files are extremely small, the system files can still be stored as 3 copies (default values) in the HDFS, so that the risk of losing the system files is reduced, the cluster reliability is improved, and a lot of storage space resources are not occupied.

Similarly, the data volume of the log file (such as WAL file) is small, and an automatic cleaning mechanism is also provided, so that a lot of storage resources are not occupied. The allocation of base, regionserver, hlog, and replication to 3 can configure the WAL log file to still store 3 copies (default) in the HDFS, thereby ensuring the reliability of data and reducing the data recovery time during downtime.

According to the embodiment of the invention, because the system file and the log file are independently grouped and configured with the data file, the system file and the log file can be stored in different rack servers with the data file when the data is stored. In specific implementation, the number of copies of the system file and the log file stored in the HDFS is at least 2, but is not suitable for being too large, the more the number of the written copies is, the greater the loss of the writing performance is, and the reliability of the data in most cases can be ensured by 3 copies, so that the writing performance is not necessarily sacrificed, and more copies are written.

Step S103: and starting rack sensing to store the data files and the corresponding files across racks.

By default, the rack sensing of hadoop is not enabled, and the rack sensing can be started by configuring a rack. Rack awareness is a policy for HDFS copy storage, and generally, a data file stores 3 copies on an HDFS by default, a 1 st copy is placed on a local data node, a 2 nd copy is placed on a node of a rack different from the 1 st copy, and a 3 rd copy is placed on a different node of the same rack as the 2 nd copy.

After the rack sensing is started, the data copy of the HDFS can be stored across racks, namely: multiple copies of the same data in the HDFS are stored in different racks, so that the data can be kept backed up in other racks and cannot be lost even if all servers of one rack are down. After the rack sensing is started, the data file and the corresponding file (system file, log file and the like) can be stored across racks. Namely: multiple copies of data files, system files, or log files in the HDFS are stored in different chassis.

Among the multiple file copies of the data file, the system file and the log file, some of the file copies may be stored in the same rack, or may be stored in different racks. One possible stored result is for example: 2 copies of the data file are stored on rack 1 and rack 2, respectively, 3 copies of the system file are stored on two data nodes of rack 3 and rack 4, respectively, and 3 copies of the log file are stored on two data nodes of rack 1 and rack 3, respectively.

According to the steps S101 to S103, the data files and the corresponding files (including the system files, the log files and the like) are separated and stored independently, then the number of the storage copies of the data files is set to be 2, and the rack sensing is started to store the data files and the corresponding files across racks, so that the number of the data copies written into the HDFS by the main and standby HBase clusters is reduced from 6 to 4, the reliability and the availability of the data of the HBase clusters are ensured while the storage space is reduced by about one third, the data storage with high availability and low cost is realized, and the storage cost can be greatly saved.

In addition, according to another embodiment of the present invention, when rack sensing is turned on, logical rack sensing may be configured, and racks with a number of deployed servers smaller than a preset threshold may be merged into one logical rack according to the number of servers deployed on different racks. The logical rack perception is a dividing method for combining a plurality of small racks with a small number of data nodes into one logical rack according to the number of servers on different racks so as to distinguish physical rack division, and the number of machines on different racks is similar logically. Specifically, by setting the logical rack sensing, a plurality of racks with a small number of data nodes can be merged and divided into one logical rack, so that the problem that data storage is inclined when the number of nodes of a physical rack is unbalanced can be solved. For example: assuming that rack 1 includes 10 data nodes, rack 2 includes 3 data nodes, rack 3 includes 4 data nodes, and rack 4 includes 3 data nodes, … …, when data is stored in these racks, respectively, a problem of data storage skew may result due to the large number of data nodes of rack 1 and the small number of data nodes of other racks. At this time, the problem of the tilt of the data storage can be solved by combining the racks 2, 3, 4 into one logical rack.

According to another embodiment of the invention, the timeout time of the response of the data nodes of the distributed file system can be reduced by reducing the heartbeat interval time of the data nodes, so that the data nodes can be timed out more quickly, the HDFS can sense the downtime of the data nodes more quickly, and data recovery is performed earlier, thereby reducing the risk of data loss.

The HDFS detects whether the data node is down by using a heartbeat mechanism, the data node Datanode sends heartbeat to the management node Namenode in a fixed period, and if the Namenode does not receive heartbeat information sent by the Datanode within a period of time, the Namenode marks that the Datanode is down. The calculation formula of the timeout time of the HDFS data node response is as follows: timeout 2 × heart.interval +10 × dfs. Wherein, heartbeat, interval time of detecting Datanode by Namenode, namely: heartbeat interval time of dataode; heart beat is the interval time between which Datanode sends heartbeats to the Namenode. Critical is default to 5 minutes, dfs is default to 3 seconds, so the timeout time of HDFS is default to 10 minutes +30 seconds. According to the embodiment of the invention, the heartbeat interval time of the DataNode can be reduced by setting heartbeat.recheck.interval to be 3 minutes, so that the timeout time of the distributed file system is reduced. In specific implementation, according to the requirement of service application, the timeout efficiency and stability are comprehensively considered, and the heartbeat interval time of the DataNode can be set to be other time lengths. Since the inter interval dfs, heartbeat, interval itself, for which dataode sends heartbeats to the Namenode is small, it is not generally adjusted.

According to a further embodiment of the invention, the hedge reading function of the distributed file system HDFS may also be kept off. The Hedged Read characteristic is Read optimization of HDFS based on multiple copies, and is not applicable after the number of copies is reduced and needs to be closed. Otherwise, in a scenario where there are few copies (2 copies of a data file), starting the Hedged Read function easily causes a problem of data reading failure.

The Hedged Read is a new characteristic introduced by Hadoop-2.4.0, if the operation response of a client requesting to Read a data block from a data node is slow, the client will send the Read request of the same data block to other data nodes, and select the first returned operation response as the request result to cancel other request operations. The Hedged Read characteristic helps to control abnormal values, and can solve the problem of short-time local reading failure caused by disk problems or network problems. However, when there are few copies of the data file, it may be determined that the data reading fails due to a long response time, and therefore it is necessary to keep the Hedged Read function of the HDFS of the distributed file system turned off.

According to the technical scheme of the invention, under the scene of deploying the HBase main cluster and the HBase standby cluster, when the batch faults of the nodes of the main cluster are down, the standby cluster can be used for disaster recovery switching, and the availability of the HBase cluster is ensured by switching the service flow to the standby cluster. And if the data part of the main cluster is lost, the data part can be copied from the standby cluster to the main cluster for data recovery, so that the integrity of the data is ensured.

FIG. 2 is a schematic architecture diagram of a data storage method of an embodiment of the present invention. As shown in fig. 2, the data storage method of the embodiment of the present invention mainly includes the following 7 aspects:

1. copy of data file 2: the copy number of the data files of the main and standby clusters is set to be 2;

2. logic rack perception: combining the frames with less data nodes into a logic frame;

3. copy of system file 3: storing the data files separately, and setting the copy number of the system files of the main and standby clusters to be 3;

4. copy of log file 3: storing the log files and the data files separately, and setting the copy number of the log files of the main and standby clusters to be 3;

5. and (3) HDFS optimization: reducing the heartbeat interval time of the data node, and closing the Hedged Read function;

6. disaster recovery switching: disaster recovery switching is realized by switching the main HBase cluster and the standby HBase cluster;

7. and (3) data recovery: and copying the data of the standby cluster to the main cluster for data recovery.

Through the configuration operations in 7 aspects as shown in fig. 2, it is possible to reduce the storage space in the context of the active and standby HBase clusters, and at the same time, ensure the data reliability and high availability of the HBase clusters, achieve high-availability low-cost data storage, and greatly save the storage cost.

Fig. 3 is a schematic diagram of main modules of an apparatus for data storage in a distributed file system according to an embodiment of the present invention. As shown in fig. 3, an apparatus 300 for storing data in a distributed file system according to an embodiment of the present invention mainly includes a file configuration module 301, a copy setting module 302, and a sense opening module 303.

The file configuration module 301 is configured to perform independent configuration and distinct storage on a data file written in the distributed file system by the data cluster and a corresponding file of the distributed file system, where the corresponding file includes a system file and a log file;

a copy setting module 302, configured to set the number of stored copies of the data file to 2;

and a sensing starting module 303, configured to start rack sensing, so as to perform cross-rack storage on the data file and the corresponding file respectively.

According to an embodiment of the present invention, the apparatus 300 for data storage in a distributed file system may further include a rack merge module (not shown in the figure) for:

and combining the racks with the number of deployed servers smaller than a preset threshold value into one logical rack according to the number of the servers deployed on different racks by configuring logical rack perception.

According to another embodiment of the present invention, the apparatus 300 for data storage in a distributed file system may further include a timeout setting module (not shown in the figure) for:

the heartbeat interval time of the data nodes is reduced to reduce the timeout time of the distributed file system data node response.

According to another embodiment of the present invention, the apparatus 300 for storing data in a distributed file system may further include a function saving module (not shown in the figure) for:

keeping the hedge reading function of the distributed file system closed.

According to the technical scheme of the embodiment of the invention, the data files and the corresponding files (including system files, log files and the like) are independently configured and stored in a distinguishing way, then the number of the storage copies of the data files is set to be 2, and the rack sensing is started to store the data files and the corresponding files across racks, so that the number of the data copies written into the HDFS by the main and standby HBase clusters is reduced to 4 from the original 6, the data reliability and the availability of the HBase clusters are ensured while the storage space is reduced by about one third, the data storage with high availability and low cost is realized, and the storage cost can be greatly saved.

Fig. 4 illustrates an exemplary system architecture 400 of a method of data storage in a distributed file system or an apparatus of data storage in a distributed file system to which embodiments of the present invention may be applied.

As shown in fig. 4, the system architecture 400 may include

terminal devices

401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the

terminal devices

401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use

terminal devices

401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The

terminal devices

401, 402, 403 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 405 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the

terminal devices

401, 402, 403. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.

It should be noted that the method for storing data in the distributed file system provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the apparatus for storing data in the distributed file system is generally disposed in the server 405.

It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use with a terminal device or server implementing an embodiment of the invention is shown. The terminal device or the server shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present invention may be implemented by software, or may be implemented by hardware. The described units or modules may also be provided in a processor, and may be described as: a processor comprises a file configuration module, a copy setting module and a perception opening module. Where the names of such units or modules do not in some cases constitute a limitation of the unit or module itself, for example, the file configuration module may also be described as a "module for independently configuring and differentially storing data files for writing a data cluster to a distributed file system from corresponding files of the distributed file system".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: independently configuring and differentially storing a data file written into a distributed file system by a data cluster and a corresponding file of the distributed file system, wherein the corresponding file comprises a system file and a log file; setting the number of the stored copies of the data file to be 2; and starting rack sensing to store the data files and the corresponding files across racks respectively.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for data storage in a distributed file system, comprising:

independently configuring and differentially storing a data file written into a distributed file system by a data cluster and a corresponding file of the distributed file system, wherein the corresponding file comprises a system file and a log file;

setting the number of the stored copies of the data file to be 2;

and starting rack sensing to store the data files and the corresponding files across racks respectively.

2. The method of claim 1, further comprising: and combining the racks with the number of deployed servers smaller than a preset threshold value into one logical rack according to the number of the servers deployed on different racks by configuring logical rack perception.

3. The method of claim 1, further comprising: the heartbeat interval time of the data nodes is reduced to reduce the timeout time of the distributed file system data node response.

4. The method of claim 1, further comprising: keeping the hedge reading function of the distributed file system closed.

5. An apparatus for data storage in a distributed file system, comprising:

the file configuration module is used for independently configuring and differentially storing a data file written into the distributed file system by the data cluster and a corresponding file of the distributed file system, wherein the corresponding file comprises a system file and a log file;

the copy setting module is used for setting the number of the stored copies of the data file to be 2;

and the perception opening module is used for opening the rack perception so as to respectively store the data file and the corresponding file across racks.

6. The apparatus of claim 5, further comprising a rack merge module to:

7. The apparatus of claim 5, further comprising a timeout setting module to:

8. The apparatus of claim 5, further comprising a function preservation module to:

keeping the hedge reading function of the distributed file system closed.

9. An electronic device for data storage in a distributed file system, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.