CN113192558A

CN113192558A - Reading and writing method for third-generation gene sequencing data and distributed file system

Info

Publication number: CN113192558A
Application number: CN202110578909.5A
Authority: CN
Inventors: 宁建峰; 宁建强; 戈素梅; 李宁宁; 刘政委
Original assignee: Beijing Lexun Technology Co ltd; Beijing Free Cat Technology Co ltd
Current assignee: Beijing Lexun Technology Co ltd; Beijing Free Cat Technology Co ltd
Priority date: 2021-05-26
Filing date: 2021-05-26
Publication date: 2021-07-30

Abstract

The application provides a reading and writing method and a distributed file system for third-generation gene sequencing data, wherein the method comprises the following steps: dividing a hard disk storage space into a plurality of data storage pools, and storing write-in data into the data storage pools in sequence after receiving a data write-in request, wherein one data storage pool is switched to the next data storage pool after being fully written. The scheme of the application utilizes the characteristic that the gene sequencing application data is written once, and then only reading can be carried out without modification, and meanwhile, the data writing is carried out asynchronously, and the real-time requirement is not high.

Description

Reading and writing method for third-generation gene sequencing data and distributed file system

Technical Field

The application relates to the technical field of information processing, in particular to a reading and writing method for third-generation gene sequencing data and a distributed file system.

Background

Gene sequencing is a typical application of high-performance calculation, and the third generation gene sequencing technology is gradually changed into the mainstream sequencing technology at present. The gene sequencing system is a standard high-performance computing cluster, the system architecture is shown in fig. 1, the whole system comprises a computing node cluster and a distributed file system, the computing node cluster comprises n computing nodes j1, j2 … … jn, the distributed file system comprises i storage servers f1 and f2 … … fi, and the computing nodes and the storage servers are connected through m switches h1 … … hm network.

The third generation gene sequencing system has several main requirements for distributed file systems: firstly, a large amount of random data extraction needs to be carried out on a gene file in the third-generation gene sequencing operation process, so that the random reading delay of a distributed file system is required to be lower; secondly, a large amount of new data is input while the third-generation gene sequencing is operated, so that the distributed file system is required to provide higher write-in bandwidth while ensuring lower random reading delay; third, during the third generation gene sequencing, all the computing nodes are parallel, so that the distributed file system is required to provide consistent performance for each computing node, and the situations that part of the nodes are fast and part of the nodes are slow can not occur.

The current distributed file system is basically positioned as a general file system, generally manages data on a hard disk based on a local file system, and has several problems in response to third-generation gene sequencing: firstly, a local file system is constructed on a mechanical hard disk, and the random read access performance is not high because the metadata of the local file system has more times of accessing the hard disk; secondly, when reading and writing are mixed, data are placed into a cache firstly when written in, collected to a certain amount and then are intensively brushed back to the hard disk, so that a large amount of writing in a short time can have great influence on reading of the hard disk, and the reading delay is uncontrollable; thirdly, when there are many computing nodes, the capability of the computing node far exceeds the capability of the storage cluster, the storage service directly rejects the request that the storage service cannot be accepted by adopting a simple flow control mechanism, and the computing node retries until the request succeeds, so that the retry may continue to be rejected, and the access delay is uncontrollable.

To this end, improvements to existing distributed file systems are needed.

Disclosure of Invention

The embodiment of the application aims to provide a reading and writing method and a distributed file for third-generation gene sequencing data, so as to solve the problem of low reading and writing efficiency in the reading and writing process of the third-generation gene sequencing data in the prior art.

In order to achieve the above objects, some embodiments of the present application provide a method for reading and writing third generation gene sequencing data, comprising the steps of:

dividing a hard disk storage space into a plurality of data storage pools;

and after receiving a data write request, sequentially storing write data into the data storage pools, wherein one data storage pool is switched to the next data storage pool after being fully written.

In some embodiments of the present application, the method for reading and writing third generation gene sequencing data further comprises the following steps:

and after receiving a data reading request, directly reading the requested data if the data storage pool in which the requested data is located is in a full write state.

after receiving a data repair request, determining an updated hard disk storage space;

and storing the written repair data into the updated hard disk storage space.

after a system mounting signal is detected, scanning data retrieval information in a hard disk storage space, wherein the data retrieval information comprises a data storage directory entry and a data index node; caching the data retrieval information into a system memory;

and after receiving a random data reading request, determining a target directory and a target index node where the requested data is located according to the data retrieval information in the system memory.

setting a flow window;

and after receiving a data writing request or a data reading request, adjusting the data writing flow or the data reading flow according to the flow window.

In some embodiments of the present application, the step of adjusting the data write flow or the data read flow according to the flow window includes:

acquiring the delay time of the data writing request or the data reading request;

if the delay time length is less than the expected time length, increasing the flow window according to a set proportion; and if the delay time length is greater than the expected time length, reducing the flow window according to a set proportion.

Based on the same inventive concept, some embodiments of the present application further provide a storage server for third generation gene sequencing data, comprising:

at least one hard disk;

the data management module is used for dividing the hard disk storage space of the hard disk into a plurality of data storage pools; and after receiving a data write request, sequentially storing write data into the data storage pools, wherein one data storage pool is switched to the next data storage pool after being fully written.

The storage server for third generation gene sequencing data in some embodiments of the present application, further comprising:

the data management module is further configured to, after receiving a data reading request, directly read the requested data if the data storage pool where the requested data is located is in a full write state; and/or the presence of a gas in the gas,

after receiving a data repair request, determining an updated hard disk storage space; storing the written repair data into the updated hard disk storage space; and/or the presence of a gas in the gas,

a Qos management module, which is used for setting a flow window; after receiving a data writing request or a data reading request, adjusting data writing flow or data reading flow according to the flow window; and:

acquiring the delay time of the data writing request or the data reading request; if the delay time length is less than the expected time length, increasing the flow window according to a set proportion; and if the delay time length is greater than the expected time length, reducing the flow window according to a set proportion.

Some embodiments of the present application further provide a distributed file system, comprising a plurality of storage servers for third generation gene sequencing data as described in any of the above aspects, further comprising:

and the global data distribution manager is used for performing differentiated processing of writing operation and reading operation on all hard disk storage spaces in the plurality of storage servers.

Compared with the prior art, the technical scheme provided by the application at least has the following beneficial effects: the method comprises the steps of dividing a hard disk storage space into a plurality of data storage pools, and storing write-in data into the data storage pools in sequence after receiving a data write-in request, wherein one data storage pool is switched to the next data storage pool after being fully written, so that a traditional data write-in mode can be changed, the data storage pool does not need to be written in as long as the data storage pool is fully written, and the data storage pool is not influenced by the data write-in operation if the fully written data storage pool is subsequently read. The scheme is designed aiming at the characteristics of the gene sequencing application data, the gene sequencing application data is characterized in that the data is written once and only can be read subsequently without being modified, meanwhile, the data writing is carried out asynchronously, the real-time requirement is not high, and on the basis, the scheme in the embodiment of the application can at least ensure the reading performance of the gene sequencing application data.

Drawings

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

FIG. 1 is a diagram of a system architecture of a computer cluster used in a prior art gene sequencing system;

FIG. 2 is a flow chart of a method for reading and writing third generation gene sequencing data according to an embodiment of the present application;

FIG. 3 is a flow chart of a method for reading and writing third generation gene sequencing data according to another embodiment of the present application;

FIG. 4 is a flow chart of a method for reading and writing third generation gene sequencing data according to yet another embodiment of the present application;

FIG. 5 is a block diagram of a storage server for third generation gene sequencing data according to one embodiment of the present application;

FIG. 6 is a block diagram of a storage server for third generation gene sequencing data according to another embodiment of the present application;

fig. 7 is a block diagram of a distributed file system according to an embodiment of the present application.

Detailed Description

In this section, reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Some embodiments of the present application provide a method for reading and writing third generation gene sequencing data, which can be used in a computer system for performing the reading and writing operations of the third generation gene sequencing data, as shown in fig. 2, and can include the following steps:

s101: the hard disk storage space is divided into a plurality of data storage pools. In this step, the storage space of each hard disk may be divided according to a set space size, all the hard disk storage spaces are divided into a plurality of data storage pools, and the space sizes of different data storage pools may be the same or different.

S102: and after receiving a data write request, sequentially storing write data into the data storage pools, wherein one data storage pool is switched to the next data storage pool after being fully written. The different data storage pools may be numbered according to certain rules and the system is able to determine the number of each data storage pool. When writing data, the system can obtain the size of the written data in real time and determine the space size of the data storage pool, for example, when the written data is 50G, space can be allocated for the written data according to the space size of different data storage pools in advance. Write data is written into the data storage pool numbered "10" with the 30G space size, and after the data storage pool numbered "10" is full, the write data is written into the data storage pool numbered "11" with the 20G space size. In some application scenarios, the data storage pools may be selected based on the size of the write data and the size of the storage space of each data storage pool. Therefore, the order in this step may be in accordance with a predetermined numbering order or may be a result of selection according to the size of the write data.

The above scheme provided by this embodiment changes the conventional data writing manner, where the state of a part of the data storage pool is a full-written state, and as long as the data storage pool is full, it is not necessary to perform a writing operation on the data storage pool, and if there is a subsequent reading operation on the full-written data storage pool, it is completely not affected by the data writing operation. The scheme is designed aiming at the characteristics of the gene sequencing application data, the gene sequencing application data is characterized in that the data is written once and only can be read subsequently without being modified, meanwhile, the data writing is carried out asynchronously, the real-time requirement is not high, and on the basis, the scheme in the embodiment of the application can at least ensure the reading performance of the gene sequencing application data.

The reading and writing method for third generation gene sequencing data provided in some embodiments of the present application, as shown in fig. 3, may further include the following steps:

s103: and after receiving the data repair request, determining the updated hard disk storage space. When a hard disk fails, the failed hard disk is replaced by an updated hard disk, the updated hard disk corresponds to the updated hard disk storage space, and at this time, data of the failed hard disk needs to be restored again.

S104: and storing the written repair data into the updated hard disk storage space.

By adopting the scheme, the common strategy of the distributed file is broken through during data repair, and only the repaired data is written into the newly replaced hard disk, so that the interference of the repaired data on other hard disks is avoided, and the influence of writing on data reading during data repair after the hard disk fails can be avoided.

The reading and writing method for third generation gene sequencing data provided in some embodiments of the present application, as shown in fig. 4, may further include the following steps:

s105: and after receiving a data reading request, directly reading the requested data if the data storage pool in which the requested data is located is in a full write state. As mentioned above, the read operation can be directly performed on the full data storage pool, and due to the characteristics of the third generation gene sequencing data, no data change operation is performed on the full data storage pool, so that no data write operation affects the read operation.

Further, the method may further include:

s106: after a system mounting signal is detected, scanning data retrieval information in a hard disk storage space, wherein the data retrieval information comprises a data storage directory entry and a data index node; and caching the data retrieval information into a system memory.

S107: and after receiving a random data reading request, determining a target directory and a target index node where the requested data is located according to the data retrieval information in the system memory.

For random read operation of large files, at present, a metadata part such as EXT4 (Fourth generation Extended file system) needs to access the hard disk at least 3 times, including finding a corresponding directory entry from a parent directory, reading file Inode (target Inode) information, and reading file layout extend (continuous storage space) information. For mechanical hard disks, each random access needs to consume about 7ms, and for third generation gene sequencing applications, the performance of the existing random read operation is difficult to meet the requirement. In order to accelerate the metadata access performance of the EXT4, according to the scheme, a kernel VFS (virtual File system) cache of a Linux system is improved, when an EXT4 File system is mounted (mounting refers to a process that an operating system enables computer files and directories on a storage device such as a hard disk, a CD-ROM or a shared resource to be accessible to a user through the File system of a computer), a background scanning program service is started, the files on the EXT4 File system are scanned, corresponding directory entries and Inode information are marked with special marks, so that a recovery mechanism of the kernel can be skipped when random reading operation is performed, and data retrieval information comprising the directory entries and the Inode information is cached in a memory all the time. As the third-generation gene sequencing is a large file, the memory consumption occupied by the directory entry and the Inode information is small, and the memory can be forcibly recycled when the memory needs to be actively recycled. According to the scheme, a cache mechanism is added in the operating system, so that directory entries and data index node information of EXT4 metadata can be cached for a long time, and metadata access performance is improved.

The reading and writing method for third generation gene sequencing data provided in some embodiments of the present application may further include the following steps:

s201: and setting a flow window. The setting of the traffic window may be established using a mechanism similar to the TCP sliding window.

S202: and after receiving a data writing request or a data reading request, adjusting the data writing flow or the data reading flow according to the flow window.

The scheme in the embodiment performs flow control on the data writing or data reading process by adopting a mechanism similar to a TCP sliding window, thereby avoiding the occurrence of a large number of conditions of request burst and retransmission, and avoiding the waste of network bandwidth and storage resources.

In addition, in order to optimize the size of the flow window and achieve the most effective flow control, in some embodiments, the step of adjusting the data write flow or the data read flow according to the flow window in step S202 includes:

s2021: and acquiring the delay time of the data writing request or the data reading request.

S2022: if the delay time length is less than the expected time length, increasing the flow window according to a set proportion; and if the delay time length is greater than the expected time length, reducing the flow window according to a set proportion.

The expected duration is the duration meeting the read-write efficiency requirement and can be set through an empirical value. In a specific implementation, an initial stage may first apply for an initial size of a traffic window from a system control center, and control the number of requests sent to each node and the number of requests received from other nodes by using the traffic window of the initial size, where if a delay duration for each request is expected, the traffic window may be expanded according to a set proportion, for example, the delay duration is increased at an increasing rate of 10%, and if a resource shortage or congestion condition occurs in a node that sends a request or receives a request, the delay duration may exceed an expected duration, and at this time, the expansion of the traffic window is stopped. If there is a node that receives a request or sends a request that continues to feed back resource shortage or congestion, the traffic window for that node is decremented proportionally, e.g., by decrementing the traffic window at a 10% decrement rate, to reduce the pressure on the node to send or receive requests until the node no longer feeds back resource shortage or congestion. In the above scheme in this embodiment, the data transmission rate is controlled by using the flow window with the size automatically regulated and controlled, so that the node is not overloaded, and the request retransmission is avoided, thereby improving the data read-write performance.

In some embodiments, a storage server 10 for third generation gene sequencing data is provided, as shown in fig. 5, comprising a data management module 101 and at least one hard disk 102. The data management module is used for dividing a hard disk storage space of the hard disk into a plurality of data storage pools; and after receiving a data write request, sequentially storing write data into the data storage pools, wherein one data storage pool is switched to the next data storage pool after being fully written. The gene sequencing application data is characterized in that the data is written once, and only can be read subsequently without modification, and meanwhile, the data writing is carried out asynchronously, so that the real-time requirement is not high.

In some embodiments of the present application, the data management module 101 is further configured to, after receiving a data reading request, directly read the requested data if a data storage pool in which the requested data is located is in a full write state; the gene sequencing application data is characterized in that the data is written once, and only reading can be carried out subsequently without modification, meanwhile, the data writing is carried out asynchronously, and the real-time requirement is not high.

In some embodiments of the present application, the data management module 101 is further configured to determine an updated hard disk storage space after receiving a data repair request; and storing the written repair data into the updated hard disk storage space. By adopting the scheme, the common strategy of the distributed file is broken through during data repair, and only the repaired data is written into the newly replaced hard disk, so that the interference of the repaired data on other hard disks is avoided, and the influence of writing on data reading during data repair after the hard disk fails can be avoided.

In some embodiments of the present application, after detecting a system mount signal, the data management module 101 scans data retrieval information in a hard disk storage space, where the data retrieval information includes a data storage directory entry and a data index node; caching the data retrieval information into a system memory; and after receiving a random data reading request, determining a target directory and a target index node where the requested data is located according to the data retrieval information in the system memory. According to the scheme, a cache mechanism is added in the operating system, so that directory entries and data index node information of EXT4 metadata can be cached for a long time, and metadata access performance is improved.

As shown in fig. 6, the storage server for third generation gene sequencing data in some embodiments of the present application further includes a Qos management module 103, configured to set a flow window; after receiving a data writing request or a data reading request, adjusting data writing flow or data reading flow according to the flow window; by adopting a mechanism similar to a TCP sliding window, the flow control is carried out on the data writing or data reading process, the situations of a large number of requests burst and retransmission are avoided, and the waste of network bandwidth and storage resources is avoided.

The Qos management module 103 in some embodiments of the present application is further configured to obtain a delay duration of the data write request or the data read request; if the delay time length is less than the expected time length, increasing the flow window according to a set proportion; and if the delay time length is greater than the expected time length, reducing the flow window according to a set proportion. The flow control method and the flow control device can optimize the size of the flow window and achieve the most effective flow control.

In some embodiments, a distributed file system is further provided, where the system architecture is as shown in fig. 7, and includes a plurality of storage servers 10 for third generation gene sequencing data as described in any of the above, and further includes a global data distribution manager 20, which is used to perform differentiated processing of write operation and read operation on all hard disk storage spaces in the plurality of storage servers 10. In the system, the global data distribution manager 20 may perform read-write data distribution separation on hard disks of different storage servers 10, and the data management module 101 in each storage server 10 may also be configured to perform data storage pool partitioning on a hard disk storage space in the server, so as to ensure that a write operation of a next data storage pool is executed only after one data storage pool is fully written. Furthermore, the Qos management module 103 in each storage server 10 is configured to control write and read traffic in each storage server 10, so as to avoid blocking in other storage servers 10 caused by excessive requests received or issued by a certain storage server 10, and avoid rejecting a large number of requests caused by insufficient storage service capability in a certain storage server 10. The reading of EXT4 metadata is accelerated in each storage server 10 by VFS caching. Through the improvements, the system provided by the embodiment can be more suitable for data reading and writing operations of third-generation gene sequencing applications.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A reading and writing method for third generation gene sequencing data is characterized by comprising the following steps:

dividing a hard disk storage space into a plurality of data storage pools;

2. The method of claim 1, further comprising the steps of:

3. The method of claim 1, further comprising the steps of:

and storing the written repair data into the updated hard disk storage space.

4. A method of reading and writing third generation gene sequencing data according to any one of claims 1 to 3, further comprising the steps of:

5. The method of claim 4, further comprising the steps of:

setting a flow window;

6. The method of claim 5, wherein the step of adjusting the data write flow or the data read flow according to the flow window comprises:

7. A storage server for third generation gene sequencing data, comprising:

at least one hard disk;

8. The storage server for third generation gene sequencing data of claim 7, wherein:

9. The storage server for third generation gene sequencing data of claim 8, further comprising:

10. A distributed file system comprising a plurality of storage servers for third generation genetic sequencing data as claimed in any of claims 7 to 9, further comprising: