CN108268208B - RDMA (remote direct memory Access) -based distributed memory file system - Google Patents

RDMA (remote direct memory Access) -based distributed memory file system Download PDF

Info

Publication number
CN108268208B
CN108268208B CN201611261722.8A CN201611261722A CN108268208B CN 108268208 B CN108268208 B CN 108268208B CN 201611261722 A CN201611261722 A CN 201611261722A CN 108268208 B CN108268208 B CN 108268208B
Authority
CN
China
Prior art keywords
metadata
file
client
memory
rdma
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611261722.8A
Other languages
Chinese (zh)
Other versions
CN108268208A (en
Inventor
陆游游
舒继武
陈游旻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201611261722.8A priority Critical patent/CN108268208B/en
Publication of CN108268208A publication Critical patent/CN108268208A/en
Application granted granted Critical
Publication of CN108268208B publication Critical patent/CN108268208B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a RDMA-based distributed memory file system, wherein in the initialization stage of the distributed memory file system, a cluster is used for uniformly dividing a memory for file storage and registering the memory to a network card so as to support a remote node to directly access the memory and further construct a distributed shared memory pool; on the distributed memory sharing pool, file indexing and file data block indexing are respectively carried out through two-stage Hash indexing, and query service is provided for the file system; and processing the request of the client by the self-identification remote procedure call method, and returning a processing result. The invention has the following advantages: the data copying of the file during reading and writing is reduced, the response delay is reduced, and the overall efficiency of file access is improved in program software.

Description

RDMA (remote direct memory Access) -based distributed memory file system
Technical Field
The invention relates to the field of distributed storage systems, in particular to a RDMA-based distributed memory file system.
Background
Remote Direct Memory Access (RDMA) refers to directly accessing a Remote Memory without Direct participation of host operating systems of both parties, thereby providing characteristics of high bandwidth and low latency.
Data transmission in a distributed environment determines the overall I/O performance of the system, and the technology is widely applied to distributed file systems and database systems. Most of traditional distributed systems use a magnetic disk as a storage medium, and perform data transmission through a remote procedure call module based on TCP/IP, because the magnetic disk has low bandwidth and high delay, a network transmission module itself cannot become a bottleneck.
At present, a general data transmission module for efficiently processing different network I/O characteristics does not exist, and meanwhile, a centralized synchronization model is still adopted in concurrency control, which seriously affects the expandability of the system.
Disclosure of Invention
The present invention is directed to solving at least one of the above problems.
Therefore, the invention aims to provide an RDMA-based distributed memory file system to reduce data copying of files during reading and writing, reduce response delay and improve the overall efficiency of file access in program software.
In order to achieve the above object, an embodiment of the present invention discloses a RDMA-based distributed memory file system, where the distributed memory file system interconnects memories of nodes through an RDMA network, the file system includes a client and a server, the client provides a file access interface for upper layer applications to call, the server provides a metadata service and a data service, and the distributed memory file system performs the following operations: s1: in the initialization stage of the distributed memory file system, uniformly dividing the memory of the cluster for file storage, and registering the memory to a network card to support a remote node to directly access the memory, thereby constructing a distributed shared memory pool; s2: on the distributed memory sharing pool, file indexing and file data block indexing are respectively carried out through two-stage Hash indexing, and query service is provided for the file system; s3: and processing the request of the client by a self-identification remote procedure call method, and returning a processing result.
According to the RDMA-based distributed memory file system disclosed by the embodiment of the invention, the data copying of the file during reading and writing is reduced, the response delay is reduced, and the overall efficiency of file access is improved in program software.
In addition, the RDMA-based distributed memory file system according to the above embodiment of the present invention may further have the following additional technical features:
further, the distributed shared memory pool sequentially stores the super block, the message pool, the chained hash index table, the metadata storage block and the data storage block.
Further, receiving direct access of a remote node in the data layout area; in the chain hash index table and the metadata storage area, a service node responds to the client concurrent request and executes query and update on metadata; and the metadata is hashed and dispersed to the whole cluster according to the file path name, and each node independently maintains the metadata and the data of the file.
Furthermore, the super block is used for storing the number of the metadata blocks, the size of the metadata blocks, the number of the data blocks and the size of the data blocks, and the super block is remotely read by each node when the file system is started.
Further, the message pool comprises a plurality of message areas, and the message areas are distributed to different clients connected to the system, so that when a new request is made by a client, the new request is remotely written into the message area to which the service node belongs by the client, and after the new request is monitored by the server receiving process, the message is quickly positioned by using a self-identification method and is processed and returned.
Further, the chain hash index table is used for calculating a hash value of a full path name of the file when the metadata of the file is queried; taking the hash value as an index number of an index table, inquiring the table items under the index number, matching file names, and if the file names are successfully matched, acquiring a metadata address and accessing the metadata according to the address; if the file names do not match, the next table entry is continuously searched until the matching is successful.
Further, the metadata storage block and the data storage block are used for storing metadata and data respectively, the metadata area and the data area are divided into fixed-size metadata blocks and data blocks, and free block metadata of corresponding areas are stored in the headers of the metadata area and the data area to describe the use condition of the memory.
Further, the step of two-level hash indexing comprises: when a client side initiates a file access request, calculating a metadata server ID for storing metadata of a file according to the full path name of the file, wherein the metadata server ID is determined by a system configuration file; the client sends the request information to the metadata server corresponding to the ID, the metadata server analyzes the request content after detecting the new message, carries out the second hash value calculation according to the file path name in the request content, accesses the chained hash index table according to the second hash value, acquires the metadata information, carries out corresponding logic processing and returns the request result.
Further, the self-recognition remote procedure call method comprises the following steps: when the client sends a message to the server, using RDMA _ WRITE _ WITH _ IMM primitive to carry message content, and storing client metadata at the head of the message; when the server side returns the request result, the returned result is directly written back to the memory area designated by the client side through the RDMA primitive, and the client side monitors the memory area for storing the returned result in a polling mode until the data is successfully returned.
Further, the client stores the self ID and the timestamp in the header of the message, and the client ID is distributed by the server master node when the connection is established and is globally unique.
Further, after the RDMA _ WRITE _ width _ IMM message is successfully delivered, the server side obtains the client metadata of the message header according to the completion information, analyzes the client ID, and directly queries the fixed offset position of the local message pool according to the client ID to obtain new request information.
Further, when the number of the clients exceeds the number of the message pools allocated by the server in advance, the server inquires the disconnected clients and transfers the message areas occupied by the disconnected clients to the current client; and if all the clients keep the connection state, the clients need to reapply the message area, register to the network card and inform the current clients.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flowchart illustrating the operation of an RDMA-based distributed memory file system according to an embodiment of the present invention;
FIG. 2 is a diagram of an RDMA data transfer of one embodiment of the invention;
FIG. 3 is a layout diagram of a service node shared memory according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating statistics of copy number during data transmission and reception according to an embodiment of the present invention;
FIG. 5 is a diagram of a self-identifying remote procedure call, in accordance with an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
These and other aspects of embodiments of the invention will be apparent with reference to the following description and attached drawings. In the description and drawings, particular embodiments of the invention have been disclosed in detail as being indicative of some of the ways in which the principles of the embodiments of the invention may be practiced, but it is understood that the scope of the embodiments of the invention is not limited correspondingly. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
The invention is described below with reference to the accompanying drawings. Before describing embodiments of the present invention, the present terminology will be described.
Direct Memory Access (DMA) allows some hardware devices to independently and directly read and write a Memory without involving a large amount of CPUs.
Remote Direct Memory Access (RDMA) is a novel network communication technology, and can directly Access a Remote Memory without Direct participation of operating systems of both parties, and realize the characteristics of high throughput and low delay. RDMA realizes zero copy of data transmission by directly transmitting data to the memory of the other side by a network adapter, thereby eliminating direct participation of CPU and Cache and reducing redundant field switching. Currently, the network protocol stack supporting RDMA technology includes Infiniband, RoCE (RDMA over Converged Ethernet), and iWARP, both of which are supported by Mellanox hardware technology, and particularly both of which are fully compatible with Ethernet due to the use of the data link layer of ordinary Ethernet. Fig. 2 shows a specific flow of RDMA communication: the method comprises the steps that firstly, a local CPU initiates a communication command to a network card in an MMIO mode, after the local network card detects a new command, data to be transmitted are read from a memory in a DMA mode, the data are packaged and transmitted in an RDMA network, after the opposite network card receives the data, the data are directly written into an address area corresponding to the memory in the DMA mode, corresponding completion information is written into a completion queue, the participation of the opposite CPU is not involved in the whole process, kernels of the two sides are bypassed, and zero copy of data transmission is achieved. Before the communication is established, the two communication parties need to go through the following steps: opening the network card equipment; creating a protection domain, wherein the protection domain can be bound with an object created at a later stage to ensure the safety of data transmission, and any cross-domain operation can cause communication errors; registering a memory, namely registering the communicated memory at the stage, wherein the specific method comprises the steps of establishing mapping between a user mode address and a memory address of the memory in the segment, storing the mapping table into a network card for caching, and generating a key pair (lkey and rkey) of the memory segment at the same time, wherein the network card needs to carry a corresponding key for identity confirmation when accessing the memory locally or remotely; creating CQ (completion queue), putting corresponding completion information into a completion queue after a sender successfully sends a message or a receiver successfully receives the message, and repeatedly detecting the completion queue by a user to verify whether the message is sent; creating a QP (Queue Pair), wherein the QP can be used for a socket which is equal to the TCP/IP, the QP is composed of a Sendqueue and a Receive Queue, a sender puts a message to be sent into a sending Queue, a receiver puts a receiving request into a receiving Queue, and the two parties carry out network communication in this way; and (3) initializing the QP state, and after the two communication parties create the QPs in one-to-one correspondence, performing a series of handshake state conversion until the communication link is successfully established. The QP can establish different connection types, including RC (replaceable connection), UC (replaceable connection) and UD (replaceable datagram), in the RC mode, the QP can only perform one-to-one reliable transmission, corresponding acknowledgement information feedback is generated after a data packet is successfully sent, in the UC mode, the QP performs one-to-one transmission without acknowledgement information feedback, in the UD mode, no one-to-one definition is generated, no acknowledgement information feedback is generated, the three transmission modes have different characteristics, and the support degrees for communication primitives are different.
Memory computing refers to a novel processing mode in which, in the face of the requirement of massive data and high real-time processing, a conventional storage system using a magnetic disk as a storage medium is difficult to meet new challenges due to slow access speed, so that the storage system is transferred to a memory for real-time processing. The memory storage system mainly comprises two types, namely a memory database system and a memory file system. The invention combines RDMA network communication to reconstruct the memory file system. Currently, the mainstream memory file systems include Alluxio, IGFS, and the like. The Alluxio is mainly used for solving the existing problems of the Spark computing framework, accelerating the data processing performance and realizing single storage and reliable recovery of data by using the link. IGFS is a cache file system between the computing framework and the HDFS, and provides an HDFS-compatible interface to the upper layer, but unlike HDFS, IGFS does not have a separate metadata server, but performs data distribution in a hash manner.
Remote Procedure Call (RPC) is a Remote communication protocol that enables a program running on one computer to remotely Call a function on another computer without the user having to be concerned with the underlying communication interaction policy. The remote procedure call is widely applied to the field of distributed systems, a client-server model is adopted, the call procedure is always initiated by a client, specifically, the call procedure comprises the steps of packaging and sending information such as a call function serial number, a call function parameter and the like to a server, then the server receives and executes a request, and after the server finishes executing the request, an execution result is returned to the client.
Fig. 1 is a flowchart of an execution operation of an RDMA-based distributed memory file system according to an embodiment of the present invention. As shown in fig. 1, according to the RDMA-based distributed memory file system according to the embodiment of the present invention, firstly, memories of nodes are interconnected via RDMA, the distributed memory file system includes a client and a server, the client provides a file access interface for upper layer application to call, the server provides metadata service and data service, and the distributed memory file system executes the following actions:
s1: in the initialization stage of the distributed memory file system, the memory of the cluster for file storage is uniformly divided and registered to the network card to support the remote node to directly access the memory, so that a distributed shared memory pool is constructed.
S2: on a distributed memory sharing pool, file indexing and file data block indexing are respectively carried out through two-stage Hash indexing, and query service is provided for a file system;
s3: and processing the request of the client by the self-identification remote procedure call method, and returning a processing result.
It should be noted that the distributed shared memory pool is formed by shared memories of nodes, each shared memory of nodes has a uniform data layout, and specifically, the shared memories sequentially store super blocks, a message pool, a chained hash index table, a metadata storage block, and a data storage block (as shown in fig. 3), and the shared memory pool is used for file storage and message transmission, so that storage media and communication modes of the file system are changed, and through the uniform management mode, the whole software stack is thinned, and the processing speed is higher.
In one embodiment of the invention, in the data layout area, the super block, the message pool and the data area are registered to the network card, and the area can be directly accessed by a remote node, so that the memory copy is reduced to improve the efficiency; the chain hash index table and the metadata storage area are independently maintained by a local service thread, and the specific method is that a service node responds to all metadata requests and completes the query and update of corresponding indexes and metadata; and the metadata is hashed and dispersed to the whole cluster according to the file path name, and each node independently maintains the metadata and the data of the file so as to improve the overall performance of the file system.
In an embodiment of the present invention, the super block is used for storing a core data structure of the file system, and specifically includes a metadata block number, a metadata block size, a data block number, a data block size, and the like. This area will be read remotely by each node at startup of the file system for initial identification and location.
In one implementation of the present invention, the message pool is used for communication between the client and the server, and specifically, the message pool is divided into message areas of the same size, each message area is occupied by one client independently, that is, the client is bound to a fixed offset of the message pool of the service node, when the client has a new request, the new request of the client is written into the message area to which the service node belongs remotely, and after the server receives the new request, the server queries the corresponding message area by using the unique ID number of the client, identifies the message type, and processes and returns the message.
In an embodiment of the present invention, a chain hash index table is used for local metadata indexing, the index table is provided with a global unified entry, is arranged in a linear table form, and is used for indexing a specific chain entry, each entry includes three fields, which are a file name, a metadata address, and a next entry address, and the specific method includes: when inquiring file metadata, firstly calculating a hash value of a full path name of a file, taking the hash value as an index number of an index table, inquiring a linear table, reading the file name of a corresponding table item and matching, and if the file name is successfully matched, acquiring a metadata address and accessing the metadata according to the address; if the file names do not match, the search is continued according to the next entry address until the matching is successful.
In one embodiment of the invention, the metadata storage block and the data storage block are used for storing metadata and data, respectively. The headers of the two areas store bitmaps of the corresponding areas, and the bitmaps are used for indicating the occupation condition of the areas.
In an embodiment of the present invention, the chained hash index table is used for local metadata indexing, each entry includes three fields, which are a file name, a metadata address, and a next entry address, and the specific method includes: when inquiring file metadata, firstly calculating a hash value of a full path name of a file, taking the hash value as an index number of an index table, inquiring and matching file names of corresponding entries, and if the hash value is successful, acquiring a metadata address and accessing the metadata according to the address; if the file names do not match, the search is continued according to the next entry address until the matching is successful.
According to the method for reconstructing the memory file system in the RDMA network, the overall performance of the system is greatly improved. The traditional distributed file system takes a slow disk as a storage medium, network communication is carried out by using a gigabit network, the performance loss brought by the file system is small because the delay of the disk and the gigabit network is high (millisecond level), when the file system takes a memory as the storage medium, the file system occupies a large amount of delay in the whole data path, and a large amount of data copy (as shown in figure 4) and redundant field switching are introduced in the period, so that the whole performance cannot be linearly improved.
Fig. 5 illustrates a self-recognition remote invocation technique according to an embodiment of the present invention, where the method is based on a large memory cluster supporting RDMA hardware technology network interconnection, where the RDMA technology refers to a technology in which nodes can directly read and write a remote memory without direct participation of a remote CPU, and the large memory cluster refers to a technology in which each node in the cluster is provided with a large-capacity memory and has a spare memory for constructing a distributed memory file system, and the method includes:
when the client sends a message to the server, the RDMA _ WRITE _ WITH _ IMM primitive is adopted for data sending and self-recognition, and when the server returns a request result, the RDMA primitive is used for writing back the returned data. The specific method comprises the following steps:
when the client sends a message to the server, the RDMA _ WRITE _ WITH _ IMM primitive allows the client to carry client metadata when sending a request, and particularly, the client stores the ID and the timestamp of the client in the area so as to facilitate the server to quickly identify and position;
when the server returns the request result, the returned result is directly written back to the memory area designated by the client through the RDMA primitive, and at the moment, the client monitors the memory area for storing the returned result in a polling mode until the data is successfully returned.
In one embodiment of the invention, the client ID is allocated by the server master node when the connection is established and is globally unique, so that the client can automatically occupy one message area in the message pool of the server through the ID of the client.
In an embodiment of the invention, after RDMA _ WRITE _ WITH _ IMM message is successfully delivered, the server receives the request in the queue and puts the completion information into a completion queue, the server circularly accesses the completion queue through an independent thread to detect a new request, after finding the new message, the server firstly obtains auxiliary information carried by the message according to the completion information and analyzes the ID of the client, directly queries the fixed offset position of the local message pool according to the ID of the client to obtain new request information, then analyzes the content of the request, executes a corresponding function at the server, and then returns the execution result.
In one embodiment of the invention, when the number of the clients reaches a certain number and exceeds the number of the message pools allocated by the server in advance, the server inquires the disconnected clients and transfers the message areas occupied by the disconnected clients to the current client; if all the clients keep active, a new application message area is needed, and the network card is registered and the clients are informed.
According to the self-recognition remote invocation technique of the invention, remote requests are responded to in time. The technology has the following advantages: the RDMA _ WRITE _ WITH _ IMM primitive is selected to send the message, and the server can quickly detect and identify and process in time by carrying auxiliary information on the premise of ensuring low delay; the server side selects RDMA primitive to write back the request result, and has the characteristic of extremely low delay, so that the whole round-trip delay is lower, meanwhile, the client side can allocate a memory area for storing the return result in advance before sending out the remote request, and attach the corresponding address to the request information, so that the server side can directly carry out remote writing according to the given address, and can well carry out concurrency control on the memory area in the client side, and the technology can well adapt to high concurrency scenes.
In addition, other configurations and functions of the RDMA-based distributed memory file system according to the embodiment of the present invention are known to those skilled in the art, and are not described in detail in order to reduce redundancy.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims (10)

1. An RDMA-based distributed memory file system, wherein the distributed memory file system interconnects memory of each node through RDMA, the file system includes a client and a server, the client provides a file access interface for upper layer applications to call, the server provides metadata services and data services, and the distributed memory file system performs the following actions:
s1: in the initialization stage of the distributed memory file system, uniformly dividing the memory of the cluster for file storage, and registering the memory to a network card to support a remote node to directly access the memory, thereby constructing a distributed shared memory pool;
s2: on the distributed shared memory pool, file indexing and file data block indexing are respectively carried out through two-stage Hash indexing, and query service is provided for the file system;
s3: processing the request of the client by a self-identification remote procedure call method, and returning a processing result;
wherein the two-level hash index comprises: when a client side initiates a file access request, calculating a metadata server ID for storing metadata of a file according to the full path name of the file, wherein the metadata server ID is determined by a system configuration file; the client sends the request information to a metadata server corresponding to the ID, the metadata server analyzes the request content after detecting a new message, performs second hash value calculation according to the file path name in the request content, accesses the chained hash index table according to the second hash value, acquires the metadata information, performs corresponding logic processing and returns a request result;
the self-recognition remote procedure call method comprises the following steps: when the client sends a message to the server, using RDMA _ WRITE _ WITH _ IMM primitive to carry message content, and storing client metadata at the head of the message; when the server side returns the request result, the returned result is directly written back to the memory area designated by the client side through the RDMA primitive, and the client side monitors the memory area for storing the returned result in a polling mode until the data is successfully returned.
2. The RDMA-based distributed memory file system of claim 1, wherein the distributed shared memory pool stores, in order, a superblock, a message pool, a chain hash index table, a metadata store, and a data store.
3. The RDMA-based distributed memory file system of claim 2, wherein direct access by a remote node is received in a data layout area; in the chain hash index table and the metadata storage area, a service node responds to the client concurrent request and executes query and update on metadata; and the metadata is hashed and dispersed to the whole cluster according to the file path name, and each node independently maintains the metadata and the data of the file.
4. The RDMA-based distributed memory file system of claim 2, wherein the superblock is used to house a number of metadata blocks, a metadata block size, a number of data blocks, and a data block size, the superblock being read remotely by nodes at startup of the file system.
5. The RDMA-based distributed memory file system of claim 2, wherein the message pool comprises a plurality of message areas, and the plurality of message areas are allocated to different clients connected to the system, so that when a new request is made by a client, the client remotely writes the new request into the message area of the service node, and after the new request is monitored by the server receiving thread, the message is quickly located by using a self-recognition method and processed back.
6. The RDMA-based distributed memory file system of claim 2, wherein the chained hash index table is used for metadata indexing, and when querying file metadata, the hash value of the full pathname of the file is first calculated;
taking the hash value as an index number of an index table, inquiring the table items under the index number, matching file names, and if the file names are successfully matched, acquiring a metadata address and accessing the metadata according to the address;
if the file names do not match, the next table entry is continuously searched until the matching is successful.
7. The RDMA-based distributed memory file system according to claim 2, wherein the metadata storage area and the data storage area are used for storing metadata and data, respectively, the metadata storage area and the data storage area are divided into fixed-size metadata blocks and data blocks, and free block bitmaps of corresponding areas are stored at the headers of the metadata area and the data area to describe the memory usage.
8. The RDMA-based distributed memory file system of claim 1, wherein the client deposits its ID and a timestamp to the packet header, the client ID being assigned by a server master node at connection setup and being globally unique.
9. The RDMA-based distributed memory file system of claim 8, wherein after the RDMA _ WRITE _ WITH _ IMM message is successfully delivered, the server obtains the client metadata of the header of the message according to the completion information, parses out the client ID, and directly queries the fixed offset location of the local message pool according to the client ID to obtain new request information.
10. The RDMA-based distributed memory file system of claim 8, wherein when the number of clients exceeds the number of message pools allocated in advance by the server, the server queries the disconnected clients and transfers the message area occupied by the disconnected clients to the current client; and if all the clients keep the connection state, the clients need to reapply the message area, register to the network card and inform the current clients.
CN201611261722.8A 2016-12-30 2016-12-30 RDMA (remote direct memory Access) -based distributed memory file system Active CN108268208B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611261722.8A CN108268208B (en) 2016-12-30 2016-12-30 RDMA (remote direct memory Access) -based distributed memory file system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611261722.8A CN108268208B (en) 2016-12-30 2016-12-30 RDMA (remote direct memory Access) -based distributed memory file system

Publications (2)

Publication Number Publication Date
CN108268208A CN108268208A (en) 2018-07-10
CN108268208B true CN108268208B (en) 2020-01-17

Family

ID=62754948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611261722.8A Active CN108268208B (en) 2016-12-30 2016-12-30 RDMA (remote direct memory Access) -based distributed memory file system

Country Status (1)

Country Link
CN (1) CN108268208B (en)

Families Citing this family (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062929B (en) * 2018-06-11 2020-11-06 上海交通大学 Query task communication method and system
CN109063103A (en) * 2018-07-27 2018-12-21 郑州云海信息技术有限公司 A kind of non-volatile file system of distribution
CN109407977B (en) * 2018-09-25 2021-08-31 佛山科学技术学院 Big data distributed storage management method and system
CN109446160A (en) * 2018-11-06 2019-03-08 郑州云海信息技术有限公司 A kind of file reading, system, device and computer readable storage medium
CN111277616B (en) * 2018-12-04 2023-11-03 中兴通讯股份有限公司 RDMA-based data transmission method and distributed shared memory system
WO2020155417A1 (en) * 2019-01-30 2020-08-06 Huawei Technologies Co., Ltd. Input/output processing in a distributed storage node with rdma
CN110018914B (en) * 2019-03-26 2021-08-13 中国人民银行清算总中心 Shared memory based message acquisition method and device
CN110109763A (en) * 2019-04-12 2019-08-09 厦门亿联网络技术股份有限公司 A kind of shared-memory management method and device
CN111858418B (en) * 2019-04-30 2023-04-07 华为技术有限公司 Memory communication method and device based on remote direct memory access RDMA
CN110221779B (en) * 2019-05-29 2020-06-19 清华大学 Construction method of distributed persistent memory storage system
CN110445848B (en) * 2019-07-22 2023-02-24 创新先进技术有限公司 Method and apparatus for transaction processing
CN110543367B (en) * 2019-08-30 2022-07-26 联想(北京)有限公司 Resource processing method and device, electronic device and medium
CN110837650B (en) * 2019-10-25 2021-08-31 华中科技大学 Cloud storage ORAM access system and method under untrusted network environment
CN111104548B (en) * 2019-12-18 2021-09-14 腾讯科技(深圳)有限公司 Data feedback method, system and storage medium
CN111125049B (en) * 2019-12-24 2023-06-23 上海交通大学 RDMA and nonvolatile memory-based distributed file data block read-write method and system
CN111240588B (en) * 2019-12-31 2021-09-24 清华大学 Persistent memory object storage system
CN111400307B (en) * 2020-02-20 2023-06-23 上海交通大学 Persistent hash table access system supporting remote concurrent access
CN111314731A (en) * 2020-02-20 2020-06-19 上海交通大学 RDMA (remote direct memory Access) mixed transmission method, system and medium for large data of video file
CN111367876B (en) * 2020-03-04 2023-09-19 中国科学院成都生物研究所 Distributed file management method based on memory metadata
CN111404931B (en) * 2020-03-13 2021-03-30 清华大学 Remote data transmission method based on persistent memory
CN113485822A (en) * 2020-06-19 2021-10-08 中兴通讯股份有限公司 Memory management method, system, client, server and storage medium
CN111539042B (en) * 2020-07-13 2020-10-30 南京云信达科技有限公司 Safe operation method based on trusted storage of core data files
CN112328560B (en) * 2020-11-25 2024-06-18 北京无线电测量研究所 File scheduling method and system
CN112596669A (en) * 2020-11-25 2021-04-02 新华三云计算技术有限公司 Data processing method and device based on distributed storage
CN112612734B (en) * 2020-12-18 2023-09-26 平安科技(深圳)有限公司 File transmission method, device, computer equipment and storage medium
WO2022160308A1 (en) * 2021-01-30 2022-08-04 华为技术有限公司 Data access method and apparatus, and storage medium
CN112817887B (en) * 2021-02-24 2021-09-17 上海交通大学 Far memory access optimization method and system under separated combined architecture
CN113238856B (en) * 2021-03-09 2022-07-26 西安奥卡云数据科技有限公司 RDMA-based memory management method and device
CN112954068B (en) * 2021-03-09 2022-09-27 西安奥卡云数据科技有限公司 RDMA (remote direct memory Access) -based data transmission method and device
CN112948025B (en) * 2021-05-13 2021-09-14 阿里云计算有限公司 Data loading method and device, storage medium, computing equipment and computing system
CN113204435B (en) * 2021-07-01 2021-12-03 阿里云计算有限公司 Data processing method and system
CN113395359B (en) * 2021-08-17 2021-10-29 苏州浪潮智能科技有限公司 File currency cluster data transmission method and system based on remote direct memory access
CN114302394B (en) * 2021-11-19 2023-11-03 深圳震有科技股份有限公司 Network direct memory access method and system under 5G UPF
CN116204487A (en) * 2021-11-30 2023-06-02 华为技术有限公司 Remote data access method and device
CN114756388B (en) * 2022-03-28 2024-05-31 北京航空航天大学 Method for sharing memory among cluster system nodes according to need based on RDMA
CN114726883B (en) * 2022-04-27 2023-04-07 重庆大学 Embedded RDMA system
CN116886719B (en) * 2023-09-05 2024-01-23 苏州浪潮智能科技有限公司 Data processing method and device of storage system, equipment and medium
CN117453986B (en) * 2023-12-19 2024-05-24 荣耀终端有限公司 Searching method, background server and searching system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1771495A (en) * 2003-05-07 2006-05-10 国际商业机器公司 Distributed file serving architecture system
CN105933325A (en) * 2016-06-07 2016-09-07 华中科技大学 Kernel mode RPC (Remote Procedure CALL) communication acceleration method based on NFSoRDMA (Network File System over Remote Direct Memory Access)
CN105978985A (en) * 2016-06-07 2016-09-28 华中科技大学 Memory management method of user-state RPC over RDMA

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1771495A (en) * 2003-05-07 2006-05-10 国际商业机器公司 Distributed file serving architecture system
CN105933325A (en) * 2016-06-07 2016-09-07 华中科技大学 Kernel mode RPC (Remote Procedure CALL) communication acceleration method based on NFSoRDMA (Network File System over Remote Direct Memory Access)
CN105978985A (en) * 2016-06-07 2016-09-28 华中科技大学 Memory management method of user-state RPC over RDMA

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HydraDB: a resilient RDMA-driven key-value middleware for in-memory cluster computing;Yandong Wang;《International Conference for High Performance Computing, Networking, Storage & Analysis.IEEE》;20151231;全文 *

Also Published As

Publication number Publication date
CN108268208A (en) 2018-07-10

Similar Documents

Publication Publication Date Title
CN108268208B (en) RDMA (remote direct memory Access) -based distributed memory file system
CN106657365B (en) RDMA (remote direct memory Access) -based high-concurrency data transmission method
CN111277616B (en) RDMA-based data transmission method and distributed shared memory system
Jose et al. Memcached design on high performance RDMA capable interconnects
CA2512312C (en) Metadata based file switch and switched file system
CN109327539A (en) A kind of distributed block storage system and its data routing method
US7562110B2 (en) File switch and switched file system
US7512673B2 (en) Rule based aggregation of files and transactions in a switched file system
US7788335B2 (en) Aggregated opportunistic lock and aggregated implicit lock management for locking aggregated files in a switched file system
US8151062B2 (en) Consistency models in a distributed store
CN110177118A (en) A kind of RPC communication method based on RDMA
CN114756388B (en) Method for sharing memory among cluster system nodes according to need based on RDMA
CN111966446B (en) RDMA virtualization method in container environment
CN105138615A (en) Method and system for building big data distributed log
CN101997924A (en) Cloud storage file transfer protocol (CFTP)
US10708379B1 (en) Dynamic proxy for databases
US20240039995A1 (en) Data access system and method, device, and network adapter
CN102307206A (en) Caching system and caching method for rapidly accessing virtual machine images based on cloud storage
WO2017092384A1 (en) Clustered database distributed storage method and device
CN111400307A (en) Persistent hash table access system supporting remote concurrent access
CN102137161B (en) File-level data sharing and storing system based on fiber channel
US20040093390A1 (en) Connected memory management
CN110704541A (en) High-availability distributed method and architecture for Redis cluster multi-data center
CN114885007A (en) Method and electronic device for real-time strong consistency session synchronization
CN116866429A (en) Data access method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant