CN107888657A - Low latency distributed memory system - Google Patents

Low latency distributed memory system Download PDF

Info

Publication number
CN107888657A
CN107888657A CN201710941988.5A CN201710941988A CN107888657A CN 107888657 A CN107888657 A CN 107888657A CN 201710941988 A CN201710941988 A CN 201710941988A CN 107888657 A CN107888657 A CN 107888657A
Authority
CN
China
Prior art keywords
data
memory
node
client
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710941988.5A
Other languages
Chinese (zh)
Other versions
CN107888657B (en
Inventor
黄林鹏
董康平
沈艳艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201710941988.5A priority Critical patent/CN107888657B/en
Publication of CN107888657A publication Critical patent/CN107888657A/en
Application granted granted Critical
Publication of CN107888657B publication Critical patent/CN107888657B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0679Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP]

Abstract

The invention provides a kind of low latency distributed memory system, store data in can byte addressing Nonvolatile memory in, the data of storage system are read and write by remote direct memory access technique, and back up data to multiple redundant nodes, to reach low latency and high availability.The present invention realizes the management to the cluster and its index information of multiple servers using the Scheduler module of centralization.Client need to only need to communicate with Scheduler module when being connected to storage system, all requests thereafter, all be sent directly to store the server node of corresponding data.The present invention is based on Nonvolatile memory and remote direct memory access technique, can provide key assignments storage system interface to client, and provide the data storage service of low latency.

Description

Low latency distributed memory system
Technical field
The present invention relates to technical field of memory, in particular it relates to which a kind of be based on Nonvolatile memory and remote direct memory The low latency distributed memory system of access technique.
Background technology
Internal memory key assignments storage system has been widely used in various large software systems to provide high bandwidth low delay Data storage service.Typical internal memory key assignments storage system is generally stored data in reduce read-write delay in internal memory at present, And disk is periodically write data into ensure the persistence of data.In addition, in order to improve serious forgiveness, the usual network of data connects Several spare machines can be copied in a manner of principal and subordinate by connecing.Therefore, disk performance and network connection speed are in limitation at present Two bottleneck factors of key assignments storage system are deposited, how to overcome the two bottlenecks so as to which factor is that this area is badly in need of solving problem.
The technical term being related to:
RDMA:Remote Direct Memory Access, remote direct memory access.
NVM:Non-volatile Memory, Nonvolatile memory.
DRAM:Dynamic Random Access Memory, dynamic RAM are that is, very widely used today interior Deposit.
RDMA and its LLP (Lower Layer Protocol) can be realized on NIC (network interface card) and (is referred to as RNIC).
The content of the invention
For in the prior art the defects of, it is an object of the invention to provide a kind of low latency distributed memory system.
According to a kind of low latency distributed memory system provided by the invention, including:
Scheduler module:The server node and client in storage system are managed, meanwhile, store the rope of storage system Fuse cease, by client be oriented to corresponding to server node;
Nonvolatile memory memory module:Management is stored in the data in Nonvolatile memory, meanwhile, provided to client The storage service of concurrent low latency;
RDMA module:Offer is remotely visited the data in the Nonvolatile memory memory module The ability asked, the data in the Nonvolatile memory memory module are directly operated using remote direct memory read-write capability, together When, by the Ethernet protocol stack of workaround system kernel, reduce the access delay of storage system;
Data redundancy backup module:Redundancy backup is carried out to the data in storage system, including data and metadata are same Physically-isolated backup node is walked, and ensures the uniformity of main and subordinate node data, it is right that the data redundancy backup module passes through The data of modification enter row write daily record, to reach the uniformity for maintaining main and subordinate node data.
Preferably, managing the server node in storage system and client includes:Addition, deletion to server node Addition with client, leave and be managed, the index information of storage system is updated and safeguarded.
Preferably, the data that management is stored in Nonvolatile memory include:
Data fragmentation:The data for being stored in single server node are divided into several tablet tables, each tablet tables Having corresponded to continuous disjoint one section of space, each tablet tables in key assignments data hash space has independent thread to be visited Ask operation;
Multithreading performs:All requests of server node, it is calculated what is be accordingly present according to its cryptographic Hash Tablet tables, and be stored in the request queue of tablet tables, thread corresponding to tablet tables always obtains and performs request The request of queue head;
Index structure based on Hash table:An independent Hash table structure is included in each tablet tables, is stored The cryptographic Hash of the key assignments of all key assignments data and key in the tablet tables, for the key of conflict, solve to conflict using open chain method, Hash table provides the operation that the data in tablet tables are inquired about, inserted and deleted;
Nonvolatile memory distributor:The memory space of each tablet tables is distributed by an independent Nonvolatile memory Device management, the insertion for key assignments data, corresponding space is distributed by Nonvolatile memory distributor;For deleting for key assignments data Remove, corresponding space is discharged by Nonvolatile memory distributor.
Preferably, there is provided the ability remotely accessed to the data in the Nonvolatile memory memory module includes:
Remote procedure call interface:It is right on the basis of the remote direct memory that infiniband is provided accesses primitive Upper strata provides remote procedure call interface, initiates to operate by client, destination server is transferred to by infiniband agreements, Corresponding operation is performed locally by destination server again;
Communication semanteme model:Using the channel type primitive of the non-transmitting based on Message Oriented, support one-to-many Communication capacity, scalability is provided for database;
User space protocol layer:The primitive storehouse provided using infiniband, has bypassed the procotol of operating system nucleus Layer, directly accesses infiniband network interface cards by User space program and sends and receives data.
Preferably, remote procedure call interface also includes operating relative to the Add of storage system extension, detects the key of offer Whether Value Data is already present in database.
Preferably, the data redundancy backup module includes:
Collect all modifications of single request:By the way that all modifications once asked are synchronized into backup node, reach pair The backup of data, the tracking to writing each time is reached by Nonvolatile memory distributor, all write-in data and Corresponding offset address is all recorded to specific send in buffering area;
Backup Data daily record:Enter row write daily record to Backup Data, all data to be backed up are written to non-easy in advance In log area in the property lost internal memory, when server crash is restarted, the data read first in daily record and execution are write accordingly Enter operation, to reach the uniformity with other data trnascriptions;
Connection-oriented remote direct memory write-in:Employed in host node and from the data transfer between node towards even The remote direct memory Writing Technology connect.
Compared with prior art, the present invention has following beneficial effect:
The present invention is based on Nonvolatile memory and remote direct memory access technique, and key assignments storage can be provided to client System interface, and the data storage service of low latency is provided.
Brief description of the drawings
The detailed description made by reading with reference to the following drawings to non-limiting example, further feature of the invention, Objects and advantages will become more apparent upon:
Fig. 1 is the Organization Chart of Scheduler module of the present invention and cluster;
Fig. 2 is that idle chain of the present invention represents to be intended to;
Fig. 3 is the fundamental diagram of tablet tables of the present invention;
Fig. 4 is the leader follower replication schematic diagram of the present invention.
Embodiment
With reference to specific embodiment, the present invention is described in detail.Following examples will be helpful to the technology of this area Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill to this area For personnel, without departing from the inventive concept of the premise, some changes and improvements can also be made.These belong to the present invention Protection domain.
A kind of low latency distributed memory system provided by the invention, including:Scheduler module, Nonvolatile memory storage Module, RDMA module and data redundancy backup module.
Scheduler module:Manage storage system in server node and client, including server node addition, delete Except the addition with client, leave.Meanwhile the index information of storage system is stored, the index information of storage system is carried out Renewal and safeguard, by client be oriented to corresponding to server node.Specifically comprise the following steps:
Scheduler starting step:In cluster, scheduler starts first, initializes cluster configuration information, prepares index data Structure, and wait the addition request of Servers-all node in fixed port.
Scheduler adds server node step:1st, authentication server node legitimacy, it is global for server node distribution Unique ID;2nd, index data structure is updated, adds the server node.
Scheduler broadcasts index information step:When the Servers-all node of cluster is added and finished, scheduler has updated Into index information, and index information is broadcast to all server nodes.
Scheduler receives client request step:1st, scheduler waits the access request of client in fixed port;2nd, it is objective After family terminates, the index information of cluster is sent to client
Nonvolatile memory memory module:Management is stored in data in Nonvolatile memory, including data inquiry, repair Change, write and delete.Meanwhile pass through data fragmentation technology, the storage service of the low latency concurrent to client offer.Its is specific Including such as lower part:
Data fragmentation:The data for being stored in single server node are divided into several tablet tables, each tablet tables Having corresponded to continuous disjoint one section of space, each tablet tables in key assignments data hash space has independent thread to be visited Ask operation;
Multithreading performs:All requests of server node, it is calculated what is be accordingly present according to its cryptographic Hash Tablet tables, and be stored in the request queue of tablet tables, thread corresponding to tablet tables always obtains and performs request The request of queue head;
Index structure based on Hash table:An independent Hash table structure is included in each tablet tables, is stored Key (key) values of all key assignments data and the cryptographic Hash of key in the tablet tables, for the key of conflict, solved using open chain method Conflict, Hash table provide the operation that the data in tablet tables are inquired about, inserted and deleted;
Nonvolatile memory distributor:The memory space of each tablet tables is distributed by an independent Nonvolatile memory Device management, the insertion for key assignments data, corresponding space is distributed by Nonvolatile memory distributor;For deleting for key assignments data Remove, corresponding space is discharged by Nonvolatile memory distributor.
RDMA module:Offer is remotely visited the data in the Nonvolatile memory memory module The ability asked, including optimization communication semanteme model and offer remote procedure call interface, use remote direct memory read-write capability The data in the Nonvolatile memory memory module are directly operated, meanwhile, assisted by the Ethernet of workaround system kernel Stack is discussed, reduces the access delay of storage system.It is specifically included such as lower part:
Remote procedure call interface:It is right on the basis of the remote direct memory that infiniband is provided accesses primitive Upper strata provide remote procedure call interface, mainly including Put, Get, Delete etc. insert, read and delete key Value Data behaviour Make.Initiate to operate by client, destination server is transferred to by infiniband agreements, then held locally by destination server The corresponding operation of row.Remote procedure call interface also includes operating relative to the Add of storage system extension, detects the key assignments of offer Whether data are already present in database;
Communication semanteme model:Infiniband supports two kinds of semantic primitive:Memory types primitive and channel types are former Language;Memory types primitive includes remote direct memory and reads and write;Channel types primitive includes sending or connecing to specific opposite end Receive message;Infiniband provides two kinds of transmission means simultaneously:The non-reliable biography of connection-oriented transmitting and Message Oriented It is defeated;The present invention is supported one-to-many communication capacity, is using the channel type primitive of the non-transmitting based on Message Oriented Database provides scalability.
User space protocol layer:The primitive storehouse provided using infiniband, has bypassed the procotol of operating system nucleus Layer, directly accesses infiniband network interface cards by User space program and sends and receives data.
Data redundancy backup module:Redundancy backup is carried out to the data in storage system, including data and metadata are same Physically-isolated backup node is walked, and ensures the uniformity of main and subordinate node data, it is right that the data redundancy backup module passes through The data of modification enter row write daily record, to reach the uniformity for maintaining main and subordinate node data.It is specifically included such as lower part:
Collect all modifications of single request:By the way that all modifications once asked are synchronized into backup node, reach pair The backup of data, the tracking to writing each time is reached by Nonvolatile memory distributor, all write-in data and Corresponding offset address is all recorded to specific send in buffering area;
Backup Data daily record:Enter row write daily record to Backup Data, all data to be backed up are written to non-easy in advance In log area in the property lost internal memory, when server crash is restarted, the data read first in daily record and execution are write accordingly Enter operation, to reach the uniformity with other data trnascriptions;
Connection-oriented remote direct memory write-in:Employed in host node and from the data transfer between node towards even The remote direct memory Writing Technology connect, low latency, the characteristic of high bandwidth are given full play to.
Being implemented as follows for the present invention is described:
Scheduler adds server node:The framework of scheduler and cluster is monitored as shown in figure 1, after scheduler startup TCP9090 ports, waiting for server request access cluster.For the server of request access, scheduler is added into clothes first Business device list.When Servers-all request addition finishes, scheduler renewal cluster index list, is each server-assignment Corresponding hash-value space.So, corresponding server node will be stored in by falling into the key assignments data of corresponding hash-value space. Finally, index list is sent to each server node by scheduler.
Hash-value space:Data are uniformly distributed in each service by the present invention by carrying out Hash to key assignments data Device node.The hash-value space that the present invention uses is 64 signless integer.According to the quantity of server in cluster, the cryptographic Hash Space will be divided, and distribute to each server for adding cluster.This distribution information is recorded in the index structure of cluster In.When client access cluster when, by the hash-value space being recorded according to this distribute information, find corresponding to store this The server of data.
Construct and broadcast index structure information:When server access cluster when, scheduler will according to server access order, The big cryptographic Hash section such as it is sequentially allocated.The scheduled device of this information is recorded in index structure.When Servers-all all access sets When group completes, this index structure is broadcast to Servers-all by scheduler.After cluster initializes, scheduler is also by this rope Fuse ceases the client for being sent to all accesses.
Server initiation:After startup of server, its Nonvolatile memory storage region will be initialized first, wrapped Include the initialization of tablet tables, the distribution of local storage space, local Hash table it is initial etc..After locally initializing, Server will enter ready state, and be connected in cluster.By scheduler dispatches the being serviced device storage of index structure information to In local Nonvolatile memory space.
Client accesses cluster:Client uses storehouse provided by the invention access cluster.Client will be firstly connected to adjust Spend device.The index structure of construction complete is sent to client by scheduler.Client, will be according to this in follow-up request Server corresponding to index information lookup.Client is by this index structure information cache in local.Once the index structure information It has been received that, client can disconnect with scheduler.
Client initiates inquiry request:When client initiates value corresponding to the given key (key) of an inquiry to cluster During the request of (value).The server for storing the key will be searched by the index structure cached first.The process is divided into:1. calculate key 64 cryptographic Hash;2. calculate the Hash section that 64 cryptographic Hash are fallen into;3. obtain the server for possessing the Hash section Address information.Then, client is sent to corresponding server by Infiniband and asked.The request is with non-reliable message (unreliable-datagram) mode is sent to the server found.The process is divided into:1. in Nonvolatile memory Distribution sends the buffering area needed for request;2. request is copied into transmission buffering area;3. buffering area will be sent and send request It is sent into Infiniband transmit queue.Finally, after client smoothly sends request, client will enter the reception server The state of reply.Client is distributed first receives the buffering area needed for replying.When client receives server for its inquiry During the reply of request, client replicates value to the address space specified, or the mistake that report key is not present.Then, will connect Receive buffering area release.
Distribution sends with release, receives buffering area:All transmissions.Buffering area is received to have allocated in advance.The present invention is hair Send, receive buffering area and safeguard a free buffer queue respectively.Client sends request every time, will send the buffering area free time A transmission buffering area is obtained in queue.When request is sent, this transmission buffering area is discharged into and sends buffering area free time team Row, and a reception buffering area is obtained in buffering area idle queues are received.When the reply for receiving request finishes, discharge this and connect Buffering area is received into reception buffering area idle queues.
Server receives client request:In the present invention, server specially opens a thread, for poll client The request at end.When poll thread receives the request from client, server examines the key of the request cryptographic Hash to be first It is no to be in the Hash section for distributing to the server.If it is not, then refusing the request, and error message is returned to client.It is no Then, server is according to the key of request cryptographic Hash, local tablet tables corresponding to lookup.And request is added to this The afterbody of request queue corresponding to tablet tables.And the worker thread of tablet tables is waken up, perform the request of client.Asking After execution terminates, the worker thread replys information corresponding to client or report error message.
Whois lookup tablet tables:Distribute to the cryptographic Hash section of each server for scheduler, server will be after It is continuous that it is divided.And each cryptographic Hash subinterval is distributed to the tablet tables of server local.Received in server During request, the tablet tables according to corresponding to the cryptographic Hash section of local distribution information finds the key values of request.
Tablet tables:In the present invention, each tablet tables on server include independent:Hash table, 64MB are non- Volatile memory space, request queue and worker thread.
Hash table in tablet tables:This Hash table size is 1000003, i.e., with 1000003 groove positions.It is mapped to this The key of tablet tables storage address will be placed in this 1000003 groove positions.Hash table is mapped to for different key Same groove position, the present invention solve this conflict using open chain method.That is, being mapped to the key of same groove position address will be added to In existing key next pointers.Hash table is located at the end of whole tablet tables.
The memory space and Memory Allocation of tablet tables:As shown in Fig. 2 the memory space of a tablet table, removes and breathes out Space occupied by uncommon table, it be used to store key assignments data.Present invention employs the Nonvolatile memory distributor of customization to come Manage the memory space of tablet tables.The memory allocator is from 16 bytes to 1024 bytes, using 16 bytes as incremental units Each Seed-ginger size maintains a free block chained list.Asked for a Memory Allocation, distributor will ask distribution first Space size is adjusted to the smallest block size not less than its value.Then free block is searched in corresponding free block chained list.If no In the presence of corresponding free block, then searched in the idle chained list of bigger block, and so on, until obtaining one not less than request The free block of size.Assuming that request block size is N, the idle block size found is M.The block that size is M is then divided into size The block of block and M-N for N.Return value of the block that size is N as request, size are that M-N block is then added to its idle team In row.Correspondingly, the block that a size is N is discharged, whether the adjacent block for first checking for the block is extremely free block by Memory Allocation. If so, then taking out it from free block chained list, and the bigger block that size is M is merged into current block.This process circulates past It is multiple, untill it can not merge free block.Finally, the free block after merging will be added in corresponding idle chained list.
The request queue of tablet tables and worker thread:As shown in figure 3, each tablet tables are assigned with independent request Queue, all requests for being mapped to this tablet table are added to the afterbody of request queue.When request is added, tablet The worker thread of table is waken up, and obtains the pending request such as one from the head of request queue.Finished when request is performed When, reply is sent to client by worker thread, and checks whether request queue is empty.If not empty, then worker thread repeats The process asked and performed is taken from head.Otherwise worker thread continues polling request queue head 20us.If there is request quilt around here It is added in request queue, then worker thread continues executing with the request on head.Otherwise, worker thread input sleep.
The worker thread of tablet tables performs request:The step of tablet tables worker thread execution inquiry request, is divided into:1. The key specified is searched in Hash table;2. if in the presence of returning to value corresponding to the key;Otherwise, return to what key was not present Mistake.The step of performing insertion request:1. the key specified is searched in Hash table;2. if be not present, call that the present invention's is interior Distributor is deposited, the space for storing the key assignments that need to be inserted into enough is distributed in tablet memory space;If in the presence of:A. it is existing Key assignments data taken up space greatly than new key assignments data, then reuse the memory space of existing key assignments data, and discharge Unnecessary space;B. already present key assignments data take up space small than new key assignments data, then discharge existing key assignments data Memory space, and bigger memory space is distributed by memory allocator, to store new key assignments data.Perform the step of removal request Suddenly:1. the key specified is searched in Hash table;2. if be not present, request is completed;Otherwise, discharge and specify depositing occupied by key Space is stored up, and its address is removed from Hash table.
Data duplication between server:For ensure individual server delay machine when, cluster still be able to client provide continue Availability, the present invention by each part of data backup on three physically-isolated servers.When one of server is delayed machine Or during inaccessible, the request service of the still executable client of remaining server.This requires the data needs of three servers Being consistent property.The present invention is by the way that server all locally modified is synchronized on backup server, to reach the one of data Cause property.
The leader follower replication of the present invention:As shown in figure 4, each part of data in cluster of the present invention are stored in three clothes It is engaged on device.One of server is host node, and other two server is from node.All client requests are all sent To host node.When host node performs the request of client, if the memory space local to host node is modified, these are repaiied Changing to be recorded.When host node performs request and finished, these modification informations will be synchronized to two from node by host node first, And wait the reply from node.After two replies from node are received, host node replys the corresponding request results of client.
Host node collects all modifications once asked:Asked for one query, not to the local data of host node Modify, therefore the simultaneously operating between main and subordinate node need not be carried out.For once inserting or removal request, to host node The particular memory distributor that all modifications of local Nonvolatile memory memory space are all write by the present invention is recorded.These are repaiied Breath is converted to be all stored in a transmission buffering area.
The all modifications that host node is synchronously once asked:After host node is finished client request, host node is simultaneously The data in modification information buffering area are sent to two from node, and are specified in the log from node for receiving these data Address.Then, host node, which enters, waits two states from node reverts back.And if only if two from node it is back to normal when, Host node is back to normal to client.Otherwise, host node replys corresponding mistake to client.
From the modification information log areas of node:From node, receive and store the region quilt of the synchronous modification information of host node Referred to as NMLOG areas.Two NMLOG are each included from the tablet of node.The modification information of host node is sequentially written to 1 In number NMLOG.When No. 1 NMLOG is fully written, the synchronizing thread on backstage is waken up, the modification information quilt stored in No. 1 NMLOG It is synchronized to the correspondence position from the corresponding tablet of node.Meanwhile No. 2 NMLOG are arranged to receive modification information, host node is synchronous The modification information to come over is stored in No. 2 NMLOG.
Host node collapses in synchronizing process:If during synchronous vacations information, host node collapse.So, one from Node will be chosen as new host node.Modification information in the new host node two NMLOG synchronous first is to the non-of its local Volatile ram memory space.Incomplete modification information, it will be dropped.In this way, new host node is up to and original main section The approximate consistent state (removing the incomplete modification information of issuable last time) of point.Meanwhile cluster is by for new master Node reallocates one from node, so that it possesses two from node all the time.
In synchronizing process from node collapses:If in synchronizing process, from node collapses.If can voluntarily recover from node, The data of synchronization master are to should be from node.If irrecoverable, cluster will be that host node reallocates one from node, and synchronously Data are new from node to this.
One skilled in the art will appreciate that except realizing system provided by the invention in a manner of pure computer readable program code And its beyond each device, module, unit, completely can be by the way that method and step progress programming in logic be provided come the present invention System and its each device, module, unit with gate, switch, application specific integrated circuit, programmable logic controller (PLC) and embedding Enter the form of the controller that declines etc. to realize identical function.So system provided by the invention and its every device, module, list Member is considered a kind of hardware component, and is used to realize that device, module, the unit of various functions also may be used to what is included in it To be considered as the structure in hardware component;It both can be real that will can also be considered as device, module, the unit of realizing various functions The software module of existing method can be the structure in hardware component again.
The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make a variety of changes or change within the scope of the claims, this not shadow Ring the substantive content of the present invention.In the case where not conflicting, the feature in embodiments herein and embodiment can any phase Mutually combination.

Claims (6)

  1. A kind of 1. low latency distributed memory system, it is characterised in that including:
    Scheduler module:The server node and client in storage system are managed, meanwhile, the index for storing storage system is believed Breath, by client be oriented to corresponding to server node;
    Nonvolatile memory memory module:Management is stored in the data in Nonvolatile memory, meanwhile, provided to client concurrent Low latency storage service;
    RDMA module:There is provided what the data in the Nonvolatile memory memory module were remotely accessed Ability, the data in the Nonvolatile memory memory module are directly operated using remote direct memory read-write capability, meanwhile, lead to The Ethernet protocol stack of workaround system kernel is crossed, reduces the access delay of storage system;
    Data redundancy backup module:Redundancy backup is carried out to the data in storage system, including data and metadata are synchronized to Physically-isolated backup node, and ensure the uniformity of main and subordinate node data, the data redundancy backup module passes through to modification Data enter row write daily record, with reach maintain main and subordinate node data uniformity.
  2. 2. low latency distributed memory system according to claim 1, it is characterised in that the service in management storage system Device node and client include:The addition of addition, deletion and client to server node, leave and be managed, to storage The index information of system is updated and safeguarded.
  3. 3. low latency distributed memory system according to claim 1, it is characterised in that management is stored in non-volatile Data in depositing include:
    Data fragmentation:The data for being stored in single server node are divided into several tablet tables, each tablet tables are corresponding Continuous disjoint one section of space in key assignments data hash space, each tablet tables have independent thread to conduct interviews behaviour Make;
    Multithreading performs:All requests of server node, the tablet tables being accordingly present in are calculated according to its cryptographic Hash, And be stored in the request queue of tablet tables, thread corresponding to tablet tables always obtains and performs request queue head Request;
    Index structure based on Hash table:An independent Hash table structure is included in each tablet tables, stores this The cryptographic Hash of the key assignments of all key assignments data and key in tablet tables, for the key of conflict, solve to conflict using open chain method, breathe out Uncommon table provides the operation that the data in tablet tables are inquired about, inserted and deleted;
    Nonvolatile memory distributor:The memory space of each tablet tables is by an independent Nonvolatile memory distributor tube Reason, the insertion for key assignments data, corresponding space is distributed by Nonvolatile memory distributor;Deletion for key assignments data, Corresponding space is discharged by Nonvolatile memory distributor.
  4. 4. low latency distributed memory system according to claim 1, it is characterised in that provide to described non-volatile interior Depositing the ability that the data in memory module are remotely accessed includes:
    Remote procedure call interface:On the basis of the remote direct memory that infiniband is provided accesses primitive, to upper strata Remote procedure call interface is provided, initiates to operate by client, destination server is transferred to by infiniband agreements, then by Destination server is performed locally corresponding operation;
    Communication semanteme model:Using the channel type primitive of the non-transmitting based on Message Oriented, one-to-many communication is supported Ability, scalability is provided for database;
    User space protocol layer:The primitive storehouse provided using infiniband, has bypassed the network protocol layer of operating system nucleus, by User space program directly accesses infiniband network interface cards and sends and receives data.
  5. 5. low latency distributed memory system according to claim 4, it is characterised in that remote procedure call interface is also wrapped Include and operated relative to the Add of storage system extension, whether the key assignments data for detecting offer are already present in database.
  6. 6. low latency distributed memory system according to claim 1, it is characterised in that the data redundancy backup module Including:
    Collect all modifications of single request:By the way that all modifications once asked are synchronized into backup node, reach to data Backup, the tracking to writing each time is reached by Nonvolatile memory distributor, all write-in data and corresponding Offset address be all recorded to specific send in buffering area;
    Backup Data daily record:Enter row write daily record to Backup Data, all data to be backed up are written to non-volatile in advance In log area in internal memory, when server crash is restarted, by the data read first in daily record and corresponding write-in behaviour is performed Make, to reach the uniformity with other data trnascriptions;
    Connection-oriented remote direct memory write-in:Employ in host node and from the data transfer between node connection-oriented Remote direct memory Writing Technology.
CN201710941988.5A 2017-10-11 2017-10-11 Low latency distributed storage system Active CN107888657B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710941988.5A CN107888657B (en) 2017-10-11 2017-10-11 Low latency distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710941988.5A CN107888657B (en) 2017-10-11 2017-10-11 Low latency distributed storage system

Publications (2)

Publication Number Publication Date
CN107888657A true CN107888657A (en) 2018-04-06
CN107888657B CN107888657B (en) 2020-11-06

Family

ID=61781297

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710941988.5A Active CN107888657B (en) 2017-10-11 2017-10-11 Low latency distributed storage system

Country Status (1)

Country Link
CN (1) CN107888657B (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109491837A (en) * 2018-11-01 2019-03-19 郑州云海信息技术有限公司 A kind of the log fault-tolerance processing method and device of Nonvolatile memory reservoir
CN109491809A (en) * 2018-11-12 2019-03-19 西安微电子技术研究所 A kind of communication means reducing high-speed bus delay
CN109714430A (en) * 2019-01-16 2019-05-03 深圳壹账通智能科技有限公司 Distributed caching method, device, computer system and storage medium
CN109767247A (en) * 2019-01-15 2019-05-17 武汉费米坊科技有限公司 A kind of distribution commodity traceability system and source tracing method
CN110109889A (en) * 2019-05-09 2019-08-09 重庆大学 A kind of distributed memory file management system
CN110262754A (en) * 2019-06-14 2019-09-20 华东师范大学 A kind of distributed memory system and lightweight synchronized communication method towards NVMe and RDMA
CN110298031A (en) * 2019-05-28 2019-10-01 北京百度网讯科技有限公司 A kind of Directory Service system and model version consistency allocator
WO2020024590A1 (en) * 2018-08-02 2020-02-06 Memverge, Inc. Persistent memory key-value store in a distributed memory architecture
CN110968530A (en) * 2019-11-19 2020-04-07 华中科技大学 Key value storage system based on nonvolatile memory and memory access method
CN111049883A (en) * 2019-11-15 2020-04-21 北京金山云网络技术有限公司 Data reading method, device and system of distributed table system
CN111078607A (en) * 2019-12-24 2020-04-28 上海交通大学 Method and system for deploying RDMA (remote direct memory Access) and non-volatile memory-oriented network access programming frame
CN111368002A (en) * 2020-03-05 2020-07-03 广东小天才科技有限公司 Data processing method, system, computer equipment and storage medium
CN111381780A (en) * 2020-03-06 2020-07-07 西安奥卡云数据科技有限公司 Efficient byte access storage system for persistent storage
CN111400307A (en) * 2020-02-20 2020-07-10 上海交通大学 Persistent hash table access system supporting remote concurrent access
CN111400312A (en) * 2020-02-25 2020-07-10 华南理工大学 Edge storage database based on improved L SM tree
CN111459418A (en) * 2020-05-15 2020-07-28 南京大学 RDMA (remote direct memory Access) -based key value storage system transmission method
CN112099728A (en) * 2019-06-18 2020-12-18 华为技术有限公司 Method and device for executing write operation and read operation
WO2021043124A1 (en) * 2019-09-06 2021-03-11 程延辉 Kbroker distributed operating system, storage medium, and electronic device
CN112667620A (en) * 2020-12-31 2021-04-16 广州方硅信息技术有限公司 Data processing method and device, computer equipment and storage medium
CN112788082A (en) * 2019-11-08 2021-05-11 内江市下一代互联网数据处理技术研究所 High-availability memory caching system
CN113326155A (en) * 2021-06-28 2021-08-31 深信服科技股份有限公司 Information processing method, device, system and storage medium
CN114338274A (en) * 2021-12-30 2022-04-12 上海交通大学 Heterogeneous industrial field bus fusion method and system
WO2022089607A1 (en) * 2020-10-29 2022-05-05 第四范式(北京)技术有限公司 Parameter server node recovery method and recovery system
WO2022252862A1 (en) * 2021-06-02 2022-12-08 北京字节跳动网络技术有限公司 Computing storage separation system and data access method therefor, medium, and electronic device
CN116257521A (en) * 2023-01-18 2023-06-13 深存科技(无锡)有限公司 KV storage method based on FPGA

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101577716A (en) * 2009-06-10 2009-11-11 中国科学院计算技术研究所 Distributed storage method and system based on InfiniBand network
CN102084332A (en) * 2008-04-06 2011-06-01 弗森-艾奥公司 Apparatus, system, and method for converting a storage request into an append data storage command
CN104364756A (en) * 2012-07-11 2015-02-18 英特尔公司 Parallel processing of a single data buffer
CN104750658A (en) * 2013-12-27 2015-07-01 英特尔公司 Assisted Coherent Shared Memory
US9116819B2 (en) * 2012-10-17 2015-08-25 Datadirect Networks, Inc. Reducing metadata in a write-anywhere storage system
CN105404546A (en) * 2015-11-10 2016-03-16 上海交通大学 RDMA and HTM based distributed concurrency control method
CN105681402A (en) * 2015-11-25 2016-06-15 北京文云易迅科技有限公司 Distributed high speed database integration system based on PCIe flash memory card
CN106372013A (en) * 2015-07-24 2017-02-01 华为技术有限公司 Remote memory access method, apparatus and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102084332A (en) * 2008-04-06 2011-06-01 弗森-艾奥公司 Apparatus, system, and method for converting a storage request into an append data storage command
CN101577716A (en) * 2009-06-10 2009-11-11 中国科学院计算技术研究所 Distributed storage method and system based on InfiniBand network
CN104364756A (en) * 2012-07-11 2015-02-18 英特尔公司 Parallel processing of a single data buffer
US9116819B2 (en) * 2012-10-17 2015-08-25 Datadirect Networks, Inc. Reducing metadata in a write-anywhere storage system
CN104750658A (en) * 2013-12-27 2015-07-01 英特尔公司 Assisted Coherent Shared Memory
CN106372013A (en) * 2015-07-24 2017-02-01 华为技术有限公司 Remote memory access method, apparatus and system
CN105404546A (en) * 2015-11-10 2016-03-16 上海交通大学 RDMA and HTM based distributed concurrency control method
CN105681402A (en) * 2015-11-25 2016-06-15 北京文云易迅科技有限公司 Distributed high speed database integration system based on PCIe flash memory card

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YOUYOU LU等: "Octopus:an RDMA-enabled Distributed Persistent Memory File System", 《PROCEEDINGS OF THE 2017 USENIX ANNUAL TECHNICAL CONFERENCE》 *
舒继武等: "基于非易失性存储器的存储系统技术研究进展", 《科技导报》 *

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020024590A1 (en) * 2018-08-02 2020-02-06 Memverge, Inc. Persistent memory key-value store in a distributed memory architecture
CN109491837A (en) * 2018-11-01 2019-03-19 郑州云海信息技术有限公司 A kind of the log fault-tolerance processing method and device of Nonvolatile memory reservoir
CN109491809A (en) * 2018-11-12 2019-03-19 西安微电子技术研究所 A kind of communication means reducing high-speed bus delay
CN109767247A (en) * 2019-01-15 2019-05-17 武汉费米坊科技有限公司 A kind of distribution commodity traceability system and source tracing method
CN109714430A (en) * 2019-01-16 2019-05-03 深圳壹账通智能科技有限公司 Distributed caching method, device, computer system and storage medium
CN110109889A (en) * 2019-05-09 2019-08-09 重庆大学 A kind of distributed memory file management system
CN110298031A (en) * 2019-05-28 2019-10-01 北京百度网讯科技有限公司 A kind of Directory Service system and model version consistency allocator
CN110298031B (en) * 2019-05-28 2023-07-18 北京百度网讯科技有限公司 Dictionary service system and model version consistency distribution method
CN110262754A (en) * 2019-06-14 2019-09-20 华东师范大学 A kind of distributed memory system and lightweight synchronized communication method towards NVMe and RDMA
CN110262754B (en) * 2019-06-14 2022-10-04 华东师范大学 NVMe and RDMA-oriented distributed storage system and lightweight synchronous communication method
CN112099728A (en) * 2019-06-18 2020-12-18 华为技术有限公司 Method and device for executing write operation and read operation
WO2021043124A1 (en) * 2019-09-06 2021-03-11 程延辉 Kbroker distributed operating system, storage medium, and electronic device
CN112788082A (en) * 2019-11-08 2021-05-11 内江市下一代互联网数据处理技术研究所 High-availability memory caching system
CN111049883A (en) * 2019-11-15 2020-04-21 北京金山云网络技术有限公司 Data reading method, device and system of distributed table system
CN111049883B (en) * 2019-11-15 2022-09-06 北京金山云网络技术有限公司 Data reading method, device and system of distributed table system
CN110968530A (en) * 2019-11-19 2020-04-07 华中科技大学 Key value storage system based on nonvolatile memory and memory access method
CN110968530B (en) * 2019-11-19 2021-12-03 华中科技大学 Key value storage system based on nonvolatile memory and memory access method
CN111078607A (en) * 2019-12-24 2020-04-28 上海交通大学 Method and system for deploying RDMA (remote direct memory Access) and non-volatile memory-oriented network access programming frame
CN111078607B (en) * 2019-12-24 2023-06-23 上海交通大学 Network access programming framework deployment method and system for RDMA (remote direct memory access) and nonvolatile memory
CN111400307A (en) * 2020-02-20 2020-07-10 上海交通大学 Persistent hash table access system supporting remote concurrent access
CN111400307B (en) * 2020-02-20 2023-06-23 上海交通大学 Persistent hash table access system supporting remote concurrent access
CN111400312A (en) * 2020-02-25 2020-07-10 华南理工大学 Edge storage database based on improved L SM tree
CN111400312B (en) * 2020-02-25 2023-04-28 华南理工大学 Edge storage database based on improved LSM tree
CN111368002A (en) * 2020-03-05 2020-07-03 广东小天才科技有限公司 Data processing method, system, computer equipment and storage medium
CN111381780A (en) * 2020-03-06 2020-07-07 西安奥卡云数据科技有限公司 Efficient byte access storage system for persistent storage
CN111459418B (en) * 2020-05-15 2021-07-23 南京大学 RDMA (remote direct memory Access) -based key value storage system transmission method
CN111459418A (en) * 2020-05-15 2020-07-28 南京大学 RDMA (remote direct memory Access) -based key value storage system transmission method
WO2022089607A1 (en) * 2020-10-29 2022-05-05 第四范式(北京)技术有限公司 Parameter server node recovery method and recovery system
CN112667620A (en) * 2020-12-31 2021-04-16 广州方硅信息技术有限公司 Data processing method and device, computer equipment and storage medium
WO2022252862A1 (en) * 2021-06-02 2022-12-08 北京字节跳动网络技术有限公司 Computing storage separation system and data access method therefor, medium, and electronic device
CN113326155A (en) * 2021-06-28 2021-08-31 深信服科技股份有限公司 Information processing method, device, system and storage medium
CN113326155B (en) * 2021-06-28 2023-09-05 深信服科技股份有限公司 Information processing method, device, system and storage medium
CN114338274A (en) * 2021-12-30 2022-04-12 上海交通大学 Heterogeneous industrial field bus fusion method and system
CN116257521A (en) * 2023-01-18 2023-06-13 深存科技(无锡)有限公司 KV storage method based on FPGA
CN116257521B (en) * 2023-01-18 2023-11-17 深存科技(无锡)有限公司 KV storage method based on FPGA

Also Published As

Publication number Publication date
CN107888657B (en) 2020-11-06

Similar Documents

Publication Publication Date Title
CN107888657A (en) Low latency distributed memory system
CN110113420B (en) NVM-based distributed message queue management system
US5835908A (en) Processing multiple database transactions in the same process to reduce process overhead and redundant retrieval from database servers
CN101013381B (en) Distributed lock based on object memory system
CN101019105B (en) Method and apparatus for data storage using striping
US8108634B1 (en) Replicating a thin logical unit
US7783607B2 (en) Decentralized record expiry
CN112084258A (en) Data synchronization method and device
CN103116552A (en) Method and device for distributing storage space in distributed type storage system
CN110555001B (en) Data processing method, device, terminal and medium
CN113268472B (en) Distributed data storage system and method
CN105426321A (en) RDMA friendly caching method using remote position information
WO2020199760A1 (en) Data storage method, memory and server
CN110597452A (en) Data processing method and device of storage system, storage server and storage medium
CN112000287A (en) IO request processing device, method, equipment and readable storage medium
CN108540510B (en) Cloud host creation method and device and cloud service system
CN112988680B (en) Data acceleration method, cache unit, electronic device and storage medium
US10284672B2 (en) Network interface
CN106713470A (en) Distributed cache updating method and cache updating system
JP2001184248A (en) Data access management device in distributed processing system
CN107493309B (en) File writing method and device in distributed system
CN101344882A (en) Data query method, insertion method and deletion method
CN107659626B (en) Temporary metadata oriented separation storage method
CN105320676A (en) Customer data query service method and device
CN116955219B (en) Data mirroring method, device, host and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant