CN107888657A - Low latency distributed memory system - Google Patents
Low latency distributed memory system Download PDFInfo
- Publication number
- CN107888657A CN107888657A CN201710941988.5A CN201710941988A CN107888657A CN 107888657 A CN107888657 A CN 107888657A CN 201710941988 A CN201710941988 A CN 201710941988A CN 107888657 A CN107888657 A CN 107888657A
- Authority
- CN
- China
- Prior art keywords
- data
- memory
- node
- client
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
- G06F3/0611—Improving I/O performance in relation to response time
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0656—Data buffering arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0673—Single storage device
- G06F3/0679—Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP]
Abstract
The invention provides a kind of low latency distributed memory system, store data in can byte addressing Nonvolatile memory in, the data of storage system are read and write by remote direct memory access technique, and back up data to multiple redundant nodes, to reach low latency and high availability.The present invention realizes the management to the cluster and its index information of multiple servers using the Scheduler module of centralization.Client need to only need to communicate with Scheduler module when being connected to storage system, all requests thereafter, all be sent directly to store the server node of corresponding data.The present invention is based on Nonvolatile memory and remote direct memory access technique, can provide key assignments storage system interface to client, and provide the data storage service of low latency.
Description
Technical field
The present invention relates to technical field of memory, in particular it relates to which a kind of be based on Nonvolatile memory and remote direct memory
The low latency distributed memory system of access technique.
Background technology
Internal memory key assignments storage system has been widely used in various large software systems to provide high bandwidth low delay
Data storage service.Typical internal memory key assignments storage system is generally stored data in reduce read-write delay in internal memory at present,
And disk is periodically write data into ensure the persistence of data.In addition, in order to improve serious forgiveness, the usual network of data connects
Several spare machines can be copied in a manner of principal and subordinate by connecing.Therefore, disk performance and network connection speed are in limitation at present
Two bottleneck factors of key assignments storage system are deposited, how to overcome the two bottlenecks so as to which factor is that this area is badly in need of solving problem.
The technical term being related to:
RDMA:Remote Direct Memory Access, remote direct memory access.
NVM:Non-volatile Memory, Nonvolatile memory.
DRAM:Dynamic Random Access Memory, dynamic RAM are that is, very widely used today interior
Deposit.
RDMA and its LLP (Lower Layer Protocol) can be realized on NIC (network interface card) and (is referred to as RNIC).
The content of the invention
For in the prior art the defects of, it is an object of the invention to provide a kind of low latency distributed memory system.
According to a kind of low latency distributed memory system provided by the invention, including:
Scheduler module:The server node and client in storage system are managed, meanwhile, store the rope of storage system
Fuse cease, by client be oriented to corresponding to server node;
Nonvolatile memory memory module:Management is stored in the data in Nonvolatile memory, meanwhile, provided to client
The storage service of concurrent low latency;
RDMA module:Offer is remotely visited the data in the Nonvolatile memory memory module
The ability asked, the data in the Nonvolatile memory memory module are directly operated using remote direct memory read-write capability, together
When, by the Ethernet protocol stack of workaround system kernel, reduce the access delay of storage system;
Data redundancy backup module:Redundancy backup is carried out to the data in storage system, including data and metadata are same
Physically-isolated backup node is walked, and ensures the uniformity of main and subordinate node data, it is right that the data redundancy backup module passes through
The data of modification enter row write daily record, to reach the uniformity for maintaining main and subordinate node data.
Preferably, managing the server node in storage system and client includes:Addition, deletion to server node
Addition with client, leave and be managed, the index information of storage system is updated and safeguarded.
Preferably, the data that management is stored in Nonvolatile memory include:
Data fragmentation:The data for being stored in single server node are divided into several tablet tables, each tablet tables
Having corresponded to continuous disjoint one section of space, each tablet tables in key assignments data hash space has independent thread to be visited
Ask operation;
Multithreading performs:All requests of server node, it is calculated what is be accordingly present according to its cryptographic Hash
Tablet tables, and be stored in the request queue of tablet tables, thread corresponding to tablet tables always obtains and performs request
The request of queue head;
Index structure based on Hash table:An independent Hash table structure is included in each tablet tables, is stored
The cryptographic Hash of the key assignments of all key assignments data and key in the tablet tables, for the key of conflict, solve to conflict using open chain method,
Hash table provides the operation that the data in tablet tables are inquired about, inserted and deleted;
Nonvolatile memory distributor:The memory space of each tablet tables is distributed by an independent Nonvolatile memory
Device management, the insertion for key assignments data, corresponding space is distributed by Nonvolatile memory distributor;For deleting for key assignments data
Remove, corresponding space is discharged by Nonvolatile memory distributor.
Preferably, there is provided the ability remotely accessed to the data in the Nonvolatile memory memory module includes:
Remote procedure call interface:It is right on the basis of the remote direct memory that infiniband is provided accesses primitive
Upper strata provides remote procedure call interface, initiates to operate by client, destination server is transferred to by infiniband agreements,
Corresponding operation is performed locally by destination server again;
Communication semanteme model:Using the channel type primitive of the non-transmitting based on Message Oriented, support one-to-many
Communication capacity, scalability is provided for database;
User space protocol layer:The primitive storehouse provided using infiniband, has bypassed the procotol of operating system nucleus
Layer, directly accesses infiniband network interface cards by User space program and sends and receives data.
Preferably, remote procedure call interface also includes operating relative to the Add of storage system extension, detects the key of offer
Whether Value Data is already present in database.
Preferably, the data redundancy backup module includes:
Collect all modifications of single request:By the way that all modifications once asked are synchronized into backup node, reach pair
The backup of data, the tracking to writing each time is reached by Nonvolatile memory distributor, all write-in data and
Corresponding offset address is all recorded to specific send in buffering area;
Backup Data daily record:Enter row write daily record to Backup Data, all data to be backed up are written to non-easy in advance
In log area in the property lost internal memory, when server crash is restarted, the data read first in daily record and execution are write accordingly
Enter operation, to reach the uniformity with other data trnascriptions;
Connection-oriented remote direct memory write-in:Employed in host node and from the data transfer between node towards even
The remote direct memory Writing Technology connect.
Compared with prior art, the present invention has following beneficial effect:
The present invention is based on Nonvolatile memory and remote direct memory access technique, and key assignments storage can be provided to client
System interface, and the data storage service of low latency is provided.
Brief description of the drawings
The detailed description made by reading with reference to the following drawings to non-limiting example, further feature of the invention,
Objects and advantages will become more apparent upon:
Fig. 1 is the Organization Chart of Scheduler module of the present invention and cluster;
Fig. 2 is that idle chain of the present invention represents to be intended to;
Fig. 3 is the fundamental diagram of tablet tables of the present invention;
Fig. 4 is the leader follower replication schematic diagram of the present invention.
Embodiment
With reference to specific embodiment, the present invention is described in detail.Following examples will be helpful to the technology of this area
Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill to this area
For personnel, without departing from the inventive concept of the premise, some changes and improvements can also be made.These belong to the present invention
Protection domain.
A kind of low latency distributed memory system provided by the invention, including:Scheduler module, Nonvolatile memory storage
Module, RDMA module and data redundancy backup module.
Scheduler module:Manage storage system in server node and client, including server node addition, delete
Except the addition with client, leave.Meanwhile the index information of storage system is stored, the index information of storage system is carried out
Renewal and safeguard, by client be oriented to corresponding to server node.Specifically comprise the following steps:
Scheduler starting step:In cluster, scheduler starts first, initializes cluster configuration information, prepares index data
Structure, and wait the addition request of Servers-all node in fixed port.
Scheduler adds server node step:1st, authentication server node legitimacy, it is global for server node distribution
Unique ID;2nd, index data structure is updated, adds the server node.
Scheduler broadcasts index information step:When the Servers-all node of cluster is added and finished, scheduler has updated
Into index information, and index information is broadcast to all server nodes.
Scheduler receives client request step:1st, scheduler waits the access request of client in fixed port;2nd, it is objective
After family terminates, the index information of cluster is sent to client
Nonvolatile memory memory module:Management is stored in data in Nonvolatile memory, including data inquiry, repair
Change, write and delete.Meanwhile pass through data fragmentation technology, the storage service of the low latency concurrent to client offer.Its is specific
Including such as lower part:
Data fragmentation:The data for being stored in single server node are divided into several tablet tables, each tablet tables
Having corresponded to continuous disjoint one section of space, each tablet tables in key assignments data hash space has independent thread to be visited
Ask operation;
Multithreading performs:All requests of server node, it is calculated what is be accordingly present according to its cryptographic Hash
Tablet tables, and be stored in the request queue of tablet tables, thread corresponding to tablet tables always obtains and performs request
The request of queue head;
Index structure based on Hash table:An independent Hash table structure is included in each tablet tables, is stored
Key (key) values of all key assignments data and the cryptographic Hash of key in the tablet tables, for the key of conflict, solved using open chain method
Conflict, Hash table provide the operation that the data in tablet tables are inquired about, inserted and deleted;
Nonvolatile memory distributor:The memory space of each tablet tables is distributed by an independent Nonvolatile memory
Device management, the insertion for key assignments data, corresponding space is distributed by Nonvolatile memory distributor;For deleting for key assignments data
Remove, corresponding space is discharged by Nonvolatile memory distributor.
RDMA module:Offer is remotely visited the data in the Nonvolatile memory memory module
The ability asked, including optimization communication semanteme model and offer remote procedure call interface, use remote direct memory read-write capability
The data in the Nonvolatile memory memory module are directly operated, meanwhile, assisted by the Ethernet of workaround system kernel
Stack is discussed, reduces the access delay of storage system.It is specifically included such as lower part:
Remote procedure call interface:It is right on the basis of the remote direct memory that infiniband is provided accesses primitive
Upper strata provide remote procedure call interface, mainly including Put, Get, Delete etc. insert, read and delete key Value Data behaviour
Make.Initiate to operate by client, destination server is transferred to by infiniband agreements, then held locally by destination server
The corresponding operation of row.Remote procedure call interface also includes operating relative to the Add of storage system extension, detects the key assignments of offer
Whether data are already present in database;
Communication semanteme model:Infiniband supports two kinds of semantic primitive:Memory types primitive and channel types are former
Language;Memory types primitive includes remote direct memory and reads and write;Channel types primitive includes sending or connecing to specific opposite end
Receive message;Infiniband provides two kinds of transmission means simultaneously:The non-reliable biography of connection-oriented transmitting and Message Oriented
It is defeated;The present invention is supported one-to-many communication capacity, is using the channel type primitive of the non-transmitting based on Message Oriented
Database provides scalability.
User space protocol layer:The primitive storehouse provided using infiniband, has bypassed the procotol of operating system nucleus
Layer, directly accesses infiniband network interface cards by User space program and sends and receives data.
Data redundancy backup module:Redundancy backup is carried out to the data in storage system, including data and metadata are same
Physically-isolated backup node is walked, and ensures the uniformity of main and subordinate node data, it is right that the data redundancy backup module passes through
The data of modification enter row write daily record, to reach the uniformity for maintaining main and subordinate node data.It is specifically included such as lower part:
Collect all modifications of single request:By the way that all modifications once asked are synchronized into backup node, reach pair
The backup of data, the tracking to writing each time is reached by Nonvolatile memory distributor, all write-in data and
Corresponding offset address is all recorded to specific send in buffering area;
Backup Data daily record:Enter row write daily record to Backup Data, all data to be backed up are written to non-easy in advance
In log area in the property lost internal memory, when server crash is restarted, the data read first in daily record and execution are write accordingly
Enter operation, to reach the uniformity with other data trnascriptions;
Connection-oriented remote direct memory write-in:Employed in host node and from the data transfer between node towards even
The remote direct memory Writing Technology connect, low latency, the characteristic of high bandwidth are given full play to.
Being implemented as follows for the present invention is described:
Scheduler adds server node:The framework of scheduler and cluster is monitored as shown in figure 1, after scheduler startup
TCP9090 ports, waiting for server request access cluster.For the server of request access, scheduler is added into clothes first
Business device list.When Servers-all request addition finishes, scheduler renewal cluster index list, is each server-assignment
Corresponding hash-value space.So, corresponding server node will be stored in by falling into the key assignments data of corresponding hash-value space.
Finally, index list is sent to each server node by scheduler.
Hash-value space:Data are uniformly distributed in each service by the present invention by carrying out Hash to key assignments data
Device node.The hash-value space that the present invention uses is 64 signless integer.According to the quantity of server in cluster, the cryptographic Hash
Space will be divided, and distribute to each server for adding cluster.This distribution information is recorded in the index structure of cluster
In.When client access cluster when, by the hash-value space being recorded according to this distribute information, find corresponding to store this
The server of data.
Construct and broadcast index structure information:When server access cluster when, scheduler will according to server access order,
The big cryptographic Hash section such as it is sequentially allocated.The scheduled device of this information is recorded in index structure.When Servers-all all access sets
When group completes, this index structure is broadcast to Servers-all by scheduler.After cluster initializes, scheduler is also by this rope
Fuse ceases the client for being sent to all accesses.
Server initiation:After startup of server, its Nonvolatile memory storage region will be initialized first, wrapped
Include the initialization of tablet tables, the distribution of local storage space, local Hash table it is initial etc..After locally initializing,
Server will enter ready state, and be connected in cluster.By scheduler dispatches the being serviced device storage of index structure information to
In local Nonvolatile memory space.
Client accesses cluster:Client uses storehouse provided by the invention access cluster.Client will be firstly connected to adjust
Spend device.The index structure of construction complete is sent to client by scheduler.Client, will be according to this in follow-up request
Server corresponding to index information lookup.Client is by this index structure information cache in local.Once the index structure information
It has been received that, client can disconnect with scheduler.
Client initiates inquiry request:When client initiates value corresponding to the given key (key) of an inquiry to cluster
During the request of (value).The server for storing the key will be searched by the index structure cached first.The process is divided into:1. calculate key
64 cryptographic Hash;2. calculate the Hash section that 64 cryptographic Hash are fallen into;3. obtain the server for possessing the Hash section
Address information.Then, client is sent to corresponding server by Infiniband and asked.The request is with non-reliable message
(unreliable-datagram) mode is sent to the server found.The process is divided into:1. in Nonvolatile memory
Distribution sends the buffering area needed for request;2. request is copied into transmission buffering area;3. buffering area will be sent and send request
It is sent into Infiniband transmit queue.Finally, after client smoothly sends request, client will enter the reception server
The state of reply.Client is distributed first receives the buffering area needed for replying.When client receives server for its inquiry
During the reply of request, client replicates value to the address space specified, or the mistake that report key is not present.Then, will connect
Receive buffering area release.
Distribution sends with release, receives buffering area:All transmissions.Buffering area is received to have allocated in advance.The present invention is hair
Send, receive buffering area and safeguard a free buffer queue respectively.Client sends request every time, will send the buffering area free time
A transmission buffering area is obtained in queue.When request is sent, this transmission buffering area is discharged into and sends buffering area free time team
Row, and a reception buffering area is obtained in buffering area idle queues are received.When the reply for receiving request finishes, discharge this and connect
Buffering area is received into reception buffering area idle queues.
Server receives client request:In the present invention, server specially opens a thread, for poll client
The request at end.When poll thread receives the request from client, server examines the key of the request cryptographic Hash to be first
It is no to be in the Hash section for distributing to the server.If it is not, then refusing the request, and error message is returned to client.It is no
Then, server is according to the key of request cryptographic Hash, local tablet tables corresponding to lookup.And request is added to this
The afterbody of request queue corresponding to tablet tables.And the worker thread of tablet tables is waken up, perform the request of client.Asking
After execution terminates, the worker thread replys information corresponding to client or report error message.
Whois lookup tablet tables:Distribute to the cryptographic Hash section of each server for scheduler, server will be after
It is continuous that it is divided.And each cryptographic Hash subinterval is distributed to the tablet tables of server local.Received in server
During request, the tablet tables according to corresponding to the cryptographic Hash section of local distribution information finds the key values of request.
Tablet tables:In the present invention, each tablet tables on server include independent:Hash table, 64MB are non-
Volatile memory space, request queue and worker thread.
Hash table in tablet tables:This Hash table size is 1000003, i.e., with 1000003 groove positions.It is mapped to this
The key of tablet tables storage address will be placed in this 1000003 groove positions.Hash table is mapped to for different key
Same groove position, the present invention solve this conflict using open chain method.That is, being mapped to the key of same groove position address will be added to
In existing key next pointers.Hash table is located at the end of whole tablet tables.
The memory space and Memory Allocation of tablet tables:As shown in Fig. 2 the memory space of a tablet table, removes and breathes out
Space occupied by uncommon table, it be used to store key assignments data.Present invention employs the Nonvolatile memory distributor of customization to come
Manage the memory space of tablet tables.The memory allocator is from 16 bytes to 1024 bytes, using 16 bytes as incremental units
Each Seed-ginger size maintains a free block chained list.Asked for a Memory Allocation, distributor will ask distribution first
Space size is adjusted to the smallest block size not less than its value.Then free block is searched in corresponding free block chained list.If no
In the presence of corresponding free block, then searched in the idle chained list of bigger block, and so on, until obtaining one not less than request
The free block of size.Assuming that request block size is N, the idle block size found is M.The block that size is M is then divided into size
The block of block and M-N for N.Return value of the block that size is N as request, size are that M-N block is then added to its idle team
In row.Correspondingly, the block that a size is N is discharged, whether the adjacent block for first checking for the block is extremely free block by Memory Allocation.
If so, then taking out it from free block chained list, and the bigger block that size is M is merged into current block.This process circulates past
It is multiple, untill it can not merge free block.Finally, the free block after merging will be added in corresponding idle chained list.
The request queue of tablet tables and worker thread:As shown in figure 3, each tablet tables are assigned with independent request
Queue, all requests for being mapped to this tablet table are added to the afterbody of request queue.When request is added, tablet
The worker thread of table is waken up, and obtains the pending request such as one from the head of request queue.Finished when request is performed
When, reply is sent to client by worker thread, and checks whether request queue is empty.If not empty, then worker thread repeats
The process asked and performed is taken from head.Otherwise worker thread continues polling request queue head 20us.If there is request quilt around here
It is added in request queue, then worker thread continues executing with the request on head.Otherwise, worker thread input sleep.
The worker thread of tablet tables performs request:The step of tablet tables worker thread execution inquiry request, is divided into:1.
The key specified is searched in Hash table;2. if in the presence of returning to value corresponding to the key;Otherwise, return to what key was not present
Mistake.The step of performing insertion request:1. the key specified is searched in Hash table;2. if be not present, call that the present invention's is interior
Distributor is deposited, the space for storing the key assignments that need to be inserted into enough is distributed in tablet memory space;If in the presence of:A. it is existing
Key assignments data taken up space greatly than new key assignments data, then reuse the memory space of existing key assignments data, and discharge
Unnecessary space;B. already present key assignments data take up space small than new key assignments data, then discharge existing key assignments data
Memory space, and bigger memory space is distributed by memory allocator, to store new key assignments data.Perform the step of removal request
Suddenly:1. the key specified is searched in Hash table;2. if be not present, request is completed;Otherwise, discharge and specify depositing occupied by key
Space is stored up, and its address is removed from Hash table.
Data duplication between server:For ensure individual server delay machine when, cluster still be able to client provide continue
Availability, the present invention by each part of data backup on three physically-isolated servers.When one of server is delayed machine
Or during inaccessible, the request service of the still executable client of remaining server.This requires the data needs of three servers
Being consistent property.The present invention is by the way that server all locally modified is synchronized on backup server, to reach the one of data
Cause property.
The leader follower replication of the present invention:As shown in figure 4, each part of data in cluster of the present invention are stored in three clothes
It is engaged on device.One of server is host node, and other two server is from node.All client requests are all sent
To host node.When host node performs the request of client, if the memory space local to host node is modified, these are repaiied
Changing to be recorded.When host node performs request and finished, these modification informations will be synchronized to two from node by host node first,
And wait the reply from node.After two replies from node are received, host node replys the corresponding request results of client.
Host node collects all modifications once asked:Asked for one query, not to the local data of host node
Modify, therefore the simultaneously operating between main and subordinate node need not be carried out.For once inserting or removal request, to host node
The particular memory distributor that all modifications of local Nonvolatile memory memory space are all write by the present invention is recorded.These are repaiied
Breath is converted to be all stored in a transmission buffering area.
The all modifications that host node is synchronously once asked:After host node is finished client request, host node is simultaneously
The data in modification information buffering area are sent to two from node, and are specified in the log from node for receiving these data
Address.Then, host node, which enters, waits two states from node reverts back.And if only if two from node it is back to normal when,
Host node is back to normal to client.Otherwise, host node replys corresponding mistake to client.
From the modification information log areas of node:From node, receive and store the region quilt of the synchronous modification information of host node
Referred to as NMLOG areas.Two NMLOG are each included from the tablet of node.The modification information of host node is sequentially written to 1
In number NMLOG.When No. 1 NMLOG is fully written, the synchronizing thread on backstage is waken up, the modification information quilt stored in No. 1 NMLOG
It is synchronized to the correspondence position from the corresponding tablet of node.Meanwhile No. 2 NMLOG are arranged to receive modification information, host node is synchronous
The modification information to come over is stored in No. 2 NMLOG.
Host node collapses in synchronizing process:If during synchronous vacations information, host node collapse.So, one from
Node will be chosen as new host node.Modification information in the new host node two NMLOG synchronous first is to the non-of its local
Volatile ram memory space.Incomplete modification information, it will be dropped.In this way, new host node is up to and original main section
The approximate consistent state (removing the incomplete modification information of issuable last time) of point.Meanwhile cluster is by for new master
Node reallocates one from node, so that it possesses two from node all the time.
In synchronizing process from node collapses:If in synchronizing process, from node collapses.If can voluntarily recover from node,
The data of synchronization master are to should be from node.If irrecoverable, cluster will be that host node reallocates one from node, and synchronously
Data are new from node to this.
One skilled in the art will appreciate that except realizing system provided by the invention in a manner of pure computer readable program code
And its beyond each device, module, unit, completely can be by the way that method and step progress programming in logic be provided come the present invention
System and its each device, module, unit with gate, switch, application specific integrated circuit, programmable logic controller (PLC) and embedding
Enter the form of the controller that declines etc. to realize identical function.So system provided by the invention and its every device, module, list
Member is considered a kind of hardware component, and is used to realize that device, module, the unit of various functions also may be used to what is included in it
To be considered as the structure in hardware component;It both can be real that will can also be considered as device, module, the unit of realizing various functions
The software module of existing method can be the structure in hardware component again.
The specific embodiment of the present invention is described above.It is to be appreciated that the invention is not limited in above-mentioned
Particular implementation, those skilled in the art can make a variety of changes or change within the scope of the claims, this not shadow
Ring the substantive content of the present invention.In the case where not conflicting, the feature in embodiments herein and embodiment can any phase
Mutually combination.
Claims (6)
- A kind of 1. low latency distributed memory system, it is characterised in that including:Scheduler module:The server node and client in storage system are managed, meanwhile, the index for storing storage system is believed Breath, by client be oriented to corresponding to server node;Nonvolatile memory memory module:Management is stored in the data in Nonvolatile memory, meanwhile, provided to client concurrent Low latency storage service;RDMA module:There is provided what the data in the Nonvolatile memory memory module were remotely accessed Ability, the data in the Nonvolatile memory memory module are directly operated using remote direct memory read-write capability, meanwhile, lead to The Ethernet protocol stack of workaround system kernel is crossed, reduces the access delay of storage system;Data redundancy backup module:Redundancy backup is carried out to the data in storage system, including data and metadata are synchronized to Physically-isolated backup node, and ensure the uniformity of main and subordinate node data, the data redundancy backup module passes through to modification Data enter row write daily record, with reach maintain main and subordinate node data uniformity.
- 2. low latency distributed memory system according to claim 1, it is characterised in that the service in management storage system Device node and client include:The addition of addition, deletion and client to server node, leave and be managed, to storage The index information of system is updated and safeguarded.
- 3. low latency distributed memory system according to claim 1, it is characterised in that management is stored in non-volatile Data in depositing include:Data fragmentation:The data for being stored in single server node are divided into several tablet tables, each tablet tables are corresponding Continuous disjoint one section of space in key assignments data hash space, each tablet tables have independent thread to conduct interviews behaviour Make;Multithreading performs:All requests of server node, the tablet tables being accordingly present in are calculated according to its cryptographic Hash, And be stored in the request queue of tablet tables, thread corresponding to tablet tables always obtains and performs request queue head Request;Index structure based on Hash table:An independent Hash table structure is included in each tablet tables, stores this The cryptographic Hash of the key assignments of all key assignments data and key in tablet tables, for the key of conflict, solve to conflict using open chain method, breathe out Uncommon table provides the operation that the data in tablet tables are inquired about, inserted and deleted;Nonvolatile memory distributor:The memory space of each tablet tables is by an independent Nonvolatile memory distributor tube Reason, the insertion for key assignments data, corresponding space is distributed by Nonvolatile memory distributor;Deletion for key assignments data, Corresponding space is discharged by Nonvolatile memory distributor.
- 4. low latency distributed memory system according to claim 1, it is characterised in that provide to described non-volatile interior Depositing the ability that the data in memory module are remotely accessed includes:Remote procedure call interface:On the basis of the remote direct memory that infiniband is provided accesses primitive, to upper strata Remote procedure call interface is provided, initiates to operate by client, destination server is transferred to by infiniband agreements, then by Destination server is performed locally corresponding operation;Communication semanteme model:Using the channel type primitive of the non-transmitting based on Message Oriented, one-to-many communication is supported Ability, scalability is provided for database;User space protocol layer:The primitive storehouse provided using infiniband, has bypassed the network protocol layer of operating system nucleus, by User space program directly accesses infiniband network interface cards and sends and receives data.
- 5. low latency distributed memory system according to claim 4, it is characterised in that remote procedure call interface is also wrapped Include and operated relative to the Add of storage system extension, whether the key assignments data for detecting offer are already present in database.
- 6. low latency distributed memory system according to claim 1, it is characterised in that the data redundancy backup module Including:Collect all modifications of single request:By the way that all modifications once asked are synchronized into backup node, reach to data Backup, the tracking to writing each time is reached by Nonvolatile memory distributor, all write-in data and corresponding Offset address be all recorded to specific send in buffering area;Backup Data daily record:Enter row write daily record to Backup Data, all data to be backed up are written to non-volatile in advance In log area in internal memory, when server crash is restarted, by the data read first in daily record and corresponding write-in behaviour is performed Make, to reach the uniformity with other data trnascriptions;Connection-oriented remote direct memory write-in:Employ in host node and from the data transfer between node connection-oriented Remote direct memory Writing Technology.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710941988.5A CN107888657B (en) | 2017-10-11 | 2017-10-11 | Low latency distributed storage system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710941988.5A CN107888657B (en) | 2017-10-11 | 2017-10-11 | Low latency distributed storage system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107888657A true CN107888657A (en) | 2018-04-06 |
CN107888657B CN107888657B (en) | 2020-11-06 |
Family
ID=61781297
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710941988.5A Active CN107888657B (en) | 2017-10-11 | 2017-10-11 | Low latency distributed storage system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107888657B (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109491837A (en) * | 2018-11-01 | 2019-03-19 | 郑州云海信息技术有限公司 | A kind of the log fault-tolerance processing method and device of Nonvolatile memory reservoir |
CN109491809A (en) * | 2018-11-12 | 2019-03-19 | 西安微电子技术研究所 | A kind of communication means reducing high-speed bus delay |
CN109714430A (en) * | 2019-01-16 | 2019-05-03 | 深圳壹账通智能科技有限公司 | Distributed caching method, device, computer system and storage medium |
CN109767247A (en) * | 2019-01-15 | 2019-05-17 | 武汉费米坊科技有限公司 | A kind of distribution commodity traceability system and source tracing method |
CN110109889A (en) * | 2019-05-09 | 2019-08-09 | 重庆大学 | A kind of distributed memory file management system |
CN110262754A (en) * | 2019-06-14 | 2019-09-20 | 华东师范大学 | A kind of distributed memory system and lightweight synchronized communication method towards NVMe and RDMA |
CN110298031A (en) * | 2019-05-28 | 2019-10-01 | 北京百度网讯科技有限公司 | A kind of Directory Service system and model version consistency allocator |
WO2020024590A1 (en) * | 2018-08-02 | 2020-02-06 | Memverge, Inc. | Persistent memory key-value store in a distributed memory architecture |
CN110968530A (en) * | 2019-11-19 | 2020-04-07 | 华中科技大学 | Key value storage system based on nonvolatile memory and memory access method |
CN111049883A (en) * | 2019-11-15 | 2020-04-21 | 北京金山云网络技术有限公司 | Data reading method, device and system of distributed table system |
CN111078607A (en) * | 2019-12-24 | 2020-04-28 | 上海交通大学 | Method and system for deploying RDMA (remote direct memory Access) and non-volatile memory-oriented network access programming frame |
CN111368002A (en) * | 2020-03-05 | 2020-07-03 | 广东小天才科技有限公司 | Data processing method, system, computer equipment and storage medium |
CN111381780A (en) * | 2020-03-06 | 2020-07-07 | 西安奥卡云数据科技有限公司 | Efficient byte access storage system for persistent storage |
CN111400307A (en) * | 2020-02-20 | 2020-07-10 | 上海交通大学 | Persistent hash table access system supporting remote concurrent access |
CN111400312A (en) * | 2020-02-25 | 2020-07-10 | 华南理工大学 | Edge storage database based on improved L SM tree |
CN111459418A (en) * | 2020-05-15 | 2020-07-28 | 南京大学 | RDMA (remote direct memory Access) -based key value storage system transmission method |
CN112099728A (en) * | 2019-06-18 | 2020-12-18 | 华为技术有限公司 | Method and device for executing write operation and read operation |
WO2021043124A1 (en) * | 2019-09-06 | 2021-03-11 | 程延辉 | Kbroker distributed operating system, storage medium, and electronic device |
CN112667620A (en) * | 2020-12-31 | 2021-04-16 | 广州方硅信息技术有限公司 | Data processing method and device, computer equipment and storage medium |
CN112788082A (en) * | 2019-11-08 | 2021-05-11 | 内江市下一代互联网数据处理技术研究所 | High-availability memory caching system |
CN113326155A (en) * | 2021-06-28 | 2021-08-31 | 深信服科技股份有限公司 | Information processing method, device, system and storage medium |
CN114338274A (en) * | 2021-12-30 | 2022-04-12 | 上海交通大学 | Heterogeneous industrial field bus fusion method and system |
WO2022089607A1 (en) * | 2020-10-29 | 2022-05-05 | 第四范式(北京)技术有限公司 | Parameter server node recovery method and recovery system |
WO2022252862A1 (en) * | 2021-06-02 | 2022-12-08 | 北京字节跳动网络技术有限公司 | Computing storage separation system and data access method therefor, medium, and electronic device |
CN116257521A (en) * | 2023-01-18 | 2023-06-13 | 深存科技(无锡)有限公司 | KV storage method based on FPGA |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101577716A (en) * | 2009-06-10 | 2009-11-11 | 中国科学院计算技术研究所 | Distributed storage method and system based on InfiniBand network |
CN102084332A (en) * | 2008-04-06 | 2011-06-01 | 弗森-艾奥公司 | Apparatus, system, and method for converting a storage request into an append data storage command |
CN104364756A (en) * | 2012-07-11 | 2015-02-18 | 英特尔公司 | Parallel processing of a single data buffer |
CN104750658A (en) * | 2013-12-27 | 2015-07-01 | 英特尔公司 | Assisted Coherent Shared Memory |
US9116819B2 (en) * | 2012-10-17 | 2015-08-25 | Datadirect Networks, Inc. | Reducing metadata in a write-anywhere storage system |
CN105404546A (en) * | 2015-11-10 | 2016-03-16 | 上海交通大学 | RDMA and HTM based distributed concurrency control method |
CN105681402A (en) * | 2015-11-25 | 2016-06-15 | 北京文云易迅科技有限公司 | Distributed high speed database integration system based on PCIe flash memory card |
CN106372013A (en) * | 2015-07-24 | 2017-02-01 | 华为技术有限公司 | Remote memory access method, apparatus and system |
-
2017
- 2017-10-11 CN CN201710941988.5A patent/CN107888657B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102084332A (en) * | 2008-04-06 | 2011-06-01 | 弗森-艾奥公司 | Apparatus, system, and method for converting a storage request into an append data storage command |
CN101577716A (en) * | 2009-06-10 | 2009-11-11 | 中国科学院计算技术研究所 | Distributed storage method and system based on InfiniBand network |
CN104364756A (en) * | 2012-07-11 | 2015-02-18 | 英特尔公司 | Parallel processing of a single data buffer |
US9116819B2 (en) * | 2012-10-17 | 2015-08-25 | Datadirect Networks, Inc. | Reducing metadata in a write-anywhere storage system |
CN104750658A (en) * | 2013-12-27 | 2015-07-01 | 英特尔公司 | Assisted Coherent Shared Memory |
CN106372013A (en) * | 2015-07-24 | 2017-02-01 | 华为技术有限公司 | Remote memory access method, apparatus and system |
CN105404546A (en) * | 2015-11-10 | 2016-03-16 | 上海交通大学 | RDMA and HTM based distributed concurrency control method |
CN105681402A (en) * | 2015-11-25 | 2016-06-15 | 北京文云易迅科技有限公司 | Distributed high speed database integration system based on PCIe flash memory card |
Non-Patent Citations (2)
Title |
---|
YOUYOU LU等: "Octopus:an RDMA-enabled Distributed Persistent Memory File System", 《PROCEEDINGS OF THE 2017 USENIX ANNUAL TECHNICAL CONFERENCE》 * |
舒继武等: "基于非易失性存储器的存储系统技术研究进展", 《科技导报》 * |
Cited By (35)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020024590A1 (en) * | 2018-08-02 | 2020-02-06 | Memverge, Inc. | Persistent memory key-value store in a distributed memory architecture |
CN109491837A (en) * | 2018-11-01 | 2019-03-19 | 郑州云海信息技术有限公司 | A kind of the log fault-tolerance processing method and device of Nonvolatile memory reservoir |
CN109491809A (en) * | 2018-11-12 | 2019-03-19 | 西安微电子技术研究所 | A kind of communication means reducing high-speed bus delay |
CN109767247A (en) * | 2019-01-15 | 2019-05-17 | 武汉费米坊科技有限公司 | A kind of distribution commodity traceability system and source tracing method |
CN109714430A (en) * | 2019-01-16 | 2019-05-03 | 深圳壹账通智能科技有限公司 | Distributed caching method, device, computer system and storage medium |
CN110109889A (en) * | 2019-05-09 | 2019-08-09 | 重庆大学 | A kind of distributed memory file management system |
CN110298031A (en) * | 2019-05-28 | 2019-10-01 | 北京百度网讯科技有限公司 | A kind of Directory Service system and model version consistency allocator |
CN110298031B (en) * | 2019-05-28 | 2023-07-18 | 北京百度网讯科技有限公司 | Dictionary service system and model version consistency distribution method |
CN110262754A (en) * | 2019-06-14 | 2019-09-20 | 华东师范大学 | A kind of distributed memory system and lightweight synchronized communication method towards NVMe and RDMA |
CN110262754B (en) * | 2019-06-14 | 2022-10-04 | 华东师范大学 | NVMe and RDMA-oriented distributed storage system and lightweight synchronous communication method |
CN112099728A (en) * | 2019-06-18 | 2020-12-18 | 华为技术有限公司 | Method and device for executing write operation and read operation |
WO2021043124A1 (en) * | 2019-09-06 | 2021-03-11 | 程延辉 | Kbroker distributed operating system, storage medium, and electronic device |
CN112788082A (en) * | 2019-11-08 | 2021-05-11 | 内江市下一代互联网数据处理技术研究所 | High-availability memory caching system |
CN111049883A (en) * | 2019-11-15 | 2020-04-21 | 北京金山云网络技术有限公司 | Data reading method, device and system of distributed table system |
CN111049883B (en) * | 2019-11-15 | 2022-09-06 | 北京金山云网络技术有限公司 | Data reading method, device and system of distributed table system |
CN110968530A (en) * | 2019-11-19 | 2020-04-07 | 华中科技大学 | Key value storage system based on nonvolatile memory and memory access method |
CN110968530B (en) * | 2019-11-19 | 2021-12-03 | 华中科技大学 | Key value storage system based on nonvolatile memory and memory access method |
CN111078607A (en) * | 2019-12-24 | 2020-04-28 | 上海交通大学 | Method and system for deploying RDMA (remote direct memory Access) and non-volatile memory-oriented network access programming frame |
CN111078607B (en) * | 2019-12-24 | 2023-06-23 | 上海交通大学 | Network access programming framework deployment method and system for RDMA (remote direct memory access) and nonvolatile memory |
CN111400307A (en) * | 2020-02-20 | 2020-07-10 | 上海交通大学 | Persistent hash table access system supporting remote concurrent access |
CN111400307B (en) * | 2020-02-20 | 2023-06-23 | 上海交通大学 | Persistent hash table access system supporting remote concurrent access |
CN111400312A (en) * | 2020-02-25 | 2020-07-10 | 华南理工大学 | Edge storage database based on improved L SM tree |
CN111400312B (en) * | 2020-02-25 | 2023-04-28 | 华南理工大学 | Edge storage database based on improved LSM tree |
CN111368002A (en) * | 2020-03-05 | 2020-07-03 | 广东小天才科技有限公司 | Data processing method, system, computer equipment and storage medium |
CN111381780A (en) * | 2020-03-06 | 2020-07-07 | 西安奥卡云数据科技有限公司 | Efficient byte access storage system for persistent storage |
CN111459418B (en) * | 2020-05-15 | 2021-07-23 | 南京大学 | RDMA (remote direct memory Access) -based key value storage system transmission method |
CN111459418A (en) * | 2020-05-15 | 2020-07-28 | 南京大学 | RDMA (remote direct memory Access) -based key value storage system transmission method |
WO2022089607A1 (en) * | 2020-10-29 | 2022-05-05 | 第四范式(北京)技术有限公司 | Parameter server node recovery method and recovery system |
CN112667620A (en) * | 2020-12-31 | 2021-04-16 | 广州方硅信息技术有限公司 | Data processing method and device, computer equipment and storage medium |
WO2022252862A1 (en) * | 2021-06-02 | 2022-12-08 | 北京字节跳动网络技术有限公司 | Computing storage separation system and data access method therefor, medium, and electronic device |
CN113326155A (en) * | 2021-06-28 | 2021-08-31 | 深信服科技股份有限公司 | Information processing method, device, system and storage medium |
CN113326155B (en) * | 2021-06-28 | 2023-09-05 | 深信服科技股份有限公司 | Information processing method, device, system and storage medium |
CN114338274A (en) * | 2021-12-30 | 2022-04-12 | 上海交通大学 | Heterogeneous industrial field bus fusion method and system |
CN116257521A (en) * | 2023-01-18 | 2023-06-13 | 深存科技(无锡)有限公司 | KV storage method based on FPGA |
CN116257521B (en) * | 2023-01-18 | 2023-11-17 | 深存科技(无锡)有限公司 | KV storage method based on FPGA |
Also Published As
Publication number | Publication date |
---|---|
CN107888657B (en) | 2020-11-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107888657A (en) | Low latency distributed memory system | |
CN110113420B (en) | NVM-based distributed message queue management system | |
US5835908A (en) | Processing multiple database transactions in the same process to reduce process overhead and redundant retrieval from database servers | |
CN101013381B (en) | Distributed lock based on object memory system | |
CN101019105B (en) | Method and apparatus for data storage using striping | |
US8108634B1 (en) | Replicating a thin logical unit | |
US7783607B2 (en) | Decentralized record expiry | |
CN112084258A (en) | Data synchronization method and device | |
CN103116552A (en) | Method and device for distributing storage space in distributed type storage system | |
CN110555001B (en) | Data processing method, device, terminal and medium | |
CN113268472B (en) | Distributed data storage system and method | |
CN105426321A (en) | RDMA friendly caching method using remote position information | |
WO2020199760A1 (en) | Data storage method, memory and server | |
CN110597452A (en) | Data processing method and device of storage system, storage server and storage medium | |
CN112000287A (en) | IO request processing device, method, equipment and readable storage medium | |
CN108540510B (en) | Cloud host creation method and device and cloud service system | |
CN112988680B (en) | Data acceleration method, cache unit, electronic device and storage medium | |
US10284672B2 (en) | Network interface | |
CN106713470A (en) | Distributed cache updating method and cache updating system | |
JP2001184248A (en) | Data access management device in distributed processing system | |
CN107493309B (en) | File writing method and device in distributed system | |
CN101344882A (en) | Data query method, insertion method and deletion method | |
CN107659626B (en) | Temporary metadata oriented separation storage method | |
CN105320676A (en) | Customer data query service method and device | |
CN116955219B (en) | Data mirroring method, device, host and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |