CN100437522C - Long-distance inner server and its implementing method - Google Patents

Long-distance inner server and its implementing method Download PDF

Info

Publication number
CN100437522C
CN100437522C CNB2005100983901A CN200510098390A CN100437522C CN 100437522 C CN100437522 C CN 100437522C CN B2005100983901 A CNB2005100983901 A CN B2005100983901A CN 200510098390 A CN200510098390 A CN 200510098390A CN 100437522 C CN100437522 C CN 100437522C
Authority
CN
China
Prior art keywords
unit
long
memory
distance inner
communication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB2005100983901A
Other languages
Chinese (zh)
Other versions
CN1928839A (en
Inventor
李磊
樊建平
陈明宇
曹政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CNB2005100983901A priority Critical patent/CN100437522C/en
Publication of CN1928839A publication Critical patent/CN1928839A/en
Application granted granted Critical
Publication of CN100437522C publication Critical patent/CN100437522C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The disclosed remote memory server fit to be accessed by remote client through corresponding protocol comprises a communication unit to provide high bandwidth low-delay communication mechanism and support different physical layer and link layer, a control unit, and a memory array unit fit to variable memory modules. Wherein, it uses high-speed network to dynamic bind the communication and memory units for high expansibility.

Description

A kind of long-distance inner server and its implementation
Technical field
The present invention relates to high-performance computing sector, relate in particular to a kind of long-distance inner server and its implementation.
Background technology
At present, the development of Computer Architecture is faced with new challenges.At first, the development of device is subjected to the restriction of frequency and power consumption, and is secondly, along with the increase of software and hardware scale and interstitial content, fault-tolerant outstanding day by day with problem of management.Therefore, must on architecture, innovate.Simultaneously, Low-cost Informationization at first requires the resource height to share, and with abundant integration utilization, next requires to prolong the infosystem life cycle, once more, requires to reduce the integrated cost that comprises management, maintenance and exploitation.
Therefore, people propose the architecture (Dynamic Self-organized computerArchitecture based on Grid-components is called for short DSAG) of gridding dynamic self-organization.Along with the further expansion of high-performance computer system scale, single paradigmatic system must satisfy all demands of applications, the connection between the system and to share will be inevitable.Since it is so, the system of original polymerization can be split, with different functional part separate, similar functional part is concentrated and relatively independent, be optimized design at communicating to connect simultaneously, thereby make up more massive system.This fractionation and reorganization may bring following benefit:
● with the base part polymerization, will reduce the expense of the combination that brings because of different component specification differences, the production run of simplified system reduces cost by a kind of large-scale industrial production.
● concentrate and independent with base part, will be convenient to miscellaneous part, eliminate the barrier that original paradigmatic system is brought, improve the utilization factor of resource the sharing of resource.
● with the design that will be convenient to belief system of concentrating of base part, particularly be convenient to the realization of resource backup and failover.
● the inhomogeneity functional part separates, with the dependence that reduces between the parts.No longer be intrinsic connection between the different parts of composition system, this will be convenient to the upgrading and the expansion of individual components.
● concentrate with base part, functional service externally is provided, shielding internal components difference, " out-of-date " device can continue to use, and hides this performance difference by other technologies, has prolonged the device life cycle, thereby has reduced information system cost.
The internal memory system has important status in computer system, and is and processor coupling part the most closely.Can find that by research the tight coupling of internal memory and processor has following problem to the internal memory system:
1) the high degree of coupling of calculating and storing is brought the load inequality; Among the traditional PC and cluster nodes, processor and internal memory bind together, and can not share, and this has just caused, and some node memory utilization factor is very low on the one hand, and on the other hand, the other node lacks enough physical memories.Under the situation of physical memory deficiency, have only by virtual memory to exchange exented memory, because the expense of disk unit is very big, brought the heavy losses of performance.This tight coupling structure can not be supported for application provides dynamic high capacity storage, has limited the raising of machine performance, simultaneously, has caused cost waste.
2) internal memory was by what node occupied separately, and when the internal memory of certain node lost efficacy, this node also must lose efficacy, and the computing power of this node just can not be utilized, and the migration of process and troubleshooting of faults are all wanted the expensive time;
3) upgrading agency's integrated cost problem; Owing to when coupling closely, needs the upgrading memory part, often will change together with mainboard and chipset.
People also propose the scheme of network internal storage, mainly comprise the SAMSON in New York University Shi Xi branch school, the Brazilian Rio deJaneiro MOMEMTO of university (MOre MEMory Than Others) etc.They are mainly realized at the software level, its essence is and be based upon on distributed shared memory (DSM) basis, internal memory does not separate and polymerization, but shared the internal memory of remote node, as document 1: patent No. US6795850, patent name is " System and method for sharingmemory among multiple storage device controllers ".This patent provides a kind of method that improves the long-distance inner access performance, but not long-distance inner polymerization of the present invention.
Memory channel is the communication protocol standard towards remote storage that international electric appliance and electronic society of engineers (Institute of Electricai andElectronics Engineers. is called for short IEEE) is being formulated.But it has mainly done a few thing from communication protocol, does not provide the structure of host side framework and inner server, and still rests on the draft stage.
Above-mentioned these research and products towards internal memory, the angle from fractionation and polymerization does not realize, still has the problem of tightly coupled system; In addition, scarce capacity is hidden in low-cost expansion, adaptability and delay.
Summary of the invention
The objective of the invention is to overcome above-mentioned the deficiencies in the prior art, a kind of long-distance inner server (Remote MemoryBox is provided, be called for short RMB) and its implementation, overcome the problem that existing processor and internal memory tight coupling are brought, provide a kind of height share, can low-cost expansion, high available, be easy to long-distance inner service equipment fault-tolerant and management.
The problem of long-distance inner server concentrate on high bandwidth, low delay, extendability, compatibility and intelligent on, to satisfy Easy Test and low-cost requirement simultaneously, need the key issue of solution to have:
(1) communication interface and agreement problem: low delay, high bandwidth, highly reliable communication mechanism, this communication mechanism is different from common communication mechanism, and the support to long-distance inner server visit and application model need be provided.
(2) low-cost expansion and internal bandwidth bottleneck problem: have only system to expand on a large scale and can obtain cost advantage, will guarantee inner high speed shared data path simultaneously when expansion, its total bandwidth is unaffected; Adopt bus-structured favorable expandability, serious but performance descends; If use cross bar switch structures such as (crossbar), when system scale descends, must cause cost sharply to rise; Therefore, need a kind of suitable interconnection network architecture, satisfy the requirement between extendability, bandwidth and the cost.By analysis, can construct suitable multistage network to system features.
(3) Compatibility Design problem: Compatibility Design comprises the Compatibility Design of communication interface, the Compatibility Design of Memory Controller Hub, the Compatibility Design of operating system etc.The quality of Compatibility Design will have influence on key factors such as the application model that can support, cost.By modular structural design, standardized Interface design, can realize the differential shading and the heterogeneous interconnects of intermodule.
(4) storage administration problem: the storage administration part will realize the allocation manager and the address mapping of storage; Be subject to processing the restriction of ability and storage space, the efficiency of fine granularity large scale system storage administration is a key; In addition, need to solve the collision problem of intermediate interconnection network, certain load balance ability is provided.The storage administration part is most of to be realized by software, dirigibility is provided, and cooperates the raising that realizes performance by hardware.
(5) postpone the problem of hiding: this problem comprises two parts, the one, support to postpone hiding software and hardware mechanism, and the 2nd, postpone algorithm and the software hidden.
In order to solve the problems of the technologies described above, the technical scheme that the present invention takes is as follows:
A kind of long-distance inner server as shown in Figure 1, is made up of by backboard is interconnected communication unit 10, control module 20 and memory array unit 30, is linked in sequence between them, and by the interconnection of long-distance inner server interconnection agreement;
Wherein, as shown in Figure 2, described communication unit 10 comprises that communications interface unit 101, communication interface converting unit 102 and communication interconnect interface unit 103 are linked in sequence;
As shown in Figure 3, described control module 20 comprises that communication interconnect interface unit 201, address mapping unit 202, flush bonding processor unit 203, local internal storage location 204, the operation of long-distance inner server are decomposed with control module 205, internal storage access control and changes unit 206, exchange scheduling unit 207 and high-speed interconnect network element 208; The communication interconnect interface unit 201 of wherein said long-distance inner server is connected with described address mapping unit 202, described flush bonding processor unit 203 is connected with described local internal storage location 204, described operation decomposition and control module 205 respectively with described communication interconnect interface unit 201, the control of described address mapping unit 202 and internal storage access is changed unit 206 and is connected, described exchange scheduling unit 207 respectively with described embedded processing unit 203, described operation decomposition is connected with high-speed interconnect network element 208 with control module 205, and described internal storage access control module 206 is connected with described high-speed interconnect network element 208; Described control module 20 is realized by one or more pieces FPGA or ASIC;
As shown in Figure 4, described memory array unit 30 comprises memory array interface unit 301, memory interface converting unit 302, Memory Controller Hub array element 303 and memory modules array element 304.
In technique scheme, described communications interface unit 101 is the communication port with client host, can be the physics and the link layer interface of GigaEthernet, Infiniband, Myrinet or PCI-Express type.
In technique scheme, the interface conversion that described communication interface converting unit 102 realizes between communications interface unit 101 and the RMB communication interconnect interface unit 103, the protocol conversion that is about to the communication interface support is the interconnection agreement of RMB inside.
In technique scheme, described RMB communication interconnect interface unit 103 is interconnect interfaces of communication unit 10 and control module 20, is connected with communication interconnect interface unit 201 in the control module 20.
In technique scheme, client host is by long-distance inner access protocal (Remote Memory accessProtocol is called for short RMP) visit RMB; This long-distance inner access protocal can be directly as the message layer agreement, also can be on TCP/IP etc., as application layer protocol.This agreement and basic communication protocol are irrelevant, support the long-distance inner operation of multiple active, comprise memory copying, move, be provided with, and the peek back is provided with, migration, internal memory consistance etc., and self-defining operation.
In technique scheme, described RMB interconnection agreement (RMB Interconnect Protocol is called for short RMIP) is the interconnection agreement of RMB inside, and it is a kind of lightweight, and point-to-point Physical layer and link layer protocol are used for interconnecting between chip chamber and plate.It adopts the serial differential signal, therefore has high bandwidth, not influenced by placement-and-routing simultaneously, realizes expansion easily.
Described RMB interconnection agreement not only is used for being connected of communication unit 10 and control module 20, also can be used for the interconnection of control module 20 and memory array unit 30.
In technique scheme, described communication interface converting unit 102 can also realize the protocol stack function, and the middle layer agreement is peeled off.For example communications interface unit 101 is an Ethernet, is ICP/IP protocol above, and application layer is the RMP agreement, and interface conversion unit can realize the ICP/IP protocol stack, and the RMP protocol data bag that restores is sent to control module 20 by RMIP, otherwise perhaps.
In technique scheme, described RMB communication interconnect interface unit 201 is connected with communication unit 10 by RMIP, and be responsible for the RMP bag of communication unit 10 is unpacked, perhaps the packet of memory array unit 30 is packaged as RMP and sends to communication unit 10.
In technique scheme, described address mapping unit 202 mainly comprises conversion look-aside buffer (TranslateLookup Buffer is called for short TLB), realizes the conversion of long-distance inner virtual address (RMVA) and long-distance inner physical address (RMPA).The address that client host sends, may be internal memory physical address or IDE protocol address etc., be transformed to RMVA, after the communication unit 10 by RMB receives through main frame side communication interface, to carry out address mapping by address mapping unit 202, and promptly search TLB and obtain RMPA.The RMPA correspondence actual physical address in the memory array unit 30, promptly indicate which address at which which memory bar of memory array plate.If TLB does not hit, then produce unusually, search page table by flush bonding processor unit 203, fill TLB.
In technique scheme, described flush bonding processor unit 203 is one group of high-performance embedded processor, the operation embedded OS.
In technique scheme, described local internal storage location 204 provides storage space for described flush bonding processor unit, in order to depositing arbitration of embedded OS, storage management program, data pre-fetching and preprocessing management program, interconnection network and control program, and code and data such as page table.
In technique scheme, described operation decomposition and control module 205 with a series of rdma reads of various complex operations type of decomposition of asking among the RMP, write internal memory and treatment step.For example Memory copy just need be decomposed into rdma read and write two steps of internal memory.Bigger or when having memory access to conflict when the data volume of request, also need to carry out in batches.
In technique scheme, described internal storage access control module 206 is with rdma read and write the memory request packing, issues memory array unit 30, perhaps the packet that returns is unpacked.The agreement that it and memory array unit are seen is that (Memory Array Access Protocol, MAAP), this protocol requirement is supported the read-write operation of high amount of traffic to the memory array access protocal.
In technique scheme, the mapping that described exchange scheduling unit 207 is responsible between communication unit 10 and the memory array unit 30; As shown in Figure 3, the RMB communications interface unit 201 in communication unit 10 and the control module 20, address mapping unit 202, operation are decomposed and control module 205, internal storage access control module 206 can be one to one, can not be one to one also; After finishing address mapping, mapping has just been set up in the address and the address in the memory array of client host request, need set up the high-speed data path between them.Described exchange scheduling unit 207 is dispatched described high-speed interconnect network element 208 according to mapping table.
In technique scheme, described high-speed interconnect network element 208 is to connect communication unit 10 and memory array unit 30, realizes the multi-to-multi mapping between them, sets up high-speed data path.After exchange scheduling unit 207 was finished scheduling, internal storage access control module 206 just can transceive data.
In technique scheme, described exchange scheduling unit 207 and described high-speed interconnect network element 208 constitute a choke free asynchronous switched network network.Asynchronous schedule makes interconnection network architecture simple, does not need decomposition/assembling and internal damping, can realize complicated flexibly high density exchange network and high-speed data path.
In technique scheme, described memory array interface unit 301 is connected with control module 20 by the long-distance inner server interconnection agreement, is responsible for the transmitting-receiving of memory array access protocal packet.
In technique scheme, the user interface that described memory interface converting unit 302 provides Memory Controller Hub array element 303 to need, the data and the control transformation of realization and memory array interface unit 301.
In technique scheme, described Memory Controller Hub array element 303 is controllers of concrete memory modules, directly controls memory bar.May form array by a plurality of controls.
In technique scheme, described memory modules array element 304 provides the high capacity memory array, can be DDRRAM, the memory modules of SDRAM or other types.
A kind of long-distance inner server implementation method may further comprise the steps:
(1) client host sends the memory access request data package.It may be a certain long-distance inner action type; The communication interface of client host and RMB may be Ethernet, PCI-express or other types;
(2) communications interface unit 101 receives packet, transfers to interface conversion unit 102, long-distance inner access protocal bag is reduced, and send to control module 20 via communication interconnect interface unit 103;
(3) 201 pairs of long-distance inner access protocals of communication interconnect interface unit bag of the long-distance inner server in the control module 20 unpacks;
(4) (Translation Look-aside Buffer TLB), finds out its corresponding address in memory array according to the address search conversion look-aside buffer in the long-distance inner access protocal bag for 202 pairs of address mapping unit; Transfer to flush bonding processor unit 203 if TLB does not hit and carry out storage administration, distribute or find out corresponding address, fill TLB;
(5) operation decomposition and control module 205 are decomposed into the simple operations step with the complicated operations request, and control the execution of these steps according to bag accessing operation type;
(6) exchange scheduling unit 207 is according to address mapping relation scheduling high-speed interconnection network unit 208;
(7) internal storage access control module 206 generates memory array access protocal bag and sends to memory array unit 30;
(8) memory array interface unit 301 unpacks memory array access protocal bag in the memory array unit 30, is converted to the input of Memory Controller Hub array element 303 by interface conversion unit 302;
(9) read-write of Memory Controller Hub array element 303 control memory modules array elements 304;
(10) reply by interface conversion unit 302 and memory array interface unit 301 and be packaged as memory array access protocal bag, send to control module 20;
(11) memory array access protocal bag is received and is unpacked by protocol memory converting unit 206 again via high-speed interconnect network 208;
(12) initiate the operation of next easy steps of decomposition in the steps (5) by operating decomposition and control module 205; Echo reply or data are given client host if desired, will control communications interface unit 201 generation long-distance inner access protocal response packets and send to communication unit 10;
(13) communication unit 20 sends to client host with long-distance inner access protocal bag.
In the present invention, adopt serial differential link and special lightweight point to point protocol between RMB intercommunication unit, control module and memory array unit, make it have the height extendability; Asynchronous serial differential high-speed interconnect network has guaranteed intraconnection bandwidth high under the high scalability.Replace communications interface unit and communication interface converting unit, can use different communication protocol (as GigaEthernet, PCI-express, Infiniband).Replace memory interface converting unit and Memory Controller Hub in the memory array unit, can use dissimilar memory modules (DDR RAM, SDRAM etc.).Flush bonding processor unit operation storage management program and delay hidden algorithm.High speed TLB finishes searching fast between long-distance inner virtual address and the long-distance inner physical address, and can finish scheduling fast by the exchange scheduling unit, has reduced the delay of total system.In sum, RMB has realized high scalability, high bandwidth, low delay and intelligent.
The typical application scenarios of Remote Memory Box of the present invention comprises:
1. directly expand local internal memory: local internal memory and long-distance inner unified addressing, application programs is transparent, utilizes delay efficiently to hide mechanism and reduces the long-distance inner delayed impact.
2. expand distributed shared memory (DSM): in the existing DSM mechanism, by the internal memory of network interface and interconnection network visit remote node, network overhead is very big between node.If adopt RMB as shared drive equipment, a large amount of internal memory operations can carry out in RMB device inside, has reduced network overhead.
3. replace disk to deposit exchange (swap) equipment: when the main frame physical memory is not enough, will utilize disk equipment in return, the partial memory exchanges data is gone out as void.But the access delay of disk be millisecond higher level other, expense is very big, has a strong impact on performance.SMB can provide more excellent delay and bandwidth, and is not subjected to the local disk capacity limit.If at the translation interface of host side realization IDE and RMB, can be without the retouching operation system.
4. as checkpoint (checkpoint) equipment: current mechanism is preserved the checkpoint data with the Local or Remote disk, and length consuming time, expense are big.Utilize RMB to be checkpoint, postpone bandwidth and be better than disk.And, when certain node failure, can just can take over by the checkpoint data among the RMB and other nodes are bound by other nodes.Can also realize asynchronous preservation, do rapid saving, the asynchronous then non-volatile memory apparatus such as disk that are saved in by RMB.
Compared with prior art, the invention has the beneficial effects as follows:
The memory part that long-distance inner server among the present invention will be distributed on each main frame is separated, with the form of memory device or inner server, for a plurality of main frames provide the long-distance inner service.Memory modules and client dynamic binding, dynamic assignment and release as required, thus realization is highly shared, reduces idleness, reduces the wasting of resources; Simultaneously, because resources centralized management, the quick replacement of being convenient to after the fault recovers and management.In addition, the appearance of RMB can also provide condition for the improvement of some application model.Advantage of the present invention can be summarized as follows:
(1) highly shared: RMB provides service for a plurality of client hosts, and memory modules and client dynamic binding are realized sharing of height, and for client provides the magnanimity internal memory, the client available internal memory is the internal memory that server end can provide;
(2) support extendible high capacity internal memory, have good extendability, can support tens GB to tens TB capacity; Simultaneously, replace Memory Controller Hub, can support dissimilar memory modules such as DDR RAM, SDRAM simultaneously.
(3) support extendible client host port: can support the more client host of more number; Simultaneously, can support dissimilar bottom communication interface (Ethernet, Infiniband, IDE, PCI etc.) and communication protocol.Special message layer agreement " long-range Memory box access protocal (RMP) " can be supported high capacity memory access and irrelevant with the bottom communication link.
(4) support high-speed communication: support high bandwidth, the low reliable communication mechanism that postpones.Single communication interface can be supported the above bandwidth of 1Gbps, and the hyperchannel polymerization can reach the 10Gbps bandwidth.
(5) transparency:, have the operating system transparency, operating systems such as operable Windows at some application such as SWAP, Checkpoint.
(6) support dynamic deferred hiding: utilize flush bonding processor, related software and home loop, support the configurable dynamic deferred hiding mechanism of user, comprise prefetch policy, data preprocessing method and data migtation.Efficiently, configurable prefetch policy, can reduce the delay that memory unit elongate to increase; Data preprocessing method can be finished some internal memory operation in RMB inside, does not need client host to participate in; The migration of RMB internal data not only can avoid utilizing communication link to transmit the expense that data are brought between client, can also improve system performance.
(7) support effectively fault-tolerant and management function: the shared internal memory of malfunctioning node can be moved to other nodes by remapping.RMB inside can also increase other fault tolerance, for example can add uninterrupted power source in the box, supports to carry out data derivation etc. to the high capacity non-volatile memory apparatus.
Description of drawings
Fig. 1 is the overall construction drawing of long-distance inner server of the present invention.
Fig. 2 is the structural drawing of communication unit of the present invention.
Fig. 3 is the structural drawing of control module of the present invention.
Fig. 4 is the structural drawing of memory array of the present invention unit.
Embodiment
Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:
As shown in Figure 1, long-distance inner server (RMB) is by communication unit 10, control module 20 and memory array unit 30, and they pass through backplane interconnect.RMB and client host are by high-speed communication (GigaEthernet, infiniband, PCI-express etc.) interconnection, and the upper strata is long-distance inner access protocal (RMP), and the centre can be agreements such as TCP/IP.
Inner each unit of RMB passes through the serial differential signal to connecting by the RMB interconnection agreement on the backboard.Communication unit 10 and control module 20 message layer agreements are RMP, and the message layer agreement is MMAP between control module 20 and memory array unit.The high-speed interconnect network can adopt the asynchronous serial exchange chip, carries out asynchronous schedule by the exchange scheduling unit in the control module 20, forms high-speed data path, communicates then.
Client host is by long-distance inner access protocal (RMP) visit RMB; This long-distance inner access protocal can be directly as the message layer agreement, also can be on TCP/IP etc., as application layer protocol.This agreement and basic communication protocol are irrelevant, support the long-distance inner operation of multiple active, comprise memory copy/move/set, fetch and set, migration, internal memory consistance etc., and self-defining operation.
RMB interconnection agreement (RMIP) is the interconnection agreement of RMB inside, and it is a kind of lightweight, and point-to-point Physical layer and link layer protocol are used for interconnecting between chip chamber and plate.It adopts the serial differential signal, therefore has high bandwidth, not influenced by placement-and-routing simultaneously, realizes expansion easily;
The RMB interconnection agreement not only is used for being connected of communication unit 10 and control module 20, also can be used for the interconnection of control module 20 and memory array unit 30; The low-cost extendability of system shows two aspects: (1) can expand more communication unit and internal storage location at an easy rate by this interconnection mode, and control module 20 also can carry out the cascade expansion by this interconnection mode; (2) by replacing communications interface unit 101 and communication interface converting unit 102, just can adapt to different client communication interfaces.
As shown in Figure 2, communication unit 10 is communicated by letter with client host by communications interface unit 101.Communications interface unit 101 can be realized with the communication interface chip of commercialization.The interface conversion that communication interface converting unit 102 realizes between communications interface unit 101 and the RMB communication interconnect interface unit 103, the protocol conversion that is about to the communication interface support is the interconnection agreement of RMB inside.Communication interface converting unit 102 is responsible for restoring the RMP bag from the signal of communications interface unit 101, transfer to RMB communication interconnect interface unit 103 and send to control module.Otherwise perhaps.
As shown in Figure 3, control module 20 is realized that by multiple FPGA it finishes core control, comprises that storage administration, delay are hidden, long-distance inner is operated decomposition and control, interconnection network scheduling etc.Wherein, RMB communication interconnect interface unit 201 is connected with communication unit 10 by RMIP, and is responsible for the RMP bag of communication unit 10 is unpacked, and perhaps the packet of memory array unit 30 is packaged as RMP and sends to communication unit 10; High-speed interconnect network element 208 can independently realize with FPGA or commercial chip; Local internal storage location 204 can adopt sheet stored or external RAM to realize.The interface of it and communication unit 10 interconnects by RMIP, and the upper strata is RMP, finishes unpacking and packing of RMP bag by RMB communications interface unit 201.TLB is responsible for searching in address mapping unit 202, finishes address mapping, if do not hit then send interruption to the flush bonding processor unit.Flush bonding processor unit 203 is one group of high-performance embedded processor, the operation embedded OS.Its function mainly contains three: the one, realize storage administration, and promptly set up and remove the mapping of client and internal memory, search page table, fill TLB etc.; The 2nd, the data pre-fetching of flexible operation and data pretreatment strategy; The 3rd, the part scheduling and the control of interconnection network.Flush bonding processor unit 203 can adopt embedded processor of FPGA or external processor, finishes functions such as storage administration, delay hidden algorithm with local internal storage location 204.Operation is decomposed and control module 205 is responsible for the complicated operations request is decomposed into some steps, and controls the carrying out of these steps, and up to finishing, for example memory copy just need be decomposed into rdma read and write two steps of internal memory.Bigger or when having memory access to conflict when the data volume of request, also need to carry out in batches.Internal storage access control module 206 is responsible for the packing of MAAP bag and is unpacked; Internal storage access control module 206 is with rdma read and write the memory request packing, issues memory array unit 30, perhaps the packet that returns is unpacked.The agreement that it and memory array unit are seen is that (Memory ArrayAccess Protocol, MAAP), this agreement is supported the read-write operation of high amount of traffic to the memory array access protocal.Mapping table of exchange scheduling unit 207 management, be responsible for the mapping between communication unit 10 and the memory array unit 30. communication unit 10 links to each other by the high-speed interconnect network with 20 of control modules, after finishing address mapping, mapping has just been set up in the address and the address in the memory array of client host request, need set up the high-speed data path between them.207 pairs of high-speed interconnect network element 208 of exchange scheduling unit are dispatched the gated data path.In the present embodiment, the RMB communications interface unit 201 in communication unit 10 and the control module 20, address mapping unit 202, operation decomposition are one to one with control module 205, internal storage access control module 206.
As shown in Figure 3, high-speed interconnect network element 208 is between internal storage access control module 206 and memory array unit 30.The mapping of communication unit 10 and memory array unit 30 can embody here.High-speed interconnect network element 208 is asynchronous schedule, needn't realize the decomposition and the assembling of wrapping, serial switching path just, and the structure flexibly because can realize complexity satisfies the demand of dynamic binding.
As shown in Figure 4, memory array unit 30 is by RMIP and MAAP and control module interconnection (by the high speed exchange network), and the packing of MAAP and unpacking by memory array interface unit 301 realizes.Memory interface converting unit 302 realizes that MAAP asks the interface conversion between the Memory Controller Hub array element 303; Memory interface converting unit 302 and memory array interface unit 301 can be realized with FPGA, are easy to replacement logic like this, realize using the product line of different memory modules easily; Memory modules array element 304 provides the high capacity memory array, adopts the memory modules of DDR RAM type.
In conjunction with Fig. 1,2,3,4, the flow process of client-access RMB is as follows:
(1) client host sends the memory access request data package.It may be a certain long-distance inner action type; The communication interface of client host and RMB may be Ethernet, PCI-express or other types;
(2) communications interface unit 101 receives packet, transfers to interface conversion unit 102, long-distance inner access protocal bag is reduced, and send to control module 20 via interconnect interface unit 103;
(3) 201 pairs of long-distance inner access protocals of the communication interconnect interface unit bag in the control module 20 unpacks;
(4) 202 pairs of address mapping unit find out its corresponding address in memory array according to the address search TLB in the RMP bag; Transfer to flush bonding processor unit 203 if TLB does not hit and carry out storage administration, distribute or find out corresponding address, fill TLB;
(5) operation decomposition and control module 205 are decomposed into the simple operations step with the complicated operations request, and control the execution of these steps according to bag accessing operation type; For example memory copy operation can be decomposed into read data and two simple operations steps of write data;
(6) exchange scheduling unit 207 is according to address mapping relation scheduling high-speed interconnection network unit 208;
(7) internal storage access control module 206 generates memory array access protocal bag and sends to memory array unit 30;
(8) memory array interface unit 301 unpacks the MAAP bag in the memory array unit 30, is converted to the input of Memory Controller Hub array element 303 by memory interface converting unit 302;
(9) read-write of Memory Controller Hub array element 303 control memory modules array elements 304;
(10) reply by memory interface converting unit 302 and memory array interface unit 301 and be packaged as MAAP bag, send to control module 20;
(11) the MAAP bag is received and is unpacked by internal storage access control module 206 again via high-speed interconnect network 208;
(12) initiate the operation of next easy steps of decomposition in the steps (5) by operating decomposition and control module 205; Echo reply or data are given client host if desired, will control RMB communication interconnect interface unit 201 generation RMP response packets and send to communication unit 10;
(13) communication unit 20 sends to client host with the RMP bag.
It should be noted that at last: above embodiment is the unrestricted technical scheme of the present invention in order to explanation only, although the present invention is had been described in detail with reference to the foregoing description, those of ordinary skill in the art is to be understood that: still can make amendment or be equal to replacement the present invention, and not breaking away from any modification or partial replacement of the spirit and scope of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims (10)

1, a kind of long-distance inner server is made up of by backboard is interconnected communication unit (10), control module (20) and memory array unit (30), is linked in sequence between them, and by the interconnection of long-distance inner server interconnection agreement;
Wherein, described communication unit (10) comprises that communications interface unit (101), communication interface converting unit (102) and communication interconnect interface unit (103) are linked in sequence;
Described control module (20) comprises communication interconnect interface unit (201), address mapping unit (202), flush bonding processor unit (203), local internal storage location (204), the operation decomposition and control module (205), internal storage access control module (206), exchange scheduling unit (207) and high-speed interconnect network element (208) of long-distance inner server; The communication interconnect interface unit (201) of wherein said long-distance inner server is connected with address mapping unit (202), flush bonding processor unit (203) is connected with local internal storage location (204), operation decompose with control module (205) respectively with the communication interconnect interface unit (201) of described long-distance inner server, the control of described address mapping unit (202) and internal storage access is changed unit (206) and is connected, described exchange scheduling unit (207) respectively with described flush bonding processor unit (203), described operation decomposition is connected with high-speed interconnect network element (208) with control module (205), and described internal storage access control module (206) is connected with described high-speed interconnect network element (208); Described control module (20) is realized by one or more pieces FPGA or ASIC;
Described memory array unit (30) comprises that memory array interface unit (301), memory interface converting unit (302), Memory Controller Hub array element (303) and memory modules array element (304) are linked in sequence.
2, according to the described long-distance inner server of claim 1, it is characterized in that, described communications interface unit (101) is and the communication port of client host that this communication port is the physics and the link layer interface of GigaEthernet, Infiniband, Myrinet or PCI-Express type;
Described communication interface converting unit (102) realizes the interface conversion between communications interface unit (101) and the communication interconnect interface unit (103);
Described communication interconnect interface unit (103) is the interconnect interface of communication unit (10) and control module (20), is connected with the communication interconnect interface unit (201) of long-distance inner server in the control module (20).
According to the described long-distance inner server of claim 1, it is characterized in that 3, client host visits long-distance inner server by the long-distance inner access protocal; This long-distance inner access protocal is directly as the message layer agreement, or on ICP/IP protocol, as application layer protocol; Described application layer protocol is supported the long-distance inner operation of multiple active, comprises memory copying, moves, assignment, peek and setting, migration, internal memory consistance or self-defining operation.
4, according to the described long-distance inner server of claim 1, it is characterized in that described long-distance inner server interconnection agreement is the interconnection agreement of long-distance inner server inside, it is a kind of lightweight, point-to-point Physical layer and link layer protocol are used for interconnecting between chip chamber and plate; It adopts the serial differential signal; Described long-distance inner server interconnection agreement is used for being connected of communication unit (10) and control module (20), also is used for the interconnection of control module (20) and memory array unit (30).
According to the described long-distance inner server of claim 1, it is characterized in that 5, described communication interface converting unit (102) also realizes the protocol stack function, and the middle layer agreement is peeled off.
6, according to claim 1 or 4 described long-distance inner servers, it is characterized in that, the communication interconnect interface unit (201) of described long-distance inner server is connected with communication unit (10) by the long-distance inner server interconnection agreement, and be responsible for the long-distance inner server interconnection agreement packet that communication unit (10) comes is unpacked, perhaps the packet that memory array unit (30) are come is packaged as long-distance inner server interconnection agreement form and sends to communication unit (10);
Described address mapping unit (202) mainly comprises the conversion look-aside buffer, realizes the conversion of long-distance inner virtual address and long-distance inner physical address;
Described flush bonding processor unit (203) operation embedded OS;
Described local internal storage location (204) provides storage space for described flush bonding processor unit (203), in order to depositing arbitration of embedded OS, storage management program, data pre-fetching and preprocessing management program, interconnection network and control program, and page table code and data;
Described operation decomposition and control module (205) with a series of rdma reads of various complex operations type of decomposition of asking in the long-distance inner server interconnection agreement, write internal memory and treatment step;
Described internal storage access control module (206) is with rdma read and write memory request packing, issues memory array unit (30), perhaps the packet that returns is unpacked, and the agreement of it and memory array unit (30) is the memory array access protocal;
Described exchange scheduling unit (207) is responsible for the mapping between communication unit (10) and memory array unit (30); The communication interconnect interface unit (201) of the long-distance inner server in communication unit (10) and the control module (20), address mapping unit (202), operation decomposition are one to one with control module (205), internal storage access control module (206); Described exchange scheduling unit (207) is dispatched described high-speed interconnect network element (208) according to mapping table;
Described high-speed interconnect network element (208) connects communication unit (10) and memory array unit (30), realizes the multi-to-multi mapping between them, sets up high-speed data path;
Described exchange scheduling unit (207) and described high-speed interconnect network element (208) constitute a choke free asynchronous switched network network.
7, long-distance inner server according to claim 1 is characterized in that, described memory array interface unit (301) is connected with control module (20) by the long-distance inner server interconnection agreement, is responsible for the transmitting-receiving of memory array access protocal packet;
The user interface that described memory interface converting unit (302) provides Memory Controller Hub array element (303) to need, the data and the control transformation of realization and memory array interface unit (301);
Described Memory Controller Hub array element (303) is the controller of concrete memory modules, directly controls memory bar; A plurality of controllers are formed array;
Described memory modules array element (304) is a memory array.
8, long-distance inner server according to claim 7 is characterized in that, described memory modules array element (304) is DDR RAM, or the memory modules of SDRAM type.
9, a kind of long-distance inner server implementation method may further comprise the steps:
1). client host sends the memory access request data package;
2). communications interface unit (101) receives packet, transfers to communication interface converting unit (102), long-distance inner access protocal bag is reduced, and send to control module (20) via communication interconnect interface unit (103);
3). the communication interconnect interface unit (201) of the long-distance inner server in the control module (20) unpacks long-distance inner access protocal bag;
4). address mapping unit (202) find out its corresponding address in memory array to according to the address search conversion look-aside buffer in the long-distance inner access protocal bag; Transfer to flush bonding processor unit (203) if the conversion look-aside buffer does not hit and carry out storage administration, distribute or find out corresponding address, fill the conversion look-aside buffer;
5). operation decomposition and control module (205) are decomposed into the simple operations step with the complicated operations request, and control the execution of described simple operations step according to bag accessing operation type;
6). exchange scheduling unit (207) is according to address mapping relation scheduling high-speed interconnection network unit (208);
7). internal storage access control module (206) generates memory array access protocal bag and sends to memory array unit (30);
8). memory array interface unit (301) unpacks memory array access protocal bag in memory array unit (30), is converted to the input of Memory Controller Hub array element (303) by memory interface converting unit (302);
9). the read-write of Memory Controller Hub array element (303) control memory modules array element (304);
10). reply by memory interface converting unit (302) and memory array interface unit (301) and be packaged as memory array access protocal bag, send to control module (20);
11). memory array access protocal bag is received and is unpacked by protocol memory converting unit (206) again via high-speed interconnect network (208);
12). operate by next easy steps that operation is decomposed and control module (205) initiates to decompose in the step 5); Echo reply or data are given client host if desired, the communication interconnect interface unit (201) of controlling long-distance inner server is generated long-distance inner access protocal response packet send to communication unit (10);
13). communication unit (10) sends to client host with long-distance inner access protocal bag.
10, according to the described long-distance inner server implementation method of claim 9, it is characterized in that the communication interface of described client host and long-distance inner server is Ethernet, Infiniband or PCI-express type.
CNB2005100983901A 2005-09-09 2005-09-09 Long-distance inner server and its implementing method Active CN100437522C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005100983901A CN100437522C (en) 2005-09-09 2005-09-09 Long-distance inner server and its implementing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005100983901A CN100437522C (en) 2005-09-09 2005-09-09 Long-distance inner server and its implementing method

Publications (2)

Publication Number Publication Date
CN1928839A CN1928839A (en) 2007-03-14
CN100437522C true CN100437522C (en) 2008-11-26

Family

ID=37858810

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100983901A Active CN100437522C (en) 2005-09-09 2005-09-09 Long-distance inner server and its implementing method

Country Status (1)

Country Link
CN (1) CN100437522C (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101841438B (en) * 2010-04-02 2011-10-05 中国科学院计算技术研究所 Method or system for accessing and storing stream records of massive concurrent TCP streams
KR101589801B1 (en) 2011-09-20 2016-01-28 엠파이어 테크놀로지 디벨롭먼트 엘엘씨 Peer-to-peer data migration
CN102609378B (en) * 2012-01-18 2016-03-30 中国科学院计算技术研究所 A kind of message type internal storage access device and access method thereof
CN103544123A (en) * 2012-07-16 2014-01-29 深圳市中兴微电子技术有限公司 SDRAM controller and access method for SDRAM memory space
CN103634277B (en) * 2012-08-23 2019-02-05 深圳市腾讯计算机系统有限公司 A kind of method of shared drive, server and system
CN102854942B (en) * 2012-09-07 2016-06-01 赵丰年 A kind of internal memory array
CN103902469B (en) * 2012-12-25 2017-03-15 华为技术有限公司 A kind of method and system of data pre-fetching
CN104516822B (en) * 2013-09-29 2018-01-23 华为技术有限公司 A kind of memory pool access method and equipment
CN104156332B (en) * 2014-08-11 2017-02-15 济南曼维信息科技有限公司 High-performance parallel computing method based on external PCI-E connection
CN104216835B (en) * 2014-08-25 2017-04-05 杨立群 A kind of method and device for realizing internal memory fusion
CN105701020B (en) * 2014-11-28 2018-11-30 华为技术有限公司 A kind of method of internal storage access, relevant apparatus and system
KR101835949B1 (en) * 2014-12-14 2018-03-08 비아 얼라이언스 세미컨덕터 씨오., 엘티디. Cache replacement policy that considers memory access type
CN104731531B (en) * 2015-03-24 2018-01-02 浪潮集团有限公司 A kind of server node architecture design method of separate type high power capacity internal memory
CN106155910B (en) * 2015-03-27 2021-02-12 华为技术有限公司 Method, device and system for realizing memory access
CN108139967B (en) * 2015-10-09 2021-07-20 华为技术有限公司 Converting a data stream into an array
CN108123984A (en) * 2016-11-30 2018-06-05 天津易遨在线科技有限公司 A kind of memory database optimizes server cluster framework
CN106844048B (en) * 2017-01-13 2020-11-06 上海交通大学 Distributed memory sharing method and system based on hardware characteristics
CN109739928A (en) * 2018-12-14 2019-05-10 深圳壹账通智能科技有限公司 Data export method, device, computer equipment and storage medium
CN112667561A (en) * 2020-12-29 2021-04-16 成都旋极历通信息技术有限公司 Implementation mode for realizing UFS array controller in FPGA
CN113722110B (en) * 2021-11-02 2022-04-15 阿里云计算有限公司 Computer system, memory access method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020112113A1 (en) * 2001-01-11 2002-08-15 Yotta Yotta, Inc. Storage virtualization system and methods
US6738872B2 (en) * 2000-12-22 2004-05-18 International Business Machines Corporation Clustered computer system with deadlock avoidance
US6795850B2 (en) * 2002-12-13 2004-09-21 Sun Microsystems, Inc. System and method for sharing memory among multiple storage device controllers
CN1547126A (en) * 2003-12-04 2004-11-17 中国科学院计算技术研究所 Initiator triggered remote memory access virtual-physical address conversion method
CN1564517A (en) * 2004-03-26 2005-01-12 清华大学 Memory-network memory-magnetic disc high speed reliable storage system and its reading/writing method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6738872B2 (en) * 2000-12-22 2004-05-18 International Business Machines Corporation Clustered computer system with deadlock avoidance
US20020112113A1 (en) * 2001-01-11 2002-08-15 Yotta Yotta, Inc. Storage virtualization system and methods
US6795850B2 (en) * 2002-12-13 2004-09-21 Sun Microsystems, Inc. System and method for sharing memory among multiple storage device controllers
CN1547126A (en) * 2003-12-04 2004-11-17 中国科学院计算技术研究所 Initiator triggered remote memory access virtual-physical address conversion method
CN1564517A (en) * 2004-03-26 2005-01-12 清华大学 Memory-network memory-magnetic disc high speed reliable storage system and its reading/writing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
网格化的动态自组织高性能计算机体系结构DSAG. 樊建平,陈明宇.计算机研究与发展,第40卷第12期. 2003 *

Also Published As

Publication number Publication date
CN1928839A (en) 2007-03-14

Similar Documents

Publication Publication Date Title
CN100437522C (en) Long-distance inner server and its implementing method
CN102457439B (en) Virtual switching system and method of cloud computing system
CN105516191B (en) System based on the FPGA 10,000,000,000 net Transmission Control Protocol unloading engine TOE realized
CN101901207B (en) Operating system of heterogeneous shared storage multiprocessor system and working method thereof
CN102882864B (en) A kind of virtualization system based on InfiniBand system for cloud computing
CN103583021B (en) The high radix network extended method of increment and system
CN101853237B (en) On-chip system and AXI bus transmission method
CN101163133B (en) Communication system and method of implementing resource sharing under multi-machine virtual environment
CN103188157B (en) A kind of router equipment
CN101859263A (en) Quick communication method between virtual machines supporting online migration
CN101069174A (en) A data processing system and a method for synchronizing data traffic
CN100550003C (en) The implementation method of chip-on communication of built-in isomerization multicore architecture interconnection organisational level
CN105867843A (en) Data transmission method and device
CN105577430A (en) Node management method of high-end fault-tolerant server
CN107038134A (en) A kind of SRIO interface solid hard disks system and its implementation based on FPGA
CN111541599B (en) Cluster software system and method based on data bus
CN102103471B (en) Data transmission method and system
CN104243172B (en) The extension input/output unit and method of a kind of scattered control system
CN202798790U (en) Virtual system based on InfiniBand cloud computing network
CN101827088A (en) Realization method of basic communication protocol based on CPU (Central Processing Unit) bus interconnection
CN101247663B (en) Considerable routing system and its forwarding table generation method
CN106844052A (en) A kind of method and device that fusion cluster is built based on Windows Server
CN100420217C (en) Interframe interconnection communication system and data exchanging method thereof
CN106814976A (en) Cluster storage system and apply its data interactive method
CN110166448B (en) Heterogeneous protocol conversion middleware and method for heterogeneous controller cluster

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: HUAWEI TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES

Effective date: 20130530

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100080 HAIDIAN, BEIJING TO: 518129 SHENZHEN, GUANGDONG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20130530

Address after: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee after: Huawei Technologies Co., Ltd.

Address before: 100080 Haidian District, Zhongguancun Academy of Sciences, South Road, No. 6, No.

Patentee before: Institute of Computing Technology, Chinese Academy of Sciences

TR01 Transfer of patent right

Effective date of registration: 20211227

Address after: 450046 Floor 9, building 1, Zhengshang Boya Plaza, Longzihu wisdom Island, Zhengdong New Area, Zhengzhou City, Henan Province

Patentee after: Super fusion Digital Technology Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.

TR01 Transfer of patent right