CN1928839A - Long-distance inner server and its implementing method - Google Patents

Long-distance inner server and its implementing method Download PDF

Info

Publication number
CN1928839A
CN1928839A CNA2005100983901A CN200510098390A CN1928839A CN 1928839 A CN1928839 A CN 1928839A CN A2005100983901 A CNA2005100983901 A CN A2005100983901A CN 200510098390 A CN200510098390 A CN 200510098390A CN 1928839 A CN1928839 A CN 1928839A
Authority
CN
China
Prior art keywords
unit
memory
control module
rmb
communication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2005100983901A
Other languages
Chinese (zh)
Other versions
CN100437522C (en
Inventor
李磊
樊建平
陈明宇
曹政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CNB2005100983901A priority Critical patent/CN100437522C/en
Publication of CN1928839A publication Critical patent/CN1928839A/en
Application granted granted Critical
Publication of CN100437522C publication Critical patent/CN100437522C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The disclosed remote memory server fit to be accessed by remote client through corresponding protocol comprises a communication unit to provide high bandwidth low-delay communication mechanism and support different physical layer and link layer, a control unit, and a memory array unit fit to variable memory modules. Wherein, it uses high-speed network to dynamic bind the communication and memory units for high expansibility.

Description

A kind of long-distance inner server and its implementation
Technical field
The present invention relates to high-performance computing sector, relate in particular to a kind of long-distance inner server and its implementation.
Background technology
At present, the development of Computer Architecture is faced with new challenges.At first, the development of device is subjected to the restriction of frequency and power consumption, and is secondly, along with the increase of software and hardware scale and interstitial content, fault-tolerant outstanding day by day with problem of management.Therefore, must on architecture, innovate.Simultaneously, Low-cost Informationization at first requires the resource height to share, and with abundant integration utilization, next requires to prolong the infosystem life cycle, once more, requires to reduce the integrated cost that comprises management, maintenance and exploitation.
Therefore, people propose the architecture (Dynamic Self-organized computerArchitecture based on Grid-components is called for short DSAG) of gridding dynamic self-organization.Along with the further expansion of high-performance computer system scale, single paradigmatic system must satisfy all demands of applications, the connection between the system and to share will be inevitable.Since it is so, the system of original polymerization can be split, with different functional part separate, similar functional part is concentrated and relatively independent, be optimized design at communicating to connect simultaneously, thereby make up more massive system.This fractionation and reorganization may bring following benefit:
● with the base part polymerization, will reduce the expense of the combination that brings because of different component specification differences, the production run of simplified system reduces cost by a kind of large-scale industrial production.
● concentrate and independent with base part, will be convenient to miscellaneous part, eliminate the barrier that original paradigmatic system is brought, improve the utilization factor of resource the sharing of resource.
● with the design that will be convenient to belief system of concentrating of base part, particularly be convenient to the realization of resource backup and failover.
● the inhomogeneity functional part separates, with the dependence that reduces between the parts.No longer be intrinsic connection between the different parts of composition system, this will be convenient to the upgrading and the expansion of individual components.
● concentrate with base part, functional service externally is provided, shielding internal components difference, " out-of-date " device can continue to use, and hides this performance difference by other technologies, has prolonged the device life cycle, thereby has reduced information system cost.
The internal memory system has important status in computer system, and is and processor coupling part the most closely.Can find that by research the tight coupling of internal memory and processor has following problem to the internal memory system:
1) the high degree of coupling of calculating and storing is brought the load inequality; Among the traditional PC and cluster nodes, processor and internal memory bind together, and can not share, and this has just caused, and some node memory utilization factor is very low on the one hand, and on the other hand, the other node lacks enough physical memories.Under the situation of physical memory deficiency, have only by virtual memory to exchange exented memory, because the expense of disk unit is very big, brought the heavy losses of performance.This tight coupling structure can not be supported for application provides dynamic high capacity storage, has limited the raising of machine performance, simultaneously, has caused cost waste.
2) internal memory was by what node occupied separately, and when the internal memory of certain node lost efficacy, this node also must lose efficacy, and the computing power of this node just can not be utilized, and the migration of process and troubleshooting of faults are all wanted the expensive time;
3) upgrading agency's integrated cost problem; Owing to when coupling closely, needs the upgrading memory part, often will change together with mainboard and chipset.
People also propose the scheme of network internal storage, mainly comprise the SAMSON in New York University Shi Xi branch school, the Brazilian Rio deJaneiro MOMEMTO of university (MOre MEMory Than Others) etc.They are mainly realized at the software level, its essence is and be based upon on distributed shared memory (DSM) basis, internal memory does not separate and polymerization, but shared the internal memory of remote node, as document 1: patent No. US6795850, patent name is " System and method for sharingmemory among multiple storage device controllers ".This patent provides a kind of method that improves the long-distance inner access performance, but not long-distance inner polymerization of the present invention.
Memory channel is the communication protocol standard towards remote storage that international electric appliance and electronic society of engineers (Institute of Electrical andElectronics Engineers. is called for short IEEE) is being formulated.But it has mainly done a few thing from communication protocol, does not provide the structure of host side framework and inner server, and still rests on the draft stage.
Above-mentioned these research and products towards internal memory, the angle from fractionation and polymerization does not realize, still has the problem of tightly coupled system; In addition, scarce capacity is hidden in low-cost expansion, adaptability and delay.
Summary of the invention
The objective of the invention is to overcome above-mentioned the deficiencies in the prior art, a kind of long-distance inner server (Remote MemoryBox is provided, be called for short RMB) and its implementation, overcome the problem that existing processor and internal memory tight coupling are brought, provide a kind of height share, can low-cost expansion, high available, be easy to long-distance inner service equipment fault-tolerant and management.
The problem of long-distance inner server concentrate on high bandwidth, low delay, extendability, compatibility and intelligent on, to satisfy Easy Test and low-cost requirement simultaneously, need the key issue of solution to have:
(1) communication interface and agreement problem: low delay, high bandwidth, highly reliable communication mechanism, this communication mechanism is different from common communication mechanism, and the support to remote memory box visit and application model need be provided.
(2) low-cost expansion and internal bandwidth bottleneck problem: have only system to expand on a large scale and can obtain cost advantage, will guarantee inner high speed shared data path simultaneously when expansion, its total bandwidth is unaffected; Adopt bus-structured favorable expandability, serious but performance descends; If use structures such as crossbar, when system scale descends, must cause cost sharply to rise; Therefore, need a kind of suitable interconnection network architecture, satisfy the requirement between extendability, bandwidth and the cost.By analysis, can construct suitable multistage network to system features.
(3) Compatibility Design problem: Compatibility Design comprises the Compatibility Design of communication interface, the Compatibility Design of Memory Controller Hub, the Compatibility Design of operating system etc.The quality of Compatibility Design will have influence on key factors such as the application model that can support, cost.By modular structural design, standardized Interface design, can realize the differential shading and the heterogeneous interconnects of intermodule.
(4) storage administration problem: the storage administration part will realize the allocation manager and the address mapping of storage; Be subject to processing the restriction of ability and storage space, the efficiency of fine granularity large scale system storage administration is a key; In addition, need to solve the collision problem of intermediate interconnection network, certain load balance ability is provided.The storage administration part is most of to be realized by software, dirigibility is provided, and cooperates the raising that realizes performance by hardware.
(5) postpone the problem of hiding: this problem comprises two parts, the one, support to postpone hiding software and hardware mechanism, and the 2nd, postpone algorithm and the software hidden.
In order to solve the problems of the technologies described above, the technical scheme that the present invention takes is as follows:
A kind of long-distance inner server as shown in Figure 1, is made up of communication unit 10, control module 20 and memory array unit 30, is linked in sequence between them, and by the interconnection of RMB interconnection agreement.
Wherein, as shown in Figure 2, described communication unit 10 comprises that communications interface unit 101, communication interface converting unit 102 and communication interconnect interface unit 103 are linked in sequence.
As shown in Figure 3, described control module 20 comprises that RMB communications interface unit 201 is connected with address mapping unit 202, flush bonding processor unit 203 is connected with local internal storage location 204, operation decompose with control module 205 respectively with described RMB communications interface unit 201, the control of described address mapping unit 202 and internal storage access is changed unit 206 and is connected, described exchange scheduling unit 207 respectively with described embedded processing unit 203, described operation decomposition is connected with high-speed interconnect network element 208 with control module 205, and described internal storage access control module 206 is connected with described high-speed interconnect network element 208; Described control module 20 can be realized by one or more pieces FPGA or ASIC; ASIC is meant special IC.
As shown in Figure 4, described memory array unit 30 comprises memory array interface unit 301, memory interface converting unit 302, Memory Controller Hub array element 303 and memory modules array element 304.
In technique scheme, described communications interface unit 101 is the communication port with client host, can be the physics and the link layer interface of types such as GigaEthernet, Infiniband, Myrinet or PCI-Express.
In technique scheme, the interface conversion that described communication interface converting unit 102 realizes between communications interface unit 101 and the RMB communication interconnect interface unit 103, the protocol conversion that is about to the communication interface support is the interconnection agreement of RMB inside.
In technique scheme, described RMB communication interconnect interface unit 103 is interconnect interfaces of communication unit 10 and control module 20, is connected with RMB communications interface unit 201 in the control module 20.
In technique scheme, client host is by long-distance inner access protocal (Remote Memory accessProtocol is called for short RMP) visit RMB; This long-distance inner access protocal can be directly as the message layer agreement, also can be on TCP/IP etc., as application layer protocol.This agreement and basic communication protocol are irrelevant, support visit of big address and big data quantity to transmit, data bits be set to 48 be more than, and support the long-distance inner of multiple active to operate, comprise memorycopy/move/set, fetch and set, migration, internal memory consistance etc., and self-defining operation.
In technique scheme, described RMB interconnection agreement (RMB Interconnect Protocol is called for short RMIP) is the interconnection agreement of RMB inside, and it is a kind of lightweight, and point-to-point Physical layer and link layer protocol are used for interconnecting between chip chamber and plate.It adopts the serial differential signal, therefore has high bandwidth, not influenced by placement-and-routing simultaneously, realizes expansion easily;
Described RMB interconnection agreement not only is used for being connected of communication unit 10 and control module 20, also can be used for the interconnection of control module 20 and memory array unit 30;
In technique scheme, described communication interface converting unit 102 can also realize the protocol stack function, and the middle layer agreement is peeled off.For example communications interface unit 101 is an Ethernet, is ICP/IP protocol above, and application layer is the RMP agreement, and interface conversion unit can realize the ICP/IP protocol station, and the RMP protocol data bag that restores is sent to control module 20 by RMIP, otherwise perhaps.
In technique scheme, described RMB communications interface unit 201 is connected with communication unit 10 by RMIP, and is responsible for the RMP bag of communication unit 10 is unpacked, and perhaps the packet of memory array unit 30 is packaged as RMP and sends to communication unit 10;
In technique scheme, described address mapping unit 202 mainly comprises conversion look-aside buffer (TranslateLookup Buffer is called for short TLB), realizes the conversion of long-distance inner virtual address (RMVA) and long-distance inner physical address (RMPA).The address that client host sends may be internal memory physical address or IDE protocol address etc., is transformed to RMVA through main frame side communication interface, after communication unit 10 by RMB receives, to carry out address mapping by address mapping unit 202, promptly search TLB, obtain RMPA.The RMPA correspondence actual physical address in the memory array unit 30, promptly indicate which address at which which memory bar of memory array plate.If TLB does not hit, then produce unusually, search page table by flush bonding processor unit 203, fill TLB.
In technique scheme, described flush bonding processor unit 203 is one group of high-performance embedded processor, the operation embedded OS.
In technique scheme, described local internal storage location 204 provides storage space for described flush bonding processor unit, in order to depositing arbitration of embedded OS, storage management program, data pre-fetching and preprocessing management program, interconnection network and control program, and code and data such as page table.
In technique scheme, described operation decomposition and control module 205 with a series of rdma reads of various complex operations type of decomposition of asking among the RMP, write internal memory and treatment step.For example memory copy just need be decomposed into rdma read and write two steps of internal memory.Bigger or when having memory access to conflict when the data volume of request, also need to carry out in batches.
In technique scheme, described internal storage access control module 206 is with rdma read and write the memory request packing, issues memory array unit 30, perhaps the packet that returns is unpacked.The agreement that it and memory array unit are seen is that (Memory Array Access Protocol, MAP), this protocol requirement is supported the read-write operation of high amount of traffic to the memory array access protocal.
In technique scheme, the mapping that described exchange scheduling unit 207 is responsible between communication unit 10 and the memory array unit 30; As shown in Figure 3, the RMB communications interface unit 201 in communication unit 10 and the control module 20, address mapping unit 202, operation decomposition are one to one with control module 205, internal storage access control module 206, or the multi-to-multi correspondence; After finishing address mapping, mapping has just been set up in the address and the address in the memory array of client host request, need set up the high-speed data path between them.Described exchange scheduling unit 207 is dispatched described high-speed interconnect network element 208 according to mapping table.
In technique scheme, described high-speed interconnect network element 208 is to connect communication unit 10 and memory array unit 30, realizes the multi-to-multi mapping between them, sets up high-speed data path.After exchange scheduling unit 207 was finished scheduling, internal storage access control module 206 just can transceive data.
In technique scheme, described exchange scheduling unit 207 and described high-speed interconnect network element 208 constitute a choke free asynchronous switched network network.Asynchronous schedule makes interconnection network architecture simple, does not need decomposition/assembling and internal damping, can realize complicated flexibly high density exchange network and high-speed data path.
In technique scheme, described memory array interface unit 301 is connected with control module 20 by the RMB interconnection agreement, is responsible for the transmitting-receiving of MAAP packet.
In technique scheme, the user interface that described memory interface converting unit 302 provides Memory Controller Hub array element 303 to need, the data and the control transformation of realization and memory array interface unit 301.
In technique scheme, described Memory Controller Hub array element 303 is controllers of concrete memory modules, directly controls memory bar.May form array by a plurality of controls.
In technique scheme, described memory modules array element 304 provides the high capacity memory array, can be DDRRAM, the memory modules of SDRAM or other types.
A kind of long-distance inner server implementation method may further comprise the steps:
(1) client host sends the memory access request data package.It may be a certain long-distance inner action type; The communication interface of client host and RMB may be Ethernet, PCI-express or other types;
(2) communications interface unit 101 receives packet, transfers to interface conversion unit 102, long-distance inner access protocal (RMP) bag is reduced, and send to control module 20 via RMB interconnect interface unit 103;
(3) 201 pairs of RMP bags of the RMB communications interface unit in the control module 20 unpack;
(4) 202 pairs of address mapping unit find out its corresponding address in memory array according to the address search TLB in the RMP bag; Transfer to flush bonding processor unit 203 if TLB does not hit and carry out storage administration, distribute or find out corresponding address, fill TLB;
(5) operation decomposition and control module 205 are decomposed into the simple operations step with the complicated operations request, and control the execution of these steps according to bag accessing operation type;
(6) exchange scheduling unit 207 is according to address mapping relation scheduling high-speed interconnection network unit 208;
(7) internal storage access control module 206 generates the MAAP bag and sends to memory array unit 30;
(8) memory array interface unit 301 unpacks the MAAP bag in the memory array unit 30, is converted to the input of Memory Controller Hub array element 303 by interface conversion unit 302;
(9) read-write of Memory Controller Hub array element 303 control memory modules array elements 304;
(10) reply by interface conversion unit 302 and memory array interface unit 301 and be packaged as MAAP bag, send to control module 20;
(11) the MAAP bag is received and is unpacked by protocol memory converting unit 206 again via high-speed interconnect network 208;
(12) initiate the operation of next easy steps of decomposition in the steps (5) by operating decomposition and control module 205; Echo reply or data are given client host if desired, will control RMB communications interface unit 201 generation RMP response packets and send to communication unit 10;
(13) communication unit 20 sends to client host with the RMP bag.
In the present invention, adopt serial differential link and special lightweight point to point protocol between RMB intercommunication unit, control module and memory array unit, make it have the height extendability; Asynchronous serial differential high-speed interconnect network has guaranteed intraconnection bandwidth high under the high scalability.Replace communications interface unit and communication interface converting unit, can use different communication protocol (as GigaEthernet, PCI-express, Infiniband etc.).Replace memory interface converting unit and Memory Controller Hub in the memory array unit, can use dissimilar memory modules (DDR RAM, SDRAM etc.).Flush bonding processor unit operation storage management program and delay hidden algorithm.High speed TLB finishes searching fast between long-distance inner virtual address and the long-distance inner physical address, and can finish scheduling fast by the exchange scheduling unit, has reduced the delay of total system.In sum, RMB has realized high scalability, high bandwidth, low delay and intelligent.
The typical application scenarios of Remote Memory Box of the present invention comprises:
1. directly expand local internal memory: local internal memory and long-distance inner unified addressing, application programs is transparent, utilizes delay efficiently to hide mechanism and reduces the long-distance inner delayed impact.
2. expand distributed shared memory (DSM): in the existing DSM mechanism, by the internal memory of network interface and interconnection network visit remote node, network overhead is very big between node.If adopt RMB as shared drive equipment, a large amount of internal memory operations can carry out in RMB inside, has reduced network overhead.
3. replace disk to deposit exchange (swap) equipment: when the main frame physical memory is not enough, will utilize disk equipment in return, the partial memory exchanges data is gone out as void.But the access delay of disk be millisecond higher level other, expense is very big, has a strong impact on performance.SMB can provide more excellent delay and bandwidth, and is not subjected to the local disk capacity limit.If at the translation interface of host side realization IDE and RMB, can be without the retouching operation system.
4. as checkpoint (checkpoint) equipment: current mechanism is preserved the checkpoint data with the Local or Remote disk, and length consuming time, expense are big.Utilize RMB to be checkpoint, postpone bandwidth and be better than disk.And, when certain node failure, can just can take over by the checkpoint data among the RMB and other nodes are bound by other nodes.Can also realize asynchronous preservation, do rapid saving, the asynchronous then non-volatile memory apparatus such as disk that are saved in by RMB.
Compared with prior art, the invention has the beneficial effects as follows:
The memory part that long-distance inner server among the present invention will be distributed on each main frame is separated, with the form of memory device or inner server, for a plurality of main frames provide the long-distance inner service.Memory modules and client dynamic binding, dynamic assignment and release as required, thus realization is highly shared, reduces idleness, reduces the wasting of resources; Simultaneously, because resources centralized management, the quick replacement of being convenient to after the fault recovers and management.In addition, the appearance of RMB can also provide condition for the improvement of some application model.Advantage of the present invention can be summarized as follows:
(1) degree is shared: RMB provides service for a plurality of client hosts, and memory modules and client dynamic binding are realized sharing of height, and for client provides the magnanimity internal memory, the client available internal memory is the internal memory that server end can provide;
(2) support extendible high capacity internal memory, have good extendability, can support tens GB to tens TB capacity; Simultaneously, replace Memory Controller Hub, can support dissimilar memory modules such as DDR RAM, SDRAM simultaneously.
(3) support extendible client host port: can support the more client host of more number; Simultaneously, can support dissimilar bottom communication interface (Ethernet, Infiniband, IDE, PCI etc.) and communication protocol.Special message layer agreement " long-range Memory box access protocal (RMP) " can be supported high capacity memory access and irrelevant with the bottom communication link.
(4) support high-speed communication: support high bandwidth, the low reliable communication mechanism that postpones.Single communication interface can be supported the above bandwidth of 1Gbps, and the hyperchannel polymerization can reach the 10Gbps bandwidth.
(5) transparency:, have the operating system transparency, operating systems such as operable Windows at some application such as SWAP, Checkpoint.
(6) support dynamic deferred hiding: utilize flush bonding processor, related software and home loop, support the configurable dynamic deferred hiding mechanism of user, comprise prefetch policy, data preprocessing method and data migtation.Efficiently, configurable prefetch policy, can reduce the delay that memory unit elongate to increase; Data preprocessing method can be finished some internal memory operation in RMB inside, does not need client host to participate in; The migration of RMB internal data not only can avoid utilizing communication link to transmit the expense that data are brought between client, can also improve system performance.
(7) support effectively fault-tolerant and management function: the shared internal memory of malfunctioning node can be moved to other nodes by remapping.RMB inside can also increase other fault tolerance, for example can add uninterrupted power source in the box, supports to carry out data derivation etc. to the high capacity non-volatile memory apparatus.
Description of drawings
Fig. 1 is the overall construction drawing of long-distance inner server of the present invention.
Fig. 2 is the structural drawing of communication unit of the present invention.
Fig. 3 is the structural drawing of control module of the present invention.
Fig. 4 is the structural drawing of memory array of the present invention unit.
Embodiment
Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:
As shown in Figure 1, long-distance inner server (RMB) is by communication unit 10, control module 20 and memory array unit 30, and they pass through backplane interconnect.RMB and client host are by high-speed communication (GigaEthernet, infiniband, PCI-express etc.) interconnection, and the upper strata is long-distance inner access protocal (RMP), and the centre can be agreements such as TCP/IP.
Inner each unit of RMB passes through the serial differential signal to connecting by RMB interconnection agreement (RMIP) on the backboard.Communication unit 10 and control module 20 message layer agreements are RMP, and the message layer agreement is MAAP between control module 20 and memory array unit.The high-speed interconnect network can adopt the asynchronous serial exchange chip, carries out asynchronous schedule by the exchange scheduling unit in the control module 20, forms high-speed data path, communicates then.
Client host is by long-distance inner access protocal (RMP) visit RMB; This long-distance inner access protocal can be directly as the message layer agreement, also can be on TCP/IP etc., as application layer protocol.This agreement and basic communication protocol are irrelevant, be set to support 48 bit data or 64 bit data of visit of big address and big data quantity transmission, and support the long-distance inner of multiple active to operate, comprise memory copy/move/set, fetch and set, migration, internal memory consistance etc., and self-defining operation.
RMB interconnection agreement (RMIP) is the interconnection agreement of RMB inside, and it is a kind of lightweight, and point-to-point Physical layer and link layer protocol are used for interconnecting between chip chamber and plate.It adopts the serial differential signal, therefore has high bandwidth, not influenced by placement-and-routing simultaneously, realizes expansion easily;
The RMB interconnection agreement not only is used for being connected of communication unit 10 and control module 20, also can be used for the interconnection of control module 20 and memory array unit 30; The low-cost extendability of system shows two aspects: (1) can expand more communication unit and internal storage location at an easy rate by this interconnection mode, and control module 20 also can carry out the cascade expansion by this interconnection mode; (2) by replacing communications interface unit 101 and communication interface converting unit 102, just can adapt to different client communication interfaces.
As shown in Figure 2, communication unit 10 is communicated by letter with client host by communications interface unit 101.Communications interface unit 101 can be realized with the communication interface chip of commercialization.The interface conversion that communication interface converting unit 102 realizes between communications interface unit 101 and the RMB communication interconnect interface unit 103, the protocol conversion that is about to the communication interface support is the interconnection agreement of RMB inside.Communication interface converting unit 102 is responsible for restoring the RMP bag from the signal of communications interface unit 101, transfer to RMB communication interconnect interface unit 103 and send to control module.Otherwise perhaps.
As shown in Figure 3, control module 20 is realized that by multiple FPGA it finishes core control, comprises that storage administration, delay are hidden, long-distance inner is operated decomposition and control, interconnection network scheduling etc.Wherein, RMB communications interface unit 201 is connected with communication unit 10 by RMIP, and is responsible for the RMP bag of communication unit 10 is unpacked, and perhaps the packet of memory array unit 30 is packaged as RMP and sends to communication unit 10; High-speed interconnect network element 208 can independently realize with FPGA or commercial chip; Local internal storage location 204 can adopt sheet stored or external RAM to realize.The interface of it and communication unit 10 interconnects by RMIP, and the upper strata is the RMP agreement, finishes unpacking and packing of RMP bag by RMB communications interface unit 201.TLB is responsible for searching in address mapping unit 202, finishes address mapping, if do not hit then send interruption to the flush bonding processor unit.Flush bonding processor unit 203 is one group of high-performance embedded processor, the operation embedded OS.Its function mainly contains three: the one, realize storage administration, and promptly set up and remove the mapping of client and internal memory, search page table, fill TLB etc.; The 2nd, the data pre-fetching of flexible operation and data pretreatment strategy; The 3rd, the part scheduling and the control of interconnection network.Flush bonding processor unit 203 can adopt embedded processor of FPGA or external processor, finishes functions such as storage administration, delay hidden algorithm with local internal storage location 204.Operation is decomposed and control module 205 is responsible for the complicated operations request is decomposed into some steps, and controls the carrying out of these steps, and up to finishing, for example memory copy just need be decomposed into rdma read and write two steps of internal memory.Bigger or when having memory access to conflict when the data volume of request, also need to carry out in batches.Internal storage access control module 206 is responsible for the packing of MAAP bag and is unpacked; Internal storage access control module 206 is with rdma read and write the memory request packing, issues memory array unit 30, perhaps the packet that returns is unpacked.The agreement that it and memory array unit are seen is that (Memory Array AccessProtocol, MAAP), this agreement is supported the read-write operation of high amount of traffic to the memory array access protocal.Mapping table of exchange scheduling unit 207 management, be responsible for the mapping between communication unit 10 and the memory array unit 30. communication unit 10 links to each other by the high-speed interconnect network with 20 of control modules, after finishing address mapping, mapping has just been set up in the address and the address in the memory array of client host request, need set up the high-speed data path between them.207 pairs of high-speed interconnect network element 208 of exchange scheduling unit are dispatched the gated data path.In the present embodiment, the RMB communications interface unit 201 in communication unit 10 and the control module 20, address mapping unit 202, operation decomposition are one to one with control module 205, internal storage access control module 206.
As shown in Figure 3, high-speed interconnect network element 208 is between internal storage access control module 206 and memory array unit 30.The mapping of communication unit 10 and memory array unit 30 can embody here.The 208th, asynchronous schedule, needn't realize the decomposition and the assembling of wrapping, serial switching path just, the structure flexibly because can realize complexity satisfies the demand of dynamic binding.
As shown in Figure 4, memory array unit 30 is by RMIP and MAAP and control module interconnection (by the high speed exchange network), and the packing of MAAP and unpacking by memory array interface unit 301 realizes.Memory interface converting unit 302 realizes that MAAP asks the interface conversion between the Memory Controller Hub array element 303; 302 and 301 can realize with FPGA, are easy to replacement logic like this, realize using the product line of different memory modules easily; Memory modules array element 304 provides the high capacity memory array, adopts the memory modules of DDR RAM type.
In conjunction with Fig. 1,2,3,4, the flow process of client-access RMB is as follows:
(1) client host sends the memory access request data package.It may be a certain long-distance inner action type; The communication interface of client host and RMB may be Ethernet, PCI-express or other types;
(2) communications interface unit 101 receives packet, transfers to interface conversion unit 102, long-distance inner access protocal (RMP) bag is reduced, and send to control module 20 via RMB interconnect interface unit 103;
(3) 201 pairs of RMP bags of the RMB communications interface unit in the control module 20 unpack;
(4) 202 pairs of address mapping unit find out its corresponding address in memory array according to the address search TLB in the RMP bag; Transfer to flush bonding processor unit 203 if TLB does not hit and carry out storage administration, distribute or find out corresponding address, fill TLB;
(5) operation decomposition and control module 205 are decomposed into the simple operations step with the complicated operations request, and control the execution of these steps according to bag accessing operation type; For example memory copy operation can be decomposed into read data and two simple operations steps of write data;
(6) exchange scheduling unit 207 is according to address mapping relation scheduling high-speed interconnection network unit 208;
(7) internal storage access control module 206 generates the MAAP bag and sends to memory array unit 30;
(8) memory array interface unit 301 unpacks the MAAP bag in the memory array unit 30, is converted to the input of Memory Controller Hub array element 303 by memory interface converting unit 302;
(9) read-write of Memory Controller Hub array element 303 control memory modules array elements 304;
(10) reply by memory interface converting unit 302 and memory array interface unit 301 and be packaged as MAAP bag, send to control module 20;
(11) the MAAP bag is received and is unpacked by internal storage access control module 206 again via high-speed interconnect network 208;
(12) initiate the operation of next easy steps of decomposition in the steps (5) by operating decomposition and control module 205; Echo reply or data are given client host if desired, will control RMB communications interface unit 201 generation RMP response packets and send to communication unit 10;
(13) communication unit 20 sends to client host with the RMP bag.
It should be noted that at last: above embodiment is the unrestricted technical scheme of the present invention in order to explanation only, although the present invention is had been described in detail with reference to the foregoing description, those of ordinary skill in the art is to be understood that: still can make amendment or be equal to replacement the present invention, and not breaking away from any modification or partial replacement of the spirit and scope of the present invention, it all should be encompassed in the middle of the claim scope of the present invention.

Claims (10)

1, a kind of long-distance inner server is made up of communication unit (10), control module (20) and memory array unit (30), is linked in sequence between them, and by the interconnection of RMB interconnection agreement; Wherein,
Described communication unit (10) comprises that communications interface unit (101), communication interface converting unit (102) and communication interconnect interface unit (103) are linked in sequence;
Described control module (20) comprises that RMB communications interface unit (201) is connected with address mapping unit (202), flush bonding processor unit (203) is connected with local internal storage location (204), operation decompose with control module (205) respectively with described RMB communications interface unit (201), the control of described address mapping unit (202) and internal storage access is changed unit (206) and is connected, described exchange scheduling unit (207) respectively with described embedded processing unit (203), described operation decomposition is connected with high-speed interconnect network element (208) with control module (205), and described internal storage access control module (206) is connected with described high-speed interconnect network element (208); Described control module (20) can be realized by one or more pieces FPGA or ASIC;
Described memory array unit (30) comprises that memory array interface unit (301), memory interface converting unit (302), Memory Controller Hub array element (303) and memory modules array element (304) are linked in sequence.
2, according to the described long-distance inner server of claim 1, it is characterized in that, described communications interface unit (101) is and the communication port of client host that this communication port is the physics and the link layer interface of GigaEthernet, Infiniband, Myrinet or PCI-Express type;
Described communication interface converting unit (102) realizes the interface conversion between communications interface unit (101) and the RMB communication interconnect interface unit (103);
Described communication interconnect interface unit (103) is the interconnect interface of communication unit (10) and control module (20), is connected with RMB communications interface unit (201) in the control module (20).
According to the described long-distance inner server of claim 1, it is characterized in that 3, client host is by long-distance inner access protocal visit RMB; This long-distance inner access protocal is directly as the message layer agreement, or on TCP/IP etc., as application layer protocol; The address visit of this agreement and the data width of data volume transmission are set to more than 48, and support the long-distance inner operation of multiple active, comprise memory copy/move/set, fetch and set, migration, internal memory consistance, and self-defining operation.
According to the described long-distance inner server of claim 1, it is characterized in that 4, described RMB interconnection agreement is the interconnection agreement of RMB inside, it is a kind of lightweight, and point-to-point Physical layer and link layer protocol are used for interconnecting between chip chamber and plate; It adopts the serial differential signal; Described RMB interconnection agreement is used for being connected of communication unit (10) and control module (20), or is used for the interconnection of control module (20) and memory array unit (30).
According to the described long-distance inner server of claim 1, it is characterized in that 5, described communication interface converting unit (102) can also realize the protocol stack function, and the middle layer agreement is peeled off.
6, according to claim 1 or 4 described long-distance inner servers, it is characterized in that, described RMB communications interface unit (201) is connected with communication unit (10) by the RMB interconnection agreement, and be responsible for the RMB interconnection agreement packet that comes from communication unit (10) is unpacked, perhaps the packet that memory array unit (30) are come is packaged as RMB interconnection agreement form and sends to communication unit (10);
Described address mapping unit (202) mainly comprises the conversion look-aside buffer, realizes the conversion of long-distance inner virtual address and long-distance inner physical address;
Described flush bonding processor unit (203) operation embedded OS;
Described local internal storage location (204) provides storage space for described flush bonding processor unit (203), in order to depositing arbitration of embedded OS, storage management program, data pre-fetching and preprocessing management program, interconnection network and control program, and code and data such as page table;
Described operation decomposition and control module (205) with a series of rdma reads of various complex operations type of decomposition of asking in the RMB interconnection agreement, write internal memory and treatment step;
Described internal storage access control module (206) is with rdma read and write memory request packing, issues memory array unit (30), perhaps the packet that returns is unpacked, and the agreement of it and memory array unit (30) is the memory array access protocal;
Described exchange scheduling unit (207) is responsible for the mapping between communication unit (10) and memory array unit (30).RMB communications interface unit (201) in communication unit (10) and the control module (20), address mapping unit (202), operation decomposition are one to one with control module (205), internal storage access control module (206), or the multi-to-multi correspondence; Described exchange scheduling unit (207) is dispatched described high-speed interconnect network element 208 according to mapping table;
Described high-speed interconnect network element (208) connects communication unit (10) and memory array unit (30), realizes the multi-to-multi mapping between them, sets up high-speed data path;
Described exchange scheduling unit (207) and described high-speed interconnect network element (208) constitute a choke free asynchronous switched network network.
7, long-distance inner server according to claim 1 is characterized in that, described memory array interface unit (301) is connected with control module (20) by the RMB interconnection agreement, is responsible for the transmitting-receiving of memory array access protocal packet;
The user interface that described memory interface converting unit (302) provides Memory Controller Hub array element (303) to need, the data and the control transformation of realization and memory array interface unit (301).
Described Memory Controller Hub array element (303) is the controller of concrete memory modules, directly controls memory bar; A plurality of controllers can be formed array;
Described memory modules array element (304) is a memory array.
8, long-distance inner server according to claim 7 is characterized in that, described memory modules array element (304) is DDR RAM, or the memory modules of SDRAM type.
9, a kind of long-distance inner server implementation method may further comprise the steps:
1) client host sends the memory access request data package.
2) communications interface unit (101) receives packet, transfers to interface conversion unit (102), long-distance inner access protocal (RMP) bag is reduced, and send to control module (20) via RMB interconnect interface unit (103);
3) the RMB communications interface unit (201) in the control module (20) unpacks the RMP bag;
4) address mapping unit (202) find out its corresponding address in memory array to according to the address search TLB in the RMP bag; Transfer to flush bonding processor unit (203) if TLB does not hit and carry out storage administration, distribute or find out corresponding address, fill TLB;
5) operation decomposition and control module (205) are decomposed into the simple operations step with the complicated operations request, and control the execution of these steps according to bag accessing operation type;
6) exchange scheduling unit (207) is according to address mapping relation scheduling high-speed interconnection network unit (208);
7) internal storage access control module (206) generates the MAAP bag and sends to memory array unit (30);
8) memory array interface unit (301) unpacks the MAAP bag in memory array unit (30), is converted to the input of Memory Controller Hub array element (303) by interface conversion unit (302);
9) read-write of Memory Controller Hub array element (303) control memory modules array element (304);
10) reply by interface conversion unit (302) and memory array interface unit (301) and be packaged as MAAP bag, send to control module (20);
11) the MAAP bag is received and is unpacked by protocol memory converting unit (206) again via high-speed interconnect network (208);
12) next simple step of being initiated to decompose in the step 5) by operation decomposition and control module (205) operates; Echo reply or data are given client host if desired, will control RMB communications interface unit (201) generation RMP response packet and send to communication unit (10);
13) communication unit (20) sends to client host with the RMP bag.
10, according to the described long-distance inner server implementation method of claim 9, it is characterized in that the communication interface of described client host and RMB is Ethernet, Infiniband or PCI-express type.
CNB2005100983901A 2005-09-09 2005-09-09 Long-distance inner server and its implementing method Active CN100437522C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005100983901A CN100437522C (en) 2005-09-09 2005-09-09 Long-distance inner server and its implementing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005100983901A CN100437522C (en) 2005-09-09 2005-09-09 Long-distance inner server and its implementing method

Publications (2)

Publication Number Publication Date
CN1928839A true CN1928839A (en) 2007-03-14
CN100437522C CN100437522C (en) 2008-11-26

Family

ID=37858810

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100983901A Active CN100437522C (en) 2005-09-09 2005-09-09 Long-distance inner server and its implementing method

Country Status (1)

Country Link
CN (1) CN100437522C (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101841438A (en) * 2010-04-02 2010-09-22 中国科学院计算技术研究所 Method or system for accessing and storing stream records of massive concurrent TCP streams
CN102854942A (en) * 2012-09-07 2013-01-02 朱龙飞 Internal memory array
WO2013107393A1 (en) * 2012-01-18 2013-07-25 华为技术有限公司 Message-based memory access device and access method thereof
CN103544123A (en) * 2012-07-16 2014-01-29 深圳市中兴微电子技术有限公司 SDRAM controller and access method for SDRAM memory space
CN103634277A (en) * 2012-08-23 2014-03-12 深圳市腾讯计算机系统有限公司 Memory sharing method, server and system
CN103797473A (en) * 2011-09-20 2014-05-14 英派尔科技开发有限公司 Peer-to-peer data migration
CN103902469A (en) * 2012-12-25 2014-07-02 华为技术有限公司 Data prefetching method and system
CN104156332A (en) * 2014-08-11 2014-11-19 济南曼维信息科技有限公司 High-performance parallel computing method based on external PCI-E connection
CN104216835A (en) * 2014-08-25 2014-12-17 杨立群 Method and device for implementing memory fusion
WO2015043379A1 (en) * 2013-09-29 2015-04-02 华为技术有限公司 Memory accessing method and device
CN104731531A (en) * 2015-03-24 2015-06-24 浪潮集团有限公司 Server node architecture design method for separated type high-capacity memory
CN105701020A (en) * 2014-11-28 2016-06-22 华为技术有限公司 Memory access method, related apparatus and system
CN105701023A (en) * 2014-12-14 2016-06-22 上海兆芯集成电路有限公司 Cache replacement policy that considers memory access type
CN106155910A (en) * 2015-03-27 2016-11-23 华为技术有限公司 A kind of methods, devices and systems realizing internal storage access
CN106844048A (en) * 2017-01-13 2017-06-13 上海交通大学 Distributed shared memory method and system based on ardware feature
CN108123984A (en) * 2016-11-30 2018-06-05 天津易遨在线科技有限公司 A kind of memory database optimizes server cluster framework
CN108139967A (en) * 2015-10-09 2018-06-08 华为技术有限公司 Stream compression is changed to array
CN109739928A (en) * 2018-12-14 2019-05-10 深圳壹账通智能科技有限公司 Data export method, device, computer equipment and storage medium
CN112667561A (en) * 2020-12-29 2021-04-16 成都旋极历通信息技术有限公司 Implementation mode for realizing UFS array controller in FPGA
CN113722110A (en) * 2021-11-02 2021-11-30 阿里云计算有限公司 Computer system, memory access method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6738872B2 (en) * 2000-12-22 2004-05-18 International Business Machines Corporation Clustered computer system with deadlock avoidance
WO2002065275A1 (en) * 2001-01-11 2002-08-22 Yottayotta, Inc. Storage virtualization system and methods
US6795850B2 (en) * 2002-12-13 2004-09-21 Sun Microsystems, Inc. System and method for sharing memory among multiple storage device controllers
CN1280735C (en) * 2003-12-04 2006-10-18 中国科学院计算技术研究所 Initiator triggered remote memory access virtual-physical address conversion method
CN100471112C (en) * 2004-03-26 2009-03-18 清华大学 Memory-network memory-magnetic disc high speed reliable storage system and its reading/writing method

Cited By (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101841438A (en) * 2010-04-02 2010-09-22 中国科学院计算技术研究所 Method or system for accessing and storing stream records of massive concurrent TCP streams
CN101841438B (en) * 2010-04-02 2011-10-05 中国科学院计算技术研究所 Method or system for accessing and storing stream records of massive concurrent TCP streams
US9742842B2 (en) 2011-09-20 2017-08-22 Empire Technology Development Llc Peer-to-peer data migration
CN103797473A (en) * 2011-09-20 2014-05-14 英派尔科技开发有限公司 Peer-to-peer data migration
CN103797473B (en) * 2011-09-20 2017-05-24 英派尔科技开发有限公司 Peer-to-peer data migration
WO2013107393A1 (en) * 2012-01-18 2013-07-25 华为技术有限公司 Message-based memory access device and access method thereof
US9870327B2 (en) 2012-01-18 2018-01-16 Huawei Technologies Co., Ltd. Message-based memory access apparatus and access method thereof
CN103544123A (en) * 2012-07-16 2014-01-29 深圳市中兴微电子技术有限公司 SDRAM controller and access method for SDRAM memory space
CN103634277B (en) * 2012-08-23 2019-02-05 深圳市腾讯计算机系统有限公司 A kind of method of shared drive, server and system
CN103634277A (en) * 2012-08-23 2014-03-12 深圳市腾讯计算机系统有限公司 Memory sharing method, server and system
CN102854942B (en) * 2012-09-07 2016-06-01 赵丰年 A kind of internal memory array
CN102854942A (en) * 2012-09-07 2013-01-02 朱龙飞 Internal memory array
CN103902469A (en) * 2012-12-25 2014-07-02 华为技术有限公司 Data prefetching method and system
CN103902469B (en) * 2012-12-25 2017-03-15 华为技术有限公司 A kind of method and system of data pre-fetching
WO2015043379A1 (en) * 2013-09-29 2015-04-02 华为技术有限公司 Memory accessing method and device
CN104516822A (en) * 2013-09-29 2015-04-15 华为技术有限公司 Memory access method and device
CN104516822B (en) * 2013-09-29 2018-01-23 华为技术有限公司 A kind of memory pool access method and equipment
CN104156332A (en) * 2014-08-11 2014-11-19 济南曼维信息科技有限公司 High-performance parallel computing method based on external PCI-E connection
CN104156332B (en) * 2014-08-11 2017-02-15 济南曼维信息科技有限公司 High-performance parallel computing method based on external PCI-E connection
CN104216835A (en) * 2014-08-25 2014-12-17 杨立群 Method and device for implementing memory fusion
CN104216835B (en) * 2014-08-25 2017-04-05 杨立群 A kind of method and device for realizing internal memory fusion
CN105701020B (en) * 2014-11-28 2018-11-30 华为技术有限公司 A kind of method of internal storage access, relevant apparatus and system
CN105701020A (en) * 2014-11-28 2016-06-22 华为技术有限公司 Memory access method, related apparatus and system
CN105701023A (en) * 2014-12-14 2016-06-22 上海兆芯集成电路有限公司 Cache replacement policy that considers memory access type
CN105701023B (en) * 2014-12-14 2019-04-16 上海兆芯集成电路有限公司 In view of the cache replacement policy of memory access type
CN104731531A (en) * 2015-03-24 2015-06-24 浪潮集团有限公司 Server node architecture design method for separated type high-capacity memory
CN104731531B (en) * 2015-03-24 2018-01-02 浪潮集团有限公司 A kind of server node architecture design method of separate type high power capacity internal memory
CN106155910B (en) * 2015-03-27 2021-02-12 华为技术有限公司 Method, device and system for realizing memory access
CN106155910A (en) * 2015-03-27 2016-11-23 华为技术有限公司 A kind of methods, devices and systems realizing internal storage access
CN108139967A (en) * 2015-10-09 2018-06-08 华为技术有限公司 Stream compression is changed to array
CN108139967B (en) * 2015-10-09 2021-07-20 华为技术有限公司 Converting a data stream into an array
CN108123984A (en) * 2016-11-30 2018-06-05 天津易遨在线科技有限公司 A kind of memory database optimizes server cluster framework
CN106844048A (en) * 2017-01-13 2017-06-13 上海交通大学 Distributed shared memory method and system based on ardware feature
CN106844048B (en) * 2017-01-13 2020-11-06 上海交通大学 Distributed memory sharing method and system based on hardware characteristics
CN109739928A (en) * 2018-12-14 2019-05-10 深圳壹账通智能科技有限公司 Data export method, device, computer equipment and storage medium
CN112667561A (en) * 2020-12-29 2021-04-16 成都旋极历通信息技术有限公司 Implementation mode for realizing UFS array controller in FPGA
CN113722110A (en) * 2021-11-02 2021-11-30 阿里云计算有限公司 Computer system, memory access method and device

Also Published As

Publication number Publication date
CN100437522C (en) 2008-11-26

Similar Documents

Publication Publication Date Title
CN1928839A (en) Long-distance inner server and its implementing method
US20210075633A1 (en) Packet multi-cast for memory pool replication
CN105516191B (en) System based on the FPGA 10,000,000,000 net Transmission Control Protocol unloading engine TOE realized
CN1697448A (en) Multi protocol conversion method and device between MODBUS/TCP industry Ethernet and field bus device network as well as field bus of PRofibus DP
CN112463714B (en) Remote direct memory access method, heterogeneous computing system and electronic equipment
CN1975680A (en) Method for realizing interprocess asynchronous communication based on Java
CN101069174A (en) A data processing system and a method for synchronizing data traffic
CN1825804A (en) System and method for implementing communication between distributed system boards
CN107038134A (en) A kind of SRIO interface solid hard disks system and its implementation based on FPGA
CN1866845A (en) Virtual identifying method for multicast forwarding table output port
CN101866326B (en) Message passing interface framework for supporting bus communication
CN1702658A (en) IP base LSI designing system and designing method
CN103209119A (en) Low-power-consumption embedding type cloud intelligent gateway
US20050132327A1 (en) Software development environment
CN1514353A (en) Method of realizing cross address space establishing construction member target based on dynamic core
CN1949736A (en) Text edition circuit and method
CN1881936A (en) Dynamic loading and control method for router virtual drive module
CN1787446A (en) Electronic conference system and method for multi main machine coordination working
CN1889483A (en) Interframe interconnection communication system and data exchanging method thereof
CN1203427C (en) Load balance modulator possessing TCP connection fault tolerant function and its modulating method
CN102231141A (en) Method and system for reading and writing data
CN1901492A (en) Communication method via bus interface in network and and system thereof
CN102761578B (en) Cluster computing system
CN101419562A (en) Hardware PRI queue implementing method for balancing load and performance
CN101540787B (en) Implementation method of communication module of on-chip distributed operating system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: HUAWEI TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES

Effective date: 20130530

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100080 HAIDIAN, BEIJING TO: 518129 SHENZHEN, GUANGDONG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20130530

Address after: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee after: Huawei Technologies Co., Ltd.

Address before: 100080 Haidian District, Zhongguancun Academy of Sciences, South Road, No. 6, No.

Patentee before: Institute of Computing Technology, Chinese Academy of Sciences

TR01 Transfer of patent right

Effective date of registration: 20211227

Address after: 450046 Floor 9, building 1, Zhengshang Boya Plaza, Longzihu wisdom Island, Zhengdong New Area, Zhengzhou City, Henan Province

Patentee after: Super fusion Digital Technology Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.

TR01 Transfer of patent right