CN100442256C - Method, system, and storage medium for providing queue pairs for I/O adapters - Google Patents

Method, system, and storage medium for providing queue pairs for I/O adapters Download PDF

Info

Publication number
CN100442256C
CN100442256C CNB2005101246118A CN200510124611A CN100442256C CN 100442256 C CN100442256 C CN 100442256C CN B2005101246118 A CNB2005101246118 A CN B2005101246118A CN 200510124611 A CN200510124611 A CN 200510124611A CN 100442256 C CN100442256 C CN 100442256C
Authority
CN
China
Prior art keywords
formation
message
adapter
queue
transmit queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2005101246118A
Other languages
Chinese (zh)
Other versions
CN1815458A (en
Inventor
戴维·F.·克拉多克
托马斯·A.·格里格
凯文·J.·雷利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US10/985,460 external-priority patent/US8055818B2/en
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN1815458A publication Critical patent/CN1815458A/en
Application granted granted Critical
Publication of CN100442256C publication Critical patent/CN100442256C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Computer And Data Communications (AREA)

Abstract

A low-latency queue pair (QP) is provided for I/O Adapters that eliminates the overhead associated with work queue elements (WQEs) and defines the mechanisms necessary to allow the placement of the message directly on the queue pair.

Description

Be provided for right method, system and the storage medium of formation of I/O adapter
Technical field
Present disclosure is usually directed to that computing machine and processor architecture, I/O (I/O) are handled, operating system, and more particularly, relate to be used for the I/O adapter low latency (low-latency) formation to (QP).
Background technology
The I/O adapter, all if can long-range direct memory access (DMA) (remote direct memoryaccess) adapter (RDMA), or RDMA network interface unit (RNIC) is such as InfiniBand TM(IB) host channel adapter (host channel adapter) (HCA) is defined on the network structure information is sent to the formation of adapter to (QPs) from soft-article-consumption person.Industrial standard, such as can from
Figure C20051012461100041
The InfiniBand that Trade Association obtains TMArchitecture specification and from the iWarp of RDMA Consortium is defined in the information that carries on the QP and is work queue unit (work queue element, form WQE) to carry the control information that belongs to this message.In addition, one or more data descriptors point to the message data that will transmit or the message that is received with the position of locating.
Some QP use, need be reduced in such as high-performance calculation (HPC) message is sent to the stand-by period that produces the process of another node from a computing node.Even now, above-mentioned industrial standard mechanism also no longer is fit to high performance computing system.Need that a kind of to strengthen standard QP semantic so that can realize by the required more low latency of these application simultaneously to the mechanism that influences minimum of existing hardware.
Summary of the invention
The present invention be intended to propose a kind of be used to provide eliminate the expense relevant and be defined as permission with the work queue unit message is placed directly in low latency formation right system, method and the computer-readable medium of formation to last required mechanism.
An aspect is the right system of formation that is used to be provided for I/O (I/O) adapter, comprises primary memory, I/O adapter and processor.Primary memory has transmit queue and receives formation.The message that the I/O adapter will receive on link is placed on and receives in the formation, and is transmitted in the message of preserving in the transmit queue on link.Processor and primary memory and I/O adapter communicate, and the user who carries out in the primary memory handles (consumer process).The user handles the access transmit queue and receives formation.
Provide the right method of formation that is used for the I/O adapter on the other hand.The message that the I/O adapter will receive on link is placed on and receives in the formation.The I/O adapter transmits the message that is kept in the transmit queue on link.Receive formation and transmit queue in primary memory.The user handles the access transmit queue and receives formation.The user handle with primary memory and processor that the I/O adapter is communicated by letter on carry out.
Be computer-readable medium on the other hand, storage is used to carry out the instruction of the right method of the formation that is provided for the I/O adapter.The message that the I/O adapter will receive on link is placed on and receives in the formation.The I/O adapter transmits the message that is kept in the transmit queue on link.Receive formation and transmit queue in primary memory.The user handles the access transmit queue and receives formation.The user handle with primary memory and processor that the I/O adapter is communicated by letter on carry out.
Description of drawings
According to following instructions, accessory claim and accompanying drawing, will understand these and other features of the present invention, aspect and advantage better, wherein:
Fig. 1 is the figure exemplary operation environment, distributed computing system of the prior art that is used for embodiments of the invention;
Fig. 2 is a figure part, host channel adapter of the prior art that is used for the exemplary operation environment of embodiments of the invention;
Fig. 3 is the figure of a processing part, work request of the prior art of the example explanation exemplary operation environment that is used for embodiments of the invention;
Fig. 4 is the figure that example illustrates the part of Distributed Computer System of the prior art, and the part of described system is a part that is used for the exemplary operation environment of embodiments of the invention, wherein service-strong Connection Service;
Fig. 5 is a figure part, that be used in layered communication architecture of the prior art that is used for the exemplary operation environment of embodiments of the invention;
Fig. 6 is the block diagram of standard queue of the prior art to structure; And
Fig. 7 is the block diagram of the right exemplary embodiment of low latency formation.
Embodiment
Exemplary embodiment of the present invention provides the elimination expense relevant with the work queue unit and has been defined as permission that message is placed directly in formation is right to the low latency formation of last required mechanism.Exemplary embodiment is preferably in distributed computing system, realizes such as the prior art systems regional network (SAN) of the link with terminal node, switch, router and these parts of interconnection is middle.Fig. 1 to 5 has shown each part of the exemplary operation environment that is used for embodiments of the invention.Fig. 6 has shown that standard queue of the prior art is to structure.Fig. 7 has shown the right exemplary embodiment of low latency formation.
Fig. 1 is the figure of Distributed Computer System.The Distributed Computer System that Fig. 1 represents adopts the form of system area network (SAN) 100, and only is providing for illustrative purpose.Exemplary embodiment of the present invention described below can realize on the computer system of various other types and structure.For example, the scope of the computer system of realization example embodiment from servlets with a processor and a small amount of I/O (I/O) adapter to huge parallel supercomputer system with hundreds of or several thousand processors and several thousand I/O adapters.
SAN100 is the high bandwidth in the Distributed Computer System, the network interconnection node of low latency.Node is to be connected to one or more links of network and to form the source of the message in the network and/or any parts of destination.In described example, SAN100 comprises the node with the form of host processor node 102, host processor node 104, redundant array independent disk (RAID) subsystem node 106 and I/O chassis node 108.SAN100 only is used for the example purpose at the node shown in Fig. 1, because can connect independent processor nodes, I/O adapter node and the I/O device node of arbitrary number and any type.Any one of these nodes can both be served as terminal node, is defined herein as the equipment of initiating or finally use message or frame among the SAN100.
In one exemplary embodiment, have fault processing mechanism in the Distributed Computer System, wherein, this fault processing mechanism allows Distributed Computer System, communicates by letter such as reliable connection between the terminal node among the SAN100 or authentic data newspaper.
Message as used in this, is the unit of the application definition of exchanges data, and it is the base unit of the communication between the processing procedure of cooperating.Grouping is a unit by the data of gateway protocol head and/or afterbody encapsulation.Head is provided for the control and the routing iinformation that guide frame to pass through SAN100 usually.Afterbody comprises control and the Cyclic Redundancy Check that is used to guarantee not send grouping under the situation of destroying content usually.
SAN100 comprises communicating by letter and management infrastructure of interior I/O of support Distributed Computer System and inter-processor communication (IPC).SAN100 shown in Figure 1 comprises switched communication structure 116, and it allows many equipment to utilize high bandwidth and low latency to transmit data in the environment of safety, telemanagement simultaneously.The a plurality of paths by the SAN structure can be communicated by letter and utilize to terminal node on a plurality of ports.Can be used for band data transmission fault-tolerant and that increase by a plurality of ports and the path of SAN shown in Figure 1.
SAN100 among Fig. 1 comprises switch 112, switch 114, switch 146 and router one 17.Switch is a plurality of links are linked together and to allow to use little head destination local identifier (DLID) field will divide into groups to be routed to from the link equipment of another link.Router is a plurality of subnets are linked together and to use big head destination Globally Unique Identifier (DGUID) that frame is routed to the equipment of another link second subnet from a link of first subnet.
In one embodiment, link be any two network structure elements, such as the full-duplex channel between terminal node, switch or router.Exemplary suitable link includes but not limited to the P.e.c. copper cash on copper cable, optical cable and base plate and the printed circuit board (PCB).
To the reliability services type, terminal node generates request grouping and echo reply grouping such as host-processor terminal node and I/O adapter terminal node.Switch and router are along transmit grouping from the source point to the destination.Except that the different CRC trailer fields that each grade in network located to upgrade, switch forwards grouping unchangeably.When routing packets, router upgrades different CRC trailer fields and revises other fields in the head.
In SAN100 shown in Figure 1, host processor node 102, host processor node 104 and I/O chassis 108 comprise that at least one channel adapter (CA) is to carry out interface with SAN100.In one embodiment, each channel adapter is fully at length to be implemented to the source of transmission on SAN structure 116 or the end points of the channel adapter interface that receiver (sink) divides into groups.Host processor node 102 comprises the channel adapter with the form of host channel adapter 118 and host channel adapter 120.Host processor node 104 comprises host channel adapter 122 and host channel adapter 124.Host processor node 102 also comprises CPU (central processing unit) 126-130 and the storer 132 by bus system 134 interconnection.Host processor node 104 comprises CPU (central processing unit) 136-140 and the storer 142 by bus system 144 interconnection similarly.
Host channel adapter 118 and 120 is provided to being connected of switch 112, and host channel adapter 122 and 124 is provided to switch 112 and 114 be connected.
In one embodiment, host channel adapter is realized with hardware.In this was realized, host channel adapter hardware unloaded many CPU (central processing unit) I/O adapter communication overheads.This hardware realization of host channel adapter also allows a plurality of parallel communicationss on the switching network, and does not have the traditional overhead relevant with communication protocol.In one embodiment, host channel adapter among Fig. 1 and SAN100 are that the I/O and inter-processor communication (IPC) user of Distributed Computer System provides zero processor to copy data to transmit, do not handle and do not comprise operating system nucleus, and utilize hardware that reliable fault-tolerant communications is provided.
As shown in Figure 1, router one 17 is coupled to the wide area network (WAN) and/or the Local Area Network connection of linking other main frames or other routers.I/O chassis 108 among Fig. 1 comprises I/O switch 146 and a plurality of I/O module 148-156.In these examples, the I/O module adopts the form of adapter card.Exemplary adapter card shown in Fig. 1 comprise the scsi adapter card that is used for I/O module 148, to the adapter card of the fibre channel hub that is used for I/O module 152 and optical-fibre channel arbitration ring (FC-AL) equipment, be used for I/O module 150 the Ethernet adapter card, be used for the graphics adapter card of I/O module 154 and the video adapter card that is used for I/O module 156.Can realize the adapter card of any known type.The I/O adapter comprises that also the switch in the I/O adapter is so that be connected to the SAN structure with adapter card.These modules comprise target channel adapter 158-166.
In this example, the RAID subsystem node 106 among Fig. 1 comprises processor 168, storer 170, target channel adapter (TCA) 172 and a plurality of redundancy and/or banded memory disc unit 174.Target channel adapter 172 can be the global function host channel adapter.
SAN100 handles data communication and the inter-processor communication that is used for I/O.SAN100 supports to be used for required high bandwidth of I/O and extensibility (scalability), and supports to be used for required extremely low stand-by period of inter-processor communication and low CPU expense.User's client computer can also be handled and direct access network service hardware by the workaround system kernel, and such as host channel adapter, this allows messaging protocol efficiently.SAN100 is suitable for current computation model, and is to be used for the troop tectonic block of the new model of communicating by letter of I/O and computing machine.In addition, the SAN100 among Fig. 1 allows I/O adapter node to communicate by letter between them or communicate by letter with the arbitrary or whole processor node in the Distributed Computer System.By being connected to the I/O adapter of SAN100, final I/O adapter node have basically with SAN100 in the identical communication capacity of any host processor node.
In one embodiment, SAN100 shown in Figure 1 supports communication semanteme (channelsemantics) and storer semanteme (memory semantics).The passage semanteme is sometimes referred to as transmission/reception or advances communication (push communication) operation.The passage semanteme is the communication type that adopts in traditional I/O passage, wherein, and source device propulsion data, and the final destination of destination equipment specified data.In the passage semanteme, the communication port that the grouping that handle to transmit from the source specifies ground to handle, but designated packet will not be written in what position of the storage space of handling the destination.Therefore, in the passage semanteme, the destination handles to allocate where place the data that transmitted in advance.
In the storer semanteme, the source is handled and is directly read or write the virtual address space that handle the remote node destination.The processing of long-range destination only needs and is used for the location communication of the impact damper of data, and does not need to be included in the transmission of arbitrary data.Therefore, in the storer semanteme, the source is handled and is sent the packet that comprises the destination buffer memory address of handling the destination.In the storer semanteme, its storer of the visit of allowance source processing is in advance handled in the destination.
Concerning I/O and inter-processor communication, passage semanteme and storer semanteme all are essential usually.The combination of passage and storer semanteme is adopted in typical I/O operation.In the exemplary I/O operation of Distributed Computer System shown in Figure 1, host processor node, by using the passage semanteme, start the I/O operation such as host processor node 102 to dish I/O adapter, such as RAID subsystem objectives channel adapter (TCA) 172 transmitting panel write commands.Order is somebody's turn to do in dish I/O adapter check, and uses the storer semanteme to come direct storage space read data buffer from host processor node.After read data buffer, dish I/O adapter adopts the passage semanteme that I/O is finished message and back into host processor node.
In one exemplary embodiment, carry out to adopt the operation of virtual address and virtual memory protection mechanism so that guarantee correct and suitable access in the Distributed Computer System shown in Fig. 1 to all storeies.The application that moves in this Distributed Computer System does not require the physical addressing that is used for any operation.
Now, with reference to figure 2, the figure of host channel adapter of the prior art is described.Host channel adapter 200 shown in Figure 2 comprises that a set of queues to (QP) 202-210, is used for transmitting the message to host channel adapter port 212-216.By Virtual Path (VL) 218-234, vectoring information is buffered to host channel adapter port 212-216, and wherein, each VL has its oneself flow control.Subnet manager is by being used for the local address of each physical port, i.e. the LID of port, collocation channel adapter.Subnet manager agency (SMA) the 236th, the entity of communicating by letter with subnet manager for the purpose of collocation channel adapter.Memory map and protection (MTP) the 238th become physical address with virtual address translation and verify the mechanism of access right.Direct memory access (DMA) (DMA) 240 to 202-210, is used storer 242 with respect to formation, and the direct memory access (DMA) operation is provided.
Single channel adapter, all host channel adapters 200 as shown in Figure 2 can support that several thousand formations are right.On the contrary, the target channel adapter in the I/O adapter supports that usually the formation of lesser number is right.Each formation sends work queue (SWQ) and receives work queue comprising.Send work queue and be used for sendaisle and the semantic message of storer.Receive the semantic message of work queue receiving cable.Client's call operation system-specific DLL (dynamic link library), it is called Verbs at this, so that work request (WR) is placed in the work queue.
With reference now to Fig. 3,, the figure of the processing that illustrates work request of the prior art is described.In Fig. 3, exist to receive work queue 300, send work queue 302 and finish formation 304, be used to handle from and the request that is used for user 306.These requests from user 306 finally are sent to hardware 308.In this example, user 306 generates work request 310 and 312, and reception work finishes 314.As shown in Figure 3, the work request that is placed in the work queue is called as work queue unit (WQE).
Send work queue 302 and comprise work queue unit (WQEs) 322-328, it has described the data that will transmit on the SAN structure.Receive work queue 300 and comprise work queue unit (WQE) 316-320, it describes the arrival passage semantic data of where placing from the SAN structure.The work queue unit is handled by the hardware in the host channel adapter 308.
Verbs also is provided for from finishing the mechanism that formation 304 retrievals have been finished the work.As shown in Figure 3, finishing formation 304 comprises and finishes queue unit (CQE) 330-336.Finish the information that queue unit comprises relevant previous completed work queue unit.Finishing formation 304 is used for creating and is used for the right a single point of finishing communication of a plurality of formations.Finishing queue unit is the data structure of finishing in the formation.Completed work queue unit is described in this unit.Finishing queue unit comprises enough information and determines the work queue unit of formation to finishing with appointment.Finish the formation context and be and comprise pointer, length and management each finishes the message block of other required information of formation.
The exemplary operation request of support transmission work queue 302 shown in Figure 3 is as follows.Send work queue and be and be used for one group of local data's section is advanced to passage semantic operation by the data segment of the reception work queue elements reference (reference) of remote node.For example, work queue unit 328 covers the reference of data segment 4 338, data segment 5 340 and data segment 6 342.Each data segment that sends work request comprises virtual continuous storage space.Be used for reference to the virtual address of local data's section in the address context of creating the right processing of local queue.
In one embodiment, a kind of work queue unit is only supported in reception work queue 300 shown in Figure 3, and it is called reception work queue unit.Reception work queue unit provides a description the passage semantic operation to the local storage space of the transmission message that wherein writes arrival.Receive the work queue unit and comprise the scattering tabulation (scatterlist) of describing several virtual continuous storage space.The transmission message that arrives is written in these storage space.Virtual address is in the address context of creating the right processing of local queue.
To inter-processor communication, the user model software processes is right by formation, directly transmits data from the position that impact damper resides in the storer.In one embodiment, by the right transmission workaround system of formation and consume the still less host command cycle.Formation allows zero processor to copy data to transmit and does not relate to operating system nucleus.Zero processor to copy data transmit effective support that high bandwidth is communicated by letter with low latency are provided.
When create formation to the time, formation is set to so that the transmission service of selection type is provided.In one embodiment, realize four kinds of transmission services of Distributed Computer System support of the present invention: reliable connection, unreliable connection, authentic data newspaper and corrupt data are reported Connection Service.
Utilize reliable Connection Service so as the part of the Distributed Computer System of between distributed treatment, communicating by letter usually as shown in Figure 4.Distributed Computer System 400 among Fig. 4 comprises host processor node 1, host processor node 2, and host processor node 3.Host processor node 1 comprises handles A 410.Host processor node 3 comprises to be handled C420 and handles D430.Host processor node 2 comprises handles E440.
Host processor node 1 comprises formation to 4,6 and 7, and each formation is to having the work queue of transmission and receiving work queue.Host processor node 2 has formation to 9, and host processor node 3 has formation to 2 and 5.The reliable Connection Service of Distributed Computer System 400 with local queue pair with one and only a remote queue to related.Therefore, formation is used for communicating by letter to 2 with formation to 4; Formation is used for communicating by letter to 5 with formation to 7; And formation is used for communicating by letter to 9 with formation to 6.
A formation that is arranged in reliable Connection Service makes data be written into reception memorizer space by the right reception WQE reference of the formation that connects to last WQE.RDMA operates in to connect on the right address space of formation and operates.
In one embodiment, because hardware is kept serial number and replied all groupings and transmit, institute is so that reliable Connection Service is reliable.Communicating by letter of any failure of combination retry of hardware and SAN driver software.Even the right processing client of formation exist receive under error code, the operation and the situation of network congestion under, also can obtain reliable communication.If there is alternative path to be present in the SAN structure,, also can keep reliable communication even so under the situation that fabric switch machine, link or channel adapter port break down.
In addition, can utilize and reply to come on the SAN structure, to transmit reliably data.Reply and can be, or can not be to handle level to reply, i.e. checking receives to handle have used replying of data.As selection, replying can be only to represent that data have arrived replying of its destination.
An embodiment who realizes layered communication architecture 500 of the present invention usually as shown in Figure 5.The layered architecture figure of Fig. 5 has shown each layer and the data of interlayer transmission and the institutional framework of control information of Data Communications Channel.
Host channel adapter terminal node protocol layer (being adopted by for example terminal node 511) comprises by user's 503 defined upper-layer protocols 502, transport layer 504, network layer 506, link layer 508 and Physical layer 510.Exchange layer (being adopted by for example switch 513) comprises link layer 508 and Physical layer 510.Route layer (being adopted by for example router five 15) comprises network layer 506, link layer 508 and Physical layer 510.
Layered architecture 500 is followed the main principle of representative communication stack usually.About the protocol layer of terminal node 511, for example, upper-layer protocol 502 adopts Verbs to create message in transport layer 504.Network layer 506 is routing packets between network subnet (516).Link layer 508 is routing packets in network subnet (518).Physical layer 510 sends to each or hyte the Physical layer of other equipment.Each layer do not know to go up or how lower floor carries out their function.
User 503 and 505 expressions are used for other layers to communicate by letter between terminal node application or processing.Transport layer 504 provides end-to-end message to move.In one embodiment, transport layer provides aforesaid four kinds of transmission services: reliable Connection Service, the service of authentic data newspaper, the service of corrupt data newspaper and original datagram service.Network layer 506 is carried out the grouping route of destination terminal node by a subnet or a plurality of subnet.Link layer 508 carry out flow rates controlled on link, error correction with the delivery of packets of distinguishing priority.
The position transmission that Physical layer 510 execution techniques are relevant.Through link 522,524 and 526, between Physical layer, transmit position or hyte.Link can or pass through other suitable links and realize with P.e.c. copper cash, copper cable, optical cable.
Fig. 6 represents that the formation of standard in the prior art is to structure.Fig. 6 is by dot-dash horizontal line separated into two parts, i.e. host channel adapter (HCA) 602 under primary memory on the line 600 and the line.
Primary memory 600 keeps transmit queue 604 and receives formation 606, and they form formation to 608.These formations comprise work queue unit (WQE).Each WQE in the transmit queue describes the characteristic and the position of the message that will transmit on link.For example, WQE1 610 refer message 1 611, WQE2 612 refer message 2 613, WQE3 614 refer message 3615, WQE4 616 refer message 4 617.In addition, user software handles and keeps transmit queue tail pointer 618 and receive rear of queue pointer 620.
HCA602 comprises QP table 622, and it has a plurality of clauses and subclauses 624 (QPTEs a/k/a QP context).Each clauses and subclauses 626 comprises transmit queue head pointer 628, reception queue head pointer 630, transmit queue totalizer counting 636, receives formation totalizer counting 638 and other information 640.
Standard queue shown in Figure 6 transmits and receives in the process of message being used in.
For transmitting message, HCA602 at first takes out WQE.Then, handle, determine the physical address of the message in the primary memory by the virtual address among the WQE, key and length information by address translation.Then, the message in the taking-up primary memory 600.At last, make up one or more groupings so that on link, transmit this message.
When HCA602 received grouping on link, the part of packet header comprised QP number.Message during adapter will divide into groups is placed in the reception formation 606 of the QP608 with that number.Then, where the WQE (WQE1 660) that take out to receive the head place of formation 606 is placed on this message in the primary memory 600 so that determine.By being used for the reception queue head pointer 630 of clauses and subclauses 626 of that QP number QP table 622, point to the head that receives formation.HCA602 takes out WQE (WQE1 660), and it comprises virtual address, key and the length of describing the position of placing this message, and HCA carries out conversion so that determine physical address, and then, HCA is placed on this place with this message.
Fig. 7 has shown the right exemplary embodiment of low latency formation.Low latency is meant and is used for message is sent to the time that another node spends from a node.The application of some performance-critical is arranged,, wherein need low latency such as high-performance calculation.For example, compare with using the time that exemplary embodiment of the present invention spent, the time that the about twice of cost is handled in some modelings with I/O adapter of standard QP is sent to storer in another node with the storer of message from a node.
Fig. 7 is by dot-dash horizontal line separated into two parts, i.e. I/O adapter 702 under primary memory on the line 700 and the line.Primary memory 700 and processor, relevant such as server.The user software that moves on processor uses the data that produced by hardware generator, I/O adapter 702.Data can be the data of message or any kind.The example of I/O adapter 702 comprises the adapter of support RDMA (RDMA-capable) or the adapter of RNIC, HCA or any other type.Preferably, I/O adapter 702 is relatively near primary memory 700.
Primary memory 700 keeps transmit queue 704 and receives formation 706, and they form formation to 708.
Adapter 702 comprises QP table 712, and it has by a plurality of clauses and subclauses of QP number 716 index (QPTEs a/k/a QP context).Each clauses and subclauses 718 comprises transmit queue head pointer 720, receives queue head pointer 722, the reception number of queues 734 of reception queue length 726, the transmit queue totalizer counting 728 of the transmit queue length 724 of message, message, the transmit queue numbers 732 that receives formation totalizer counting 730, message, message, a plurality of transmit queue message 738 that each is finished, receive formation and whether finish 740 and other information 742.Preferably, the information in the queue table 712 in the I/O adapter by high-speed cache.
For example, transmitting and receiving in the process of message, use exemplary low latency formation shown in Figure 7 right.For transmitting message 710, the user uses and simply 710 message is placed directly on the transmit queue 704.User notification I/O adapter 702 is placed on one or more message 710 on the transmit queue 704 by that number storage is counted in 718 in the transmit queue totalizer.Then, the message that I/O adapter 702 directly takes out by 720 references of transmit queue head pointer from primary memory 710, and make up grouping so that on link, send.When adapter 702 received grouping on link, adapter 702 was directly receiving mobile messaging 710 in the formation 706 simply.Therefore, the stand-by period be lower than standard queue shown in Figure 6 to and more efficient.
A right application of exemplary low latency formation is in high-performance computing environment, wherein, a plurality of nodes that connect in cluster is arranged and carry out parallel processing on very large task.Data and control messages flow between node.The exemplary embodiment of Fig. 7 will help to increase the processing speed of this system.Usually, the message in this system may be 128 byte longs.
Compare the WQE in exemplary embodiment shown in Figure 7 of no use with Fig. 6.Eliminate four problems that solving has appearred in WQEs in the exemplary embodiment of Fig. 7.
The first, adapter 702 need can be searched the message 710 that will transmit under the situation without any WQE.This solves by message 710 is placed directly on the transmit queue 704.
The second, adapter 702 need know that reception is maybe with the length of the message 710 that transmits.This by generate as by the characteristic of the described QP table clause 718 of RQ length of the SQ length of LL message 724 and LL message 726, be that length solves.This length is fixed size, and this is favourable to adapter 702 hardware.The example of message size comprises 128 bytes, 256 bytes, 512 bytes or the like.
The 3rd, software users needs finishing of success message transmission to notify so that the space that recovery force lists.Traditionally, this information is the available parameter among the WQE.Expectation is each to generate the queue entries of finishing that is used for a not only message 710, so that reduce bandwidth and improve performance.Therefore, each QP table clause 718 comprises the transmit queue message count 738 that each is finished.Each transmit queue message count 738 of finishing can be any requisite number, comprises one.
Similarly, software users need know when receive message 710.This solves by all or noon (all-or-nothing) option, and it is whether reception formation in the QP table clause 718 finishes 740 fields.In " having entirely " pattern, each 710 message that is received are provided finish." completely without " in the pattern, to the message 710 that is received, never provide and finish.In this case, the fact that receives message 710 is embedded in the message 710 itself that receives formation 706.For example, the available potential energy in the message 710 receives efficient message 710 by the software users inquiry to determine when.
The 4th, it is right that adapter 702 need know when formation is configured to the low latency formation to 708.This is by generating config option, being that low latency solves.For example, when create formation to the time, software users can with formation to be configured to the low latency formation to 708 or standard queue to 608 (Fig. 6).
Exemplary embodiment of the present invention has many advantages.The low latency formation that exemplary embodiment of the present invention provides the elimination expense relevant with the work queue unit to and defined and allow message is placed directly in formation to last required mechanism.Can realize these savings at the transmission and the receiving end of link.Simulation result has shown that use the present invention can make the stand-by period of node-to-node approximately reduce by half.In addition, exemplary embodiment can be operated mutually with other standard nodes that do not realize those exemplary embodiments, and does not have adverse effect (still not realizing the over-all properties benefit when realizing on two nodes).
As mentioned above, can realize embodiments of the invention to be used to implement the computer implemented processing of those processing and the form of device.Embodiments of the invention also can be with comprising tangible medium, realizing such as the form of the computer program code of the instruction that comprises in floppy disk, CD-ROMs, hard disk drive or any other computer-readable recording medium, wherein, when computer program code being loaded in the computing machine and being carried out by it, computing machine becomes and is used to implement device of the present invention.The present invention also form of general-purpose computers program code realizes, for example, no matter whether this program code is stored in the storage medium, is loaded into and/or is carried out or on some transmission mediums, such as electric wire or cable, transmit by optical fiber or through electromagnetic radiation by computing machine, wherein, when computer program code being loaded in the computing machine and being carried out by it, computing machine becomes and is used to implement device of the present invention.When carrying out on general purpose microprocessor, computer program code section configure microprocessor is so that create dedicated logic circuit.
Although reference example embodiment has described the present invention, it should be appreciated by those skilled in the art that and to make various changes and can replace its element, and do not deviate from scope of the present invention with equivalent.In addition, various parts can be realized with hardware, software or firmware or its combination in any.At last, can make various improvement and make particular condition or material be applicable to instruction of the present invention, and not deviate from its essential scope.Therefore, the invention is not restricted to conduct and be used to realize desired the best of the present invention or the disclosed specific embodiment of sole mode, but the present invention will comprise all embodiment that drop in the accessory claim book scope.Use first, second grade of term not represent any order or importance, first, second waits distinct elements and be to use term.Use one, one of term etc. not represent the restriction of quantity, there is at least one referenced items in the phase antirepresentation.

Claims (16)

1. right system of formation that is used to be provided for I/O-I/O adapter comprises:
Primary memory has transmit queue and receives formation;
The I/O adapter, the message self that is used for receiving on link is placed on described reception formation, and is used for being transmitted on link the message self that described transmit queue is preserved; And
Processor is communicated by letter with described I/O adapter with described primary memory, and the user that described processor is carried out in the described primary memory handles, and described user handles the access transmit queue and receives formation,
Wherein, described I/O adapter comprises the formation his-and-hers watches, and described formation his-and-hers watches comprise the transmit queue characteristic that is used for described transmit queue and are used for the reception formation characteristic of described reception formation.
2. the system as claimed in claim 1, wherein, described transmit queue and described reception formation do not keep work queue unit-WQE.
3. the system as claimed in claim 1, wherein, described transmit queue characteristic comprises message-length.
4. the system as claimed in claim 1, wherein, described reception formation characteristic comprises message-length.
5. the system as claimed in claim 1, wherein, described formation his-and-hers watches comprise the transmit queue totalizer counter that is used for notifying described I/O adapter when message has been placed on the described transmit queue.
6. the system as claimed in claim 1, wherein, described characteristic comprises the transmit queue message count that each is finished.
7. the system as claimed in claim 1, wherein, described characteristic comprises that whether having the formation of reception finishes.
8. the system as claimed in claim 1, wherein, described user's processing configuration particular queue is right, so that described I/O adapter will point to the work queue unit-WQE of the message that receives and be placed on and be used for the right reception formation of that particular queue on described link, and on described link, transmit by being kept at the message that the WQE that is used for the right transmit queue of that particular queue points to.
9. right method of formation that is provided for I/O-I/O adapter comprises:
The message self that will be received on link by the I/O adapter is placed in the reception formation of primary memory;
Transmit the message self that is kept in the transmit queue by the I/O adapter on link, described transmit queue is in described primary memory;
Handle described transmit queue of access and described reception formation by the user, described user handle with described primary memory and processor that described I/O adapter is communicated by letter on carry out,
Wherein, described I/O adapter comprises the formation his-and-hers watches, and described formation his-and-hers watches comprise the transmit queue characteristic that is used for described transmit queue and are used for the reception formation characteristic of described reception formation.
10. method as claimed in claim 9, wherein, work queue unit-WQE is not preserved in described transmit queue and described reception formation.
11. method as claimed in claim 9, wherein, described transmit queue characteristic comprises message-length.
12. method as claimed in claim 9, wherein, described reception formation characteristic comprises message-length.
13. method as claimed in claim 9, wherein, described formation his-and-hers watches comprise and being used for when message has been placed on the described transmit queue, notify the transmit queue totalizer counter of described I/O adapter.
14. method as claimed in claim 9, wherein, described characteristic comprises the transmit queue message count that each is finished.
15. method as claimed in claim 9, wherein, described characteristic comprises that whether having the formation of reception finishes.
16. method as claimed in claim 9 further comprises:
Right by described user's processing configuration particular queue, so that described I/O adapter will point to the work queue unit-WQE of the message that receives and be placed on and be used for the right reception formation of that particular queue on described link, and on described link, transmit by being kept at the message that the WQE that is used for the right transmit queue of that particular queue points to.
CNB2005101246118A 2004-11-10 2005-11-09 Method, system, and storage medium for providing queue pairs for I/O adapters Expired - Fee Related CN100442256C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/985,460 2004-11-10
US10/985,460 US8055818B2 (en) 2004-08-30 2004-11-10 Low latency queue pairs for I/O adapters

Publications (2)

Publication Number Publication Date
CN1815458A CN1815458A (en) 2006-08-09
CN100442256C true CN100442256C (en) 2008-12-10

Family

ID=36907674

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005101246118A Expired - Fee Related CN100442256C (en) 2004-11-10 2005-11-09 Method, system, and storage medium for providing queue pairs for I/O adapters

Country Status (1)

Country Link
CN (1) CN100442256C (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7668984B2 (en) * 2007-01-10 2010-02-23 International Business Machines Corporation Low latency send queues in I/O adapter hardware
TW201237632A (en) * 2010-12-21 2012-09-16 Ibm Buffer management scheme for a network processor
US9354933B2 (en) * 2011-10-31 2016-05-31 Intel Corporation Remote direct memory access adapter state migration in a virtual environment
CN104426797B (en) * 2013-08-27 2018-03-13 华为技术有限公司 A kind of communication means and device based on queue
CN103942097B (en) * 2014-04-10 2017-11-24 华为技术有限公司 A kind of data processing method, device and the computer for possessing related device
CN112256407A (en) * 2020-12-17 2021-01-22 烽火通信科技股份有限公司 RDMA (remote direct memory Access) -based container network, communication method and computer-readable medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6480500B1 (en) * 2001-06-18 2002-11-12 Advanced Micro Devices, Inc. Arrangement for creating multiple virtual queue pairs from a compressed queue pair based on shared attributes
US20030035433A1 (en) * 2001-08-16 2003-02-20 International Business Machines Corporation Apparatus and method for virtualizing a queue pair space to minimize time-wait impacts
US20030202519A1 (en) * 2002-04-25 2003-10-30 International Business Machines Corporation System, method, and product for managing data transfers in a network
CN1487417A (en) * 2002-09-05 2004-04-07 �Ҵ���˾ ISCSI drive program and interface protocal of adaptor
US6742075B1 (en) * 2001-12-03 2004-05-25 Advanced Micro Devices, Inc. Arrangement for instigating work in a channel adapter based on received address information and stored context information
US6789143B2 (en) * 2001-09-24 2004-09-07 International Business Machines Corporation Infiniband work and completion queue management via head and tail circular buffers with indirect work queue entries

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6480500B1 (en) * 2001-06-18 2002-11-12 Advanced Micro Devices, Inc. Arrangement for creating multiple virtual queue pairs from a compressed queue pair based on shared attributes
US20030035433A1 (en) * 2001-08-16 2003-02-20 International Business Machines Corporation Apparatus and method for virtualizing a queue pair space to minimize time-wait impacts
US6789143B2 (en) * 2001-09-24 2004-09-07 International Business Machines Corporation Infiniband work and completion queue management via head and tail circular buffers with indirect work queue entries
US6742075B1 (en) * 2001-12-03 2004-05-25 Advanced Micro Devices, Inc. Arrangement for instigating work in a channel adapter based on received address information and stored context information
US20030202519A1 (en) * 2002-04-25 2003-10-30 International Business Machines Corporation System, method, and product for managing data transfers in a network
CN1487417A (en) * 2002-09-05 2004-04-07 �Ҵ���˾ ISCSI drive program and interface protocal of adaptor

Also Published As

Publication number Publication date
CN1815458A (en) 2006-08-09

Similar Documents

Publication Publication Date Title
CN100361100C (en) Method and system for hardware enforcement of logical partitioning of a channel adapter's resources in a system area network
US7233570B2 (en) Long distance repeater for digital information
US6748559B1 (en) Method and system for reliably defining and determining timeout values in unreliable datagrams
EP1374521B1 (en) Method and apparatus for remote key validation for ngio/infiniband applications
US8265092B2 (en) Adaptive low latency receive queues
CN100375469C (en) Method and device for emulating multiple logic port on a physical poet
US7668984B2 (en) Low latency send queues in I/O adapter hardware
EP1399829B1 (en) End node partitioning using local identifiers
US6578122B2 (en) Using an access key to protect and point to regions in windows for infiniband
US8341237B2 (en) Systems, methods and computer program products for automatically triggering operations on a queue pair
TW583544B (en) Infiniband work and completion queue management via head and tail circular buffers with indirect work queue entries
US7899050B2 (en) Low latency multicast for infiniband® host channel adapters
US6938138B2 (en) Method and apparatus for managing access to memory
US9037640B2 (en) Processing STREAMS messages over a system area network
JP5735883B2 (en) How to delay the acknowledgment of an operation until the local adapter read operation confirms the completion of the operation
US20020073257A1 (en) Transferring foreign protocols across a system area network
US20090077268A1 (en) Low Latency Multicast for Infiniband Host Channel Adapters
US20030035433A1 (en) Apparatus and method for virtualizing a queue pair space to minimize time-wait impacts
US20030018828A1 (en) Infiniband mixed semantic ethernet I/O path
US6990528B1 (en) System area network of end-to-end context via reliable datagram domains
CN100442256C (en) Method, system, and storage medium for providing queue pairs for I/O adapters
US7409432B1 (en) Efficient process for handover between subnet managers
US20020198927A1 (en) Apparatus and method for routing internet protocol frames over a system area network
US6601148B2 (en) Infiniband memory windows management directly in hardware
JP2002305535A (en) Method and apparatus for providing a reliable protocol for transferring data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20081210

Termination date: 20181109

CF01 Termination of patent right due to non-payment of annual fee