CN101409715B - Method and system for communication using InfiniBand network - Google Patents

Method and system for communication using InfiniBand network Download PDF

Info

Publication number
CN101409715B
CN101409715B CN2008102246636A CN200810224663A CN101409715B CN 101409715 B CN101409715 B CN 101409715B CN 2008102246636 A CN2008102246636 A CN 2008102246636A CN 200810224663 A CN200810224663 A CN 200810224663A CN 101409715 B CN101409715 B CN 101409715B
Authority
CN
China
Prior art keywords
recipient
packet
transmit leg
current data
data packet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2008102246636A
Other languages
Chinese (zh)
Other versions
CN101409715A (en
Inventor
林瑶
韩冀中
张洪伟
贺劲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN2008102246636A priority Critical patent/CN101409715B/en
Publication of CN101409715A publication Critical patent/CN101409715A/en
Application granted granted Critical
Publication of CN101409715B publication Critical patent/CN101409715B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a method which uses InfiniBand network for communication and a system thereof. The method comprises: first step, a sender and a receiver exchange handshaking information which comprises QPN that is used for establishing a new link in the InfiniBand network, RDMA buffer address of the receiver and the size of the RDMA buffer of the receiver; second step, the sender directly writes a current data packet into the RDMA buffer of the receiver according to the RDMA buffer address of the receiver and the size of the RDMA buffer of the receiver; third step, the sender upgrades the buffer state of the receiver which is stored by the sender; forth step, when the application data transmission is finished, the sender closes the link. The method and the system thereof have the advantages that uStream is superior to SDP in both bandwidth and delay and performance comparative to a bottom InfiniBand Verbs interface is obtained.

Description

The method and system that a kind of InfiniBand of utilization network communicates
Technical field
The present invention relates to the InfiniBand network, relate in particular to the method and system that a kind of InfiniBand of utilization network communicates.
Background technology
Owing to have low cost, high-performance and good extensibility, since generation nineteen ninety, the computer cluster that is connected with other high performance network based on Ethernet has obtained increasingly extensive application in high-performance calculation and enterprise calculation field.Major technique as the computer cluster interconnected; System area network (SAN; System area networks) obtained very fast development simultaneously, some have the system area network of the low lag characteristic of high bandwidth, like Myrinet; Quadrics, SCI and InfiniBand etc. are owing to can provide the main flow interconnection technique that becomes the network high-speed passage than the higher performance of Ethernet gradually.
Wherein, InfiniBand uses one of the widest system area network at present, and it is widely used in the HPCC, and gets the nod in enterprise data center market.On the Top500 high-performance computer ranking list of announcing in November, 2007, there is 24.2% cluster computing system to use InfiniBand.InfiniBand has high bandwidth, the low performance that postpones, and it provides many advanced features, like mechanism such as remote direct memory visit (RDMA, Remote Direct Memory Access) and zero-copy.
The RDMA communication mechanism allows data between application program address space and network, directly to transmit, and the critical path of operating system nucleus from transfer of data bypassed, and has reduced the memory copying number of times, is a kind of data transmission mechanism efficiently.Zero-copy mechanism has been avoided the frequent copy of data between each layer of communication protocol stack, has alleviated the load of operating system nucleus, is the effective means that improves communication performance.How to utilize these advanced features of InfiniBand network, become a research focus in trunking communication field for application provides the high performance communication technology.
At present the communication protocol stack on the InfiniBand network mainly contains IPoIB (IP over IB) and SDP (Socket Direct Protocol), and they are that advanced feature based on the applications exploiting InfiniBand network of Socket provides approach.But they all are to depend on the communication protocol that operating system nucleus is realized, always can introduce the expense that user/kernel spacing context switches and data copy, and have suitable complexity.Wherein, IPoIB by the TCP/IP emulation technology realize that traditional TCP is applied to the mapping of InfiniBand network; SDP then is based on the Socket of kernel state driving interface (kVerbs) design, realizes the adaptive of InfiniBand DLL and traditional Socket DLL.Owing to increased unnecessary protocol hierarchy, IPoIB has introduced the expense of protocol processes; Though and the SDP that realizes based on Send/Receive model and kernel-bypass message transmission protocol can reach than IPoIB more performance; But the expense that user/kernel spacing context switches and data copy is arranged equally; And destroyed the asynchronous model of InfiniBand, agreement itself is more complicated also.
Though some research institutions and scholar have proposed the improvement to SDP; Data copy through eliminating user/kernel spacing perhaps provides the asynchronous communication model to improve SDP, but the complexity of user/kernel spacing context overhead in switching and SDP still exists.Test result shows; The delay of SDP and bandwidth all have big gap with the performance of bottom-layer network driving interface (InfiniBand Verbs); The minimum delay of SDP almost is 6 times of bottom InfiniBand Verbs interface, and the peak bandwidth of SDP also only reaches about 70% of bottom InfiniBand Verbs interface.It is thus clear that InfiniBand network high bandwidth, the low performance that postpones can't be fully utilized through existing IPoIB and SDP.
Summary of the invention
In order to solve above-mentioned technical problem; The method and system that provide a kind of InfiniBand of utilization network to communicate; Its purpose is; Utilize the InfiniBand network characteristic,, InfiniBand network high bandwidth, the low performance that postpones can be given full play to for application provides than existing IPoIB and the more high performance communication plan of SDP.
The invention provides the method that a kind of InfiniBand of utilization network communicates, comprising:
Step 1, transmit leg and recipient exchange handshaking information, comprising the QPN, recipient's RDMA buffer zone address and recipient's the RDMA buffer size that are used for creating at the InfiniBand network a new connection
Step 2, transmit leg is directly write current data packet in recipient's the RDMA buffering area according to said recipient's RDMA buffer zone address and said recipient's RDMA buffer size;
Step 3, transmit leg upgrades recipient's buffer state that transmit leg is preserved;
Step 4, after application data transmission was accomplished, transmit leg was closed above-mentioned connection.
Packet comprises packet header and data; Packet header comprises following parameter: the destination address of the data length of the packet number of current data packet, the destination address of current data packet, current data packet, unroll address and next packet; The address of wherein unrolling is used for when packet header of current data packet and data are separately sent, being set to the destination address of the data of current data packet; The destination address of next packet is used to inform that next packet of recipient is with the position that is written into; The recipient RDMA buffer state that transmit leg is preserved according to this locality obtains the destination address of current data packet.
In the step 1, transmit leg exchanges handshaking information with the recipient through being connected initial message.
Handshaking information also comprises the RDMA buffer size of the RDMA buffer zone address and the transmit leg of transmit leg.
In the step 2:
If transmit leg is when sending current data packet; Judge recipient's RDMA buffering area afterbody remaining can write the space continuously and write the data of current data packet inadequately the time, then the address of unrolling that comprises in the current data packet packet header is changed to the destination address of the data of current data packet; Transmit leg is directly write remaining can the writing continuously in the space of recipient RDMA buffering area afterbody according to the destination address of current data packet with the packet header of current data packet, according to the address of unrolling of current data packet the data of current data packet is written to the head of recipient RDMA buffering area then;
If transmit leg judge recipient's RDMA buffering area afterbody remaining can write the space continuously and enough write current data packet the time, current data packet is directly write in recipient's the RDMA buffering area according to the destination address of current data packet;
If transmit leg when sending current data packet, judge recipient's RDMA buffering area afterbody remaining can write the space continuously and write the packet header of next packet inadequately the time, then the destination address of the next packet that comprises in the current data packet packet header is changed to the initial address of recipient RDMA buffering area.
In the step 3:
When unacknowledged data length in recipient's the buffering area surpassed pre-set threshold, the recipient initiatively sent first request message, confirmed to receive data and notified transmit leg to upgrade recipient's buffer state of its preservation; Perhaps
Can write the space when not enough when thereby unacknowledged data in recipient's the buffering area too much causes, transmit leg initiatively sends second request message, and the request recipient confirms data and replys first request message to upgrade recipient's buffer state of its preservation.
Step 2 comprises:
Step 21, the main thread of transmit leg are inquired about a buffering area that is fit to deposit current data packet in transmit queue;
Step 22, the main thread of transmit leg writes data in this buffer area;
Step 23, the main thread of transmit leg judge whether recipient's buffer area has enough spaces, if having then wake the transmission thread of transmit leg up;
Step 24; After the transmission thread of transmit leg is waken up; With write direct recipient's RDMA buffering area of current data packet, and after writing recipient's RDMA buffering area at packet, this buffering area of main thread of the transmission thread notice transmit leg of transmit leg can write new packet.
Step 2 and step 3 are carried out alternately.
In the step 3, the recipient uses recipient's control thread to send first request message, and transmit leg uses the control thread of transmit leg to send second request message.
The invention provides the system that a kind of InfiniBand of utilization network communicates, comprise transmit leg and recipient,
Transmit leg is used for exchanging handshaking information with the recipient, comprising QPN, recipient's RDMA buffer zone address and recipient's RDMA buffer size; QPN is used for creating a new connection at the InfiniBand network;
Transmit leg is also directly write current data packet in recipient's the RDMA buffering area according to recipient RDMA buffer zone address and RDMA buffer size; Upgrade recipient's buffer state that transmit leg is preserved; After application data transmission is accomplished, close above-mentioned connection.
Packet comprises packet header and data; Packet header comprises following parameter: the destination address of the data length of the packet number of current data packet, the destination address of current data packet, current data packet, unroll address and next packet; The address of wherein unrolling is used for when packet header of current data packet and data are separately sent, being set to the destination address of the data of current data packet; The destination address of next packet is used to inform that next packet of recipient is with the position that is written into; The recipient RDMA buffer state that transmit leg is preserved according to this locality obtains the destination address of current data packet and the destination address of next packet.
Transmit leg exchanges handshaking information with the recipient through being connected initial message.
Handshaking information also comprises the RDMA buffer size of the RDMA buffer zone address and the transmit leg of transmit leg.
Transmit leg also is used for:
When transmit leg sends current data packet; Judge recipient's RDMA buffering area afterbody remaining can write the space continuously and write the data of current data packet inadequately the time, then the address of unrolling that comprises in the current data packet packet header is changed to the destination address of the data of current data packet; Transmit leg is directly write remaining can the writing continuously in the space of recipient RDMA buffering area afterbody according to the destination address of current data packet with the packet header of current data packet, according to the address of unrolling of current data packet the data of current data packet is written to the head of recipient RDMA buffering area then;
When transmit leg sends current data packet, judge recipient's RDMA buffering area afterbody remaining can write the space continuously and write the packet header of next packet inadequately the time, then the destination address of the next packet that comprises in the current data packet packet header is changed to the initial address of RDMA buffering area.
The recipient also is used for when unacknowledged data length in recipient's the buffering area surpasses pre-set threshold, initiatively sending first request message, confirms to receive data and notifies transmit leg to upgrade recipient's buffer state of its preservation; Perhaps
Transmit leg; Also be used for to write the space when not enough when thereby unacknowledged data in recipient's the buffering area too much causes; Transmit leg initiatively sends second request message, and the request recipient confirms data and replys first request message to upgrade recipient's buffer state of its preservation.
Transmit leg comprises main thread and sends thread;
Main thread is used at buffering area that is fit to deposit current data packet of transmit queue inquiry; Main thread writes data in this buffer area; Whether the RDMA buffer area of judging the recipient has enough spaces, if having then wake the transmission thread up;
Send thread, be used for by after waking up with write direct recipient's RDMA buffering area of current data packet, and after writing recipient's RDMA buffering area at packet, this buffering area of main thread of the transmission thread notice transmit leg of transmit leg can write new packet.
The recipient uses recipient's control thread to send first request message, and transmit leg uses the control thread of transmit leg to send second request message.
No matter the present invention still postpones uStream in bandwidth, all than SDP very big performance boost has been arranged, and has reached and the suitable performance of bottom InfiniBand Verbs interface.
The present invention has realized an advantages of simplicity and high efficiency high-performance stream communication unit uStream; It supports the RDMA characteristic and the zero-copy mechanism of InfiniBand network; Eliminated user/kernel spacing context overhead in switching; For application provides than the existing IPoIB and the higher communication technology of SDP performance, make InfiniBand network high bandwidth, the low performance that postpones be able to give full play to.
Description of drawings
Fig. 1 is separate double path (data/control) sketch map of uStream;
Fig. 2 is the structural representation in uStream packet packet header;
Fig. 3 is the segmentation operation chart of unrolling;
Fig. 4 is the not segmentation operation chart of unrolling;
Fig. 5 is the first kind of situation that produces and send confirmation of receipt message;
Fig. 6 is the second kind of situation that produces and send confirmation of receipt message;
Fig. 7 is the operating state of transmit queue and the state transition graph in each minibuffer district;
Fig. 8 is the communication process sketch map of uStream;
Fig. 9 is uStream and SDP delay performance result relatively;
Figure 10 is uStream and InfiniBand bottom layer driving interface (uVerbs) delay performance result relatively;
Figure 11 is uStream and SDP bandwidth performance result relatively;
Figure 12 is uStream and InfiniBand bottom layer driving interface (uVerbs) bandwidth performance result relatively.
Embodiment
The present invention is through adopting series of key techniques to design and Implement a high performance stream communication unit, called after uStream in user's attitude.Owing to realize in user's attitude, uStream has eliminated the complexity of user/kernel spacing context overhead in switching and kernel dependence, and supports RDMA characteristic and the zero-copy mechanism of InfiniBand.It provides high performance communication through a stream interface for using.
UStream has adopted following key technology means to eliminate the performance cost in the communication process, thereby reaches high bandwidth, the low performance that postpones.
1, separate double path (data/control) strategy;
2, unroll algorithm and pre-registration of buffering area;
3, non real-time is confirmed strategy;
4, asynchronous transmission mechanism and zero-copy.
Introduce the several key means that adopt among the uStream below respectively, briefly describe the communication process of uStream then.
Separate double path (data/control) strategy:
A complete communication process comprises transfer of data and control information exchange, and they have different requirement to communication.Transfer of data requires high bandwidth to postpone with low usually, and the exchange of control messages then depends on reliable connection and real-time.Therefore, uStream has adopted two independently paths: data path and control access, come to carry out respectively the exchange of transfer of data and control messages.Wherein, data path adopts the RDMA write operation to combine with zero-copy, reaches high bandwidth, the low performance that postpones.But because RDMA is monolateral operation, be not suitable for being used for realizing the exchange of control messages,,, thereby realize timely information interaction reliably like TCP, SDP or IPoIB so adopt based on the communication protocol of blocking the Send/Receive model control access of uStream.
As shown in Figure 1, in uStream, data path is responsible for transmits data packets, and the exchange of control messages is responsible in the control access.To introduce the structure of uStream packet and the type of control messages respectively below.
Packet structure:
The packet of uStream comprises two parts: packet header and data.Packet header is the array of one 32 byte, comprises 5 parameters: packet number (PSN, Packet Serial Number), the destination address of current pack, data segment, length, the destination address of unroll address and next bag.
Fig. 2 has described the structure in uStream packet packet header.Wherein, the address of unrolling is to be used for judging that whether a packet is by segmentation; When data encapsulate segmentation, its packet header and data can separately be sent.If a packet is not by segmentation, then its address of unrolling is NULL, otherwise, its destination address of address that unroll for its data segment.When packet during not by segmentation, its packet header and data are sent with an integral body together, and this moment, packet had only the destination address of an integral body, just was included in the destination address in the packet header; And when packet during by segmentation, its packet header and data are sent separately, and this moment, packet header was the destination address in packet header with data segment with regard to respectively a destination address, the destination address that comprises in the packet header being arranged, and the destination address of data segment just is placed in the address of unrolling.In addition, the destination address of next bag is to be used for telling the recipient position that next bag will write.
The control messages type:
UStream has four kinds of control messages to be used for realizing connection management and current control.Connection management message comprises connection initial message (Connection Initialization Message) and connection closed message (Connection Close Message).The former has carried QPN (Queue Pair Number) (formation check mark) and RDMA buffer information, comprises Remote Key, RDMA buffer zone address and size etc.The latter is used for notifying remote node to discharge resource closing when connecting, comprise RDMA buffering area, QP (Queue Pair, formation to) etc.Wherein, Remote Key is used to authorize the visit of remote node to this machine internal memory, and the RDMA buffer zone address is used to notify the initial address of transmit leg purpose buffering area.Flow control messages comprises ACK PLEASE and ACK REQUEST, and is as shown in fig. 1.Transmit leg and recipient use the state of flow control messages exchange recipient buffering area.
Buffering area unroll algorithm and pre-registration:
Buffering area is registered communication process as and has been brought very big expense, and in order to eliminate the expense of buffering area registration, uStream adopts recipient's buffering area of registered in advance, simultaneously, can improve recipient's memory usage.Because recipient's buffering area of uStream is a registered in advance,, realize the recycling between recipient's buffer empty so must design an effective management algorithm.Therefore, the present invention proposes the algorithm that unrolls, not only solved the problem that reuses between buffer empty, and support stream interface and elongated bag.The algorithm that unrolls mainly solves when packet has arrived recipient's buffering area afterbody, how to handle the problem of buffering area space reuse.
The algorithm that unrolls has defined two kinds of operations of unrolling and solved this problem: segmentation is unrolled and not segmentation is unrolled, and Fig. 3 and Fig. 4 have described this two kinds of operations respectively.
As shown in Figure 3; It is when the remaining length that can write the not enough packet in space continuously of recipient's buffering area afterbody that segmentation is unrolled; But when enough writing a packet header, packet just is divided into two parts and sends respectively, wherein; Packet header is written to the afterbody of recipient's buffering area, and data are then write the head of recipient's buffering area by unrolling.
Fig. 4 has described the situation that not segmentation is unrolled, and is as shown in the figure, when recipient's buffering area afterbody remaining can write packet header, space continuously when all not enough, whole packet will be written to the head of recipient's buffering area by unrolling.
In order to support elongated packet, in the algorithm that unrolls, the writing position in packet header is calculated by a last packet, and the writing position of data is then calculated by the notebook data bag.
For elongated packet; Transmit leg is when sending current data packet; It is the size of not knowing next packet; And the length in packet header is (32 byte) of fixing, so when sending current data packet, transmit leg can only be judged the remaining packet header whether enough next ones wrap, space of can writing continuously of recipient's buffering area afterbody, and (just recipient's buffering area afterbody can write the space continuously after writing current packet; Remaining can write the packet header whether space enough writes next bag continuously); If enough even continue packet header into next bag, if would just write the packet header of the next one bag head of recipient's buffering area inadequately, so the destination address in packet header is judged by a last packet.And the data segment, length of each packet is only just known when sending this packet; So when sending current data packet; Judge recipient's buffering area afterbody remaining can write whether the break even data segment, length of packet of space continuously; If enough just write data segment and packet header together, if will separate data segment and packet header inadequately, because the destination address in packet header is at the afterbody of recipient's buffering area; This has just confirmed that by a last packet destination address of data segment then will be set to the head of recipient's buffering area.
Non real-time is confirmed strategy:
In uStream, the copy of the in store recipient's buffer state of transmit leg (comprising: write how many data in the buffering area, what have confirmed, what have do not have to confirm).Connecting the starting stage of setting up, through the exchange of memory information, recipient's buffer state that transmitting-receiving two sides preserve is consistent.After data transfer phase, recipient's buffer state that transmit leg is preserved upgrades through the ACKREQUEST control messages.
Usually, each packet has sent all should have ACK REQUEST message to confirm, but can cause so a large amount of acknowledge messages frequent interrupt data transmission procedure, thereby cause performance cost.Therefore, uStream has adopted non real-time affirmation strategy to realize flow-control mechanism.
Confirm in the strategy at non real-time; Have only under two kinds of situation and can produce and send ACK REQUEST message: the one, when unacknowledged data length in recipient's buffering area surpasses pre-set threshold threshold; The recipient can initiatively send ACK REQUEST message, confirms to receive data and notifies transmit leg to upgrade its state (being the data address of having confirmed in the buffering area); The 2nd, thus when too much causing, unacknowledged data in recipient's buffering area can write the space when not enough, and transmit leg can be found this situation and initiatively send ACK PLEASE message, asks the recipient to confirm data and reply ACK REQUEST message to upgrade its state.Fig. 5 and Fig. 6 have described both of these case respectively.Head_pos is the initial address of RDMA recipient's buffering area, and tail_pos is the tail address of RDMA recipient's buffering area; Ack_pos is the address of the data confirmed in RDMA recipient's buffering area; Nextpack_pos is the address that next packet will write.
Non real-time confirms that strategy can make the RDMA operation carry out continuously and repeatedly and not be identified the message interruption; Eliminated the expense of frequent exchange acknowledge message; Most of communication process of uStream only just can be accomplished by a RDMA write operation, thereby improved delay and bandwidth performance.
Asynchronous transmission mechanism and zero-copy:
In order to realize high bandwidth, uStream has used asynchronous transmission mechanism, and this mechanism is mainly by an asynchronous transmission formation with independently send thread and realize.The transmission buffering area of uStream also is a registered in advance, and is defined as the transmit queue of being made up of a series of minibuffers district, wherein puts a packet in each minibuffer district.The process of transmitting of uStream has been realized zero-copy, is applied to the data copy that sends between the buffering area thereby eliminated.Fig. 7 has described the operating state of transmit queue and the state transition graph in each minibuffer district.
UStream has defined one and has independently sent thread and realize asynchronous transmission mechanism.5 operations that the process of transmitting of uStream is carried out by two parallel threads are accomplished, and these 5 operations are respectively: RDMA_malloc, Write, Flush, Post and Poll.Detailed process is following: at first, main thread call RDMA_malloc in transmit queue, inquire about one that can write and big or smallly be fit to deposit the minibuffer district (being the buffer area of transmit leg) that will send packet and obtain its address; Then, main thread calls Write and in the minibuffer district that obtains, writes data; Then; Main thread calls flush and judges whether recipient's buffering area has enough spaces; Just wake the transmission thread up if having,, after receiving the affirmation message that the recipient replys, recomputate the remaining space of recipient's buffering area if the request recipient does not confirm data and waits for then main thread transmits control message; There are enough spaces just to wake the transmission thread up up to definite recipient's buffering area, otherwise repeat this process; Send and call this packet of Post transmission after thread is waken up; Send thread and also be responsible for obtaining the completion incident of transmit operation, revise the state in minibuffer district then and notify main thread, make it can be written into new packet again.
The transmission thread of uStream is sightless; Because main thread can be operated minibuffer districts different in the transmit queue simultaneously with the transmission thread; Therefore application needn't be waited until after a packet sends completion and submit next packet again to; Can make the work of transmit queue flowing water like this, thereby make uStream obtain quite high bandwidth performance.
The communication process of uStream:
Fig. 8 has described the communication process of uStream.UStream adopts two daemon threads to carry out the exchange of transfer of data and control messages respectively, and the hiding independent thread that sends moves at transmit leg.
As shown in Figure 8, the communication process of uStream may be summarized to be following four steps:
I, receiving-transmitting sides exchange handshaking information comprise QPN, RDMA buffer zone address and size etc.; Transmit leg need be known recipient's RDMA buffer zone address and size, just can carry out the RDMA operation, and each side who connects possibly be transmit leg or recipient, all will exchange RDMA recipient's buffer information separately so receive and dispatch two sides.Wherein, QPN is used on InfiniBand, creating a new connection, specifically is to create formation earlier in this locality to QP, obtains QPN, tells remote node QPN through the exchange handshaking information then, connects and has just set up.In the communication process of uStream, transmit leg has been preserved the address and the size of recipient's buffering area, and is connecting the starting stage of setting up, and the state of recipient's buffering area that receiving-transmitting sides is preserved is consistent.And in data transmission procedure subsequently, recipient's buffer state that transmit leg is preserved is upgraded by the ACK REQUEST message that the recipient sends.
II, transmit leg are carried out the RDMA operation packet are directly write in recipient's the RDMA buffering area according to asynchronous transmission mechanism, and this process is a zero-copy.In the process of transmitting of uStream, transmit leg at first will calculate can writing the space continuously and whether enough writing the packet that will send of recipient's buffering area through recipient's buffer state that preserve this locality before carrying out each RDMA operation.
This computational process is operated by Flush and is accomplished, and its pseudo-code is following:
if(psn>ack_psn)
if(next_pack_pos>ack_pos)
consecutiveLen=tail_pos-next_pack_pos;
else
consecutiveLen=ack_pos-next_pack_pos;
else?if(psn=ack_psn)
if(next_pack_pos=ack_pos)
consecutiveLen=tail_pos-next_pack_pos;
else
exit?for?error;
else
exit?for?error;
If the space of can writing continuously of the recipient's buffering area that calculates writes the packet that will send inadequately, will adopt the algorithm judgment data bag that unrolls whether to want segmentation.After main thread has confirmed whether packet wants segmentation and destination address and the address of unrolling, just wake the transmission thread up and carry out the RDMA operation, send packet.
III, exchange ACK PLEASE and ACK REQUEST message are upgraded recipient's buffer state that transmit leg is preserved, and Here it is, and non real-time is confirmed strategy, and this process is accomplished by the control thread, and are to carry out alternately simultaneously with a last step.Non real-time affirmation strategy by top introduction can be known; The communication process of uStream only can exchange acknowledge message under two kinds of situation: the one, and when the unacknowledged data length in recipient's buffering area had surpassed predefined receive threshold threshold, the recipient can initiatively send ACK REQUEST message; Another kind of situation is in transmit leg is found recipient's buffering area, not have enough can write the space continuously and write packet the time, and transmit leg can initiatively send ACK PLEASE message, asks the recipient to confirm to receive data and reply ACK REQUEST message.
IV, close connection.After application data (for example audio-video document) transmission was accomplished, transmit leg main thread notice control thread sent connection closed message, notify the recipient to connect and will close, and receiving-transmitting sides will discharge resource.Before connection closed, transmit leg will be guaranteed to have obtained all and send the completion incident, otherwise both sides will wait thread to be sent to obtain to close connection again after all send the completion incident.
Introduce delay and the bandwidth performance of uStream below respectively.Because the performance of SDP is well more a lot of than IPoIB, so the performance of uStream and SDP is mainly compared in following test.
Delay performance:
Fig. 9 is uStream and SDP delay performance result relatively.Wherein, recipient's buffering area is 256K, and the length of uStream transmit queue is 16, and 16 minibuffer districts are promptly arranged in the transmit queue.As can be seen from the figure, the delay performance of uStream parcel transmission has improved 40%~50% than SDP, and the delay performance of big bag transmission has improved 60%~75% than SDP, and its minimum delay performance can reach 7.9us, considerably beyond the 15.3us of SDP.
Figure 10 has shown uStream and uVerbs delay performance result relatively.Shown in figure 10, the delay performance of uStream has reached the level suitable with InfiniBand uVerbs.Specifically; The twice of the delay of uStream parcel transmission the chances are uVerbs; But along with packet increases, the gap of both delay performances reduces gradually, and reaches identical delay performance during for 8K at packet; Along with packet continue to increase, the delay performance of uStream even surmounted uVerbs.This is because the benchmark of uVerbs adopts is that synchronous post operates with poll, and is asynchronous in the realization of uStream.
The delay performance that can know uStream from above test result has had large increase than SDP, the communication semanteme that uses from uStream below, its reason of realization angle analysis.
Pre-registration and zero-copy have been eliminated memory cost.The expense of internal memory registration is very big, and delay performance is made a big impact, and uStream has eliminated the expense of internal memory registration with copy through buffering area and the zero-copy mechanism of using registered in advance, thereby has played crucial effect to improving delay performance;
Non real-time is confirmed the tactful expense of having eliminated the frequent exchange acknowledge message.Most of communication process of uStream all is only to be accomplished by a RDMA write operation, and need not frequent exchange acknowledge message.SDP then needs before sending data, to send SrcAvail and SinkAvail message earlier, and after data are sent, also will send WrComp or RdCompl message;
Use the RDMA write operation.The data path of uStream is to adopt RDMA to write realization, and existing SDP
Realization is to use the Send/Recv model; The delay performance that RDMA writes is better than Send/Recv.
Bandwidth performance:
Figure 11 is uStream and SDP bandwidth performance result relatively, and wherein, uStream recipient's buffering area is 8M, and the length of transmit queue is 128.As can be seen from the figure, the high bandwidth of uStream is stung to reach 10.4Gbps, considerably beyond the 7.8Gbps of SDP.Generally speaking, when parcel transmitted, the bandwidth performance of uStream had improved 30%~60% than SDP, and when big bag transmitted, its bandwidth performance had on average improved 30% than SDP.
Shown in figure 11, according to our test result, SDP reaches its high bandwidth performance when recipient's buffering area is the 85K left and right sides.And as far as uStream, owing to adopted the buffering area of pre-registration and the algorithm that unrolls, so bigger its bandwidth performance of recipient's buffering area is good more.Because recipient's buffering area is big more, the time that reaches confirmation of receipt threshold value threshold is just long more, before being identified the message interruption, just can send the more data bag so, thereby obtains higher bandwidth.In addition, because the buffering area of uStream is a pre-registration, so in the buffering area hour of log-on is not calculated in, so even registering big buffering area can not impact the performance of uStream yet, opposite SDP then may be affected.
Figure 12 has shown uStream and uVerbs bandwidth performance result relatively.Shown in figure 12, the peak bandwidth of uStream10.4Gbps and the 11Gbps of uVerbs ten minutes are approaching, have proved that uStream can effectively utilize the communication link of bottom to reach the performance suitable with InfiniBand Verbs.
UStream can reach high like this bandwidth, mainly is because used asynchronous transmission formation and the independent thread that sends.Use uStream, use to write data and send buffering area and wake up to send and just can return immediately behind the thread and submit next packet to, and needn't wait pending data to send completion.In addition, non real-time confirms that strategy can reduce data and send interrupted number of times, makes communication process more smooth and easy.Like this, transmit queue just can be filled, and making always has lot of data in transmission in the data path of uStream.And, send thread one and waken up, just can the packet in the transmit queue be sent in order and not be interrupted, only if there are the data in certain minibuffer district to be not ready for.This flowing water mode of operation makes uStream can make full use of the bandwidth of bottom InfiniBand Verbs.
Those skilled in the art can also carry out various modifications to above content under the condition that does not break away from the definite the spirit and scope of the present invention of claims.Therefore scope of the present invention is not limited in above explanation, but confirm by the scope of claims.

Claims (13)

1. method of utilizing the InfiniBand network to communicate; It is characterized in that; Realize high-performance stream communication unit uStream in user's attitude, uStream has adopted two independently paths: data path and control access, come to carry out respectively the exchange of transfer of data and control messages; The communication process of said uStream comprises:
Step 1; Transmit leg and recipient exchange handshaking information, create the formation check mark of a new connection, recipient's remote direct memory access buffer regional address and recipient's remote direct memory access buffer district size comprising being used at the InfiniBand network;
Step 2; Transmit leg is directly write in recipient's the remote direct memory access buffer district according to said recipient's remote direct memory access buffer regional address and said recipient's the big young pathbreaker's current data packet in remote direct memory access buffer district according to asynchronous transmission mechanism; Wherein
If transmit leg is when sending current data packet; Judge recipient's remote direct memory access buffer district afterbody remaining can write the space continuously and write the data of current data packet inadequately the time, then the address of unrolling that comprises in the current data packet packet header is changed to the destination address of the data of current data packet; Transmit leg is directly write afterbody remaining can writing continuously in the space in recipient's remote direct memory access buffer district according to the destination address of current data packet with the packet header of current data packet, according to the address of unrolling of current data packet the data of current data packet is written to the head in recipient's remote direct memory access buffer district then;
Step 3, exchange ACK PLEASE and ACK REQUEST message are upgraded recipient's buffer state that transmit leg is preserved; Wherein
When unacknowledged data length in recipient's the buffering area surpassed pre-set threshold, the recipient initiatively sent first request message, confirmed to receive data and notified transmit leg to upgrade recipient's buffer state of its preservation; Perhaps can write the space when not enough when thereby unacknowledged data in recipient's the buffering area too much causes, transmit leg initiatively sends second request message, and the request recipient confirms data and replys first request message to upgrade recipient's buffer state of its preservation;
Step 4, after application data transmission was accomplished, transmit leg was closed above-mentioned connection;
Wherein step 2 comprises:
Step 21, the main thread of transmit leg are inquired about a buffering area that is fit to deposit current data packet in transmit queue;
Step 22, the main thread of transmit leg writes data in this buffer area;
Step 23, the main thread of transmit leg judge whether recipient's buffer area has enough spaces, if having then wake the transmission thread of transmit leg up;
Step 24; After the transmission thread of transmit leg is waken up; With write direct recipient's remote direct memory access buffer district of current data packet; And after writing recipient's remote direct memory access buffer district at packet, this buffering area of main thread of the transmission thread of transmit leg notice transmit leg can write new packet.
2. the method that the InfiniBand of utilization network as claimed in claim 1 communicates is characterized in that packet comprises packet header and data; Packet header comprises following parameter: the destination address of the data length of the packet number of current data packet, the destination address of current data packet, current data packet, unroll address and next packet; The address of wherein unrolling is used for when packet header of current data packet and data are separately sent, being set to the destination address of the data of current data packet; The destination address of next packet is used to inform that next packet of recipient is with the position that is written into; Recipient's remote direct memory access buffer zone state that transmit leg is preserved according to this locality obtains the destination address of current data packet.
3. the method that the InfiniBand of utilization network as claimed in claim 1 communicates is characterized in that, in the step 1, transmit leg exchanges handshaking information with the recipient through being connected initial message.
4. the method that the InfiniBand of utilization network as claimed in claim 1 communicates is characterized in that, handshaking information also comprises the remote direct memory access buffer regional address of transmit leg and the remote direct memory access buffer district size of transmit leg.
5. the method that the InfiniBand of utilization network as claimed in claim 2 communicates is characterized in that, in the step 2:
If transmit leg judge recipient's remote direct memory access buffer district afterbody remaining can write the space continuously and enough write current data packet the time, current data packet is directly write in recipient's the remote direct memory access buffer district according to the destination address of current data packet;
If transmit leg when sending current data packet, judge recipient's remote direct memory access buffer district afterbody remaining can write the space continuously and write the packet header of next packet inadequately the time, then the destination address of the next packet that comprises in the current data packet packet header is changed to the initial address in recipient's remote direct memory access buffer district.
6. the method that the InfiniBand of utilization network as claimed in claim 1 communicates is characterized in that step 2 and step 3 are carried out alternately.
7. the method that the InfiniBand of utilization network as claimed in claim 1 communicates is characterized in that, in the step 3, the recipient uses recipient's control thread to send first request message, and transmit leg uses the control thread of transmit leg to send second request message.
8. system that utilizes the InfiniBand network to communicate; Comprise transmit leg and recipient; It is characterized in that; Realize high-performance stream communication unit uStream in user's attitude, said uStream has adopted two independently paths: data path and control access, come to carry out respectively the exchange of transfer of data and control messages; Wherein:
Transmit leg is used for exchanging handshaking information with the recipient, comprising formation check mark, recipient's remote direct memory access buffer regional address and recipient's remote direct memory access buffer district size; Formation is checked the number and is used for creating a new connection at the InfiniBand network;
Transmit leg also is used for:
When transmit leg sends current data packet; Judge recipient's remote direct memory access buffer district afterbody remaining can write the space continuously and write the data of current data packet inadequately the time, then the address of unrolling that comprises in the current data packet packet header is changed to the destination address of the data of current data packet; Transmit leg is directly write afterbody remaining can writing continuously in the space in recipient's remote direct memory access buffer district according to the destination address of current data packet with the packet header of current data packet, according to the address of unrolling of current data packet the data of current data packet is written to the head in recipient's remote direct memory access buffer district then; Wherein
Transmit leg comprises main thread and sends thread;
Main thread is used at buffering area that is fit to deposit current data packet of transmit queue inquiry; Main thread writes data in this buffer area; Whether the remote direct memory access cache district of judging the recipient has enough spaces, if having then wake the transmission thread up;
Send thread; Be used for by after waking up with write direct recipient's remote direct memory access buffer district of current data packet; And after writing recipient's remote direct memory access buffer district at packet, this buffering area of main thread of the transmission thread of transmit leg notice transmit leg can write new packet;
Transmit leg also according to asynchronous transmission mechanism, is directly write in recipient's the remote direct memory access buffer district according to the big young pathbreaker's current data packet of recipient's remote direct memory access buffer regional address and remote direct memory access buffer district; Exchange ACK PLEASE and ACK REQUEST message are upgraded recipient's buffer state that transmit leg is preserved; After application data transmission is accomplished, close above-mentioned connection;
The recipient also is used for when unacknowledged data length in recipient's the buffering area surpasses pre-set threshold, initiatively sending first request message, confirms to receive data and notifies transmit leg to upgrade recipient's buffer state of its preservation; Perhaps
Transmit leg; Also be used for to write the space when not enough when thereby unacknowledged data in recipient's the buffering area too much causes; Transmit leg initiatively sends second request message, and the request recipient confirms data and replys first request message to upgrade recipient's buffer state of its preservation.
9. the system that the InfiniBand of utilization network as claimed in claim 8 communicates is characterized in that packet comprises packet header and data; Packet header comprises following parameter: the destination address of the data length of the packet number of current data packet, the destination address of current data packet, current data packet, unroll address and next packet; The address of wherein unrolling is used for when packet header of current data packet and data are separately sent, being set to the destination address of the data of current data packet; The destination address of next packet is used to inform that next packet of recipient is with the position that is written into; Recipient's remote direct memory access buffer zone state that transmit leg is preserved according to this locality obtains the destination address of current data packet and the destination address of next packet.
10. the system that the InfiniBand of utilization network as claimed in claim 8 communicates is characterized in that, transmit leg exchanges handshaking information with the recipient through being connected initial message.
11. the system that the InfiniBand of utilization network as claimed in claim 8 communicates is characterized in that, handshaking information also comprises the remote direct memory access buffer regional address of transmit leg and the remote direct memory access buffer district size of transmit leg.
12. the system that the InfiniBand of utilization network as claimed in claim 9 communicates is characterized in that,
Transmit leg also is used for:
When transmit leg sends current data packet, judge recipient's remote direct memory access buffer district afterbody remaining can write the space continuously and write the packet header of next packet inadequately the time, then the destination address of the next packet that comprises in the current data packet packet header is changed to the initial address in remote direct memory access buffer district.
13. the method that the InfiniBand of utilization network as claimed in claim 8 communicates is characterized in that, the recipient uses recipient's control thread to send first request message, and transmit leg uses the control thread of transmit leg to send second request message.
CN2008102246636A 2008-10-22 2008-10-22 Method and system for communication using InfiniBand network Expired - Fee Related CN101409715B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008102246636A CN101409715B (en) 2008-10-22 2008-10-22 Method and system for communication using InfiniBand network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008102246636A CN101409715B (en) 2008-10-22 2008-10-22 Method and system for communication using InfiniBand network

Publications (2)

Publication Number Publication Date
CN101409715A CN101409715A (en) 2009-04-15
CN101409715B true CN101409715B (en) 2012-04-18

Family

ID=40572503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008102246636A Expired - Fee Related CN101409715B (en) 2008-10-22 2008-10-22 Method and system for communication using InfiniBand network

Country Status (1)

Country Link
CN (1) CN101409715B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8898222B2 (en) * 2012-01-19 2014-11-25 International Business Machines Corporation Processing STREAMS messages over a system area network

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102404212A (en) * 2011-11-17 2012-04-04 曙光信息产业(北京)有限公司 Cross-platform RDMA (Remote Direct Memory Access) communication method based on InfiniBand
CN102404398B (en) * 2011-11-17 2015-03-25 曙光信息产业(北京)有限公司 Multi-client-side supported RDMA (Remote Direct Memory Access) message sending method
CN102438048B (en) * 2011-12-15 2014-04-30 北京新媒传信科技有限公司 Method and system for calling remote service from Internet
CN103716360A (en) * 2012-10-09 2014-04-09 宇瞻科技股份有限公司 Method for sharing files in network transmission system
CN103248467B (en) * 2013-05-14 2015-10-28 中国人民解放军国防科学技术大学 Based on the RDMA communication means of sheet inner connection tube reason
EP2840576A4 (en) * 2013-05-20 2015-07-01 Huawei Tech Co Ltd Hard disk and data processing method
CN103440202B (en) * 2013-08-07 2016-12-28 华为技术有限公司 A kind of communication means based on RDMA, system and communication equipment
CN105446936B (en) * 2015-11-16 2018-07-03 上海交通大学 Distributed hashtable method based on HTM and unidirectional RDMA operation
CN105630426A (en) * 2016-01-07 2016-06-01 清华大学 Method and system for obtaining remote data based on RDMA (Remote Direct Memory Access) characteristics
CN108011909B (en) * 2016-10-28 2020-09-01 北京市商汤科技开发有限公司 Communication method and system, electronic device and computer cluster
CN107147722A (en) * 2017-05-19 2017-09-08 郑州云海信息技术有限公司 A kind of IB RTI methods based on RDMA communication mechanisms
CN109144742B (en) * 2017-06-15 2020-02-07 北京忆芯科技有限公司 Method for exchanging information through queue and system for processing queue
CN107451092A (en) * 2017-08-09 2017-12-08 郑州云海信息技术有限公司 A kind of data transmission system based on IB networks
CN107579892A (en) * 2017-08-29 2018-01-12 郑州云海信息技术有限公司 A kind of communication means based on RapidIO agreements and RDMA technologies
WO2019140556A1 (en) * 2018-01-16 2019-07-25 华为技术有限公司 Message transmission method and apparatus
CN109117288B (en) * 2018-08-15 2022-04-12 无锡江南计算技术研究所 Message optimization method for low-delay bypass
CN109067752B (en) * 2018-08-15 2021-03-26 无锡江南计算技术研究所 Method for realizing compatibility of TCP/IP protocol by using RDMA message
CN109274647B (en) * 2018-08-27 2021-08-10 杭州创谐信息技术股份有限公司 Distributed trusted memory exchange method and system
CN110602211B (en) * 2019-09-16 2022-06-14 无锡江南计算技术研究所 Out-of-order RDMA method and device with asynchronous notification
CN111400213B (en) * 2019-09-29 2022-02-18 杭州海康威视系统技术有限公司 Method, device and system for transmitting data
EP4054140A4 (en) 2019-11-22 2022-11-16 Huawei Technologies Co., Ltd. Method for processing non-buffer data write request, and buffer and node
CN111314311A (en) * 2020-01-19 2020-06-19 苏州浪潮智能科技有限公司 Method, system, equipment and medium for improving performance of switch
CN111988241B (en) * 2020-08-20 2022-09-30 恒生电子股份有限公司 Message queuing method, system, device and storage medium
CN112003860B (en) * 2020-08-21 2021-09-21 上海交通大学 Memory management method, system and medium suitable for remote direct memory access
CN113422793A (en) * 2021-02-05 2021-09-21 阿里巴巴集团控股有限公司 Data transmission method and device, electronic equipment and computer storage medium
CN113572582B (en) * 2021-07-15 2022-11-22 中国科学院计算技术研究所 Data transmission and retransmission control method and system, storage medium and electronic device
CN115002047B (en) * 2022-05-20 2023-06-13 北京百度网讯科技有限公司 Remote direct data access method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1494003A (en) * 2002-10-30 2004-05-05 华为技术有限公司 Device and method for realizing interface conversion
CN1549108A (en) * 2003-05-07 2004-11-24 中兴通讯股份有限公司 Method for realizing communication process zero copy information queue
WO2005018149A1 (en) * 2003-08-14 2005-02-24 International Business Machines Corporation System, method, and computer program product for centralized management of an infiniband distributed system area network
CN1624668A (en) * 2003-12-02 2005-06-08 国际商业机器公司 Storing fibre channel information on an infiniband administration data base
CN101135980A (en) * 2006-08-29 2008-03-05 飞塔信息科技(北京)有限公司 Device and method for realizing zero copy based on Linux operating system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1494003A (en) * 2002-10-30 2004-05-05 华为技术有限公司 Device and method for realizing interface conversion
CN1549108A (en) * 2003-05-07 2004-11-24 中兴通讯股份有限公司 Method for realizing communication process zero copy information queue
WO2005018149A1 (en) * 2003-08-14 2005-02-24 International Business Machines Corporation System, method, and computer program product for centralized management of an infiniband distributed system area network
CN1624668A (en) * 2003-12-02 2005-06-08 国际商业机器公司 Storing fibre channel information on an infiniband administration data base
CN101135980A (en) * 2006-08-29 2008-03-05 飞塔信息科技(北京)有限公司 Device and method for realizing zero copy based on Linux operating system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8898222B2 (en) * 2012-01-19 2014-11-25 International Business Machines Corporation Processing STREAMS messages over a system area network

Also Published As

Publication number Publication date
CN101409715A (en) 2009-04-15

Similar Documents

Publication Publication Date Title
CN101409715B (en) Method and system for communication using InfiniBand network
US11899596B2 (en) System and method for facilitating dynamic command management in a network interface controller (NIC)
US8140696B2 (en) Layering serial attached small computer system interface (SAS) over ethernet
CN104011696B (en) Explicit flow control for implied memory registration
CN105531685B (en) The port general PCI EXPRESS
TWI332150B (en) Processing data for a tcp connection using an offload unit
US8769036B2 (en) Direct sending and asynchronous transmission for RDMA software implementations
US20060075057A1 (en) Remote direct memory access system and method
US20080002578A1 (en) Network with a constrained usage model supporting remote direct memory access
CN109936510A (en) Multipath RDMA transmission
US7826350B1 (en) Intelligent network adaptor with adaptive direct data placement scheme
US8356112B1 (en) Intelligent network adaptor with end-to-end flow control
TW200814672A (en) Method and system for a user space TCP offload engine (TOE)
JP2006033854A (en) Method of enabling transmission between nodes, system, and program
WO2012143953A2 (en) Optimized multi-root input output virtualization aware switch
CN103945456A (en) LTE (long term evolution) base station user plane efficient UDP (user datagram protocol) data transmission optimization method based on Linux system
WO2009015549A1 (en) Shared cache system, realizing method and realizing software thereof
JP6177934B2 (en) Data retransmission method and access network gateway in cooperative service transmission
WO2012106934A1 (en) Device, link energy management method and link energy management system for peripheral component interconnect (pci) express
WO2017186042A1 (en) Method and device for data transmission in virtual switch technique
JP2010183450A (en) Network interface device
WO2014186940A1 (en) Hard disk and data processing method
US20200089649A1 (en) Transport Protocol and Interface for Efficient Data Transfer Over RDMA Fabric
CN101158936A (en) Data-transmission system between nodes, and device and method
CN111459417A (en) NVMeoF storage network-oriented lock-free transmission method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120418

Termination date: 20201022