WO2007138250A2 - Système informatique - Google Patents

Système informatique Download PDF

Info

Publication number
WO2007138250A2
WO2007138250A2 PCT/GB2007/001821 GB2007001821W WO2007138250A2 WO 2007138250 A2 WO2007138250 A2 WO 2007138250A2 GB 2007001821 W GB2007001821 W GB 2007001821W WO 2007138250 A2 WO2007138250 A2 WO 2007138250A2
Authority
WO
WIPO (PCT)
Prior art keywords
queue
lock
application
computer system
payload
Prior art date
Application number
PCT/GB2007/001821
Other languages
English (en)
Other versions
WO2007138250A3 (fr
Inventor
David James Riddoch
Original Assignee
Solarflare Communications Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0610506A external-priority patent/GB0610506D0/en
Priority claimed from GB0613556A external-priority patent/GB0613556D0/en
Priority claimed from GB0613975A external-priority patent/GB0613975D0/en
Priority claimed from GB0614220A external-priority patent/GB0614220D0/en
Application filed by Solarflare Communications Incorporated filed Critical Solarflare Communications Incorporated
Publication of WO2007138250A2 publication Critical patent/WO2007138250A2/fr
Publication of WO2007138250A3 publication Critical patent/WO2007138250A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/52Program synchronisation; Mutual exclusion, e.g. by means of semaphores
    • G06F9/526Mutual exclusion algorithms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • H04L49/901Buffering arrangements using storage descriptor, e.g. read or write pointers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • H04L49/9047Buffering arrangements including multiple buffers, e.g. buffer pools
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • H04L49/9084Reactions to storage capacity overflow
    • H04L49/9089Reactions to storage capacity overflow replacing packets in a storage arrangement, e.g. pushout
    • H04L49/9094Arrangements for simultaneous transmit and receive, e.g. simultaneous reading/writing from/to the storage element
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/542Intercept
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Definitions

  • the present application relates to a computer system capable of running a plurality of processes, especially a computer system which is connected to a network, and discloses four distinct inventive concepts which are described below in Sections A to D of the description.
  • Claims 1 to 13 relate to the description in Section A
  • claims 14 to 32 relate to the description in Section B
  • claims 33 to 59 relate to the description in Section C
  • claims 60 to 72 relate to the description in Section D.
  • figures 1 to 8 relate to the description in Section A
  • figures 9 to 12 relate to the description in Section B
  • figures 13 to 15 relate to the description in Section C
  • figures 16 to 22 relate to the description in Section D.
  • Embodiments of each of the inventions described herein may include any one or more of the features described in relation to the other inventions.
  • the present invention relates to a computer system capable of running a plurality of processes, especially a computer system which is connected to a network.
  • Computer systems often operate in a network.
  • the shared lock is prone to a problem called lock contention.
  • a lock becomes contended when a process tries to obtain the lock that is held by another process.
  • the overhead associated with lock contention reduces system performance and is always best avoided.
  • spinlocks In operating system kernels, it is common to use spinlocks. According to the spinlock model, a process repeatedly tries to obtain the lock until it succeeds. This works well when locks are held for short periods of time.
  • this model does not work well at the user level, for example, when the contending processes are threads of user applications where the lock may be held for a considerable period of time.
  • the normal approach in this case, is to put the contending process to sleep until the current lock-holding process releases the lock.
  • This model is normally referred to as blocking.
  • An object of the present invention is to reduce the lock contention overhead when, queueing items in preparation for subsequent sending over a network.
  • the present invention may provide a computer system which is capable of running a plurality of concurrent processes, the system. being operable: to establish a first queue in which items related to data for sending over the network are enqueued/and to which access is governed by a. lock; when a first of said, processes is denied access to the first queue by the lock, to enqueue the items into a second queue to which access is not governed by the lock; and to arrange for the items in the second queue to be dequeued with the items in the first queue.
  • the present invention avoids the above- mentioned overheads associated with the spinlock or blocking lock contention handling models. Further, by arranging it such that items on the second queue are handled together with items in the first queue, the present invention ensures that the items in the second queue are processed in a timely fashion.
  • the system is operable to integrate items from the second queue ⁇ into the first queue.
  • the items in the second queue will be dequeued by the system as though they were from the beginning enqueued in first queue items. Integration may be achieved by linking the first queue and the second queue together.
  • the items in the second queue may be dequeued from the second queue and moved to the first queue.
  • the second queue comprises a data structure facilitating access by concurrent processes. In this way, the integrity of the second queue can be maintained even after it has been integrated into the first queue and might be ; subject to concurrent manipulation by the first process enqueueing items and another process dequeueing items from the first queue.
  • the second queue comprises a linked list having a head, to : which items are added . and from which they are removed by atomic instructions; !n other embodiments, the second: queue may comprise a circular.buffer having an input pointer pointing to where items are entered into the buffer, and an output pointer pointing where items are removed from the buffer, wherein the input and output pointer are. prevented from crossing one another.
  • Linked lists are suitable structures by which to implement the first and second queues, and preferably the first queue is linked to the second queue by arranging, for a pointer of the first queue to point to the second queue, whereby the first and second queues form a single linked list structure. . • : • ' . • ⁇
  • the system comprises means for registering the existence of the second queue, and : is further operable after completing the second queue to register the existence of the second queue with the registering means if the lock is held by a said process other than the first process.
  • This registration provides the mechanism through which the need to process the second queue can be communicated, and thus delegated, by the first process to a said process other than first process; However, if after completing the second queue, it turns out that, the lock is no longer being held and is grabbed by the first process, the first process may itself dequeue for sending the items in the second queue. In such a case, no registration of the second queue takes place.
  • the lock is a single unit of memory which is manipulated atomically.
  • the uniUot ' memory can be a single word, or multiple words, .if the processor architecture supports atomic manipulation of. multiple words. . . . . ' ⁇ .. . : ' .: . - : • , ⁇ . : '
  • the registering means nhay include bits of the lock. This is advantageous because, it enables ⁇ lock " manipulation 'and .the rdeterr ⁇ ination of whether there exist :any second queues fdr-.prbcessing ! to be carried out iri the same operation.
  • lmthoset bits of the lock allocatedrto the registering ; - means, the head of a linked list;may be; stored. : !
  • The:lihked-4ristniay comprise: items, each item referring to a soekefrwhich has formed a r said: second queue. ; . : ⁇ -sr -. ⁇ I'V -T •. . ; • •' .• • - . . ⁇ : j; v
  • the system is further operable when a process is about to release a lock to check whether there are any second queues to be dequeued for sending.
  • This may ; be achieved by the registering means where the existence of second queues, is logged.
  • the ' operation of checking for the existence of any registered second queues should be atomic with respect to releasing the lock, i.e. it should not be possible ftKrelease the lock if there is a registered second queue. . ; ...
  • Said items .can . include 1 the data itself, especially when the data volume is, small, but preferablyy comprises a pointer :to ; a ; ; buffer in, for example, an applications address spiace where ; the. data is . actually held. This saves the overhead . of moving the data around durfng.enqueueing'. / ; . ' ⁇ • ' - . ⁇ v . . ⁇ ⁇ ,-• ⁇
  • the present invention may provide a computer program for, sending data to the network interface of a computer system whi ⁇ h Js capable of running a plurality of concurrent processes, the computer program being operable to establish a first queue in which, items related to data for sending over the ' network interface are enqueued, and to which access is governed by a lock; and when a said process is denied access to the first queue by the lock, to enqueue the items for sending into a second queue to which access is not governed by said lock, wherein the computer program is further operable to arrange for the items in the second queue to be dequeued with the items in the first queue.
  • the present invention may provide a data carrier bearing the above-mentioned computer program.
  • the present invention may provide a computer, system running a plurality of processes, comprising a first queue in which items related to data for sending over a network are enqueued; a lock by which access " , to the first q ⁇ u ⁇ is governed; a second queue to which access is not governed by ; the lock; wherein'ihe system is operating such that when a said process is denied 1 access ' to the first 5 queue by the lock, the data item for sending is enqueued in the second queue, and items in the second queue are dequeued with items in the first queue.
  • - " .
  • Figure 1 shows a hardware overview of an embodiment of the invention
  • Figure- 2 shows" an overview of Various portions of state and associated lock in an embodiment of the invention
  • Figures 3(a) and 3(b) show algorithms in accordance with an embodiment of the invention; ' :; . ⁇ / • .. . ,
  • FIGs 4 to 6 show an overview of various data structures established by an embodiment of the invention operating according to the algorithms of Figures 3(a), (b);
  • Figure 7 shows the bit structure of the netif lock
  • Figure 8 shows the structure of a queue formed in accordance with an embodiment of the invention.
  • a computer system 10 in the form of a personal computer comprises a central processing unit 15, a memory 20 which is connected to a network 35 by an Ethernet connection 30.
  • the Ethernet connection is managed by a network interface card 25 which supports the physical and hardware requirements of the Ethernet system.
  • a network interface card 25 which supports the physical and hardware requirements of the Ethernet system.
  • the physical hardware implementation need not be that of a card: for instance, it could be in the form of integrated circuit and connector mounted directly on a motherboard.
  • FIG. 2 shows the system state. when a number of application execution threads 40, 42, 44 have been initiated, and a number of TCP or UDP sockets 46, 48 have been opened up by the applications to communicate with the network interface card 25. Because the sockets 46, 48 can be accessed by more than one thread, some of the state for each socket is protected by a socket-specific lock 52, 54 known as a sock-lock. The sock-locks protect aspects of the operation of the sockets which are independent of other parts of the system. In the drawings, these portions of state, are represented diagrammatically by the regions 56, 58. Also, each socket has -portions of state all of which are protected by a single shared lock, hereinafter. referred to as. the network interface or netif lock 60. In the drawings, these portions of state are represented diagrammatically by the regions 62, 64. In the embodiment the network interface lock 60 also protects, as the name suggests, various portions of state related to access to the network interface. ,
  • the netif lock is implemented by as a single word in memory so that it may be manipulated by atomic instructions.
  • Some bits 60a are used to indicate whether it is locked or not.
  • Some bits 60b are used to indicate whether it is contended (i.e. whether other threads are waiting for the lock).
  • Some bits 60c are used to request special actions when the netif is unlocked e.g. via callbacks, as described in the applicant's co-pending patent application GB0504987.9, which is incorporated herein by reference.
  • a set of bits 6Od are used to implement a list or register of deferred sockets which is described in more detail below.
  • the thread 44 allocates buffers for the data to be sent in the application address space.
  • the data to be sent need not be at the user level.
  • it fills those buffers with data for sending. It will be noted that steps 110, 112 are independent of other parts of the system and so there is no need to obtain access by means of a lock.
  • the thread 44 attempts to grab the netif lock 60 using an atomic compare-and-swap (CAS) operation.
  • CAS compare-and-swap
  • the CAS operation compares the bits 60a of the netif lock which are indicative of whether it is locked or not with a predetermined set of bits which represent the lock not being held; if the compared bits are the same, then the bits are swapped for another set of bits indicative of the netif lock being held by the thread 44. Else, if the compared bits are different, indicating the netif lock is already held by another thread, no swap operation is performed.
  • the use of a netif lock comprising only one word and manipulated by an atomic operation guarantees that only one thread can hold the netif lock at one time.
  • step 114 If the thread 44 found the netif lock 60 in an unheld condition and took possession of it, at step 114, it moves onto step 116 where it enqueues the items in the send queue 70 in the socket 48 which it is using.
  • Each queue item 72a-d comprises a pointer field 74 which points to the next item in the queue, a field 75 indicating the length of the data for sending, and a field 76 indicating its start address in the application address space 77.
  • IOVEC pointers means that the data itself, the volume of which might be quite high, need not be moved while the send queue is being formed.
  • step 124 obtains the sock-lock 54 and establishes a send, prequeue 85 as illustrated in Figure 5.
  • the send prequeue 85 comprises a send prequeue head pointer 86 and can be the same basic structure as the send queue, but differs in that access to the send prequeue is governed by a sock-lock, in this particular case the sock-lock 54 associated with the socket 48.
  • enqueueing of the data items may be accomplished by an IOVEC pointer structure as shown in Figure 8.
  • the send prequeue 85 is protected by a sock-lock generally, it is possible that, as will be described later, it will, be dequeued by a netif lock holding thread which will ignore the sock-lock.
  • the location to which send prequeue head pointer 86 points is manipulated using an atomic instruction, which means that a thread enqueueing data items into the queue need not synchronize with a thread dequeueing data items from the queue.
  • the thread 44 at step 126, operating on the bits 60a performs an atomic CAS operation. If the netif lock 60 was dropped during the formation of the send prequeue 85, then it is grabbed and the socket 48 is not registered as a deferred socket.
  • the send prequeue is integrated into the send queue and the sock-lock is released.
  • the integration is carried out by transferring the link list structure itself into the send queue i.e. by removing items from the send prequeue 85 and transferring those items to the send queue, 70.
  • the pointer at the end of the send queue is simply made equal to the send prequeue head pointer 86, whereby the send queue and the send prequeues are effectively concatenated.
  • the send queue head pointer 71 is simply made equal to the send prequeue head pointer 86.
  • the send queue including the linked send prequeue are dequeued and transmitted onto the network. Because, at step 130, no deference is paid to any sock-lock, it is essential that, as mentioned previously, that the head of the send prequeue is manipulated atomically because while items are being dequeued and sent over the network, another thread could be enqueueing more items.
  • the socket 48 is registered as a deferred socket.
  • the register or list of deferred sockets is constructed as a linked list 90 comprising a head item formed by the bits 6Od of the netif lock 60 and linked items 91a, 91b.
  • Each item 6Od, 91a, 91b comprises a pointer pointing to a socket which has formed a send prequeue which was not able to be immediately sent.
  • only sockets 46, 48 are present and, therefore, the socket 48 is registered as the first and head item in the register 90 in bits 6Od.
  • the socket 48 would be registered in one of the linked items 91a, 91b. In this case, the thread 44 having obtained the sock-lock 54, at step 124, still holds it. Holding the sock-lock before registering the socket as deferred is important as it ensures that the socket registration takes place only once. For this reason, in other implementations, where the send prequeue is not protected by a sock-lock, it is necessary to grab the sock-lock just before registration. After registration, the thread continues with other tasks. The fact that, as here, the socket is registered as a deferred socket only when the netif lock is being held by another thread is a necessary condition for this embodiment of the invention to operate properly.
  • the thread 44 need not wait in limbo until the netif lock is available again, but the act of registering the socket delegates the handling of the send prequeue to the current lock holding thread, i.e. thread 40. This is because a thread is never allowed to drop the netif lock when there are sockets registered as deferred. So after the thread 40 has dequeued its send queue 80 and sent the data over the network, it makes a check for any sockets which have been registered as deferred.
  • Figure 3(b) shows the algorithm which is used whenever a thread wants to drop the netif lock.
  • the thread performs by means of a single atomic instruction a comparison between the bits 60c, and bits 6Od to check whether they are all set such that are no callbacks registered to be performed amongst the bits 60c, and there are no sockets registered in the deferred sockets list 90. If there are no callbacks or deferred sockets registered, the netif lock 60 is dropped (still step 140).
  • the thread enters a slow path (step 144), and checks individually which of the bits 60c, 6Od indicate that action is required and attends to the actions which need to be done before attempting again to drop the netif lock at step 140.
  • the socket 48 is registered as a deferred socket, and so the thread 40 transfers the prequeue 85 into its send queue 70. Again, this is done simply by making the send queue head pointer equal to the send prequeue head pointer 71, as shown in Figure 6.
  • the send queue 70 may be dequeued and the data sent to the network 30 according to the algorithms of the transport protocol.
  • the network interface lock 60 has been used as the single shared lock protecting the send queues 70,80, but in other embodiments, another shared lock unconnected with the role of protecting access to the network interface and which may be dedicated to protecting the send queues may be used instead. It will be understood by the skilled person that, as long as the role of protecting the send queues is provided, it is not material to the invention whether the shared lock is also used to protect access to any other shared resources.
  • the present invention relates to a computer system having a network interface and which is capable of running a plurality of concurrent processes.
  • the application When data is received from the network interface, it can be delivered to a receive buffer of a destination application using a variety of receive models.
  • a received packet is de-multiplexed to an appropriate socket and the payload is delivered to the receive queue for that socket. Then, at some later time, the application transfers the payload to its receive buffer. In the case where the application requires the payload before it has been received, the application may choose to block until it arrives.
  • the application may pre-allocate an application receive buffer to which subsequently received payload should be delivered. Normally, if a receive buffer has been allocated, then received payload is delivered directly to the receive buffer. If a receive buffer has not been allocated, then the payload is enqueued on the receive queue. Often, an application will allocate many receive buffers for incoming payload, and so descriptors for the allocated receive buffers are stored in a queue.
  • the state of a socket may be directly manipulated by processes operating in multiple address spaces, including the address spaces of the operating system kernel and one or more others for user-level processes.
  • the process which is handling the receipt of an incoming packet and operating in one context may not be able to access the application receive buffers which may reside in another address space.
  • the present invention may provide, a computer system having a network interface and capable of running a plurality of concurrent processes, the system being arranged to
  • the system being operable when a said process, holding the first lock, processes incoming payload to
  • a first-lock-holding process may fail to take possession of the second lock and so be prevented from loading the payload directly into an available application receive buffer, but, by setting the control flag it ensures that another process holding the second lock is signaled that this work needs to be carried out.
  • the payload can nonetheless be enqueued without delay on the first queue.
  • a process, on taking possession of only the second lock is empowered to dequeue the payload from the first queue, and transfer it to application receive buffer.
  • the incoming payload can be enqueued on the first queue by default. Alternatively, only when, there is no application receive buffer descriptor is specified in the second queue or the said process fails to obtain the second lock is the payload enqueued on the first queue.
  • the system is further operable such that the another said process in response to the control flag being set and when holding the second lock dequeues payload from the first queue, and transfers it to an application receive buffer specified in the second queue,.
  • the said another process which is signaled by the control flag and which then goes on to dequeue the payload from the first queue may be the same process which initially set the control flag. For example, this might happen when the process initially tries to grab the second lock, but fails, sets the control flag and goes on to perform a series of further operations. Then, just before releasing the first lock, it tries one final time to grab the second lock. If this time it is successful, because during the performance of the further operations the second lock was dropped by another process, the process will be able to take care of dequeueing the payload from the first queue itself.
  • the attempt to obtain the second lock, and upon failing, setting the control flag is performed by an atomic instruction, for example, a compare-and- swap instruction.
  • bits implementing the second lock and the control flag reside in the same word of memory.
  • the present invention may provide a computer system having a network interface and running a plurality of concurrent processes, the system
  • the system determines from the second queue whether an application receive buffer is available for the payload; and if an application receive buffer is available, attempts to take possession of the second lock, and if the attempt fails, sets a control flag as a signal to another said process.
  • the present invention may provide a computer program for a computer system having a network interface and which is capable of running a plurality of concurrent processes, the computer program being operable
  • the present invention may provide a data carrier for the above computer program. , . .
  • the present invention may provide a computer system having a network interface and capable of running a plurality of concurrent processes in a plurality of address spaces, the system being arranged to establish a receive queue structure comprising at least one pointer to an application receive buffer, the system- being operable when a said process processes incoming payload to
  • a process is able to ensure that payload destined for an application receive buffer reaches that destination despite the fact the pointer to the application receive buffer is valid only in a different address space.
  • the system is operable, upon determining that said application receive buffer is not accessible in the current address space, to make a system call to the kernel context in order to load the payload into the application receive buffer.
  • a system call to the kernel context in order to load the payload into the application receive buffer.
  • ariy:address space can be accessed providing the page tables for the address " space are known.
  • information about the address space in which the pointer(s) is valid is stored either together with the pointer(s) or in the state of a socket with which the application received buffer is associated.
  • a reference to the page tables for the socket are stored in a kernel-private portion of the socket state.
  • a pointer may not successfully resolve to an application receive buffer even though it is valid. This may happen because the memory has been paged-out to disk, or because of a physical memory page has not yet been allocated.
  • the system is further operable, if a process fails to address an application receive buffer, to enqueue the payload in the receive queue structure and set a control flag as a signal to another process that some payload in the receive queue structure needs to be moved to an application receive buffer.
  • the system is operable, upon determining that said application receive is not accessible in the current address space, to arrange for a thread in the appropriate address space to run in order that the payload may be loaded into the application receive buffer.
  • this may be achieved by scheduling an Asynchronous Procedure Call (APC).
  • APC Asynchronous Procedure Call
  • Said receive queue structure preferably comprises a first receive queue on which received payload can be, enqueued, and a second receive queue on which descriptors for application receive buffers can be enqueued.
  • the present invention may provide a computer system having a network interface and running a plurality of concurrent processes in- a plurality of address spaces, the system establishing a receive queue structure comprising at least one pointer to an application receive buffer, wherein, when a said process processes incoming payload, the system identifying from the receive queue structure an application .receive buffer for the payload; determining whether said application receive buffer is accessible in the current address space; and if it is not, arranging for loading of the payload into the application receive buffer to take place in a context in which said application receive buffer is accessible.
  • the present invention may provide a computer program for a computer system having a network interface which is capable of running a plurality of concurrent processes in a plurality of address spaces, the program being arranged to establish a receive queue structure comprising at least one pointer to an application receive buffer, the computer program being operable when a said process processes incoming payload to
  • the present invention may provide a data carrier for the above computer program.
  • Figure 9 shows an overview of hardware suitable for performing the invention.
  • Figure 10 shows an overview of various portions of state in a first embodiment of the invention
  • FIG. 11 show an algorithm in accordance with the invention.
  • Figure 12 shows an overview of various portions of state in a second embodiment of the invention.
  • a computer system 10 in the form of a personal computer comprises a central processing unit 15, a memory 20 which is connected to a network 35 by an Ethernet connection 30.
  • the Ethernet connection is managed by a network interface card 25 which supports the physical and hardware requirements of the Ethernet system.
  • a network interface card 25 which supports the physical and hardware requirements of the Ethernet system.
  • the physical hardware implementation need not be that of a card: for instance, it could be in the form of integrated circuit and connector mounted directly on a motherboard.
  • FIG 10 shows the system state of a first embodiment of the invention.
  • an application execution thread 40 has been initiated and a TCP socket 50 has been opened up enabling the application to communicate via the network interface card 25 over the network.
  • Processing of data at the TCP layer involves two linked list queue structures: a receive queue (RQ) 70 and an asynchronous receive queue (ARQ) 80, both of which are first-in first-out (FIFO).
  • the RQ 70 comprises a plurality of items 71 in which each item 71 references a block of data after TCP processing.
  • Each item 71 comprises a pointer portion 71a which points to the start of the data block, and block-length portion 71b giving the length of the block.
  • the memory region where blocks of data are stored in buffers after TCP processing is designated 95.
  • the ARQ 80 comprises a plurality of items 81 in which each item 81 references an application receive buffer which the application 40 has pre-allocated for incoming data, for example, when an application invokes an asynchronous (or overlapped) receive request.
  • Each item 81 consists of an application receive buffer descriptor comprising a pointer portion 81a which points to the start of the buffer and a buffer-length portion 81b defining the length of the buffer.
  • the memory region which may be allocated for buffers for incoming data is designated 42. In other embodiments, queue structures other than linked lists may be used.
  • the first lock 62 is a shared lock which is widely used to protect various portions of the system state.
  • this lock is referred to as the network interface lock or netif lock.
  • the second lock 52 is a lock which is dedicated to protecting certain portions of the state of the socket 50.
  • this lock is referred to as a sock-lock.
  • the right to put an item 71 onto the RQ 70 is governed by the netif lock 62 and this right is denoted in Figure 10 by the arrow 62-P.
  • the right to remove/get an item from the RQ 70 is governed by the sock-lock 52 and this right is denoted in Figure 10 by the arrow 52-G. Any process which enqueues an item on the RQ 70 has first to take possession of the netif lock 62, whereas a process which dequeues an item has first to take possession of the sock-lock 52.
  • the right to both put an item 71 onto the ARQ 80 and the right to remove/get an item from the ARQ 80 are governed by the sock-lock 52 and these rights are denoted in Figure 10 by the arrows 52-P and 52-G, respectively.
  • any application which seeks to enqueue an application receive buffer descriptor on the ARQ 80 has first to take possession of the sock-lock 52, and similarly any process which dequeues an item has also first to take possession of the sock-lock 52.
  • a drain bit 58 is included within the socket state. From time to time, the system makes a check to see whether there is a receive event which is ready for processing and queueing. If there is, then the algorithm shown in Figure 11 is carried out. In this case, it is assumed that a process 68 acts as the receive process and that it, at this point, is already in possession of the netif lock 62.
  • the process 68 can be a user-level process, a kernel thread or an interrupt service routine.
  • the TCP layer protocol processing and de-multiplexing is carried out, and the post TCP layer processing payload/data block 93 is stored in a memory space 95.
  • the data block 93 is enqueued on the RQ 70 by adding an item 71 onto the RQ 70 which references the data block 93.
  • a check is made to see whether the ARQ 80 contains any items 81. If there are buffer descriptors in the ARQ 80, this means that the application 40 has allocated some buffers for incoming data, and so the receive process 68 tries, at step 106, to grab the sock-lock 52. If there are no buffers descriptors in the ARQ 80, this means that there are no application receive buffers yet allocated, and so the receive process 68 having already deposited the payload in the RQ 70 moves onto further tasks or finishes as the case may be.
  • the receive process 68 succeeds in taking possession of the sock-lock 52 without blocking, it, at step 108, performs a so-called 'drain down' operation, in which data blocks referenced in the RQ 70 (and actually stored in the memory region 95) are transferred to the buffers listed in the ARQ 80 i.e. to the memory region 42.
  • the single action of taking possession of the sock-lock 52 empowers the receive process to invoke the drain down operation which requires dequeueing rights for both the RQ 70 and ARQ 80. Filled buffers are either removed from the ARQ 80 or their buffer-length portion 81 b adjusted. At the end of the drain down operation, notification may be made to the application 40 that the operation has occurred.
  • the drain bit 58 is set. This is the instant shown in Figure 10, as the application process thread 40 holds the sock-lock 52.
  • the attempt to grab the sock-lock 52 at step 108, and the setting of the drain bit 58 at step 110 are atomic.
  • a word of memory can contain bits serving as the sock-lock 52 and a bit serving as the drain bit 58, and the steps 108, 110 can be performed using an atomic compare-and-swap operation.
  • the process 68 then goes about other processing actions, and the set drain bit 58 serves as a signal to another process that the socket 50 needs attention, specifically that a drain down operation needs to be performed.
  • This technique of delegating a required action from one process to another was described in the applicant's co- pending patent application GB0504987.9, which is incorporated herein by reference.
  • the check, at step 104, to determine whether the ARQ 80 is empty or not, can be carried out before incoming payload is enqueued in the RQ 70 (at step 102). Thus, when the ARQ 80 is non-empty the RQ 70 can be completely bypassed.
  • Figure 12 shows the system state of a second embodiment of the invention operating on a LinuxTM operating system in a multiple address space environment, including a kernel context and at least one user-level context.
  • the second embodiment is substantially the same as the first embodiment, except that it includes features, discussed hereinafter, to handle the multiple address space environment.
  • the RQ 70 and ARQ 80 reside in shared memory that is directly accessible in at least two, and possibly all, of the multiple address spaces.
  • an address space is associated with the socket.
  • Each address space is allocated an address space tag which uniquely corresponds to a single address space.
  • the address space tag for the socket is stored in the shared socket state and is designated by reference numeral 56.
  • a reference to the page tables for the address space associated with the socket 50 is stored in kernel-private buffer 60 rather than in the shared socket state to ensure that it cannot be corrupted by user-level processes as that would be a hazard to system security.
  • the second embodiment operates similarly to the first embodiment and essentially performs the Figure 11 algorithm. However, before performing step 108 where the receive process 68 is required to write to the application receive buffer 42, the process 68 compares the address space tag 56 of the socket with that of the current address space. If they do not match (and the process is not executing in the kernel context), then the process 68 cannot address the application receive buffer 42 because the pointer portion 81a of the application receive buffer descriptor will only validly resolve to the correct address within the same address space.
  • any address space for any process can be accessed providing the page tables for the address space are known. Therefore, the task of loading the application receive buffer 42 is passed to a kernel context routine which using the page tables in the kernel-private buffer 60 and standard operating system routines is able to resolve the relevant pointer 81a to the correct address.
  • a pointer may not successfully resolve to a buffer even though it is valid. This may happen because the memory has been paged-out to disk, or because of a physical memory page has not yet been allocated. In such circumstances, it is not possible for the process 68 to access the buffer. Instead the data block is added to the RQ 70 and the system ensures that the application thread 40 (or another thread using the same address space) is awoken, and subsequently performs a drain-down operation. The normal paging mechanisms of the operating system will make the application receive buffer available in this case.
  • the present invention relates to a computer system operating a user-level stack.
  • an application running on a computer system communicates over a network by opening up a socket.
  • the socket connects the application to remote entities by means of a network protocol, for example, TCP/IP.
  • the application can send and receive TCP/IP messages " by invoking the operating system's networking functionality via system calls which cause the messages to be transported across the network. System calls cause the. CPU to switch to a privileged level and start executing routines in the operating system.
  • An alternative approach is to use an architecture in which at least some of the networking functionality, including the stack implementation, is performed at the user level, for example, as described in the applicant's co pending PCT application WO 2004/079981 and WO 2005/104475.
  • Some I/O synchronisation mechanisms involve the application directly interrogating sockets, which it specifies, to obtain I/O status information for the . specified sockets.
  • the interrogation is performed by operating system routines which are invoked by a system call.
  • the application specifies the sockets of interest to it in the argument of the system call.
  • I/O synchronisation mechanisms involve the use of I/O synchronisation objects, such as, for example, I/O completion ports in WindowsTM.
  • I/O synchronisation objects can be associated with one or sockets and later interrogated by the application to provide I/O status information on the overlapped I/O operations on the associated socket(s).
  • the creation, updating and interrogation of the I/O synchronisation objects is performed by operating system routines which are invoked by a system call.
  • the application specifies the I/O synchronisation object of interest to it in the argument of the system call.
  • I/Q synchronisation mechanisms are used with a user-level-stack architecture.
  • the operating system is not operating the stack itself, it is blind, in some systems, to the data traffic passing through a particular socket.
  • the user-level stack has the responsibility of keeping the I/O synchronisation object updated as appropriate.
  • this updating is performed by a system call and, is therefore, expensive and should not be performed when it is not necessary.
  • interrupts are generated by incoming events in order to allow prompt updating of the stack.
  • an interrupt incurs a particularly heavy overhead and so interrupts may be selectively enabled. While selective enablement of interrupts is beneficial in terms of overall system performance, at any given instant, there is a danger that an application requesting I/O status information from an I/O synchronisation mechanism via a system call may be given a misleading result.
  • the present invention may provide a computer system capable of operating in a network and being arranged to establish a user-level stack, the system, when an application associates an I/O synchronisation object with a socket, being responsive to the system call, made by the application, for associating the I/O synchronisation object with the socket to record said association in memory accessible to a user-level process.
  • the association of an I/O synchronisation object with the socket is recorded in its user-level state.
  • the system is configured to direct said system call made by the application to a user-level routine, in which the recording of the said association may take place.
  • I/O synchronisation object comprises an I/O completion port
  • the system call CreateloCompletionPort()
  • the system call may also serve to associate the I/O completion port with another file object.
  • the I/O synchronisation object may be created by one system call and associated with a socket or other file object by a separate system call.
  • the present invention may provide a computer system capable of operating in a network and being arranged to establish a user-level stack, the system being configured to direct a system call, made by an application, requesting I/O status information to a user-level routine which is operable to update a user-level stack;
  • Configuring the system to direct an application's system call to a user-level routine provides an opportunity for the user-level routine to update a user-level stack before going on to make a system call mimicking or duplicating that made by the application.
  • this aspect of the invention can reduce the likelihood that misleading information will be returned to the application.
  • the user-level routine updates a user-level stack which has relevance to the request from the application.
  • This is particularly advantageous in systems with more than one stack, since it may be beneficial to perform updates on only those one or more stacks which have relevance to the information requested by the application.
  • the relevance of a particular stack to the request from the application can be ascertained from the sockets specified in the application's request.
  • the I/O synchronisation mechanisms involve I/O synchronisation objects
  • the. relevance of a particular stack can be ascertained from the one or more sockets with which the I/O synchronisation object is associated.
  • the user-level routine updates all the associated user-level stacks.
  • the user-level routine can update all the user-level stacks regardless of their number and relevance to the request made by the application.
  • the nature of the I/O status information may vary.
  • the I/O status information comprises event-based information, for example, information about I/O operations which have completed
  • the I/O status information may comprise state-based information, for example, information about whether or not the socket has received data available for reading.
  • the I/O synchronisation mechanism comprises an I/O synchronisation object associated with a said user-level stack.
  • I/O synchronisation object comprises an I/O completion port
  • system call, GetQueuedCompletionStatus() returns a list of completed I/O operations.
  • the user-level routine is operable, based on certain operating conditions, to make a determination as to whether it is currently opportune to update the user-level stack, and to update the user-level stack only when it is determined to be opportune.
  • the system call may be made without updating the stack.
  • the determination may include a check as to whether there is any data awaiting processing by the user-level stack. In some cases there may not be, and so there is no point in arranging for the stack to be specially updated.
  • the system may comprise a lock that governs the right to update the user-level stack, wherein the determination may include a check as to whether the lock is locked. It is preferred that if the lock is not obtained without blocking meaning that is already locked, i.e. held by another process thread, then similarly the system call should be made without updating the stack. In this case, it is possible that the process, which is currently holding the lock, may attend to the updating of the stack. It is desirable in a user-level stack to avoid the heavy overhead incurred by . interrupts. However, it is at times desirable to enable interrupts because a user- level process may not be available to update the stack itself, for example, because it is blocked. Accordingly, it is preferred that the interrupts for the user- level stack are selectively enabled.
  • a flag may be used to store the enablement status of the interrupts for the user- level stack.
  • said interrupts are not enabled, when, during the stack updating, a process thread was awoken. It is advantageous not to enable interrupts when, during the updating of the stack a process thread was awoken, since that thread will probably take care of updating the stack.
  • said interrupts are not enabled, if the lock was locked. Again, another process thread may well take care of updating the stack.
  • said interrupts are not enabled, if the said system call made by the application was non-blocking.
  • the determination of whether it is opportune to update the stack includes a check as to whether the interrupts are enabled. If they are enabled, then the stack is not updated.
  • the system is configured to direct the system call made by the application to the user-level routine using a dll interception mechanism which is discussed hereinafter.
  • the present invention may provide a computer system operating in a network and providing a user-level stack, the system, when an application associates an I/O synchronisation object with a socket, responding to the system call, made by the application, for associating the I/O synchronisation object with the socket to record said association in memory accessible ,to a user- level process.
  • the present invention may provide a computer program for a computer system capable of operating in a network and being arranged to establish a user-level stack, the program, when an application associates an I/O synchronisation object with a socket, being responsive to the system call, made by the application, for associating the I/O synchronisation object with the socket to record said association in memory accessible to a user- level process.
  • the present invention may provide a data carrier bearing the above computer program.
  • the present invention may provide a computer system operating in a network and providing a user-level stack, the system directing a system call, made by an application, requesting I/O status information to a user-level routine which
  • the present invention may provide a computer program for a computer system capable of operating in a network and being arranged to establish a user-level stack, the system being configured to direct a system call, made by an application, requesting I/O status information to the program which runs at the user level and which is operable to
  • the present invention may provide a data carrier bearing the above computer program. . . . . ' •
  • the invention may provide a method for use in a computer system capable of operating in a network and being arranged to establish a user-level stack, the method comprising, when an application associates an I/O synchronisation object with a socket, responding to the system call, made by the application, for. associating the I/O synchronisation object with the socket to record said association in memory accessible to a user-level process. . . . . . :
  • the invention may provide a method for use in a computer system operating in a network and providing a user-level stack, the method comprising directing a system call, made by an application, requesting I/O status information to a user-level routine which
  • Figure 13 shows the basic architecture of a computer system operating a user- level stack
  • Figure 14 illustrates the operation of a computer system in accordance with an embodiment of the invention having created an I/O completion port
  • Figure 15 shows ⁇ routine in accordance with an aspect of the invention.
  • FIG. 13 a basic example of an architecture of a computer system 10 operating the WindowsTM operating system and providing networking functionality at the user-level is shown in Figure 13.
  • the system 10 comprises the operating system 12, the user/application space 14 and the network interface hardware 16.
  • an application 18a, 18b wants to request a networking operation, it does so via a user-mode library using the Winsock API.
  • the user-mode library comprises a Winsock API . implementation 22 and a Winsock service provider or WSP 24.
  • WindowsTM is supplied with a default WSP, the MicrosoftTM TCP/IP WSP or MS WSP. This WSP does very little other than map the networking operations requested by an application onto the corresponding operating system networking call interface and then invokes the appropriate operating system routine .
  • the WSP 24 provides the networking functionality, including the implementation of the stack 30.
  • an application 18a, 18b requires a networking-related operation to be performed, it invokes a command, say send() or receive() supported by the Winsock API, which is then carried out by the WSP 24.
  • the WSP 24 can also make use of existing operating system networking functionality to the extent that it needs to.
  • Figure 14 shows the computer system 10 where, for clarity of illustration, the user- mode library has been omitted.
  • the stack 30 has been illustrated as comprising a receive path 3OR arid a transmit path 3OT. In the situation in Figure 14, the application 18a has opened up a socket 32 for communication with the network.
  • a lock 40 governs the right to update both paths 3OR, 3OT of the stack 30.
  • the system 10 may also comprise other stacks (not shown) each of which is protected by their own lock.
  • the application has chosen to set up an I/O completion port 35 for the socket 32.
  • the I/O completion port 35 is shown as being associated with only the socket 32. In other embodiments, it may be associated with more than one socket and other system objects.
  • the I/O completion port 35 serves as a repository in which the details of completed overlapped I/O operations made from the socket 35 is stored as a list. Details of various types of I/O operations are stored including, for example, transmitting, receiving, connecting, disconnecting, and accepting new connections. It will be noted that by virtue of being associated with the socket 32, the I/O completion port 35 inherently becomes associated with the stack 30 which services the socket 32.
  • the completion port is set up by a CreateloCompletionPort() system call by the application 18a.
  • this system call does not pass through the WSP 24, and so in the normal course of events, the CreateloCompletionPort() function would be looked up from a function table maintained in the application. From this table, a pointer corresponding to the CreateloCompletionPort() function would be identified and the operating, system code referenced by the pointer invoked.
  • the system is configured to direct the CreateloCompletionPortO system call to a user-level dll (dynamic link library) function, denoted by the reference numeral 45 in the drawings. This configuration is achieved by pre .
  • the original pointer for CreateloCompletionPortO system call with a pointer to the user-level dll 45.
  • the dll 45 may be thought, from the perspective of the application 18a, to be intercepting its original system call.
  • the prefix intercept ⁇ will be hereinafter used for the dll function name.
  • lntercept_CreateloCompletionPort() is operable to ascertain the socket for which an I/O completion port has been requested i.e.
  • the application 18a may from time to time initiate I/O requests on the socket 32. Depending on the state of the system, these I/O requests may complete immediately, or at a later time as a result of, for example, the processing of network events in the event queue 31a.
  • the stack 30 will be notified of the completion and make a check in the user-level state of the socket 32, specifically the completion port indicator 33, and determine whether the socket 32 has an associated I/O completion port 35. If there is an associated I/O completion port 35, then a system call is made to update the I/O completion port 35 to that effect. If there is no associated I/O completion port 35, then, no system call is made.
  • Protocol processing stacks implemented in an operating system tend to be interrupt driven. This means no matter what the relevant application is doing the stack will always be prioritised when network events occur (e.g. when data is received from the network and is passed onto an event queue) because the protocol processing software is invoked in response to such events by means of interrupts.
  • user-level architectures which involve certain . processing tasks being driven by a user process rather than by an operating system, can suffer from the following disadvantage. If the driving process is blocking, waiting for control of the CPU, or performing a non-data-transfer task, then the user-level stack may not be given control of the CPU.
  • the application 18a makes the GetQueuedCompletionStatus() system call.
  • This system call returns a list of completed overlapped I/O operations for the sockets associated with the I/O completion port. In this case, only the socket 32 is associated with the I/O completion port 35.
  • GetQueuedCompletionStatusO system call by an application is made with a timeout argument set to timeout e.g. GetQueuedCompletionStatus(....timeout).
  • a flag,.do_enable_interrupts, to signal whether to enable interrupts is set to true. True means that interrupts should be set and false means that they should not be set.
  • a test is made to determine whether the user-level stack 30 needs updating. If it needs updating, then, at step 56, an attempt is made to grab the lock 40 without blocking. If possession of the lock 40 is taken, then, the situation is opportune to update the stack.
  • the stack is updated at the user-level.
  • step 60 the lock 40 is released.
  • step 62 a test is made to determine whether during stack updating at step 58, a process thread was awoken, and if a process thread was awoken, do_enable_interrupts is set to false.
  • step 64 a system call, GetQueuedCompletionStatus(... ,0) is made. It will be noted that the timeout argument is set to zero meaning the system call is non-blocking and will return immediately.
  • step 66 a it determined whether any data was returned by the system call, or the original timeout argument supplied by the calling application, i.e. timeout was zero.
  • timeout was zero or some data was returned by the system call, then a return is made to the calling application at step 67. This is because if the timeout argument was set to zero, then the application wanted a non-blocking response. And, if some data was returned, then this should be promptly reported to the application.
  • step 54 if the stacks did not need updating, then, the routine goes straight to step 64, where the non-blocking system call, GetQueuedCompletionStatus(...,0) is made.
  • step 56 if the attempt to grab the lock 40 without blocking failed, then, at step 68, the do_enable_interrupts flag is set to false and the routine 50 goes on to, step 64, where the non-blocking system call, GetQueuedCompletionStatus() is made.
  • the do_enable_interrupts flag is inspected, and if it is true and interrupts for the stack 30 are not already enabled, then they are enabled.
  • the system call GetQueuedCompletionStatus(....timeout) is made, and when this returns, the routine 50 then returns to the calling application.
  • the routine may return to the beginning at step 52, and continue spinning in a loop until either some data is returned by GetQueuedCompletionStatusO or a short spin period has elapsed, whereafter the routine continues at step 72.
  • the advantage of this approach is that interrupts are not enabled if any I/O operations complete within the spin period.
  • intercept_GetQueuedCompletionStatus() 50 invokes a stack update if operating conditions, like, for example, the availability of the lock 40 and state of the receive event queues 31a,31b,31c, make it advantageous to do so, whereby the list returned to the calling application 18a should be up-to-date.
  • synchronisation object supported by Windows namely an I/O completion port
  • the invention may be implemented on any other suitable operating system.
  • I/O synchronisation objects supported by other operating systems include "kqueues", “epoll”, and “realtime signal queues”.
  • the invention may be implemented using an I/O synchronisation mechanism which does not use an I/O synchronisation object, for example, "poll", and "select”.
  • the present invention relates to a computer system having a network interface and which is capable of running a plurality of concurrent processes. .
  • the application When data is received from the network interface, it can be delivered to a receive buffer of a destination application using a variety of receive models.
  • a received packet is de-multiplexed to an appropriate socket and the payload is delivered to the receive queue for that socket. Then, at some later time, the application transfers the payload to its receive buffer. In the case where the application requires the payload before it has been received, the application may choose to block until it arrives.
  • the application may pre-allocate an application receive buffer to which subsequently received payload should be delivered. Normally, if a receive buffer has been allocated, then received payload is delivered directly to the receive buffer. If a receive buffer has not been allocated, then the payload is enqueued on the receive queue. Often, an application will allocate many receive buffers for incoming payload, and so descriptors for the allocated receive buffers are stored in a queue.
  • the present invention may provide a computer system having a network interface and capable of running a plurality of concurrent processes, the system being arranged to
  • the above-defined lock regime is advantageous in that an application process operating according to the asynchronous receive model when allocating a new application receive buffer for incoming payload is able by virtue of taking possession of the second lock, when the second queue is empty, to transfer payload from the first queue to the new application receive buffer without further having to obtain possession of the first lock.
  • the above-defined lock regime is also advantageous in that an application process operating according to the synchronous model is able by virtue of taking possession of the second lock, when the second queue is empty, to transfer payload from the first queue to a new application receive buffer specified by the application process without further having to take possession of the first lock.
  • the above-defined lock regime is also advantageous in that when a process, holding the first lock, is processing incoming payload, it is able either, when the second queue is empty, to enqueue the incoming payload on the first queue, or, when the second queue is not empty, to perform a drain down operation as described later, without further having to obtain possession of the second lock.
  • the system when an application process, holding the second lock, has a new descriptor for enqueueing on the second queue, is operable, when the second queue is empty, to dequeue payload from the first queue and transfer it to the application receive buffer specified by the said new descriptor.
  • the system when an application process seeks to receive incoming payload and takes possession of the second lock, is operable to transfer payload from the first queue to a new application receive buffer specified by the application receive process.
  • the system when a process, holding the first lock, processes incoming payload is operable, when the second queue is empty, to enqueue the incoming payload on the first queue, and, when the second queue is not empty, to transfer payload from the first queue to a buffer specified by a descriptor in the second queue.
  • payload transferred from the first queue to a buffer may from time to time be said incoming payload when said incoming payload is deposited in the first queue without first determining the condition of the second queue.
  • the incoming payload, when the second queue is not empty may completely bypass the first queue.
  • the right to enqueue on the second queue is governed by the second lock.
  • an application process operating according to the asynchronous receive model is able, by virtue of taking possession of the second lock, when second queue is empty, to not only transfer payload from the first queue to the new application receive buffer, but also add the descriptor corresponding to the new buffer to the second queue without having to obtain possession of the first lock.
  • adding the descriptor corresponding to the new application receive buffer takes place when either the first queue is empty or the second queue is non-empty.
  • the system may be operable to perform a drain down operation in which, while the first and second queues are not empty, to transfer payload from the first queue to buffers specified in the second queue.
  • payload is transferred until there is no more payload in the first queue or there are no more buffers specified in the second queue, whichever comes first.
  • One situation is after the application process has caused the new buffer descriptor to be added to the second queue, but during that operation, payload has been added to the first queue, whereby it has become non-empty. At this point, an attempt is made to take possession of the first lock. Once the first lock is obtained, the drain down operation is carried out.
  • Another situation is when the process responsible for delivering the incoming payload to a socket finds that the second queue is not empty, This also presents an opportunity for draining from the first queue if it is needed.
  • the present invention may provide a computer system having a network interface and running a plurality of concurrent processes, the system
  • the present invention may provide a computer program for a computer system having a network interface and capable of running a plurality of concurrent processes, the program being arranged to
  • the present invention may provide a data carrier bearing the above program.
  • Figure 16 shows an overview of hardware suitable for performing the invention
  • Figure 17 shows an overview of various portions of state of a socket in an embodiment of the invention.
  • Figure 18 show a first lock configuration applied to the Figure 17 embodiment
  • Figure 19 shows a second configuration applied to the Figure 17 embodiment
  • Figures 20 show an algorithm for delivering a packet to the socket in accordance with the invention
  • Figure 21 shows an algorithm according to an asynchronous receive model of operation
  • FIG 22 shows an algorithm according to a synchronous receive model of operation
  • a computer system 10 in the form of a personal computer (PC) comprises a central processing unit 15, a memory 20 which is connected to a network 35 by an Ethernet connection 30.
  • the Ethernet connection is managed by a network interface card 25 which supports the physical and hardware requirements of the Ethernet system.
  • a card the physical hardware implementation need not be that of a card: for instance, it could be in the form of integrated circuit and connector mounted directly on a motherboard.
  • FIG 17 shows the system state of an embodiment of the invention.
  • an application process thread 40 has been initiated and a TCP socket 50 has been opened up enabling the application to communicate via the network interface card 25 over the network.
  • Processing of data at the TCP layer involves two linked list queue structures: a receive queue (RQ) 70 and an asynchronous receive queue (ARQ) 80, both of which are first-in first-out (FIFO).
  • the RQ 70 comprises a plurality of items 71 in which each item 71 references a block of data after TCP processing.
  • Each item 71 comprises a pointer portion 71a which points to the start of the data block, and block-length portion 71b giving the length of the block.
  • the memory region where blocks of data are stored in buffers after TCP processing is designated 95.
  • the ARQ 80 comprises a plurality of items 81 in which each item 81 references an application receive buffer which the application 40 has pre-allocated for incoming data, for example, when an application makes an asynchronous receive request.
  • Each item 81 consists of an application receive buffer descriptor comprising a pointer portion 81a which points to the start of the buffer and a buffer-length portion 81b defining the length of the buffer.
  • the memory region which may be allocated for buffers for incoming data is designated 42. In other embodiments, queue structures other than linked lists may be used.
  • the first lock 62 is a shared lock which is widely used to protect various portions of the system state, including, for example, portions of state of other (unshown) sockets.
  • this lock is referred to as the network interface lock or netif lock.
  • the second lock 52 is a lock which is dedicated to protecting certain portions of the state of the specific socket 50.
  • this lock is referred to as the sock-lock 50.
  • FIG. 18 represents the lock configuration when the ARQ 80 contains no items 81.
  • the symbol 0 is used to represent an empty condition.
  • the right to put an item 71 onto the RQ 70 is governed by the netif lock 62 and this right is denoted in Figure 18 by the arrow 62-P.
  • the right to remove/get an item from the RQ 70 is governed by the sock-lock 52 and this right is denoted in Figure 18 by the arrow 52-G. Any process which enqueues an item on the RQ 70 has first to take possession of the netif lock 62, whereas a process, which dequeues an item has first to take possession of the sock-lock 52.
  • the right to put an item 81 onto the ARQ 80 is governed by the sock-lock 52, and this right is denoted in Figure 18 by the arrow 52-P.
  • any application which seeks to enqueue an application receive buffer descriptor on the ARQ 80 has first to take possession of the sock-lock 52.
  • the ARQ 80 is empty in Figure 18, no right to dequeue has been illustrated.
  • Figure 19 represents the lock configuration when the ARQ 80 contains at least one item 81.
  • the rights both to put an item 71 onto the RQ 70 and to remove/get an item 71 from the RQ 70 are governed by the netif lock 62 and these rights are denoted in Figure 19 by the arrows 62-P and 62-G. Any process which enqueues an item on the RQ 70 has first to take possession of the netif lock 62, and any process which seeks to dequeue an item has likewise first to take possession of the netif lock 62.
  • the right to put an item 81 onto the ARQ 80 is governed by the sock-lock 52, and the right to remove/get an item from the ARQ 80 is governed by the netif lock 62 and these rights are denoted in Figure 19 by the arrows 52-P and 62-G, respectively.
  • any application which seeks to enqueue an application receive buffer descriptor on the ARQ 80 has first to take possession of the sock- lock 52, and any process which dequeues an item has first to take possession of the netif lock 62. From time to time, the system . makes a check to see whether there is a received packet which is ready for processing and.queueipg,.
  • the algorithm, 100 for delivering the. received packet to the socket 50 as shown in Figure 20 is invoked/performed.
  • a process 68 is awakened and acts as the receive process- and.. that it, at this point, has already taken possession of the netif lock 62.
  • the . TCP layer protocol processing and demultiplexing is carried out, and; the post TCP layer processing payload/data block 93 is stored in a memory space 95.
  • the data block 93 is enqueued on the RQ 70 by adding an item 71 onto the RQ. 70 which references the data block 93.
  • step 104 a check is made to see whether the ARQ 80 contains any items 81. If there are buffer descriptors in; the ARQ 80, this, means that the Figure 19 lock configuration is valid and the process 68 being already in possession of the netif lock 62 has dequeue rights for both the RQ 70 and the ARQ 80. Therefore, without having to attempt to take possession of another lock, the process 68, at step 106, is able to drain down payload in the RQ 70 into the buffers referenced by descriptors in the ARQ 80 to the extent that the buffers have been pre-allocated i.e. until either the RQ 70 or the ARQ 80 becomes empty. Filled buffers are either removed from the ARQ 80 or their buffeNength portion 81b adjusted
  • the check, at step 104, to determine whether the ARQ 80 is empty or not can be carried out before incoming payload is enqueued in the RQ 70 (at step 102).
  • the ARQ 80 when the ARQ 80 is non-empty and RQ 70 is empty, the RQ 70 can be completely bypassed.
  • the netif lock 62 is a shared lock, and thus, according to the described lock regime, it governs the enqueueing of payload to not only the RQ 1 70 of the socket 50, but it may do so too for the other unshown sockets. In such a case, it will be appreciated that many sockets can be serviced while incurring the overhead of obtaining the netif lock only once.
  • the application process thread 40 When" the application process thread 40 wants to operate according to the asynchronous receive model, as mentioned earlier, and has a new receive buffer 43 for allocation, it performs/invokes the asynchronous receive algorithm 108 shown in . Figure .21.
  • the descriptor for the hew receive buffer 43 may be specified as the argument of a call, like async_recv(), which runs the / algorithm 108. ; .V .
  • the application 40 takes possession of the sock-lock 52 in order to obtain the right to enqueue the descriptor corresponding to the new receive buffer onto the ARQ 80. It will be noted that, at this point, no test is first required to determine the condition of the ARQ 80 as in both the possible Figure 18 and Figure 19 lock configurations, the right to enqueue on the ARQ 80 is. governed by the sock-lock 52.
  • a test is performed to determine whether the ARQ 80 is empty. If the ARQ is empty, then the Figure 18 Jock configuration applies, whereby the application 40, which already holds the sock ⁇ lock 52, thus, already has the right to dequeue from the RQ 70.
  • step 114 a. check is made to determine whether there is any payload/data in the RQ 70. If there is, at step 116, it is transferred from the RQ 70 to the new buffer 43. In this case, it will be noted that a descriptor corresponding to the new buffer 43 is not put onto the ARQ 80 at any time. Furthermore, it will be appreciated that performance of step 116, in terms of access to the RQ 70 and the ARQ 80, requires only the right to dequeue from the RQ 70. This means that in this branch of the algorithm with the Figure 18 lock configuration being valid, only possession of the sock-lock which was already taken at step 110 is required. At step 130, the sock-lock 52 is dropped.
  • step 118 a descriptor 44 corresponding to the new buffer 43 is enqueued on the ARQ 80; With the application 40 holding the sock-lock 52, this is the situation shown in Figure 19.
  • step 120 a check is made to determine whether during the performance of step 118, further payload has been enqueued on the RQ 70, i.e. whether RQ 70 is still empty. If it is still empty, then the sock-lock is dropped at step 130. In this case, the new descriptor 44 has been enqueued on the ARQ, but no further work needed to be performed before dropping the sock-lock.
  • the netif lock 62 is grabbed. This lock is needed to dequeue items from the RQ 70, because, in this branch of the algorithm, the Figure 19 lock configuration applies.
  • the payload in the RQ 70 is drained down i.e. transferred to application received buffers specified by the descriptor in the ARQ 80 to the extent permitted by the availability of buffers, or in other words, until either RQ or ARQ become empty.
  • the netif lock 62 is dropped. Obtaining the shared netif lock at step 122 might tend to result in blocking, but entering this part of the algorithm is a not so common occurrence.
  • the application process thread 40 When the application process thread 40 wants to operate according to the synchronous receive model, as mentioned earlier, it invokes/performs the following synchronous receive algorithm 140 specifying a new application receive buffer.
  • the descriptor for the new receive buffer 43 may be specified as the argument of a call, like, for example, recv(), which runs the algorithm 140.
  • the sock-lock 52 is grabbed.
  • a check is made, at step 144, on the condition of the ARQ 80 and if the ARQ 80 is non-empty, then, regardless of the condition of the RQ 70, blocking occurs at step 146.
  • step 148 a check is made on the condition of the RQ 70 and if it is non-empty, then payload is transferred from the RQ 70 to the new application receive buffer. On the other hand, if the RQ 70 is empty, then, blocking occurs at step 152.
  • the condition of the ARQ 80 could be checked before grabbing the sock-lock 52, and if it is non-empty, then blocking. However, in this variation, after grabbing the sock-lock 52, it is still necessary to check the condition of the ARQ 80 again in order to verify that another thread has not enqueued a buffer descriptor on the ARQ 80 in the meantime. In other embodiments, and depending on the way the application process thread calls the synchronous receive algorithm, instead of blocking, an error can be immediately returned to the application process.
  • the algorithms 100, 108, 140 are typically operating system routines which are in the examples described called by the process 68 and the application process 40. Although in terms of implementation at the code level, locks are picked up/taken and released within those routines, possession of the lock is said to reside with the higher-level calling processes 68, 40 on whose behalf the routines are working.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer And Data Communications (AREA)

Abstract

L'invention concerne un système informatique pouvant exécuter une pluralité de processus concurrents. Ledit système est utilisé pour former une première file d'attente dans laquelle des articles associés aux données à envoyer sur le réseau sont mis en file d'attente, et dont l'accès est commandé par un verrou. Lorsque le verrou refuse l'accès à la première file d'attente à un premier desdits processus, le système met les articles en file d'attente dans une seconde file d'attente dont l'accès n'est pas commandé par le verrou, et permet aux articles de la seconde file d'attente d'être retirés de la file d'attente avec des articles de la première file d'attente.
PCT/GB2007/001821 2006-05-25 2007-05-18 Système informatique WO2007138250A2 (fr)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
GB0610506A GB0610506D0 (en) 2006-05-25 2006-05-25 Computer system
GB0610506.8 2006-05-25
GB0613556A GB0613556D0 (en) 2006-07-07 2006-07-07 Computer system
GB0613556.0 2006-07-07
GB0613975.2 2006-07-13
GB0613975A GB0613975D0 (en) 2006-07-13 2006-07-13 Computer System
GB0614220A GB0614220D0 (en) 2006-07-17 2006-07-17 Computer system
GB0614220.2 2006-07-17

Publications (2)

Publication Number Publication Date
WO2007138250A2 true WO2007138250A2 (fr) 2007-12-06
WO2007138250A3 WO2007138250A3 (fr) 2008-01-17

Family

ID=38426542

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2007/001821 WO2007138250A2 (fr) 2006-05-25 2007-05-18 Système informatique

Country Status (1)

Country Link
WO (1) WO2007138250A2 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8346975B2 (en) 2009-03-30 2013-01-01 International Business Machines Corporation Serialized access to an I/O adapter through atomic operation
EP2770430A1 (fr) * 2013-02-25 2014-08-27 Texas Instruments France Système et procédé de programmation de tâches atomiques dans un processeur multi-coeur pour éviter l'éxecution des processus de la même atomicité en parallel
JP2017117448A (ja) * 2015-12-26 2017-06-29 インテル コーポレイション アプリケーションレベルネットワークキューイング
US10409655B2 (en) * 2014-03-31 2019-09-10 Solarflare Communications, Inc. Ordered event notification

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0466339A2 (fr) * 1990-07-13 1992-01-15 International Business Machines Corporation Procédé pour passer des messages de tâches dans un système de traitement de données
US5758184A (en) * 1995-04-24 1998-05-26 Microsoft Corporation System for performing asynchronous file operations requested by runnable threads by processing completion messages with different queue thread and checking for completion by runnable threads
US5951706A (en) * 1997-06-30 1999-09-14 International Business Machines Corporation Method of independent simultaneous queueing of message descriptors
EP1213892A2 (fr) * 2000-12-05 2002-06-12 Microsoft Corporation Méthode et appareil pour la realisation d'une pile programme au coté client
US20020174258A1 (en) * 2001-05-18 2002-11-21 Dale Michele Zampetti System and method for providing non-blocking shared structures
WO2003055157A1 (fr) * 2001-12-19 2003-07-03 Inrange Technologies Corporation Mise en file d'attente differee dans un commutateur a tampon
US6651146B1 (en) * 2000-02-24 2003-11-18 International Business Machines Corporation Method and apparatus for managing access contention to a linear list without the use of locks
US20040031044A1 (en) * 2002-08-08 2004-02-12 Jones Richard A. Method for increasing performance of select and poll applications without recompilation
WO2005018179A1 (fr) * 2003-08-07 2005-02-24 Intel Corporation Procede, systeme et article de fabrication pour l'utilisation d'une memoire hote d'un adaptateur de decharge

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0466339A2 (fr) * 1990-07-13 1992-01-15 International Business Machines Corporation Procédé pour passer des messages de tâches dans un système de traitement de données
US5758184A (en) * 1995-04-24 1998-05-26 Microsoft Corporation System for performing asynchronous file operations requested by runnable threads by processing completion messages with different queue thread and checking for completion by runnable threads
US5951706A (en) * 1997-06-30 1999-09-14 International Business Machines Corporation Method of independent simultaneous queueing of message descriptors
US6651146B1 (en) * 2000-02-24 2003-11-18 International Business Machines Corporation Method and apparatus for managing access contention to a linear list without the use of locks
EP1213892A2 (fr) * 2000-12-05 2002-06-12 Microsoft Corporation Méthode et appareil pour la realisation d'une pile programme au coté client
US20020174258A1 (en) * 2001-05-18 2002-11-21 Dale Michele Zampetti System and method for providing non-blocking shared structures
WO2003055157A1 (fr) * 2001-12-19 2003-07-03 Inrange Technologies Corporation Mise en file d'attente differee dans un commutateur a tampon
US20040031044A1 (en) * 2002-08-08 2004-02-12 Jones Richard A. Method for increasing performance of select and poll applications without recompilation
WO2005018179A1 (fr) * 2003-08-07 2005-02-24 Intel Corporation Procede, systeme et article de fabrication pour l'utilisation d'une memoire hote d'un adaptateur de decharge

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
FINKEL R A: "An Operating Systems Vade Mecum, CONCURRENCY" OPERATING SYSTEMS VADE MECUM, ENGLEWOOD CLIFFS, PRENTICE HALL, US, 1989, pages 274-313, XP002266962 *
KNESTRICK, C C: "Lunar: A User-Level Stack Library for Network Emulation" THESIS, 24 February 2004 (2004-02-24), pages I-VII,1-58, XP002457631 *
LEA D: "Concurrent Programming in Java: Design Principles and Patterns, Second Edition" PRENTICE HALL, 25 October 1999 (1999-10-25), pages 1-18, XP002457351 ISBN: 0-201-31009-0 *
MICHAEL M M ET AL: "SIMPLE, FAST, AND PRACTICAL NON-BLOCKING AND BLOCKING CONCURRENT QUEUE ALGORITHMS" PROCEEDINGS OF THE 15TH ANNUAL SYMPOSIUM ON PRINCIPLES OF DISTRIBUTED COMPUTING. PHILADELPHIA, MAY 23 - 26, 1996, PROCEEDINGS OF THE ANNUAL SYMPOSIUM ON PRINCIPLES OF DISTRIBUTED COMPUTING (PODC), NEW YORK, ACM, US, vol. SYMP. 15, 23 May 1996 (1996-05-23), pages 267-275, XP000681051 ISBN: 0-89791-800-2 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8346975B2 (en) 2009-03-30 2013-01-01 International Business Machines Corporation Serialized access to an I/O adapter through atomic operation
EP2770430A1 (fr) * 2013-02-25 2014-08-27 Texas Instruments France Système et procédé de programmation de tâches atomiques dans un processeur multi-coeur pour éviter l'éxecution des processus de la même atomicité en parallel
US10409655B2 (en) * 2014-03-31 2019-09-10 Solarflare Communications, Inc. Ordered event notification
US11321150B2 (en) 2014-03-31 2022-05-03 Xilinx, Inc. Ordered event notification
JP2017117448A (ja) * 2015-12-26 2017-06-29 インテル コーポレイション アプリケーションレベルネットワークキューイング

Also Published As

Publication number Publication date
WO2007138250A3 (fr) 2008-01-17

Similar Documents

Publication Publication Date Title
EP2645674B1 (fr) Gestion d'interruption
Li et al. Socksdirect: Datacenter sockets can be fast and compatible
US5448698A (en) Inter-processor communication system in which messages are stored at locations specified by the sender
US9747134B2 (en) RDMA (remote direct memory access) data transfer in a virtual environment
US6070189A (en) Signaling communication events in a computer network
Buzzard et al. An implementation of the Hamlyn sender-managed interface architecture
US6038604A (en) Method and apparatus for efficient communications using active messages
KR102011949B1 (ko) 미들웨어 머신 환경에서 다중노드 어플리케이션들을 위한 메시지 큐들을 제공 및 관리하는 시스템 및 방법
US20180375782A1 (en) Data buffering
JP5956565B2 (ja) メッセージングアプリケーションプログラムインターフェイスを提供するためのシステムおよび方法
EP0444376A1 (fr) Dispositif de passage de messages entre plusieurs processeurs couplé par une mémoire partagée intelligente
US8341643B2 (en) Protecting shared resources using shared memory and sockets
CA2536037A1 (fr) Systeme de messagerie asynchrone rapide a protection de memoire dans un environnement a processus multiples et files multiples
WO2002031672A2 (fr) Procédé et dispositif de communication entre processeurs et de partage de périphériques
US20120272248A1 (en) Managing queues in an asynchronous messaging system
US9069592B2 (en) Generic transport layer mechanism for firmware communication
EP2383658B1 (fr) Gestion de profondeur de file d'attente en vue d'une communication entre un hôte et un dispositif peripherique
WO2007138250A2 (fr) Système informatique
US7853713B2 (en) Communication interface device and communication method
US20070079077A1 (en) System, method, and computer program product for shared memory queue
US7130936B1 (en) System, methods, and computer program product for shared memory queue
US9948533B2 (en) Interrupt management
US20020174258A1 (en) System and method for providing non-blocking shared structures
Heinlein et al. Integrating multiple communication paradigms in high performance multiprocessors
JP2022012656A (ja) 並列分散計算システム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07732843

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07732843

Country of ref document: EP

Kind code of ref document: A2