WO2006109087A2 - Data processing system - Google Patents

Data processing system Download PDF

Info

Publication number
WO2006109087A2
WO2006109087A2 PCT/GB2006/001389 GB2006001389W WO2006109087A2 WO 2006109087 A2 WO2006109087 A2 WO 2006109087A2 GB 2006001389 W GB2006001389 W GB 2006001389W WO 2006109087 A2 WO2006109087 A2 WO 2006109087A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
processing system
interface
instruction
data processing
Prior art date
Application number
PCT/GB2006/001389
Other languages
French (fr)
Other versions
WO2006109087A3 (en
WO2006109087B1 (en
Inventor
Steven Leslie Pope
David James Riddoch
Kieran Mansley
Martin Porter
Original Assignee
Level 5 Networks Incorporated
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB0507482A external-priority patent/GB0507482D0/en
Priority claimed from GB0507739A external-priority patent/GB0507739D0/en
Priority claimed from GB0508288A external-priority patent/GB0508288D0/en
Application filed by Level 5 Networks Incorporated filed Critical Level 5 Networks Incorporated
Priority to EP06726786A priority Critical patent/EP1875708A2/en
Publication of WO2006109087A2 publication Critical patent/WO2006109087A2/en
Publication of WO2006109087A3 publication Critical patent/WO2006109087A3/en
Publication of WO2006109087B1 publication Critical patent/WO2006109087B1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/545Interprogram communication where tasks reside in different layers, e.g. user- and kernel-space
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/24Multipath
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/19Flow control; Congestion control at layers above the network layer
    • H04L47/193Flow control; Congestion control at layers above the network layer at the transport layer, e.g. TCP related
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/16Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]
    • H04L69/161Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields
    • H04L69/162Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields involving adaptations of sockets based mechanisms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/16Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]

Definitions

  • the present application relates to data processing systems and discloses three distinct inventive concepts which are described below in Sections A to C of the description.
  • Claims 1 to 74 relate to the description in Section A
  • claims 75 to 92 relate to the description in Section B
  • claims 93 to 128 relate to the description in Section C.
  • figures 1 and 2 relate to the description in Section A
  • figures 3 to 5 relate to the description in Section B
  • figures 6 to 10 relate to the description in Section C.
  • Embodiments of each of the inventions described herein may include any one or more of the features described in relation to the other inventions.
  • the present invention relates to the transmission of instructions and data in a data processing system, and in particular but not exclusively to intercepting responses transmitted within a data processing system.
  • FIG. 1 represents equipment capable of implementing a prior art protocol stack, such as a transmission control protocol (TCP) stack in a computer connected to a network.
  • the equipment includes an application 1 , a socket 2 and an operating system 3 incorporating a kernel 4.
  • the socket connects the application to remote entities by means of a network protocol, in this example TCP/IP.
  • the application can send and receive TCP/IP messages by opening a socket and reading and writing data to and from the socket, and the operating system causes the messages to be transported across the network.
  • the application can invoke a system call (syscall) for transmission of data through the socket and then via the operating system to the network.
  • system call system call
  • Syscalls can be thought of as functions taking a series of arguments which cause execution of the CPU to switch to a privileged level and start executing the operating system.
  • a given syscall will be composed of a specific list of arguments, and the combination of arguments will vary depending on the type of syscall.
  • Syscalls made by applications in a computer system can indicate a file descriptor (sometimes called a handle), which is usually an integer number that identifies an open file within a process.
  • a file descriptor is obtained each time a file is opened or a socket or other resource is created.
  • File descriptors can be re-used within a computer system, but at any given time a descriptor uniquely identifies an open file or other resource. Thus, when a resource (such as a file) is closed down, the descriptor will be destroyed, and when another resource is subsequently opened the descriptor can be re-used to identify the new resource. Any operations which for example read from, write to or close the resource take the corresponding file descriptor as an input parameter.
  • a network related application program interface (API) call When a network related application program interface (API) call is made through the socket library this causes a system call to be made, which creates (or opens) a new file descriptor.
  • the accept() system call takes as an input a pre-existing file descriptor which has been configured to await new connection requests, and returns as an output a newly created file descriptor which is bound to the connection state corresponding to a newly made connection.
  • the system call when invoked causes the operating system to execute algorithms which are specific to the file descriptor.
  • a descriptor table which contains a list of file descriptors and, for each descriptor, pointers to a set of functions that can be carried out for that descriptor.
  • the table is indexed by descriptor number and includes pointers to calls, state data, memory mapping capabilities and ownership bits for each descriptor.
  • the operating system selects a suitable available descriptor for a requesting process and temporarily assigns it for use to that process.
  • Certain management functions of a computing device are conventionally managed entirely by the operating system. These functions typically include basic control of hardware (e.g. networking hardware) attached to the device. When these functions are performed by the operating system the state of the computing device's interface with the hardware is managed by and is directly accessible to the operating system.
  • An alternative architecture is a user-level architecture, as described in the applicant's copending PCT applications WO 2004/079981 and WO 2005/104475. In a user-level architecture at least some of the functions usually performed by the operating system are performed by code running at user level. In a user-level architecture at least some of the state of the function can be stored by the user-level code. This can cause difficulties when an application performs an operation that requires the operating system to interact with or have knowledge of that state. Examples of syscalls are select() and poll(). These can be used by an application for example to determine which descriptors in use by the application have data ready for reading or writing.
  • FIG. 2 shows components implementing a TCP stack for use in accordance with embodiments of the present invention.
  • Layers of the stack include an application 1 and a socket 2 provided by a socket library.
  • the socket library is an application program interface (API) for building software applications.
  • the socket library can carry out various functions, including creating descriptors and storing information.
  • an operating system 3 comprising a TCP kernel 4, and a proprietary TCP user-level stack 5.
  • TCP Transmission Control Protocol
  • RTP Real-Time Transport Protocol
  • Non-Ethernet protocols could be used.
  • the user- level stack is connected to hardware 6 in figure 2.
  • the hardware could be a network interface card (NIC). It interfaces with a network so that data can be transferred between the system of figure 2 and other data processing systems.
  • NIC network interface card
  • Data received at the NIC or other hardware 6 is transmitted within the system of figure 2 according to the file descriptor with which it is associated. For example, L5 data will be transmitted onto a receive event queue 7 within the stack 5
  • the application 1 When the application 1 wishes to determine whether any data intended for processing by the application has recently been received by the hardware, it initiates a selectO or poll() call listing a set of file descriptors. The call is passed to the OS via the socket 2, and a response is returned to the application 1 to indicate, for each descriptor listed in the select() call, whether any new data is available for that descriptor.
  • some of the descriptors will relate to queues run by the L5 stack, whereas some will relate to components in the OS (such as a driver 11 for a storage connection). Both types of data need to be handled by the system of figure 2, and it is desirable to avoid servicing the OS and the user-level TCP stack descriptors separately since this would be inefficient.
  • a data processing system comprising: a set of data stores for storing items of incoming data; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores, a plurality of the descriptors being associated with one of the stores, there being a common descriptor indicative of the descriptors of the plurality; and an instruction in respect of a set of the descriptors to which it returns a response indicating for which of the descriptors the stores contain data that has not yet been handled; and an interface for modifying communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to indicate by way of the common descriptor whether the stores contain data for any of the plurality of the descriptors that are members of the set.
  • the data processing system could also comprise a processing unit arranged to process the items of incoming data according to a protocol and produce an output representing the processed items.
  • the protocol is suitably TCP/IP.
  • the operating system is preferably arranged to form the said response based on the output of the processing unit.
  • the processing unit could be arranged to process the items of incoming data at user level, or the processing unit could be implemented in hardware on an interface between the data processing system and a data transmission network.
  • the instruction is preferably a user-level instruction.
  • the interface could be responsive to the user-level instruction for forming a communication to the operating system that represents an instruction of the same type and that identifies, instead of any descriptors of the plurality, the said common descriptor.
  • the interface could be an instruction library. More specifically, it could be a socket library.
  • the data processing system may further comprise a data structure storing, for each descriptor, an indication of whether or not it is a member of the set.
  • the data processing system may also comprise an ordering unit arranged to: monitor the indications stored in the data structure; and if the monitoring indicates that descriptors that are members of the set are excessively interleaved with descriptors that are not members of the set, cause the descriptors to be reordered such that the descriptors of the set form a contiguous group.
  • the ordering unit could be arranged to perform the monitoring periodically.
  • the said instruction could be a select call, or it could be a poll call.
  • the said instruction preferably originates from an application running in the data processing system.
  • a data processing system comprising: a set of data stores for storing items of incoming data, a first one of the data stores being associated with a function of the system; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores; and an instruction in respect of one or more of the descriptors to which it returns a response indicating for which of those descriptors the stores contain data that has not yet been handled; and an interface for modifying communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to omit any descriptors other than those associated with any of a subset of the stores that includes the said first one of the data stores.
  • the function is preferably at user level.
  • the interface could be an instruction library. More specifically, it could be a socket library.
  • the said instruction could be a select call, or it could be a poll call.
  • the said instruction preferably originates from an application running in the data processing system.
  • an interface for use in a data processing system comprising: a set of data stores for storing items of incoming data; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores, a plurality of the descriptors being associated with one of the stores, there being a common descriptor indicative of the descriptors of the plurality; and an instruction in respect of a set of the descriptors to which it returns a response indicating for which of the descriptors the stores contain data that has not yet been handled; the interface being arranged to modify communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to indicate by way of the common descriptor whether the stores contain data for any of the plurality of the descriptors that are members of the set.
  • a data carrier defining software for operation as an interface in a data processing system, the data processing system comprising: a set of data stores for storing items of incoming data; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores, a plurality of the descriptors being associated with one of the stores, there being a common descriptor indicative of the descriptors of the plurality; and an instruction in respect of a set of the descriptors to which it returns a response indicating for which of the descriptors the stores contain data that has not yet been handled; the interface being arranged to modify communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to indicate by way of the common descriptor whether the stores contain data for any of the plurality of the descriptors that are members of the set.
  • Figure 1 shows a prior art computer system
  • FIG. 2 shows a computer system in accordance with embodiments of the present invention.
  • the operating system (OS) 3 incorporates a driver 11 for a piece of hardware such as a disk and a TCP driver or helper 12 for supporting the stack 5.
  • the user-level stack will be referred to as a Level 5 (or L5) stack.
  • the TCP driver 12 is mapped onto the TCP stack 5 by means of a file descriptor.
  • each application there can be one user-level TCP stack 5 for each application that requires one. This can provide better performance than if a stack is shared between applications. Each stack is located in the same address space as the application that it serves.
  • L5 data When L5 data is received at the NIC 6 it is passed to the event queue 7 in the user- level stack, and is flagged to the TCP driver 12 by means of the mapping between the user-level stack 5 and the driver 12. The driver 12 is thereby informed when new L5 data is available. Preferably read only memory mapping is permitted between the OS and the L5 stack to avoid corruption of data held in the OS by the stack 5.
  • a single event queue will be provided for a given transport library (or socket library) and there will usually be one instance of the transport library associated with each application.
  • transport library or socket library
  • a first alternative in accordance with an embodiment of the invention is for the library 2 to intercept a selectQ call from the application 1 , identify all L5 file descriptors identified in the call, and replace them all with a single descriptor denoting L5 descriptors.
  • the single descriptor could suitably be the descriptor used to map the driver 12 onto the stack 5.
  • the select() call once modified by the library, is passed to the OS.
  • a response is then created by the OS, having polled the TCP driver 21 , to indicate whether any L5 descriptors have new data in the event queue 7. This response is based on the results of the TCP/IP validation processing carried out when incoming data is received at the event queue. Data from a given network endpoint can be identified within an event queue by means of the associated file descriptor.
  • the response once created by the OS, is intercepted by the library 2 and sent to the application, so that the application can establish whether any L5 data is waiting to be handled. If the response indicates that there is new L5 data, the application will need to process the event queue 7 by checking the L5 file descriptors by means of the L5 helper. In this way, unnecessary accessing of the event queue 7 can be avoided when the response indicates that there is no new L5 data.
  • the library could refrain from modifying the parameters of the select() call itself, but could instead modify the response to the selectQ call to replace any L5 descriptors mentioned in the response with a reference to a single descriptor denoting L5 descriptors.
  • a second alternative for efficiently handling new data is particularly appropriate when the TCP file descriptors are busy, in other words when a large amount of TCP data is being received at the hardware 6 and passed to the event queue 7.
  • This approach effectively assigns a high priority to the TCP descriptors, in preference to descriptors related to other components such as the storage connection driver 11.
  • the approach involves directly accessing the queue 7 and ignoring new data intended for components of the system other than the TCP stack. This can be achieved by removing at the library any non-L5 descriptors from a select() call sent from the application, so that it appears to the application that no non-L5 data is available.
  • the library may have access to a data store that stores a record of which of the descriptors are L5 descriptors.
  • the socket library could be prompted by the receipt of a new select() call from the application to access the OS to collect new disk data.
  • the library may be able to respond to select() calls in one of two modes: by indicating for all descriptors specified in the select() call whether there is data waiting to be handled, or by indicating for only those descriptors that are specified in the select() call and that are also L5 descriptors whether there is data waiting to be handled.
  • One convenient way to employ these modes is to respond to a select call using the first mode if more than a predetermined time has elapsed since the last response using the first mode and otherwise to respond using the second mode.
  • Another way is to respond to every n-th select() call using the first mode, and to all other select() calls with the second mode, where n is a predetermined integer.
  • details of the data being written to the event queue can be fed back to the application so that the application can determine whether the L5 descriptors are busy, and thus whether the second alternative, involving ignoring data intended for other parts of the system, is appropriate. If the L5 descriptors are not busy then the first alternative, involving accessing of the stack 5 only when L5 data is available, is likely to be more efficient.
  • a Dup2(a,b) call has the effect of duplicating the file or other resource represented by descriptor "a" and creating a new resource represented by descriptor "b" and having the same properties.
  • a descriptor that has a system-wide significance for example the descriptor that maps on to error output - commonly descriptor #2
  • an element of the system can monitor the arrangement of the descriptors. For example, it could periodically analyse the arrangement of the descriptors.
  • the element When the L5 descriptors are disaggregated beyond a predetermined level: for example when they are split by other descriptors into more than a predetermined number of groups, the element initiates a reordering of the descriptors using dup2() operations to reduce the disaggregation of the L5 descriptors, and most preferably bring them into a contiguous group.
  • the invention relates to the storing of environment variables.
  • Environment variables are variables that define the manner in which a process (or other entity) operates within a data processing system; in other words, they define the environment in which a process operates.
  • an environment variable can specify a feature of a process such as dependence on a particular library of instructions for the proper functioning of the process.
  • the presence of such an environment variable which could suitably specify the location of the library so that the library could automatically load when the process starts up, can ensure that the process can function as intended by means of the instructions in the library, for example by ensuring that instructions issued by the process are intercepted by the library.
  • An environment variable could also specify the priority with which instructions are passed to different libraries.
  • Environment variables may be dynamic or static, depending on the aspect of a process to which they relate. When a process is being launched on a host processing system, environment variables may be transmitted to that host from another processing system to enable the necessary environment to be created within the host to support the process. Environment variables are commonly stored within a data processing system, typically in memory allocated to the associated process, representing a complete specification of that process. In the context of the invention, memory allocated to an entity is memory to which access may be limited to that entity or to entities operating under the control of that entity.
  • the stored environment variables can be utilised by a process when it is re-starting, or when a process is copying itself, for example in an operation such as fork() or exec() in a Unix-based data processing system. In order for the process to be successfully launched, the full set of environment variables defining the process should be stored against the process in advance of the process starting.
  • a process may be configured to cause the deletion of some or ali of its environment variables when that process undergoes a duplication operation such as an execQ system call.
  • a duplication operation such as an execQ system call.
  • a library operates as an interface between a process and an operating system then the process's environment variables will typically identify the location and configuration of the library. Then, when the process restarts, the process can access the stored environment variables and identify that the library needs to be loaded before the process can perform its functions. However, if a process destroys part or all of its environment before requesting an exec() operation then any information previously held in the environment variables identifying such a library may no longer be available, and the process may be initiated without any link to the library. This could restrict the subsequent performance of the process.
  • a method for retaining within a data processing system a set of environment variables defining the operation of a first entity supported by the data processing system, wherein the set of environment variables is initially stored in a first store in the data processing system comprising: automatically copying the set of environment variables into a back-up store; intercepting at an interface instructions issued by one or more processes running on the data processing system; in response to intercepting an instruction, restoring by means of the interface the set of environment variables to the first store from the back-up store; and subsequently permitting execution of the instruction.
  • the step of automatically copying may be performed by means of the interface.
  • the method could further comprise the step of, prior to restoring the set of environment variables, causing the deletion of the set of environment variables from the first store. It could further comprise the step of, prior to restoring the set of environment variables, determining by> means of the interface whether the set of environment variables has been deleted from the first store.
  • the method may further comprise the step of, after intercepting the said instruction, determining whether the instruction is of a first type, and only if that determination is positive performing the step of restoring the set of environment variables.
  • the first type could suitably be a type indicating that the issue of the said instruction may have been preceded by deletion of the set of environment variables from the first store.
  • the said instruction could be an exec() call.
  • the interface is suitably an interface between the first entity and an operating system of the data processing system.
  • the step of permitting execution of the instruction could involve passing the instruction to the operating system.
  • the interface could be a library.
  • the first store could be in memory allocated to the first entity, and the back-up store could be in memory allocated to the interface.
  • the set of environment variables suitably identifies a memory location in which the interface is stored.
  • the first entity may be one of the one or more processes, and the said instruction may be an instruction issued by the first entity.
  • the said instruction could also be an instruction issued by a process other than the first entity.
  • the first entity could be a library.
  • the step of automatically copying the set of environment variables into a back-up store is preferably performed in response to initialisation of the first entity.
  • the set of environment variables is preferably such as to cause the interception of instructions by the interface.
  • the method could further comprise providing data storage means indicating one or more sets of environment variables; accessing the data storage means to determine the indicated sets of environment variables; and performing the above-recited steps in respect of the indicated sets of environment variables.
  • the step of accessing the data storage means is preferably performed by the interface.
  • the one or more sets of environment variables may include environment variables defining one or more entities other than the first entity.
  • the data storage means could suitably be a configuration file or an application.
  • the method could further comprise the step of writing data to the data storage means to specify a set of environment variables for which the above-recited steps are to be performed.
  • the step of writing data to the data storage means could suitably be performed by means of an application program interface.
  • the instruction could be such as to cause re- initialisation of the first entity.
  • an interface in a data processing system wherein the data processing system has a first store that initially stores a set of environment variables defining the operation of a first entity supported by the data processing system, the interface being arranged to: automatically copy the set of environment variables into a back-up store; intercept instructions issued by one or more processes running on the data processing system; in response to intercepting an instruction, restore the set of environment variables to the first store from the back-up store; and subsequently permit execution of the instruction.
  • the interface could suitably be a set of instructions or routines such as a library.
  • a data processing system comprising an interface as set out above.
  • a data carrier carrying data defining an interface as set out above.
  • the library 2 illustrated in figure 2 represents an interface between the application (or process) 1 and the OS 3.
  • the library may suitably be a transport library handling data transfer functionality to support the transfer of data to and from the process over the network. It is configured to intercept communications from the process 1 , including system calls intended for the OS or the user-level stack 5. Details of the library are stored as one or more environment variables in memory allocated to the process such that the library is linked to the process. Thus, if the process were to be restarted, the library would be loaded at an early stage in the launching of the process due to its presence within the process ' s environment variables.
  • the environment variables of a process may be stored in any suitable way, but in this example are held in a file having a list of (name,value) pairs, where the name is the name of an environment variable and the value is the current value of the variable.
  • the "value" field is initially set to a default value, and is modified if the value changes.
  • the entries in the list can be modified or added to by means of a user or a configuration script.
  • a process when a process is about to make an execQ system call, it may choose to destroy its own environment first for security reasons. Thus, it may arrange for the deletion of the environment variables stored in the memory allocated to it. It could then proceed to issue an exec() call without those environment variables presenting a security risk.
  • the library of the present embodiment is configured to make a copy of a process's environment variables automatically when the process starts up.
  • the library holds the copied environment variables in the memory allocated to the library.
  • the library can enable the environment variables to be restored to the process ' s memory after the process has caused them to be deleted for security reasons.
  • the library achieves this by identifying an intercepted call from the process as indicating that the process may have requested deletion of its state, including environment variables, and in response replacing the deleted environment variables (and optionally any other state) in the process's memory.
  • the library subsequently permits the intercepted call (such as an exec() call) to be executed, the environment variables detailing the library will be accessible by the process, so that the library will be loaded. Subsequent operations of the process can then be intercepted by the library as intended.
  • the library surreptitiously copies and restores the environment variables while the process itself believes them to have been destroyed.
  • the library can ensure its continued existence in conjunction with the process by this mechanism.
  • the library withholds the intercepted system call from the OS (or user-level stack) until the destroyed environment variables have been restored, and then permits the call to proceed once the environment variables are again available to the process.
  • the library need only maintain a copy of the process's environment variables that relate to the library itself; for example, the location of the library in memory, and its configuration. This information may be sufficient to cause the successful loading of the library when the process starts up. However, in general it may be desirable for the library to maintain a copy of other items of information, such as other state owned by that process, or even state owned by other elements of the data processing system, such as other processes or libraries.
  • Embodiments of the invention could conveniently be extended to enable the library 2 (or another library) to hold copies of environment variables of other processes, other libraries, or any entities, so that these environment variables could be restored following deletion.
  • a configuration file or a separate application could be set up to maintain a list of the state that a given library is intended to copy.
  • the file or application could contain an entry for every element that is active in a data processing system, together with an indication of whether or not a library copy is desired for that element.
  • a simpler configuration could be used, whereby wildcards could indicate, for example, that environment variables for all applications except X 1 Y and Z are to be copied, or that all environment variables for application X are to be copied, except those relating to elements A and B.
  • the configuration file or application could conveniently be written to by means of an application program interface (API), to enable entities within the data processing system to request the saving and restoration of some or all of their environment variables.
  • API application program interface
  • a library running in order to support a process additionally maintains its state defining its own operation.
  • This state is typically stored in memory allocated to the library.
  • this state may include details of the network connection (for example a TCP connection) which it is configured to support.
  • an exec() call is executed in respect of a process, it will typically cause the destruction of the state defining libraries acting to support the operation of the process.
  • the library 2 is an example of such a supporting library, and it could thus be deleted on the execution of an exec() in respect of the process 1.
  • the library 2 is configured to maintain a copy of its own state in memory allocated to the OS. In this way, if the library's own memory is deleted the state can subsequently be restored from the OS memory, as described in more detail below.
  • records detailing the ownership of file handles in use in the data processing system may be stored as library state in memory allocated to the library. This state can enable the efficient routing of instructions and data within the system between the process, the user-level stack and the OS. If this information is deleted (together with other state of the library) when the process undergoes an exec() operation, then the shortcut routing mechanism that had been created for the process will no longer be available following the exec(). The file handle ownership information would therefore need to be re-acquired in order to activate the previous shortcuts.
  • the library ' s state is copied into the OS prior to a system call that will result in its destruction from the library's own memory space, such that the state can later be restored as required from the OS memory.
  • the state could be saved in any location where it would not be destroyed on the execution of any system call issued by the process.
  • the environment variables of the process are copied into library memory automatically each time the process starts up, so that they are available in the library memory whenever they may be needed for the purpose of restoring the process's own copy.
  • the library copy may be deleted each time the process shuts down, or alternatively it could be retained and then checked against the new values each time the process starts up again, with any changes being copied into the library's memory.
  • the library's own state could be saved into the OS (or elsewhere) each time the library is loaded. Alternatively, it could be saved only when the library intercepts a call from the process which it interprets as indicating that process state (and the corresponding library) may be destroyed, for example an exec() call.
  • the library could be configured to recognise particular types of system call issued from the process as implying that destruction of the process's environment may accompany or precede the call. In the case of interception of such a call, the library could be configured to check the process's environment variables, stored in the process's own memory, to determine whether they are still intact. If so, the library could continue to operate as normal; if not, the library could trigger restoration of the process's environment variables so that they are present during the subsequent operation of the process. Although it is not anticipated to be an efficient manner of operation, it is conceivable that the library could restore the process's environment variables periodically, or each time an instruction from the process is intercepted by the library, if the environment variables are not already intact.
  • destruction or deletion of environment variables or state could involve the removal of all such information from the memory in which it is held, or it could involve modifying the information, for example by resetting variables to their default values.
  • the preferred embodiment of the invention involves storing the copy of the process state into the library's memory, it could suitably be held in any location where it would not be affected by the process's destruction of its own environment.
  • the present invention relates to the transmission of data in a network, and in particular to handling control data in such a data processing system within a network.
  • Figure 3 represents equipment capable of implementing a prior art protocol stack, such as a transmission control protocol (TCP) stack in a computer connected to a network.
  • the equipment includes an application 1 , a socket 2 and an operating system 3 incorporating a kernel 4.
  • the socket connects the application to remote entities by means of a network protocol, in this example TCP/IP.
  • Data can be transmitted across the network via hardware 6, which interfaces with the network so that data can be transferred between the system of figure 4 and other data processing systems.
  • the hardware could be a network interface card (NIC).
  • the application 1 can send and receive TCP/IP messages by opening a socket and reading and writing data to and from the socket, and the operating system causes the messages to be transported across the network.
  • the application can invoke a system call (syscall) for transmission of data through the socket and then via the operating system to the network.
  • Syscalls can be thought of as functions taking a series of arguments which cause execution of the CPU to switch to a privileged level and start executing the operating system.
  • a given syscall will be composed of a specific list of arguments, and the combination of arguments will vary depending on the type of syscall.
  • Syscalls made by applications in a computer system can indicate a file descriptor (sometimes called a handle), which is usually an integer number that identifies an open file within a process.
  • a file descriptor is obtained each time a file is opened or a socket or other resource is created.
  • File descriptors can be re-used within a computer system, but at any given time a descriptor uniquely identifies an open file or other resource. Thus, when a resource (such as a file) is closed down, the descriptor will be destroyed, and when another resource is subsequently opened the descriptor can be re-used to identify the new resource. Any operations which for example read from, write to or close the resource take the corresponding file descriptor as an input parameter.
  • a network related application program interface (API) call When a network related application program interface (API) call is made through the socket library this causes a system call to be made, which creates (or opens) a new file descriptor.
  • the accept() system call takes as an input a pre-existing file descriptor which has been configured to await new connection requests, and returns as an output a newly created file descriptor which is bound to the connection state corresponding to a newly made connection.
  • the system call when invoked causes the operating system to execute algorithms which are specific to the file descriptor.
  • a descriptor table which contains a list of file descriptors and, for each descriptor, pointers to a set of functions that can be carried out for that descriptor.
  • the table is indexed by descriptor number and includes pointers to calls, state data, memory mapping capabilities and ownership bits for each descriptor.
  • the operating system selects a suitable available descriptor for a requesting process and temporarily assigns it for use to that process.
  • Certain management functions of a computing device are conventionally managed entirely by the operating system. These functions typically include basic control of hardware (e.g. networking hardware) attached to the device. When these functions are performed by the operating system the state of the computing device's interface with the hardware is managed by and is directly accessible to the operating system.
  • An alternative architecture is a user-level architecture, as described in the applicant's copending PCT applications WO 2004/079981 and WO 2005/104475. In a user-level architecture at least some of the functions usually performed by the operating system are performed by code running at user level. In a user-level architecture at least some of the state of the function can be stored by the user-level code. This can cause difficulties when an application performs an operation that requires the operating system to interact with or have knowledge of that state.
  • Control data within a network can include statistics relating to the use of functions and interfaces with Fn the network, and can include control messages, for example transmitted within the network using Internet Control Message Protocol (ICMP).
  • Control data can be requested by means of a system call sent by an application running on a data processing system.
  • ICMP Internet Control Message Protocol
  • FIG. 4 shows components implementing a user-level TCP stack.
  • Layers of the stack include an application 1 and a socket 2 provided by a socket library.
  • the socket library is an application program interface (API) for building software applications.
  • the socket library can carry out various functions, including creating descriptors and storing information.
  • an operating system 3 comprising a TCP kernel 4, and a proprietary TCP user-level stack 5.
  • a helper or driver 12 for the TCP user-level stack is provided within the operating system.
  • TCP is referred to by way of example, other protocols could also be used in accordance with embodiments of the invention.
  • UCP User Datagram Protocol
  • ICMP Real-Time Transport Protocol
  • RTP Real-Time Transport Protocol
  • Non-Ethernet protocols could be used.
  • the user-level stack is connected to hardware 6, such as a NIC.
  • control data (such as an Address Resolution Protocol message) relating to the TCP stack is transmitted directly from the NIC 6 to the stack.
  • Control data such as route changes can also be transmitted to the stack through the syscall API.
  • Control data relating to other functions of the data processing system is transmitted to the operating system and stored there.
  • one network entity that has a single address e.g. a MAC or IP address
  • communicates with another entity problems can arise in handling the command data.
  • some of the command data may relate to overall operation of the network entity: for instance flow control data such as an ICMP source quench message. This data must be used by both stacks. The control data could simply be copied to both stacks.
  • each stack might send its own response to such data, which would be difficult for the other entity to interpret.
  • one stack might send a TCP RESET message to the network entity in respect of a connection which is being managed by the other stack.
  • some of the control data might be intended for only one . of the stacks. It would be inefficient for that data to be handled by both stacks. There is therefore a need for an improved way of handling control and other like data, especially but not exclusively in data communication entities that implement multiple full or partial protocol stacks for a single protocol.
  • a data processing system comprising: a network interface; an operating system capable of transmitting traffic data by means of a network protocol via a first set of communication links over the interface and receiving control data of the protocol via the interface; and a network data transmission function external to the operating system and also capable of transmitting traffic data by means of the network protocol via a second set of communication links over the interface; the data processing system being arranged to route solely to the operating system control messages relating collectively to the operation of the first and second sets of communication links received by the interface, and the operating system being arranged to share the control messages with the network data transmission function so as to permit the operating system and the network data transmission system to react collectively to the control messages.
  • the operating system may be arranged to share with the network data transmission function control messages designated according " to the protocol for setting data transmission and/or reception parameters by permitting the network data transmission function to access such messages via the operating system.
  • the said network protocol could be Transmission Control Protocol or Internet Control Message Protocol.
  • the operating system preferably comprises a kernel and a driver, and the data processing system is preferably arranged to route the control messages solely to the kernel and the kernel is arranged to share the control messages with the network data transmission function via the driver.
  • the operating system may be arranged to share with the network data transmission function control messages designated according to the protocol for setting data transmission and/or reception parameters by: storing the content of such messages in memory allocated to the driver; and permitting the network data transmission function to access the content in that memory.
  • the operating system is preferably arranged to inform the network data transmission function of changes to the control data.
  • the operating system may be arranged to copy the control data to the network data transmission function; and the operating system may be arranged to selectively copy the control data to the network data transmission function.
  • the operating system is preferably arranged to copy to the network data transmission function control data relating to the network data transmission function.
  • the operating system is preferably arranged not to copy to the network data transmission function control messages designated according to the protocol for requesting a response indicative of the responsiveness of the data processing system.
  • the control messages designated according to the protocol for requesting a response indicative of the responsiveness of the data processing system may include ping messages.
  • the data processing system could be arranged to route solely to the network data transmission function destination unreachable messages relating to any of the second set of communication links.
  • the operating system may be arranged to copy to the network data transmission function instructions to cease the transmission of data from the data processing system.
  • a data processing system comprising: an operating system capable of transmitting traffic data by means of a network protocol via a first set of communication links over a network interface; and a network data transmission function external to the operating system and also capable of transmitting traffic data by means of the network protocol via a second set of communication links over the interface; the data processing system being arranged to route solely to the operating system a system call requesting data relating to the status of the first and second sets of communication links, and the operating system being arranged to, in response to receiving such a system call: request such status data from the network data transmission function in respect of data transmission and/or reception by the network data transmission function; determine such status data in respect of data transmission and/or reception by the operating system; and on receiving such status data from the network data transmission function, form a response to the function call that is indicative collectively of the status data received from the network transmission function and the status data determined in respect of the operating system.
  • the kernel of the operating system is preferably arranged to perform the said requesting, determining and forming.
  • an operating system for a data processing system comprising a network data transmission function external to the operating system and capable of transmitting traffic data by means of a network protocol via a first set of communication links over a network interface; the operating system being capable of transmitting traffic data by means of the network protocol via a second set of communication links over the interface and being arranged to, in response to receiving a system call requesting data relating to the status of the first and second sets of communication links: request such status data from the network data transmission function in respect of data transmission and/or reception by the network data transmission function; determine such status data in respect of data transmission and/or reception by the operating system; and on receiving such status data from the network data transmission function, form a response to the function call that is indicative collectively of the status data received from the network transmission function and the status data determined in respect of the operating system.
  • a data carrier carrying data defining an operating system as defined above.
  • Figure 3 shows a prior art computer system
  • Figure 4 shows a prior art computer system incorporating a user-level stack
  • FIG. 5 shows a system in accordance with embodiments of the present invention.
  • control data 10 relating to the operating system 3 is sent via the NIC 6 to the operating system.
  • Control data 11 relating to the TCP user-level stack 5 is also transmitted to the operating system, where it is stored in conjunction with the data 10. This can ensure that TCP flow control data is handled in a coherent way.
  • a filtering function filters incoming control data.
  • the filtering function has access to filter definition data that indicates where each type of control data should be directed in the communication entity: whether to the operating system's stack, the user-level stack or both.
  • the filter definition data may conveniently be implemented as a look-up or masking table, as will be described in more detail below.
  • the unit checks the filter definition data to determine where to direct the item, and then directs the item accordingly.
  • the filter definition data is stored in a data structure 13, which is accessible by the TCP driver 12 by means of memory mapping 14.
  • the memory mapping is preferably read only, to avoid corruption of the data structure by the TCP driver.
  • each application there is one user-level TCP stack 5 for each application that requires one. This can provide better performance than if a stack is shared between applications.
  • Each stack is located in the same address space as the application that it serves. None of the user-level stacks is a master stack. It is preferred for the stack of the operating system to be the master stack since it can be expected to be available whenever the operating system is operational, whether or not a user-level stack has been configured.
  • a system call initiated at a control API by an application 1 requesting statistical data representing usage of the TCP stack 5 is intercepted by the kernel 4.
  • the kernel accesses the data structure 13 to obtain the required data.
  • a response is transmitted from the operating system to the application containing the requested information.
  • the kernel incorporates control data from the TCP user-level stack and any other user-level stacks in the data processing system into the response in order that the response to the function call can be representative collectively of the status of the operating system's stack and the user-level stacks.
  • Calls that request statistical data can, for example be made through a UNIX /proc/net interface or the Windows WMI interface.
  • One example is the command cat/proc/net/snmp on the Linux system, which will result in a read system call to the /proc/net/snmp special file, which causes the kernel network subsystem to return the ICMP Management Information Base (MIB) statistics.
  • MIB ICMP Management Information Base
  • Such calls may, for example, be sent from the application level through a control API (application programming interface), for example through /proc/net in a UNIX-like system.
  • the kernel of the system is arranged so that it can efficiently react to that request with a response that takes account of the relevant ICMP MIB status data from both the operating system's TCP stack and any user-level TCP stacks.
  • the system is configured so that the request itself is sent to the operating system only, and not to the user level stacks.
  • the operating system is configured to process such requests for status data by combining the relevant data from its own stack with the relevant data from the user-level stacks.
  • the operating system requests, in response to the receipt of the function call, the relevant data from the user-level stacks. It may conveniently do this by way of the appropriate driver which may be installed as part of the operating system.
  • the relevant data from the user-level stacks On receiving the relevant data from the user-level stacks it combines that data with the corresponding data from its own stack. Thus, if the request is for the total number of packets sent it would add together the numbers from the user-level stack and the operating system stack. If the request were for an average data rate it would determine a combined average taking account of the relative amounts of data sent by the user-level and operating system stacks. It then responds to the system call by returning the appropriate data. The data about the operating system's own stack could be gathered before or after the data about the user-level stacks is received or requested.
  • control data passing to the data processing system through the hardware 6 need be transmitted to the operating system.
  • a filtering arrangement implemented, for example at the hardware level, to determine which control data should be sent to which stack function of the system. For example, it could be determined that certain data is passed to the operating system only (such as the data 10 and 11 in figure 5), while other data is passed to the TCP stack only, and further data is passed to both the operating system and the TCP stack. For example:
  • a response indicating "destination unreachable” is preferably transmitted only to the user-level stacks in the system, to be stored as TCP control data. Each stack then interprets this incoming response and determines, based on its own list of outstanding connection requests, whether the response was intended for it.
  • a "ping" message received at the hardware 6 is transmitted only to the operating system.
  • the operating system can then respond to that message without the user- level stack having to use processing time on it, which also avoids the possibility of two ping responses being sent.
  • a "source quench" control message is sent both to the operating system and to the TCP stack, since this message should prevent both stacks from transmitting data.
  • rules can be set, and the filter definition data configured, to determine the destination of any type of control message in accordance with its applicability to the operating system or to user-level stacks.
  • the filter definition data will typically filter incoming messages based on their protocol type and, where appropriate, IP addresses and protocol port numbers. Incoming messages are compared against the content of the filter definition data and routed according to the rules defined by that data.
  • the present approach is applicable to control data of various types. It has been described in relation to TCP and ICMP data but it could be used with data of other types.
  • the present invention relates to the transmission of data between a pair of data processing units, and in particular but not exclusively to data transmission in networks.
  • FIG. 6 shows schematically the architecture of a networked system.
  • the system comprises two data processors 10a, 10b (such as personal computers or servers), each of which has a network interface 11a, 11b (such as a NIC).
  • the network interfaces are linked together over a data network 12.
  • the data network could be an Ethernet network (e.g. using Gigabit Ethernet) or could employ any other suitable protocol.
  • Each data processor has an operating system 13a, 13b which includes a kernel 14a, 14b and a device driver 15a, 15b for controlling communications between the data processor and its network interface.
  • the operating system supports applications or processes 16a, 16b running on the_data processor.
  • a transport . library 17a, 17b provides the applications/processes with routines that can be used for controlling communications over the network, and supervises communications between the applications/processes and the operating system.
  • Each data processor has a memory 18a, 18b.
  • DMA Direct memory access
  • PIO programmed input-output
  • Figure 7 illustrates such a system and shows a transmitting station 1 for sending data to a series of receiving stations 2. A link is shown for sending data between the transmitting station and a switch 3a. The data can then be passed either to a hub 4, to a further switch 3b, or directly to a receiving station 2.
  • the transmitting station could suitably be a personal computer or a server, or it could be any other network device needing to send or receive data, such as a dedicated network appliance or a multimedia terminal.
  • the transmitting station preferably includes a user-level stack for handling transmission and reception of data.
  • a user-level stack for handling transmission and reception of data.
  • TCP transmission control protocol
  • at least one TCP stack could be provided in the transmitting station to enable transmission of data from that station across the network by TCP.
  • the routes between the transmitting station to the various receiving stations illustrated in figure 7 may have different transmission rates beyond which the respective receiving station will start dropping packets.
  • the difference in transmission rate between links can be caused by many factors.
  • the maximum possible transmission rate may be determined by, for example:
  • acknowledge messages to inform the transmitting station when the receiving station successfully receives transmitted data.
  • a receiving station will typically send an acknowledge message to a transmitting station each time a packet, or a predetermined number of packets, is received. For example, an acknowledge message can be sent for every two packets received.
  • Figure 8 illustrates an exemplary arrangement of bytes of data for transmission across a network in packets.
  • the packets 30 each contain a number of bytes 31.
  • packet P 1 contains byte numbers N to N+M
  • packet P 2 contains byte numbers N+M+1 to N+2M+1.
  • acknowledgement messages generally include an indication of the byte number of the last byte in the last received packet, rather than indicating the last received packet number.
  • the transmitting station interprets an acknowledgement as meaning that all bytes prior to the indicated byte number have been successfully received over the network. At this time, the transmitting station can remove all such prior data from buffers on the interface since this data is no longer required for transmission.
  • the transmitting station can modify the rate at which it transmits data over a particular link. For example, it could make a decision to increase the transmission rate if it is receiving acknowledgements for every packet sent, and the rate could be progressively increased until it becomes apparent by means of a lack of acknowledgements that data is no longer being reliably received at the receiving station, and at that point decrease the rate to a more reliable level.
  • the receiving station When an expected packet is not received at a receiving station the receiving station sends multiple acknowledge (ACK) messages identifying the last successfully received packet until the missing packet is received.
  • ACK acknowledge
  • the receiving station If the next received packet is P3 then the receiving station can determine that the byte number of the first byte in P3 does not run consecutively on from the byte number of the last byte in P-i. It therefore recognizes that data is missing from the stream. In order to alert the transmitting station to the loss of data it re-transmits the ACK indicating P-i.
  • a further ACK is sent, again identifying the last byte that was received in order.
  • the transmitting station interprets these duplicate ACKs as the loss of data subsequent to P 1 .
  • the transmitting station cannot determine from the duplicate ACK how much data has been lost over the network - it simply knows that at least one packet has been lost. It therefore re-transmits all data transmitted since the byte number acknowledged in the last received ACK. This algorithm is known as Fast Retransmit and is fully described in RFC2001.
  • the packets may be received at the receiving station in the order: P 1 , P 3 , P 2 .
  • P 1 the packets
  • P 3 the packets may be received at the receiving station in the order: P 1 , P 3 , P 2 .
  • the receiving station when the receiving station receives P 1 followed by P 3 it will transmit a duplicate ACK, causing the re-transmission of P 2 and a reduction in the rate of future transmissions, However, these steps are unnecessary since P 2 would have been received at the receiving station anyway, after P 3 , and there was thus no need for it to be re-transmitted.
  • a transmitting station will recognise a threshold number of duplicate ACKs (dupACKs) beyond which data that is apparently lost will be re-transmitted. For example, after receiving three identical ACKs it may be programmed to begin retransmitting. This can help to avoid unnecessary re-transmissions in cases of reordering, since when a delayed packet is received at a receiving station, a new ACK can then be sent and the transmitting station will recognise that subsequent data has been received.
  • duplicate ACKs duplicate ACKs
  • the dupACK threshold is relatively large then this can have undesirable impacts on the efficiency of data transmission. Firstly, it can give rise to a long delay before re-transmission in the event of a true loss of data; secondly, an insufficient number of packets may be transmitted after a lost packet (if the lost packet was near to the end of a stream of data), meaning that the receiving station does not send a sufficient number of dupACKs to cause the transmitting station to re-transmit the lost data.
  • a method for receiving data by means of a data transmission system from a data source via two routes by a protocol according to which data is transmitted in the form of sequential packets and the transmission by a receiver of a predetermined number of acknowledgement messages in respect of a packet is used to signal a data source to retransmit subsequent packets, the predetermined number being greater than one comprising: receiving data packets from the data source via both of the two routes; and on receiving a data packet from the data source: determining whether any packets in sequence between that packet and the previous packet received o ⁇ /er the same route as that packet for which a first acknowledgement message has been transmitted have not been received, and if they have not, transmitting a further acknowledgement message to the data source in respect of that previous packet.
  • the method could further comprise the step of issuing from a receiver an acknowledgement message to the data source in response to receiving a predetermined number of packets from the data source.
  • the data source may comprise two transmission ports, each port forming a node within a respective one of the two routes.
  • the data transmission system is preferably such that the route via which each data packet is transmitted from the data source is determined in dependence on an algorithm.
  • the data source and the receiver have access to the algorithm.
  • the method could further comprise the step of: at the data source, modifying information within a packet of data in dependence on the route determined for transmission of that packet.
  • the said information could be a Media Access Control (MAC) address.
  • MAC Media Access Control
  • the method could further comprise the step of, on receiving a data packet from the data source, determining from the modified information the route via which the packet was transmitted.
  • the algorithm is preferably such as to balance the transmission of data over the routes such that substantially the same number of bytes is transmitted over each route in a given time period.
  • the protocol is preferably such as to support the selection of one of multiple routes between a source and a receiver for routing of each packet in dependence on data contained in the respective packet.
  • the protocol could suitably be TCP over Internet Protocol.
  • the algorithm is preferably such as to determine the said route for each packet in dependence on a TCP sequence number contained in the packet.
  • the step of determining whether any packets have not been received may comprise determining by means of the algorithm the routes via which (i) the said data packet, (ii) the said previous packet, and (iii) any packets in sequence between the said packet and the said previous packet were transmitted.
  • the sequence of data packets in the data transmission system may be defined by identifiers of bytes contained in a data stream from which the packets are formed. Each of the said bytes could be the byte constituting the first byte of traffic data within a respective packet.
  • a receiver in a data transmission system the receiver being arranged for receiving data from a data source via two routes by a protocol according to which data is transmitted in the form of sequential packets and the transmission by the receiver of a predetermined number of acknowledgement messages in respect of a packet is used to signal the data source to retransmit subsequent packets, the predetermined number being greater than one
  • the receiver comprising: a transmission unit for transmitting acknowledgement messages; and a determining unit arranged to, on receiving a data packet, determine whether any packets in sequence between that packet and the previous packet received over the same route as that packet for which a first acknowledgement message has been transmitted by the transmission unit have not been received, and if they have not, cause the transmission unit to transmit a further acknowledgement message to the data source in respect of that previous packet.
  • the transmission unit is preferably arranged to transmit an acknowledgement message to the data source in response to receiving at the receiver a predetermined number of packets from the data source.
  • the route via which each data packet is transmitted from the data source is preferably determined in dependence on an algorithm, and the data source and the receiver may have access to the algorithm.
  • a method for transmitting a data message from a transmitter to a receiver the message being conveyed in the form of data units each containing a portion of the message, there being a plurality of routes available for transmitting the data units between the transmitter and the receiver, the method comprising: selecting for each data unit one of the plurality of routes, the selection being performed in dependence on the position in the data message of at least part of the portion of the data message contained in the data unit; and transmitting that data unit over the selected route.
  • the said at least part of the portion is preferably the byte in the portion whose position in the data message is closest to the start of the message.
  • the portion of the data message may consist of contiguous bytes of the data message in their order in the data message.
  • Each data unit could comprise only a single portion of the data message.
  • the position is preferably determined as the offset of the part of the portion from the start of the message.
  • a transmitter for transmitting a data message from a transmitter to a receiver, the message being conveyed in the form of data units each containing a portion of the message, there being a plurality of routes available for transmitting the data units between the transmitter and the receiver, the transmitter comprising: a selection unit arranged to select for each data unit one of the plurality of routes, the selection being performed in dependence on the position in the data message of at least part of the portion of the data message contained in the data unit; and a transmission unit arranged to transmit that data unit over the selected route.
  • Figure 6 shows a prior art data transmission system
  • Figure 7 shows a prior art network
  • Figure 8 shows the arrangements of bytes and packets within a data stream
  • Figure 9 shows a section of a data transmission system
  • Figure 10 shows a data stream resulting from a pair of transmitting ports.
  • FIG 9 shows a part of a data transmission network in which a pair of NICs 11c, 11d are transmitting data via to a pair of switches 3c, 3d to endpoints in the network.
  • Each NIC has two transmission ports 40 and 41 , termed port 0 (40) and port 1 (41 ).
  • each port is arranged to transmit data to the one of the two switches.
  • the port Os can communicate with each other, and the port 1 s can communicate with each other, but as shown, switches cannot communicate with each other.
  • an application transmitting data across the network from one of the NICs will cause data to be transmitted alternately from port 0 and port 1 on the NIC.
  • sequence number information in the packet for example, the sequence number of the first byte in the packet (i.e. its offset in bytes from the start of the message) - will be determined and applied to an algorithm.
  • the TCP sequence number cpu Id , be . extracted for entry into the algorithm.
  • the algorithm will then cause one of the ports to be selected for transmitting that packet.
  • the algorithm could be that if the number of the first byte of the packet (numbered in sequence from the start of the message) is even then a first route is chosen, and if it is odd then a second route is chosen.
  • Another example would be to divide the sequence number of the first byte of the packet by the maximum segment size (i.e. packet size), and use low order bits of the result to choose the route.
  • the number of bytes transmitted from each port will be approximately equal.
  • reordering of transmitted data can occur.
  • a first packet of data transmitted from port 0 prior to a second packet of data transmitted from port 1 might be received at a receiving station after the second packet. This may be due for example to a difference in the volume of traffic or a difference in the number of switches along the two routes.
  • Figure 10 shows a data stream comprising packets 1 to 9. Bytes within these packets are in consecutive numerical order. Packets 1 , 2, 3, 4 and 5 are transmitted sequentially from port 0 of NIC 11c. Packets 6, 7, 8 and 9 are then transmitted sequentially from port 1 of the NIC.
  • a receiving station receives the data stream in the following order: 1 , 2, 3, 4, 6, 7, 8, 9, 5. ACKs are sent following the receipt of each of packets 1 to 4.
  • the receiver recognises that the first byte of this packet does not follow consecutively from the last byte of packet 4, and that there is thus a gap of at least one packet in the received data.
  • the receiving station can deduce knowledge of the port from which at least the first packet of the missing data was transmitted from the NIC, it can determine that the missing data - packet 5 - was sent from port 0 whereas packet 6 was sent from port 1.
  • the receiving station therefore refrains from sending a dupACK on receipt of the out-of-order packet 6 and instead awaits the next packet from port 0. This is received after packet 9, and packet 5 is then successfully received without re-transmission having been triggered.
  • ACKs are sent only in respect of the packets which arrive in their proper sequence.
  • no new ACKs are sent until packet 5 has been received.
  • an ACK is sent identifying the last received in-order packet, packet 9.
  • An ACK indicates that all data up to the byte indicated in the ACK has been received.
  • embodiments of the invention interpret a gap on a port and subsequent receipt of data from the same port as data loss, and a gap on a port and subsequent receipt of data from another port as reordering.
  • the receiving station In order for the receiving station to be able to accurately determine the port from which a given packet was sent, it is desirable for both the transmitting station and the receiving station to have access to the same algorithm for selecting a port number for transmission of a packet. In this way, when a gap is observed at the receiving station, the receiving station can determine whether the first missing packet would have been transmitted on the same port as subsequent received data. If so, the receiving station interprets the lack of data as a loss over the communication link. If not, the receiving station interprets this initially as re-ordering and awaits the next batch of packets from the link on which the missing packet was sent. If the packet then duly arrives, no further action is required. However, if the missing packet does not then arrive, this is interpreted as loss and re-transmission is triggered by means of dupACKs.
  • a receiving station can consider data received from each transmission port separately. It can treat the data being received from each port as an individual stream. This can assist in the determination of whether a missing packet implies loss or simply re-ordering. For example, if data is sent in the order indicated below: Table 1
  • the receiver recognises that the port 0 data was all received in order and the port 1 data was all received in order, and thus there was no loss of data.
  • the receiver would recognise that the received port 1 data is missing at least one packet, and would trigger re-transmission accordingly.
  • ACKs would be sent after the receipt of packet 1 , and then after the receipt of packet 2 (since this is the next in-order packet).
  • the ACK sent following the receipt of packet 2 could identify packet 3, since this had already been successfully received by the time packet 2 was received.
  • no new ACKs would be sent after the receipt of packet 2 since the next in- order packet, packet 4, was not received. Receipt of packet 6 would thus trigger a dupACK identifying packet 3, this being the last in-order packet successfully received.
  • Acknowledgement messages can specify an amount of data (a "window") which the receiving station wishes the transmitting station to send in its next transmission.
  • Prior art mechanisms can be utilised in conjunction with embodiments of the present invention to inform the transmitting station when it re-transmits more data than necessary.
  • a receiving station generally does not know how many packets of data have been lost when a missing packet is identified. All it knows is that at least one packet has been lost. The transmitting station therefore does not know accurately how much data needs to be re-sent in the event of dupACKs. It can estimate, from the number of packets sent since it transmitted the byte identified in the duplicated ACK, which packet(s) have not been received, and on that basis it can begin re-transmitting all those packets.
  • a prior art Selective Acknowledgement mechanism improved by the incorporation of reordering information from the receiver, can be implemented to inform the transmitter that too much data is being re-transmitted.
  • the mechanism relies on information obtained at the transmitter from the receiver. In particular, the following information can be obtained by the transmitter:
  • This data can conveniently be sent in messages from the receiver to the transmitter, and can be used to improve the efficiency of communications from the transmitter to the receiver.
  • the receiving station can be provided with information detailing specific routes between the transmitting station and the receiving station, preferably including details of any switches in the routes and maximum transmission rates of links within the routes, and including details of which transmission ports are using which routes in a given set-up. This information can be used by the receiving station in improving its understanding of instances of re-ordering. If re-ordering is anticipated by the receiving station on the basis of such information, the receiving station can refrain from sending dupACKs while awaiting delayed data, and thus unnecessary re-transmission of data can be avoided further.
  • Embodiments of the invention can be applied in respect of switches, or other nodes within a route, as well as in respect of transmitters.
  • a receiver could be provided with knowledge of rules used by the switch to make routing decisions, and the accuracy with which the receiver could predict instances of re-ordering could thereby be improved.
  • a user-level stack implements the invention.
  • the skilled person will understand that the invention is not limited to implementation in such a stack.
  • Embodiments of the invention are suitable for application to a single TCP application, thereby improving the retransmission characteristics over a single connection between a transmitter and a receiver. These embodiments are preferable to prior art techniques which can typically only be used for multiple connections and which effectively bond the connections together.
  • the present invention is not limited to use with TCP.
  • the present invention can advantageously be used with other protocols that call for retransmission of unacknowledged messages, and under which data units may traverse two or more paths between a transmitter and a receiver.

Abstract

A data processing system comprising: a set of data stores for storing items of incoming data; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores, a plurality of the descriptors being associated with one of the stores, there being a common descriptor indicative of the descriptors of the plurality; and an instruction in respect of a set of the descriptors to which it returns a response indicating for which of the descriptors the stores contain data that has not yet been handled; and an interface for modifying communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to indicate by way of the common descriptor whether the stores contain data for any of the plurality of the descriptors that are members of the set.

Description

DATA PROCESSING SYSTEM
The present application relates to data processing systems and discloses three distinct inventive concepts which are described below in Sections A to C of the description.
Claims 1 to 74 relate to the description in Section A, claims 75 to 92 relate to the description in Section B, and claims 93 to 128 relate to the description in Section C.
In the appended drawings, figures 1 and 2 relate to the description in Section A, figures 3 to 5 relate to the description in Section B, and figures 6 to 10 relate to the description in Section C.
Embodiments of each of the inventions described herein may include any one or more of the features described in relation to the other inventions.
Where reference numerals are used in a Section of the description they refer only to the figures that relate to the description in that Section.
SECTION A INTERCEPTING MESSAGES
The present invention relates to the transmission of instructions and data in a data processing system, and in particular but not exclusively to intercepting responses transmitted within a data processing system.
Figure 1 represents equipment capable of implementing a prior art protocol stack, such as a transmission control protocol (TCP) stack in a computer connected to a network. The equipment includes an application 1 , a socket 2 and an operating system 3 incorporating a kernel 4. The socket connects the application to remote entities by means of a network protocol, in this example TCP/IP. The application can send and receive TCP/IP messages by opening a socket and reading and writing data to and from the socket, and the operating system causes the messages to be transported across the network. For example, the application can invoke a system call (syscall) for transmission of data through the socket and then via the operating system to the network. Syscalls can be thought of as functions taking a series of arguments which cause execution of the CPU to switch to a privileged level and start executing the operating system. A given syscall will be composed of a specific list of arguments, and the combination of arguments will vary depending on the type of syscall.
Syscalls made by applications in a computer system can indicate a file descriptor (sometimes called a handle), which is usually an integer number that identifies an open file within a process. A file descriptor is obtained each time a file is opened or a socket or other resource is created. File descriptors can be re-used within a computer system, but at any given time a descriptor uniquely identifies an open file or other resource. Thus, when a resource (such as a file) is closed down, the descriptor will be destroyed, and when another resource is subsequently opened the descriptor can be re-used to identify the new resource. Any operations which for example read from, write to or close the resource take the corresponding file descriptor as an input parameter.
When a network related application program interface (API) call is made through the socket library this causes a system call to be made, which creates (or opens) a new file descriptor. For example the accept() system call takes as an input a pre-existing file descriptor which has been configured to await new connection requests, and returns as an output a newly created file descriptor which is bound to the connection state corresponding to a newly made connection. The system call when invoked causes the operating system to execute algorithms which are specific to the file descriptor. Typically there exists within the operating system a descriptor table which contains a list of file descriptors and, for each descriptor, pointers to a set of functions that can be carried out for that descriptor. Typically, the table is indexed by descriptor number and includes pointers to calls, state data, memory mapping capabilities and ownership bits for each descriptor. The operating system selects a suitable available descriptor for a requesting process and temporarily assigns it for use to that process.
Certain management functions of a computing device are conventionally managed entirely by the operating system. These functions typically include basic control of hardware (e.g. networking hardware) attached to the device. When these functions are performed by the operating system the state of the computing device's interface with the hardware is managed by and is directly accessible to the operating system. An alternative architecture is a user-level architecture, as described in the applicant's copending PCT applications WO 2004/079981 and WO 2005/104475. In a user-level architecture at least some of the functions usually performed by the operating system are performed by code running at user level. In a user-level architecture at least some of the state of the function can be stored by the user-level code. This can cause difficulties when an application performs an operation that requires the operating system to interact with or have knowledge of that state. Examples of syscalls are select() and poll(). These can be used by an application for example to determine which descriptors in use by the application have data ready for reading or writing.
Figure 2 shows components implementing a TCP stack for use in accordance with embodiments of the present invention. Layers of the stack include an application 1 and a socket 2 provided by a socket library. The socket library is an application program interface (API) for building software applications. The socket library can carry out various functions, including creating descriptors and storing information. Additionally, there is an operating system 3 comprising a TCP kernel 4, and a proprietary TCP user-level stack 5. It will be understood by the skilled person that although TCP is referred to by way of example, other protocols could also be used in accordance with embodiments of the invention. For example, User Datagram Protocol (UDP), Internet Control Message Protocol (ICMP) or Real-Time Transport Protocol (RTP) could be used. Non-Ethernet protocols could be used. The user- level stack is connected to hardware 6 in figure 2. The hardware could be a network interface card (NIC). It interfaces with a network so that data can be transferred between the system of figure 2 and other data processing systems.
Data received at the NIC or other hardware 6 is transmitted within the system of figure 2 according to the file descriptor with which it is associated. For example, L5 data will be transmitted onto a receive event queue 7 within the stack 5
When the application 1 wishes to determine whether any data intended for processing by the application has recently been received by the hardware, it initiates a selectO or poll() call listing a set of file descriptors. The call is passed to the OS via the socket 2, and a response is returned to the application 1 to indicate, for each descriptor listed in the select() call, whether any new data is available for that descriptor. In general, some of the descriptors will relate to queues run by the L5 stack, whereas some will relate to components in the OS (such as a driver 11 for a storage connection). Both types of data need to be handled by the system of figure 2, and it is desirable to avoid servicing the OS and the user-level TCP stack descriptors separately since this would be inefficient.
According to a first aspect of the present invention there is provided a data processing system comprising: a set of data stores for storing items of incoming data; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores, a plurality of the descriptors being associated with one of the stores, there being a common descriptor indicative of the descriptors of the plurality; and an instruction in respect of a set of the descriptors to which it returns a response indicating for which of the descriptors the stores contain data that has not yet been handled; and an interface for modifying communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to indicate by way of the common descriptor whether the stores contain data for any of the plurality of the descriptors that are members of the set.
The data processing system could also comprise a processing unit arranged to process the items of incoming data according to a protocol and produce an output representing the processed items. The protocol is suitably TCP/IP. The operating system is preferably arranged to form the said response based on the output of the processing unit. The processing unit could be arranged to process the items of incoming data at user level, or the processing unit could be implemented in hardware on an interface between the data processing system and a data transmission network.
The instruction is preferably a user-level instruction. The interface could be responsive to the user-level instruction for forming a communication to the operating system that represents an instruction of the same type and that identifies, instead of any descriptors of the plurality, the said common descriptor.
The interface could be an instruction library. More specifically, it could be a socket library. The data processing system may further comprise a data structure storing, for each descriptor, an indication of whether or not it is a member of the set.
The data processing system may also comprise an ordering unit arranged to: monitor the indications stored in the data structure; and if the monitoring indicates that descriptors that are members of the set are excessively interleaved with descriptors that are not members of the set, cause the descriptors to be reordered such that the descriptors of the set form a contiguous group. The ordering unit could be arranged to perform the monitoring periodically.
The said instruction could be a select call, or it could be a poll call.
The said instruction preferably originates from an application running in the data processing system.
According to a second aspect of the present invention there is provided a data processing system comprising: a set of data stores for storing items of incoming data, a first one of the data stores being associated with a function of the system; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores; and an instruction in respect of one or more of the descriptors to which it returns a response indicating for which of those descriptors the stores contain data that has not yet been handled; and an interface for modifying communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to omit any descriptors other than those associated with any of a subset of the stores that includes the said first one of the data stores.
The function is preferably at user level. The interface could be an instruction library. More specifically, it could be a socket library.
The said instruction could be a select call, or it could be a poll call.
The said instruction preferably originates from an application running in the data processing system.
According to a third aspect of the present invention there is provided an interface for use in a data processing system, the data processing system comprising: a set of data stores for storing items of incoming data; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores, a plurality of the descriptors being associated with one of the stores, there being a common descriptor indicative of the descriptors of the plurality; and an instruction in respect of a set of the descriptors to which it returns a response indicating for which of the descriptors the stores contain data that has not yet been handled; the interface being arranged to modify communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to indicate by way of the common descriptor whether the stores contain data for any of the plurality of the descriptors that are members of the set.
According to a fourth aspect of the present invention there is provided a data carrier defining software for operation as an interface in a data processing system, the data processing system comprising: a set of data stores for storing items of incoming data; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores, a plurality of the descriptors being associated with one of the stores, there being a common descriptor indicative of the descriptors of the plurality; and an instruction in respect of a set of the descriptors to which it returns a response indicating for which of the descriptors the stores contain data that has not yet been handled; the interface being arranged to modify communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to indicate by way of the common descriptor whether the stores contain data for any of the plurality of the descriptors that are members of the set.
The present invention will now be described by way of example with reference to the accompanying drawings, in which:
Figure 1 shows a prior art computer system; and
Figure 2 shows a computer system in accordance with embodiments of the present invention.
In the system of figure 2, the operating system (OS) 3 incorporates a driver 11 for a piece of hardware such as a disk and a TCP driver or helper 12 for supporting the stack 5. In a particular non-limiting example, the user-level stack will be referred to as a Level 5 (or L5) stack. The TCP driver 12 is mapped onto the TCP stack 5 by means of a file descriptor.
In this arrangement there can be one user-level TCP stack 5 for each application that requires one. This can provide better performance than if a stack is shared between applications. Each stack is located in the same address space as the application that it serves.
When L5 data is received at the NIC 6 it is passed to the event queue 7 in the user- level stack, and is flagged to the TCP driver 12 by means of the mapping between the user-level stack 5 and the driver 12. The driver 12 is thereby informed when new L5 data is available. Preferably read only memory mapping is permitted between the OS and the L5 stack to avoid corruption of data held in the OS by the stack 5.
Typically, a single event queue will be provided for a given transport library (or socket library) and there will usually be one instance of the transport library associated with each application. However it is possible for one library instance to manage a number of event queues. Since one transport library is capable of supporting a large number of sockets (i.e. application level connections), it can therefore occur that a single queue contains data relating to a number of network endpoints, and thus a single queue can contain data relating to a number of file descriptors.
In the system of figure 2, when new data is received at an event queue it is processed, for example using TCP/IP. In this way incoming data can be validated - in other words the stack 5 can ensure that the new data is compliant with the rules of TCP/IP - and the stack 5 can then register its presence in the event queue. Prior to such validation, the stack may not be aware of the content of new data in a queue. In order to handle new data efficiently, a first alternative in accordance with an embodiment of the invention is for the library 2 to intercept a selectQ call from the application 1 , identify all L5 file descriptors identified in the call, and replace them all with a single descriptor denoting L5 descriptors. The single descriptor could suitably be the descriptor used to map the driver 12 onto the stack 5. The select() call, once modified by the library, is passed to the OS. A response is then created by the OS, having polled the TCP driver 21 , to indicate whether any L5 descriptors have new data in the event queue 7. This response is based on the results of the TCP/IP validation processing carried out when incoming data is received at the event queue. Data from a given network endpoint can be identified within an event queue by means of the associated file descriptor.
The response, once created by the OS, is intercepted by the library 2 and sent to the application, so that the application can establish whether any L5 data is waiting to be handled. If the response indicates that there is new L5 data, the application will need to process the event queue 7 by checking the L5 file descriptors by means of the L5 helper. In this way, unnecessary accessing of the event queue 7 can be avoided when the response indicates that there is no new L5 data. Alternatively, the library could refrain from modifying the parameters of the select() call itself, but could instead modify the response to the selectQ call to replace any L5 descriptors mentioned in the response with a reference to a single descriptor denoting L5 descriptors. A second alternative for efficiently handling new data is particularly appropriate when the TCP file descriptors are busy, in other words when a large amount of TCP data is being received at the hardware 6 and passed to the event queue 7. This approach effectively assigns a high priority to the TCP descriptors, in preference to descriptors related to other components such as the storage connection driver 11. The approach involves directly accessing the queue 7 and ignoring new data intended for components of the system other than the TCP stack. This can be achieved by removing at the library any non-L5 descriptors from a select() call sent from the application, so that it appears to the application that no non-L5 data is available. In order to achieve this the library may have access to a data store that stores a record of which of the descriptors are L5 descriptors.
A check is made by the socket library directly with the event queue 7 to identify new L5 data. If no data is found, the library can stay spinning (i.e. re-checking) for a certain period of time on a given select() call. However, because the library is not accessing the OS during this period of time, new data for the disk driver may be waiting in the OS to be handled and the application would be unaware of it. Thus, in one embodiment a timer is run to count the period of time for which the library is spinning on the queue 7, and after a predetermined time the library is triggered to access the OS to acquire any disk data waiting to be handled. The predetermined time could for example be 100μs. Alternatively or in addition, the socket library could be prompted by the receipt of a new select() call from the application to access the OS to collect new disk data. Thus, according to this second alternative, the library may be able to respond to select() calls in one of two modes: by indicating for all descriptors specified in the select() call whether there is data waiting to be handled, or by indicating for only those descriptors that are specified in the select() call and that are also L5 descriptors whether there is data waiting to be handled. One convenient way to employ these modes is to respond to a select call using the first mode if more than a predetermined time has elapsed since the last response using the first mode and otherwise to respond using the second mode. Another way is to respond to every n-th select() call using the first mode, and to all other select() calls with the second mode, where n is a predetermined integer.
Suitably, details of the data being written to the event queue can be fed back to the application so that the application can determine whether the L5 descriptors are busy, and thus whether the second alternative, involving ignoring data intended for other parts of the system, is appropriate. If the L5 descriptors are not busy then the first alternative, involving accessing of the stack 5 only when L5 data is available, is likely to be more efficient.
Typically the file descriptors listed in a select() call from the application are in numerical order. This can improve efficiency since all L5 descriptors can be kept together in a block, away from other descriptors of the application. It is convenient to monitor the assignment of descriptors and reorder them if the L5 descriptors become mixed up with other descriptors of the application. This reordering can be achieved using Dup2() calls. A Dup2(a,b) call has the effect of duplicating the file or other resource represented by descriptor "a" and creating a new resource represented by descriptor "b" and having the same properties. One example of when such a call might be useful is when a descriptor that has a system-wide significance (for example the descriptor that maps on to error output - commonly descriptor #2) is to be redirected on to some other file or device. Accordingly, an element of the system (conveniently the socket library) can monitor the arrangement of the descriptors. For example, it could periodically analyse the arrangement of the descriptors. When the L5 descriptors are disaggregated beyond a predetermined level: for example when they are split by other descriptors into more than a predetermined number of groups, the element initiates a reordering of the descriptors using dup2() operations to reduce the disaggregation of the L5 descriptors, and most preferably bring them into a contiguous group.
In another aspect, the invention relates to the storing of environment variables. Environment variables are variables that define the manner in which a process (or other entity) operates within a data processing system; in other words, they define the environment in which a process operates. For example, an environment variable can specify a feature of a process such as dependence on a particular library of instructions for the proper functioning of the process. The presence of such an environment variable, which could suitably specify the location of the library so that the library could automatically load when the process starts up, can ensure that the process can function as intended by means of the instructions in the library, for example by ensuring that instructions issued by the process are intercepted by the library. An environment variable could also specify the priority with which instructions are passed to different libraries.
Environment variables may be dynamic or static, depending on the aspect of a process to which they relate. When a process is being launched on a host processing system, environment variables may be transmitted to that host from another processing system to enable the necessary environment to be created within the host to support the process. Environment variables are commonly stored within a data processing system, typically in memory allocated to the associated process, representing a complete specification of that process. In the context of the invention, memory allocated to an entity is memory to which access may be limited to that entity or to entities operating under the control of that entity. The stored environment variables can be utilised by a process when it is re-starting, or when a process is copying itself, for example in an operation such as fork() or exec() in a Unix-based data processing system. In order for the process to be successfully launched, the full set of environment variables defining the process should be stored against the process in advance of the process starting.
In some circumstances, it can be desirable to delete some or all of the stored environment variables relating to a process. Such deletion can be desirable for security reasons, for example to avoid permitting access to certain functions or memory locations by a new process that is modelled on a currently running process. Consequently, a process may be configured to cause the deletion of some or ali of its environment variables when that process undergoes a duplication operation such as an execQ system call. Although this achieves the security benefits referred to above, it can create the problem that the' child process" does not have all the functionality of the parent process since the parent process's environment has been destroyed. An example of such a problem is in the use of an interface such as a library between a process and an operating system. The applicant's co-pending PCT application PCT/GB2006/000852 describes how efficiency advantages can be gained by implementing an interface between a process and an operating system to assist in the passing of data and instructions within a data processing system. For example, by maintaining a record of file handle ownership locally at the interface, instructions that incorporate a file handle can be passed directly to the element of the data processing system for which they are intended, rather than passing first to the kernel in order that the intended destination can be identified.
If a library operates as an interface between a process and an operating system then the process's environment variables will typically identify the location and configuration of the library. Then, when the process restarts, the process can access the stored environment variables and identify that the library needs to be loaded before the process can perform its functions. However, if a process destroys part or all of its environment before requesting an exec() operation then any information previously held in the environment variables identifying such a library may no longer be available, and the process may be initiated without any link to the library. This could restrict the subsequent performance of the process.
It is therefore desirable to provide a method and system for retaining environment variables in a data processing system.
According to a fifth aspect of the invention there is provided a method for retaining within a data processing system a set of environment variables defining the operation of a first entity supported by the data processing system, wherein the set of environment variables is initially stored in a first store in the data processing system, the method comprising: automatically copying the set of environment variables into a back-up store; intercepting at an interface instructions issued by one or more processes running on the data processing system; in response to intercepting an instruction, restoring by means of the interface the set of environment variables to the first store from the back-up store; and subsequently permitting execution of the instruction.
The step of automatically copying may be performed by means of the interface.
The method could further comprise the step of, prior to restoring the set of environment variables, causing the deletion of the set of environment variables from the first store. It could further comprise the step of, prior to restoring the set of environment variables, determining by> means of the interface whether the set of environment variables has been deleted from the first store.
The method may further comprise the step of, after intercepting the said instruction, determining whether the instruction is of a first type, and only if that determination is positive performing the step of restoring the set of environment variables. The first type could suitably be a type indicating that the issue of the said instruction may have been preceded by deletion of the set of environment variables from the first store.
The said instruction could be an exec() call.
The interface is suitably an interface between the first entity and an operating system of the data processing system. The step of permitting execution of the instruction could involve passing the instruction to the operating system.
The interface could be a library.
The first store could be in memory allocated to the first entity, and the back-up store could be in memory allocated to the interface.
The set of environment variables suitably identifies a memory location in which the interface is stored. The could also include details of the configuration of the interface. The first entity may be one of the one or more processes, and the said instruction may be an instruction issued by the first entity. The said instruction could also be an instruction issued by a process other than the first entity.
The first entity could be a library.
The step of automatically copying the set of environment variables into a back-up store is preferably performed in response to initialisation of the first entity.
The set of environment variables is preferably such as to cause the interception of instructions by the interface.
The method could further comprise providing data storage means indicating one or more sets of environment variables; accessing the data storage means to determine the indicated sets of environment variables; and performing the above-recited steps in respect of the indicated sets of environment variables.
The step of accessing the data storage means is preferably performed by the interface.
The one or more sets of environment variables may include environment variables defining one or more entities other than the first entity.
The data storage means could suitably be a configuration file or an application.
The method could further comprise the step of writing data to the data storage means to specify a set of environment variables for which the above-recited steps are to be performed. The step of writing data to the data storage means could suitably be performed by means of an application program interface.
The instruction could be such as to cause re- initialisation of the first entity. According to a sixth aspect of the invention there is provided an interface in a data processing system, wherein the data processing system has a first store that initially stores a set of environment variables defining the operation of a first entity supported by the data processing system, the interface being arranged to: automatically copy the set of environment variables into a back-up store; intercept instructions issued by one or more processes running on the data processing system; in response to intercepting an instruction, restore the set of environment variables to the first store from the back-up store; and subsequently permit execution of the instruction.
The interface could suitably be a set of instructions or routines such as a library.
According to a seventh aspect of the invention there is provided a data processing system comprising an interface as set out above.
According to an eighth aspect of the invention there is provided a data carrier carrying data defining an interface as set out above.
An exemplary embodiment of this aspect of the invention will now be described with reference to figure 2.
The library 2 illustrated in figure 2 represents an interface between the application (or process) 1 and the OS 3. In the present example of a networked data processing system, the library may suitably be a transport library handling data transfer functionality to support the transfer of data to and from the process over the network. It is configured to intercept communications from the process 1 , including system calls intended for the OS or the user-level stack 5. Details of the library are stored as one or more environment variables in memory allocated to the process such that the library is linked to the process. Thus, if the process were to be restarted, the library would be loaded at an early stage in the launching of the process due to its presence within the process's environment variables. The environment variables of a process may be stored in any suitable way, but in this example are held in a file having a list of (name,value) pairs, where the name is the name of an environment variable and the value is the current value of the variable. The "value" field is initially set to a default value, and is modified if the value changes. The entries in the list can be modified or added to by means of a user or a configuration script.
As indicated above, when a process is about to make an execQ system call, it may choose to destroy its own environment first for security reasons. Thus, it may arrange for the deletion of the environment variables stored in the memory allocated to it. It could then proceed to issue an exec() call without those environment variables presenting a security risk.
The library of the present embodiment is configured to make a copy of a process's environment variables automatically when the process starts up. The library holds the copied environment variables in the memory allocated to the library. By means of the copied environment variables, the library can enable the environment variables to be restored to the process's memory after the process has caused them to be deleted for security reasons. In a preferred embodiment, the library achieves this by identifying an intercepted call from the process as indicating that the process may have requested deletion of its state, including environment variables, and in response replacing the deleted environment variables (and optionally any other state) in the process's memory. Then, when the library subsequently permits the intercepted call (such as an exec() call) to be executed, the environment variables detailing the library will be accessible by the process, so that the library will be loaded. Subsequent operations of the process can then be intercepted by the library as intended. In other words, the library surreptitiously copies and restores the environment variables while the process itself believes them to have been destroyed. The library can ensure its continued existence in conjunction with the process by this mechanism. Preferably, the library withholds the intercepted system call from the OS (or user-level stack) until the destroyed environment variables have been restored, and then permits the call to proceed once the environment variables are again available to the process.
For the purpose of ensuring its continued link with a process, the library need only maintain a copy of the process's environment variables that relate to the library itself; for example, the location of the library in memory, and its configuration. This information may be sufficient to cause the successful loading of the library when the process starts up. However, in general it may be desirable for the library to maintain a copy of other items of information, such as other state owned by that process, or even state owned by other elements of the data processing system, such as other processes or libraries. Embodiments of the invention could conveniently be extended to enable the library 2 (or another library) to hold copies of environment variables of other processes, other libraries, or any entities, so that these environment variables could be restored following deletion. To this end, a configuration file or a separate application could be set up to maintain a list of the state that a given library is intended to copy. The file or application could contain an entry for every element that is active in a data processing system, together with an indication of whether or not a library copy is desired for that element. Alternatively, a simpler configuration could be used, whereby wildcards could indicate, for example, that environment variables for all applications except X1 Y and Z are to be copied, or that all environment variables for application X are to be copied, except those relating to elements A and B. The configuration file or application could conveniently be written to by means of an application program interface (API), to enable entities within the data processing system to request the saving and restoration of some or all of their environment variables.
A library running in order to support a process (or a plurality of processes) additionally maintains its state defining its own operation. This state is typically stored in memory allocated to the library. In the case of a transport library, this state may include details of the network connection (for example a TCP connection) which it is configured to support. When an exec() call is executed in respect of a process, it will typically cause the destruction of the state defining libraries acting to support the operation of the process. The library 2 is an example of such a supporting library, and it could thus be deleted on the execution of an exec() in respect of the process 1.
In order to guard against such destruction, the library 2 is configured to maintain a copy of its own state in memory allocated to the OS. In this way, if the library's own memory is deleted the state can subsequently be restored from the OS memory, as described in more detail below.
Continuing the above-described example of the management of file handle ownership at the library 2, records detailing the ownership of file handles in use in the data processing system may be stored as library state in memory allocated to the library. This state can enable the efficient routing of instructions and data within the system between the process, the user-level stack and the OS. If this information is deleted (together with other state of the library) when the process undergoes an exec() operation, then the shortcut routing mechanism that had been created for the process will no longer be available following the exec(). The file handle ownership information would therefore need to be re-acquired in order to activate the previous shortcuts.
To avoid duplication of this kind, and to protect state of the library in general, the library's state is copied into the OS prior to a system call that will result in its destruction from the library's own memory space, such that the state can later be restored as required from the OS memory. In general, the state could be saved in any location where it would not be destroyed on the execution of any system call issued by the process.
Preferably the environment variables of the process are copied into library memory automatically each time the process starts up, so that they are available in the library memory whenever they may be needed for the purpose of restoring the process's own copy. The library copy may be deleted each time the process shuts down, or alternatively it could be retained and then checked against the new values each time the process starts up again, with any changes being copied into the library's memory.
Similarly, the library's own state could be saved into the OS (or elsewhere) each time the library is loaded. Alternatively, it could be saved only when the library intercepts a call from the process which it interprets as indicating that process state (and the corresponding library) may be destroyed, for example an exec() call.
The library could be configured to recognise particular types of system call issued from the process as implying that destruction of the process's environment may accompany or precede the call. In the case of interception of such a call, the library could be configured to check the process's environment variables, stored in the process's own memory, to determine whether they are still intact. If so, the library could continue to operate as normal; if not, the library could trigger restoration of the process's environment variables so that they are present during the subsequent operation of the process. Although it is not anticipated to be an efficient manner of operation, it is conceivable that the library could restore the process's environment variables periodically, or each time an instruction from the process is intercepted by the library, if the environment variables are not already intact.
It may be noted that where the destruction or deletion of environment variables or state is referred to herein, such destruction or deletion could involve the removal of all such information from the memory in which it is held, or it could involve modifying the information, for example by resetting variables to their default values. It should also be noted that while the preferred embodiment of the invention involves storing the copy of the process state into the library's memory, it could suitably be held in any location where it would not be affected by the process's destruction of its own environment.
Although this aspect of the invention has been described by way of example with reference to Unix compatible system calls such as exec(), it will be understood by the skilled person that the invention may be equally applicable to other types of data processing system and to instructions according to any protocol where the instruction may involve the destruction of the environment of one or more entities in a data processing system.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.
SECTION B CONTROL DATA
The present invention relates to the transmission of data in a network, and in particular to handling control data in such a data processing system within a network.
Figure 3 represents equipment capable of implementing a prior art protocol stack, such as a transmission control protocol (TCP) stack in a computer connected to a network. The equipment includes an application 1 , a socket 2 and an operating system 3 incorporating a kernel 4. The socket connects the application to remote entities by means of a network protocol, in this example TCP/IP. Data can be transmitted across the network via hardware 6, which interfaces with the network so that data can be transferred between the system of figure 4 and other data processing systems. The hardware could be a network interface card (NIC).
The application 1 can send and receive TCP/IP messages by opening a socket and reading and writing data to and from the socket, and the operating system causes the messages to be transported across the network. For example, the application can invoke a system call (syscall) for transmission of data through the socket and then via the operating system to the network. Syscalls can be thought of as functions taking a series of arguments which cause execution of the CPU to switch to a privileged level and start executing the operating system. A given syscall will be composed of a specific list of arguments, and the combination of arguments will vary depending on the type of syscall.
Syscalls made by applications in a computer system can indicate a file descriptor (sometimes called a handle), which is usually an integer number that identifies an open file within a process. A file descriptor is obtained each time a file is opened or a socket or other resource is created. File descriptors can be re-used within a computer system, but at any given time a descriptor uniquely identifies an open file or other resource. Thus, when a resource (such as a file) is closed down, the descriptor will be destroyed, and when another resource is subsequently opened the descriptor can be re-used to identify the new resource. Any operations which for example read from, write to or close the resource take the corresponding file descriptor as an input parameter.
When a network related application program interface (API) call is made through the socket library this causes a system call to be made, which creates (or opens) a new file descriptor. For example the accept() system call takes as an input a pre-existing file descriptor which has been configured to await new connection requests, and returns as an output a newly created file descriptor which is bound to the connection state corresponding to a newly made connection. The system call when invoked causes the operating system to execute algorithms which are specific to the file descriptor. Typically there exists within the operating system a descriptor table which contains a list of file descriptors and, for each descriptor, pointers to a set of functions that can be carried out for that descriptor. Typically, the table is indexed by descriptor number and includes pointers to calls, state data, memory mapping capabilities and ownership bits for each descriptor. The operating system selects a suitable available descriptor for a requesting process and temporarily assigns it for use to that process.
Certain management functions of a computing device are conventionally managed entirely by the operating system. These functions typically include basic control of hardware (e.g. networking hardware) attached to the device. When these functions are performed by the operating system the state of the computing device's interface with the hardware is managed by and is directly accessible to the operating system. An alternative architecture is a user-level architecture, as described in the applicant's copending PCT applications WO 2004/079981 and WO 2005/104475. In a user-level architecture at least some of the functions usually performed by the operating system are performed by code running at user level. In a user-level architecture at least some of the state of the function can be stored by the user-level code. This can cause difficulties when an application performs an operation that requires the operating system to interact with or have knowledge of that state. Control data within a network can include statistics relating to the use of functions and interfaces with Fn the network, and can include control messages, for example transmitted within the network using Internet Control Message Protocol (ICMP). Control data can be requested by means of a system call sent by an application running on a data processing system.
Figure 4 shows components implementing a user-level TCP stack. Layers of the stack include an application 1 and a socket 2 provided by a socket library. The socket library is an application program interface (API) for building software applications. The socket library can carry out various functions, including creating descriptors and storing information. Additionally, there is an operating system 3 comprising a TCP kernel 4, and a proprietary TCP user-level stack 5. A helper or driver 12 for the TCP user-level stack is provided within the operating system. It will be understood by the skilled person that although TCP is referred to by way of example, other protocols could also be used in accordance with embodiments of the invention. For example, User Datagram Protocol (UDP), ICMP or Real-Time Transport Protocol (RTP) could be used. Non-Ethernet protocols could be used. The user-level stack is connected to hardware 6, such as a NIC.
In the prior art system illustrated in figure 4, control data (such as an Address Resolution Protocol message) relating to the TCP stack is transmitted directly from the NIC 6 to the stack. Control data such as route changes can also be transmitted to the stack through the syscall API. Control data relating to other functions of the data processing system is transmitted to the operating system and stored there. When one network entity that has a single address (e.g. a MAC or IP address) to which command data is sent, but which implements two parallel communication stacks, communicates with another entity problems can arise in handling the command data. First, some of the command data may relate to overall operation of the network entity: for instance flow control data such as an ICMP source quench message. This data must be used by both stacks. The control data could simply be copied to both stacks. However, in some circumstances this might result in each stack sending its own response to such data, which would be difficult for the other entity to interpret. For example, one stack might send a TCP RESET message to the network entity in respect of a connection which is being managed by the other stack. Second, some of the control data might be intended for only one. of the stacks. It would be inefficient for that data to be handled by both stacks. There is therefore a need for an improved way of handling control and other like data, especially but not exclusively in data communication entities that implement multiple full or partial protocol stacks for a single protocol.
According to a first aspect of the present invention there is provided a data processing system comprising: a network interface; an operating system capable of transmitting traffic data by means of a network protocol via a first set of communication links over the interface and receiving control data of the protocol via the interface; and a network data transmission function external to the operating system and also capable of transmitting traffic data by means of the network protocol via a second set of communication links over the interface; the data processing system being arranged to route solely to the operating system control messages relating collectively to the operation of the first and second sets of communication links received by the interface, and the operating system being arranged to share the control messages with the network data transmission function so as to permit the operating system and the network data transmission system to react collectively to the control messages.
The operating system may be arranged to share with the network data transmission function control messages designated according" to the protocol for setting data transmission and/or reception parameters by permitting the network data transmission function to access such messages via the operating system.
The said network protocol could be Transmission Control Protocol or Internet Control Message Protocol.
The operating system preferably comprises a kernel and a driver, and the data processing system is preferably arranged to route the control messages solely to the kernel and the kernel is arranged to share the control messages with the network data transmission function via the driver.
The operating system may be arranged to share with the network data transmission function control messages designated according to the protocol for setting data transmission and/or reception parameters by: storing the content of such messages in memory allocated to the driver; and permitting the network data transmission function to access the content in that memory.
The operating system is preferably arranged to inform the network data transmission function of changes to the control data.
The operating system may be arranged to copy the control data to the network data transmission function; and the operating system may be arranged to selectively copy the control data to the network data transmission function. The operating system is preferably arranged to copy to the network data transmission function control data relating to the network data transmission function.
The operating system is preferably arranged not to copy to the network data transmission function control messages designated according to the protocol for requesting a response indicative of the responsiveness of the data processing system.
The control messages designated according to the protocol for requesting a response indicative of the responsiveness of the data processing system may include ping messages.
The data processing system could be arranged to route solely to the network data transmission function destination unreachable messages relating to any of the second set of communication links. The operating system may be arranged to copy to the network data transmission function instructions to cease the transmission of data from the data processing system.
According to a second aspect of the present invention there is provided a data processing system comprising: an operating system capable of transmitting traffic data by means of a network protocol via a first set of communication links over a network interface; and a network data transmission function external to the operating system and also capable of transmitting traffic data by means of the network protocol via a second set of communication links over the interface; the data processing system being arranged to route solely to the operating system a system call requesting data relating to the status of the first and second sets of communication links, and the operating system being arranged to, in response to receiving such a system call: request such status data from the network data transmission function in respect of data transmission and/or reception by the network data transmission function; determine such status data in respect of data transmission and/or reception by the operating system; and on receiving such status data from the network data transmission function, form a response to the function call that is indicative collectively of the status data received from the network transmission function and the status data determined in respect of the operating system.
The kernel of the operating system is preferably arranged to perform the said requesting, determining and forming.
According to a third aspect of the present invention there is provided an operating system for a data processing system comprising a network data transmission function external to the operating system and capable of transmitting traffic data by means of a network protocol via a first set of communication links over a network interface; the operating system being capable of transmitting traffic data by means of the network protocol via a second set of communication links over the interface and being arranged to, in response to receiving a system call requesting data relating to the status of the first and second sets of communication links: request such status data from the network data transmission function in respect of data transmission and/or reception by the network data transmission function; determine such status data in respect of data transmission and/or reception by the operating system; and on receiving such status data from the network data transmission function, form a response to the function call that is indicative collectively of the status data received from the network transmission function and the status data determined in respect of the operating system.
According to a fourth aspect of the present invention there is provided a data carrier carrying data defining an operating system as defined above.
The present invention will now be described by way of example with reference to the accompanying drawings, in which:
Figure 3 shows a prior art computer system;
Figure 4 shows a prior art computer system incorporating a user-level stack; and
Figure 5 shows a system in accordance with embodiments of the present invention.
In the system of figure 5, control data 10 relating to the operating system 3 is sent via the NIC 6 to the operating system. Control data 11 relating to the TCP user-level stack 5 is also transmitted to the operating system, where it is stored in conjunction with the data 10. This can ensure that TCP flow control data is handled in a coherent way.
To implement this system a filtering function filters incoming control data. The filtering function has access to filter definition data that indicates where each type of control data should be directed in the communication entity: whether to the operating system's stack, the user-level stack or both. The filter definition data may conveniently be implemented as a look-up or masking table, as will be described in more detail below. When each item of control data is received the unit checks the filter definition data to determine where to direct the item, and then directs the item accordingly. The filter definition data is stored in a data structure 13, which is accessible by the TCP driver 12 by means of memory mapping 14. The memory mapping is preferably read only, to avoid corruption of the data structure by the TCP driver.
In a preferred arrangement there is one user-level TCP stack 5 for each application that requires one. This can provide better performance than if a stack is shared between applications. Each stack is located in the same address space as the application that it serves. None of the user-level stacks is a master stack. It is preferred for the stack of the operating system to be the master stack since it can be expected to be available whenever the operating system is operational, whether or not a user-level stack has been configured.
In a preferred embodiment of the invention, a system call initiated at a control API by an application 1 requesting statistical data representing usage of the TCP stack 5 is intercepted by the kernel 4. The kernel then accesses the data structure 13 to obtain the required data. A response is transmitted from the operating system to the application containing the requested information. In addition, the kernel incorporates control data from the TCP user-level stack and any other user-level stacks in the data processing system into the response in order that the response to the function call can be representative collectively of the status of the operating system's stack and the user-level stacks.
Calls that request statistical data can, for example be made through a UNIX /proc/net interface or the Windows WMI interface. One example is the command cat/proc/net/snmp on the Linux system, which will result in a read system call to the /proc/net/snmp special file, which causes the kernel network subsystem to return the ICMP Management Information Base (MIB) statistics. Such calls may, for example, be sent from the application level through a control API (application programming interface), for example through /proc/net in a UNIX-like system. Thus, if the system receives a read system function call the kernel of the system is arranged so that it can efficiently react to that request with a response that takes account of the relevant ICMP MIB status data from both the operating system's TCP stack and any user-level TCP stacks. To achieve this, first the system is configured so that the request itself is sent to the operating system only, and not to the user level stacks. Second, the operating system is configured to process such requests for status data by combining the relevant data from its own stack with the relevant data from the user-level stacks. The operating system requests, in response to the receipt of the function call, the relevant data from the user-level stacks. It may conveniently do this by way of the appropriate driver which may be installed as part of the operating system. On receiving the relevant data from the user-level stacks it combines that data with the corresponding data from its own stack. Thus, if the request is for the total number of packets sent it would add together the numbers from the user-level stack and the operating system stack. If the request were for an average data rate it would determine a combined average taking account of the relative amounts of data sent by the user-level and operating system stacks. It then responds to the system call by returning the appropriate data. The data about the operating system's own stack could be gathered before or after the data about the user-level stacks is received or requested.
As indicated above, not all of the control data passing to the data processing system through the hardware 6 need be transmitted to the operating system. There is preferably a filtering arrangement implemented, for example at the hardware level, to determine which control data should be sent to which stack function of the system. For example, it could be determined that certain data is passed to the operating system only (such as the data 10 and 11 in figure 5), while other data is passed to the TCP stack only, and further data is passed to both the operating system and the TCP stack. For example:
1. If a "connect" message originating from the user-level TCP stack 5 is undelivered, a response indicating "destination unreachable" is preferably transmitted only to the user-level stacks in the system, to be stored as TCP control data. Each stack then interprets this incoming response and determines, based on its own list of outstanding connection requests, whether the response was intended for it.
2. A "ping" message received at the hardware 6 is transmitted only to the operating system. The operating system can then respond to that message without the user- level stack having to use processing time on it, which also avoids the possibility of two ping responses being sent.
3. A "source quench" control message is sent both to the operating system and to the TCP stack, since this message should prevent both stacks from transmitting data. In general, rules can be set, and the filter definition data configured, to determine the destination of any type of control message in accordance with its applicability to the operating system or to user-level stacks. The filter definition data will typically filter incoming messages based on their protocol type and, where appropriate, IP addresses and protocol port numbers. Incoming messages are compared against the content of the filter definition data and routed according to the rules defined by that data.
As indicated above, the present approach is applicable to control data of various types. It has been described in relation to TCP and ICMP data but it could be used with data of other types.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention. > SECTION C
DATA TRANSMISSION
The present invention relates to the transmission of data between a pair of data processing units, and in particular but not exclusively to data transmission in networks.
It is generally acknowledged that the transmission of data between data processing units in a network entails the risk of data loss. If a stream of packets of data is sent from a transmitting station to a receiving station, not all packets of the stream may be received at the receiving station. One possible reason for data loss is congestion at one or more points in the network.
Figure 6 shows schematically the architecture of a networked system. The system comprises two data processors 10a, 10b (such as personal computers or servers), each of which has a network interface 11a, 11b (such as a NIC). The network interfaces are linked together over a data network 12. The data network could be an Ethernet network (e.g. using Gigabit Ethernet) or could employ any other suitable protocol. Each data processor has an operating system 13a, 13b which includes a kernel 14a, 14b and a device driver 15a, 15b for controlling communications between the data processor and its network interface. The operating system supports applications or processes 16a, 16b running on the_data processor. A transport . library 17a, 17b provides the applications/processes with routines that can be used for controlling communications over the network, and supervises communications between the applications/processes and the operating system. Each data processor has a memory 18a, 18b.
In general, when data is to be transmitted from processor 10a to processor 10b, the data is transmitted from memory 18a, via interface 11a, across the network 12 to interface 11 b and then to the processor 10b. Direct memory access (DMA) can be used for transferring data. In DMA a network interface accesses local memory directly so that separate commands causing retrieval of data from memory need not be issued by a central processing unit. DMA can thus be more efficient than a programmed input-output (PIO) arrangement, which requires such commands to be sent, in a situation where a relatively large block of contiguous data is to be retrieved from the memory.
In a typical network of data processors, data transferred from one processor to another will pass through a number of intermediate switches or hubs. Figure 7 illustrates such a system and shows a transmitting station 1 for sending data to a series of receiving stations 2. A link is shown for sending data between the transmitting station and a switch 3a. The data can then be passed either to a hub 4, to a further switch 3b, or directly to a receiving station 2.
The transmitting station could suitably be a personal computer or a server, or it could be any other network device needing to send or receive data, such as a dedicated network appliance or a multimedia terminal. The transmitting station preferably includes a user-level stack for handling transmission and reception of data. For example, in a network that supports transmission control protocol (TCP) at least one TCP stack could be provided in the transmitting station to enable transmission of data from that station across the network by TCP.
The routes between the transmitting station to the various receiving stations illustrated in figure 7 may have different transmission rates beyond which the respective receiving station will start dropping packets. The difference in transmission rate between links can be caused by many factors. For a given route between the transmitting station and the receiving station, the maximum possible transmission rate may be determined by, for example:
The number of switches or hubs between the transmitting station and the various receiving stations;
The speed of the individual links between the transmitting station and the receiving station. (For instance, in an Ethernet network the weakest link could be as slow as half-duplex at 10 Mbps);
■ The possibility of many-to-one congestion at a particular link i.e. multiple nodes transmitting to one node; The efficiency of the TCP stack of the receiving station in handling packets.
Data processors in typical prior art networks utilise acknowledge messages to inform the transmitting station when the receiving station successfully receives transmitted data. A receiving station will typically send an acknowledge message to a transmitting station each time a packet, or a predetermined number of packets, is received. For example, an acknowledge message can be sent for every two packets received.
Figure 8 illustrates an exemplary arrangement of bytes of data for transmission across a network in packets. The packets 30 each contain a number of bytes 31. For example, packet P1 contains byte numbers N to N+M, and packet P2 contains byte numbers N+M+1 to N+2M+1. In general, the number of bytes which a packet can contain is not fixed, and so acknowledgement messages generally include an indication of the byte number of the last byte in the last received packet, rather than indicating the last received packet number. The transmitting station interprets an acknowledgement as meaning that all bytes prior to the indicated byte number have been successfully received over the network. At this time, the transmitting station can remove all such prior data from buffers on the interface since this data is no longer required for transmission.
The transmitting station can modify the rate at which it transmits data over a particular link. For example, it could make a decision to increase the transmission rate if it is receiving acknowledgements for every packet sent, and the rate could be progressively increased until it becomes apparent by means of a lack of acknowledgements that data is no longer being reliably received at the receiving station, and at that point decrease the rate to a more reliable level.
When an expected packet is not received at a receiving station the receiving station sends multiple acknowledge (ACK) messages identifying the last successfully received packet until the missing packet is received. With reference to figure 8, if Pi is received, an ACK is sent to the transmitting station. If the next received packet is P3 then the receiving station can determine that the byte number of the first byte in P3 does not run consecutively on from the byte number of the last byte in P-i. It therefore recognizes that data is missing from the stream. In order to alert the transmitting station to the loss of data it re-transmits the ACK indicating P-i. In response to the receipt of each further packet (or each predetermined number of packets) a further ACK is sent, again identifying the last byte that was received in order. The transmitting station interprets these duplicate ACKs as the loss of data subsequent to P1. The transmitting station cannot determine from the duplicate ACK how much data has been lost over the network - it simply knows that at least one packet has been lost. It therefore re-transmits all data transmitted since the byte number acknowledged in the last received ACK. This algorithm is known as Fast Retransmit and is fully described in RFC2001.
In a typical prior art network, when a transmitting station receives duplicate ACKs it subsequently reduces the transmission rate in an effort to avoid further loss of data. There is thus a fairly severe penalty to be paid for the loss of a packet across a network: not only is there an overhead associated with re-transmitting the lost data and any subsequent data that had already been transmitted, but there is also a subsequent reduction in transmission rate.
However, although prior art networks tend to interpret a missing packet as data loss, a packet can in fact be missing from a stream for other reasons. One important reason is that it is out of order among the packets in a stream. Thus, in one example with reference to figure 8, the packets may be received at the receiving station in the order: P1, P3, P2. In a typical prior art system, when the receiving station receives P1 followed by P3 it will transmit a duplicate ACK, causing the re-transmission of P2 and a reduction in the rate of future transmissions, However, these steps are unnecessary since P2 would have been received at the receiving station anyway, after P3, and there was thus no need for it to be re-transmitted. It can therefore be seen that there can be a considerable reduction in efficiency in the event of a packet being received out of order in a stream. In general, a transmitting station will recognise a threshold number of duplicate ACKs (dupACKs) beyond which data that is apparently lost will be re-transmitted. For example, after receiving three identical ACKs it may be programmed to begin retransmitting. This can help to avoid unnecessary re-transmissions in cases of reordering, since when a delayed packet is received at a receiving station, a new ACK can then be sent and the transmitting station will recognise that subsequent data has been received.
However, if the dupACK threshold is relatively large then this can have undesirable impacts on the efficiency of data transmission. Firstly, it can give rise to a long delay before re-transmission in the event of a true loss of data; secondly, an insufficient number of packets may be transmitted after a lost packet (if the lost packet was near to the end of a stream of data), meaning that the receiving station does not send a sufficient number of dupACKs to cause the transmitting station to re-transmit the lost data.
It is desirable to provide an efficient data transmission system which supports retransmission of unreceived data.
According to a first aspect of the present invention there is provided a method for receiving data by means of a data transmission system from a data source via two routes by a protocol according to which data is transmitted in the form of sequential packets and the transmission by a receiver of a predetermined number of acknowledgement messages in respect of a packet is used to signal a data source to retransmit subsequent packets, the predetermined number being greater than one, the method comprising: receiving data packets from the data source via both of the two routes; and on receiving a data packet from the data source: determining whether any packets in sequence between that packet and the previous packet received o\/er the same route as that packet for which a first acknowledgement message has been transmitted have not been received, and if they have not, transmitting a further acknowledgement message to the data source in respect of that previous packet.
The method could further comprise the step of issuing from a receiver an acknowledgement message to the data source in response to receiving a predetermined number of packets from the data source.
The data source may comprise two transmission ports, each port forming a node within a respective one of the two routes.
The data transmission system is preferably such that the route via which each data packet is transmitted from the data source is determined in dependence on an algorithm. Preferably the data source and the receiver have access to the algorithm.
The method could further comprise the step of: at the data source, modifying information within a packet of data in dependence on the route determined for transmission of that packet. The said information could be a Media Access Control (MAC) address.
The method could further comprise the step of, on receiving a data packet from the data source, determining from the modified information the route via which the packet was transmitted.
The algorithm is preferably such as to balance the transmission of data over the routes such that substantially the same number of bytes is transmitted over each route in a given time period.
The protocol is preferably such as to support the selection of one of multiple routes between a source and a receiver for routing of each packet in dependence on data contained in the respective packet.
The protocol could suitably be TCP over Internet Protocol. The algorithm is preferably such as to determine the said route for each packet in dependence on a TCP sequence number contained in the packet.
The step of determining whether any packets have not been received may comprise determining by means of the algorithm the routes via which (i) the said data packet, (ii) the said previous packet, and (iii) any packets in sequence between the said packet and the said previous packet were transmitted.
The sequence of data packets in the data transmission system may be defined by identifiers of bytes contained in a data stream from which the packets are formed. Each of the said bytes could be the byte constituting the first byte of traffic data within a respective packet.
According to a second aspect of the present invention there is provided a receiver in a data transmission system, the receiver being arranged for receiving data from a data source via two routes by a protocol according to which data is transmitted in the form of sequential packets and the transmission by the receiver of a predetermined number of acknowledgement messages in respect of a packet is used to signal the data source to retransmit subsequent packets, the predetermined number being greater than one, the receiver comprising: a transmission unit for transmitting acknowledgement messages; and a determining unit arranged to, on receiving a data packet, determine whether any packets in sequence between that packet and the previous packet received over the same route as that packet for which a first acknowledgement message has been transmitted by the transmission unit have not been received, and if they have not, cause the transmission unit to transmit a further acknowledgement message to the data source in respect of that previous packet.
The transmission unit is preferably arranged to transmit an acknowledgement message to the data source in response to receiving at the receiver a predetermined number of packets from the data source.
The route via which each data packet is transmitted from the data source is preferably determined in dependence on an algorithm, and the data source and the receiver may have access to the algorithm.
According to a third aspect of the present invention there is provided a method for transmitting a data message from a transmitter to a receiver, the message being conveyed in the form of data units each containing a portion of the message, there being a plurality of routes available for transmitting the data units between the transmitter and the receiver, the method comprising: selecting for each data unit one of the plurality of routes, the selection being performed in dependence on the position in the data message of at least part of the portion of the data message contained in the data unit; and transmitting that data unit over the selected route.
The said at least part of the portion is preferably the byte in the portion whose position in the data message is closest to the start of the message.
The portion of the data message may consist of contiguous bytes of the data message in their order in the data message.
Each data unit could comprise only a single portion of the data message.
The position is preferably determined as the offset of the part of the portion from the start of the message.
According to a fourth aspect of the present invention there is provided a transmitter for transmitting a data message from a transmitter to a receiver, the message being conveyed in the form of data units each containing a portion of the message, there being a plurality of routes available for transmitting the data units between the transmitter and the receiver, the transmitter comprising: a selection unit arranged to select for each data unit one of the plurality of routes, the selection being performed in dependence on the position in the data message of at least part of the portion of the data message contained in the data unit; and a transmission unit arranged to transmit that data unit over the selected route. The present invention will now be described by way of example with reference to the accompanying drawings, in which:
Figure 6 shows a prior art data transmission system;
Figure 7 shows a prior art network;
Figure 8 shows the arrangements of bytes and packets within a data stream;
Figure 9 shows a section of a data transmission system;
Figure 10 shows a data stream resulting from a pair of transmitting ports.
Figure 9 shows a part of a data transmission network in which a pair of NICs 11c, 11d are transmitting data via to a pair of switches 3c, 3d to endpoints in the network. Each NIC has two transmission ports 40 and 41 , termed port 0 (40) and port 1 (41 ). In this example, each port is arranged to transmit data to the one of the two switches. The port Os can communicate with each other, and the port 1 s can communicate with each other, but as shown, switches cannot communicate with each other. In a preferred embodiment, an application transmitting data across the network from one of the NICs will cause data to be transmitted alternately from port 0 and port 1 on the NIC. For a given packet, sequence number information in the packet - for example, the sequence number of the first byte in the packet (i.e. its offset in bytes from the start of the message) - will be determined and applied to an algorithm. For example, in a TCP system. the TCP sequence number cpu Id, , be. extracted for entry into the algorithm. The algorithm will then cause one of the ports to be selected for transmitting that packet. For example, the algorithm could be that if the number of the first byte of the packet (numbered in sequence from the start of the message) is even then a first route is chosen, and if it is odd then a second route is chosen. Another example would be to divide the sequence number of the first byte of the packet by the maximum segment size (i.e. packet size), and use low order bits of the result to choose the route. Preferably, over time the number of bytes transmitted from each port will be approximately equal.
In a system incorporating multiple transmission ports on a NIC, reordering of transmitted data can occur. Thus, a first packet of data transmitted from port 0 prior to a second packet of data transmitted from port 1 might be received at a receiving station after the second packet. This may be due for example to a difference in the volume of traffic or a difference in the number of switches along the two routes.
For simplicity of discussion, an exemplary system in which ACKs are sent after every single received packet will be considered, but it will be understood that other arrangements are possible within the scope of the invention.
Figure 10 shows a data stream comprising packets 1 to 9. Bytes within these packets are in consecutive numerical order. Packets 1 , 2, 3, 4 and 5 are transmitted sequentially from port 0 of NIC 11c. Packets 6, 7, 8 and 9 are then transmitted sequentially from port 1 of the NIC.
A receiving station receives the data stream in the following order: 1 , 2, 3, 4, 6, 7, 8, 9, 5. ACKs are sent following the receipt of each of packets 1 to 4. When packet 6 is received, the receiver recognises that the first byte of this packet does not follow consecutively from the last byte of packet 4, and that there is thus a gap of at least one packet in the received data. However, due to the fact that the receiving station can deduce knowledge of the port from which at least the first packet of the missing data was transmitted from the NIC, it can determine that the missing data - packet 5 - was sent from port 0 whereas packet 6 was sent from port 1. The receiving station therefore refrains from sending a dupACK on receipt of the out-of-order packet 6 and instead awaits the next packet from port 0. This is received after packet 9, and packet 5 is then successfully received without re-transmission having been triggered.
Preferably, in the situation where packets arrive out of order (as in the example of figure 10), ACKs are sent only in respect of the packets which arrive in their proper sequence. Thus, after packet 4 no new ACKs are sent until packet 5 has been received. At this time, an ACK is sent identifying the last received in-order packet, packet 9. An ACK indicates that all data up to the byte indicated in the ACK has been received. In general, embodiments of the invention interpret a gap on a port and subsequent receipt of data from the same port as data loss, and a gap on a port and subsequent receipt of data from another port as reordering.
In order for the receiving station to be able to accurately determine the port from which a given packet was sent, it is desirable for both the transmitting station and the receiving station to have access to the same algorithm for selecting a port number for transmission of a packet. In this way, when a gap is observed at the receiving station, the receiving station can determine whether the first missing packet would have been transmitted on the same port as subsequent received data. If so, the receiving station interprets the lack of data as a loss over the communication link. If not, the receiving station interprets this initially as re-ordering and awaits the next batch of packets from the link on which the missing packet was sent. If the packet then duly arrives, no further action is required. However, if the missing packet does not then arrive, this is interpreted as loss and re-transmission is triggered by means of dupACKs.
It can be desirable for a receiving station to consider data received from each transmission port separately. It can treat the data being received from each port as an individual stream. This can assist in the determination of whether a missing packet implies loss or simply re-ordering. For example, if data is sent in the order indicated below: Table 1
Figure imgf000044_0001
then the receiver recognises that the port 0 data was all received in order and the port 1 data was all received in order, and thus there was no loss of data.
On the other hand, if the same transmitted data were received as follows: Table 3
Figure imgf000045_0001
then the receiver would recognise that the received port 1 data is missing at least one packet, and would trigger re-transmission accordingly.
In the above example, assuming an arrangement in which each packet received in order is acknowledged, ACKs would be sent after the receipt of packet 1 , and then after the receipt of packet 2 (since this is the next in-order packet). The ACK sent following the receipt of packet 2 could identify packet 3, since this had already been successfully received by the time packet 2 was received. In the situation shown in Table 3, no new ACKs would be sent after the receipt of packet 2 since the next in- order packet, packet 4, was not received. Receipt of packet 6 would thus trigger a dupACK identifying packet 3, this being the last in-order packet successfully received.
Acknowledgement messages can specify an amount of data (a "window") which the receiving station wishes the transmitting station to send in its next transmission.
Prior art mechanisms can be utilised in conjunction with embodiments of the present invention to inform the transmitting station when it re-transmits more data than necessary. As noted above, a receiving station generally does not know how many packets of data have been lost when a missing packet is identified. All it knows is that at least one packet has been lost. The transmitting station therefore does not know accurately how much data needs to be re-sent in the event of dupACKs. It can estimate, from the number of packets sent since it transmitted the byte identified in the duplicated ACK, which packet(s) have not been received, and on that basis it can begin re-transmitting all those packets. However, it may be that the receiver has received all but one of the packets sent since the last acknowledged byte, and in that case the re-transmission of the other packets is unnecessary and inefficient. In order to deal with this situation, a prior art Selective Acknowledgement mechanism, improved by the incorporation of reordering information from the receiver, can be implemented to inform the transmitter that too much data is being re-transmitted. The mechanism relies on information obtained at the transmitter from the receiver. In particular, the following information can be obtained by the transmitter:
- the packets (or, more correctly, the byte numbers) received at the receiver;
- the packets received in duplicate at the receiver;
- the packets that have not yet been received at the receiver.
This data can conveniently be sent in messages from the receiver to the transmitter, and can be used to improve the efficiency of communications from the transmitter to the receiver.
In one embodiment, the receiving station can be provided with information detailing specific routes between the transmitting station and the receiving station, preferably including details of any switches in the routes and maximum transmission rates of links within the routes, and including details of which transmission ports are using which routes in a given set-up. This information can be used by the receiving station in improving its understanding of instances of re-ordering. If re-ordering is anticipated by the receiving station on the basis of such information, the receiving station can refrain from sending dupACKs while awaiting delayed data, and thus unnecessary re-transmission of data can be avoided further.
Embodiments of the invention can be applied in respect of switches, or other nodes within a route, as well as in respect of transmitters. Thus a receiver could be provided with knowledge of rules used by the switch to make routing decisions, and the accuracy with which the receiver could predict instances of re-ordering could thereby be improved.
In a preferred embodiment, a user-level stack implements the invention. However, the skilled person will understand that the invention is not limited to implementation in such a stack.
Embodiments of the invention are suitable for application to a single TCP application, thereby improving the retransmission characteristics over a single connection between a transmitter and a receiver. These embodiments are preferable to prior art techniques which can typically only be used for multiple connections and which effectively bond the connections together.
The present invention is not limited to use with TCP. The present invention can advantageously be used with other protocols that call for retransmission of unacknowledged messages, and under which data units may traverse two or more paths between a transmitter and a receiver.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A data processing system comprising: a set of data stores for storing items of incoming data; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores, a plurality of the descriptors being associated with one of the stores, there being a common descriptor indicative of the descriptors of the plurality; and an instruction in respect of a set of the descriptors to which it returns a response indicating for which of the descriptors the stores contain data that has not yet been handled; and an interface for modifying communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to indicate by way of the common descriptor whether the stores contain data for any of the plurality of the descriptors that are members of the set.
2. A data processing system as claimed in claim 1 comprising a processing unit arranged to process the items of incoming data according to a protocol and produce an output representing the processed items.
3. A data processing system as claimed in claim 2 wherein the protocol is TCP/IP.
4. A data processing system as claimed in claim 2 or claim 3 wherein the operating system is arranged to form the said response based on the output of the processing unit.
5. A data processing system as claimed in any of claims 2 to 4 wherein the processing unit is arranged to process the items of incoming data at user level.
6. A data processing system as claimed in any of claims 2 to 4 wherein the processing unit is implemented in hardware on an interface between the data processing system and a data transmission network.
7. A data processing system as claimed in any preceding claim, wherein the instruction is a user-level instruction.
8. A data processing system as claimed in claim 7, wherein the interface is responsive to the user-level instruction for forming a communication to the operating system that represents an instruction of the same type and that identifies, instead of any descriptors of the plurality, the said common descriptor.
9. A data processing system as claimed in any preceding claim wherein the interface is an instruction library.
10. A data processing system as claimed in claim 9 wherein the interface is a socket library.
11. A data processing system as claimed in any preceding claim comprising a data structure storing, for each descriptor, an indication of whether or not it is a member of the set.
12. A data processing system as claimed in claim 11 further comprising an ordering unit arranged to: monitor the indications stored in the data structure; and if the monitoring indicates that descriptors that are members of the set are excessively interleaved with descriptors that are not members of the set, cause the descriptors to be reordered such that the descriptors of the set form a contiguous group.
13. A data processing system as claimed in claim 12 wherein the ordering unit is arranged to perform the monitoring periodically.
14. A data processing system as claimed in any preceding claim wherein the said instruction is a select call.
15. A data processing system as claimed in any of claims 1 to 13 wherein the said instruction is a poll call.
16. A data processing system as claimed in any preceding claim wherein the said instruction originates from an application running in the data processing system.
17. A data processing system comprising: a set of data stores for storing items of incoming data, a first one of the data stores being associated with a function of the system; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores; and an instruction in respect of one or more of the descriptors to which it returns a response indicating for which of those descriptors the stores contain data that has not yet been handled; and an interface for modifying communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to omit any descriptors other than those associated with any of a subset of the stores that includes the said first one of the data stores.
18. A data processing system as claimed in claim 17 wherein the interface is a socket library.
19. A data processing system as claimed in claim 17 or claim 18 wherein the said function is at user level.
20. A data processing system as claimed in any of claims 17 to 19 wherein the said instruction is a select call.
21. A data processing system as claimed in any of claims 17 to 19 wherein the said instruction is a poll call.
22. A data processing system as claimed in any of claims 17 to 20 wherein the said instruction originates from an application running in the data processing system.
23. An interface for use in a data processing system, the data processing system comprising: a set of data stores for storing items of incoming data; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores, a plurality of the descriptors being associated with one of the stores, there being a common descriptor indicative of the descriptors of the plurality; and an instruction in respect of a set of the descriptors to which it returns a response indicating for which of the descriptors the stores contain data that has not yet been handled; the interface being arranged to modify communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to indicate by way of the common descriptor whether the stores contain data for any of the plurality of the descriptors that are members of the set.
24. A data carrier defining software for operation as an interface in a data processing system, the data processing system comprising: a set of data stores for storing items of incoming data; an operating system that supports: a series of descriptors, wherein each item of incoming data is associated with one of the descriptors and each descriptor is associated with one of the stores, a plurality of the descriptors being associated with one of the stores, there being a common descriptor indicative of the descriptors of the plurality; and an instruction in respect of a set of the descriptors to which it returns a response indicating for which of the descriptors the stores contain data that has not yet been handled; the interface being arranged to modify communications between the operating system and a source of the said instruction so as to cause the response to the said instruction to indicate by way of the common descriptor whether the stores contain data for any of the plurality of the descriptors that are members of the set.
25. A method for retaining within a data processing system a set of environment variables defining the operation of a first entity supported by the data processing system, wherein the set of environment variables is initially stored in a first store in the data processing system, the method comprising: automatically copying the set of environment variables into a back-up store; intercepting at an interface instructions issued by one or more processes running on the data processing system; in response to intercepting an instruction, restoring by means of the interface the set of environment variables to the first store from the back-up store; and subsequently permitting execution of the instruction.
26. A method according to claim 25 wherein the step of automatically copying is performed by means of the interface.
27. A method according to claim 25 or claim 26 further comprising the step of, prior to restoring the set of environment variables, causing the deletion of the set of environment variables from the first store.
28. A method according to any of claims 25 to 27 further comprising the step of, prior to restoring the set of "environment variables, determining by means of the interface whether the set of environment variables has been deleted from the first store.
29. A method according to any of claims 25 to 28 further comprising the step of, after intercepting the said instruction, determining whether the instruction is of a first type, and only if that determination is positive performing the step of restoring the set of environment variables.
30. A method according to claim 29 wherein the first type is a type indicating that the issue of the said instruction may have been preceded by deletion of the set of environment variables from the first store.
31. A method according to any of claims 25 to 30 wherein the said instruction is an exec() call.
32. A method according to any of claims 25 to 31 wherein the interface is an interface between the first entity and an operating system of the data processing system.
33. A method according to claim 32 wherein the step of permitting execution of the instruction involves passing the instruction to the operating system.
34. A method according to any of claims 25 to 33 wherein the interface is a library.
35. A method according to any of claims 25 to 34 wherein the first store is in memory allocated to the first entity.
36. A method according to any of claims 25 to 35 wherein the back-up store is in memory allocated to the interface.
37. A method according to any of claims 25 to 36 wherein the set of environment variables identifies a memory location in which the interface is stored.
38. A method according to any of claims 25 to 37 wherein the set of environment variables includes details of the configuration of the interface.
39. A method according to any of claims 25 to 38 wherein the first entity is one of the one or more processes.
40. A method according to any of claims 25 to 39 wherein the said instruction is an instruction issued by the first entity.
41. A method according to any of claims 25 to 39 wherein the said instruction is an instruction issued by a process other than the first entity.
42. A method according to any of claims 25 to 38 wherein the first entity is a library.
43. A method according to any of claims 25 to 42 wherein the step of automatically copying the set of environment variables into a back-up store is performed in response to initialisation of the first entity.
44. A method according to any of claims 25 to 43 wherein the set of environment variables is such as to cause the interception of instructions by the interface.
45. A method according to any of claims 25 to 44 further comprising: providing data storage means indicating one or more sets of environment variables; accessing the data storage means to determine the indicated sets of environment variables; and performing the steps of claim 25 in respect of the indicated sets of environment variables.
46. A method according to claim 45 wherein the step of accessing the data storage means is performed by the interface.
47. A method according to claim 45 or claim 46 wherein the one or more sets of environment variables include environment variables defining one or more entities other than the first entity.
48. A method according to any of claims 45 to 47 wherein the data storage means is a configuration file or an application.
49. A method according to any of claims 45 to 48 further comprising the step of writing data to the data storage means to specify a set of environment variables for which the steps of claim 25 are to be performed.
50. A method according to claim 49 wherein the step of writing data to the data storage means is performed by means of an application program interface.
51. A method according to any of claims 25 to 50 wherein the instruction is such as to cause re-initialisation of the first entity.
52. An interface in a data processing system, wherein the data processing system has a first store that initially stores a set of environment variables defining the operation of a first entity supported by the data processing system, the interface being arranged to: automatically copy the set of environment variables into a back-up store; intercept instructions issued by one or more processes running on the data processing system; in response to intercepting an instruction, restore the set of environment variables to the first store from the back-up store; and subsequently permit execution of the instruction.
53. An interface-according to claim 52 further arranged to, prior to restoring the set of environment variables, determine whether the set of environment variables has been deleted from the first store.
54. An interface according to claim 52 or claim 53 further arranged to, after intercepting the said instruction, determine whether the instruction is of a first type, and only if that determination is positive perform the step of restoring the set of environment variables.
55. An interface according to claim 54 wherein the first type is a type indicating that the issue of the said instruction may have been preceded by deletion of the set of environment variables from the first store.
56. An interface according to any of claims 52 to 55 wherein the said instruction is an exec() call.
57. An interface according to any of claims 52 to 56 arranged to interface between the first entity and an operating system of the data processing system.
58. An interface according to claim 57 wherein the step of permitting execution of the instruction involves passing the instruction to the operating system.
59. An interface according to any of claims 52 to 58 wherein the interface is a library.
60. An interface according to any of claims 52 to 59 wherein the first store is in memory allocated to the first entity.
61. An interface according to any of claims 52 to 58 wherein the back-up store is in memory allocated to the interface.
62. An interface according to any of claims 52 to 61 wherein the set of environment variables identifies a memory location in which the interface is stored.
63. An interface according to any of claims 52 -to 62 wherein the set of environment variables includes details of the configuration of the interface.
64. An interface according to any of claims 52 to 63 wherein the first entity is one of the one or more processes.
65. An interface according to any of claims 52 to 64 wherein the said instruction is an instruction issued by the first entity.
66. An interface according to any of claims 52 to 64 wherein the said instruction is an instruction issued by a process other than the first entity.
67. An interface according to any of claims 52 to 66 wherein the first entity is a library.
68. An interface according to any of claims 52 to 67~arranged to perform the step of automatically copying the set of environment variables into a back-up store in response to initialisation of the first entity.
69. An interface according to any of claims 52 to 68 wherein the set of environment variables is such as to cause interception of instructions by the interface.
70. An interface according to any of claims 52 to 69 wherein the data processing system comprises data storage means indicating one or more sets of environment variables, and the interface is further arranged to: access the data storage means to determine the indicated sets of environment variables; and perform the steps recited in claim 52 in respect of the indicated sets of environment variables.
71. An interface according to claim 70 wherein the one or more sets of environment variables include environment variables defining one or more entities other than the first entity.
72. An interface according to claim 70 or claim 71 wherein the data storage means is a configuration file or an application.
73. A data processing system comprising an interface according to any of claims 52 to 72.
74. A data carrier carrying data defining an interface according to any of claims 52 to 72.
75. A data processing system comprising: a network interface; an operating system capable of transmitting traffic data by means of a network protocol via a first set of communication links over the interface and receiving control data of the protocol via the interface; and a network data transmission function external to the operating system and also capable of transmitting traffic data by means of the network protocol via a second set of communication links over the interface; the data processing system being arranged to route solely to the operating system control messages relating collectively to the operation of the first and second sets of communication links received by the interface, and the operating system being arranged to share the control messages with the network data transmission function so as to permit the operating system and the network data transmission system to react collectively to the control messages.
76. A data processing system as claimed in claim 75, wherein the operating system is arranged to share with the network data transmission function control messages designated according to the protocol for setting data transmission and/or reception parameters by permitting the network data transmission function to access such messages via the operating system.
77. A data processing system as claimed in claim 75 or claim 76 wherein the said network protocol is Transmission Control Protocol.
78. A data processing system as claimed in claim 75 or 76 wherein the said network protocol is Internet Control Message Protocol.
79. A data processing system as claimed in any of claims 75 to 78 wherein the operating system comprises a kernel and a driver, the data processing system is arranged to route the control messages solely to the kernel and the kernel is arranged to share the control messages with the network data transmission function via the driver.
80. A data processing system as claimed in claim 79 as dependent directly or indirectly on claim 76 wherein the operating system is arranged to share with the network data transmission function control messages designated according to the protocol for setting data transmission and/or reception parameters by: storing the content of such messages in memory allocated to the driver; and permitting the network data transmission function to access the content in that memory.
81. A data processing system as claimed in any of claims 75 to 80 wherein the operating system is arranged to inform the network data transmission function of changes to the control data.
82. A data processing system as claimed in any of claims 75 to 81 wherein the operating system is arranged to copy the control data to the network data transmission function.
83. A data processing system as claimed in claim 82 wherein the operating system is arranged to selectively copy the control data to the network data transmission function.
84. A data processing system as claimed in claim 83 wherein the operating system is arranged to copy to the network data transmission function control data relating to the network data transmission function.
85. A data processing system as claimed in claim 83 or claim 84 wherein the operating system is arranged not to copy to the network data transmission function control messages designated according to the protocol for requesting a response indicative of the responsiveness of the data processing system.
86. A data processing system as claimed in claim 85, wherein the control messages designated according to the protocol for requesting a response indicative of the responsiveness of the data processing system include ping messages.
87. A data processing system as claimed in any of claims 75 to 86 further arranged to route solely to the network data transmission function destination unreachable messages relating to any of the second set of communication links.
88. A data processing system as claimed in any of claims 82 to 87 wherein the operating system is arranged to copy to the network data transmission function instructions to cease the transmission of data from the data processing system.
89. A data processing system comprising: an operating system capable of transmitting traffic data by means of a network protocol via a first set of communication links over a network interface; and a network data transmission function external to the operating system and also capable of transmitting traffic data by means of the network protocol via a second set of communication links over the interface; the data processing system being arranged to route solely to the operating system a system call requesting data relating to the status of the first and second sets of communication links, and the operating system being arranged to, in response to receiving such a system call: request such status data from the network data transmission function in respect of data transmission and/or reception by the network data transmission function; determine such status data in respect of data transmission and/or reception by the operating system; and on receiving such status data from the network data transmission function, form a response to the function call that is indicative collectively of the status data received from the network transmission function and the status data determined in respect of the operating system.
90. A data processing system as claimed in claim 89, wherein it is the kernel of the operating system that is arranged to perform the said requesting, determining and forming.
91. An operating system for a data processing system comprising a network data transmission function external to the operating system and capable of transmitting traffic data by means of a network protocol via a first set of communication links over a network interface; the operating system being capable of transmitting traffic data by means of the network protocol via a second set of communication links over the interface and being arranged to, in response to receiving a system call requesting data relating to the status of the first and second sets of communication links: request such status data from the network data transmission function in respect of data transmission and/or reception by the network data transmission function; determine such status data in respect of data transmission and/or reception by the operating system; and on receiving such status data from the network data transmission function, form a response to the function call that is indicative collectively of the status data received from the network transmission function and the status data determined in respect of the operating system.
92. A data carrier carrying data defining an operating system as claimed in claim 91.
93. A method for receiving data by means of a data transmission system from a data source via two routes by a protocol according to which data is transmitted in the form of sequential packets and the transmission by a receiver of a predetermined number of acknowledgement messages in respect of a packet is used to signal a data source to retransmit subsequent packets, the predetermined number being greater than one, the method comprising: receiving data packets from the data source via both of the two routes; and on receiving a data packet from the data source: determining whether any packets in sequence between that packet and the previous packet received over the same route as that packet for which a first acknowledgement message has been transmitted have not been received, and if they have not, transmitting a further acknowledgement message to the data source in respect of that previous packet.
94. A method as claimed in claim 93 further comprising the step of issuing from a receiver an acknowledgement message to the data source in response to receiving a predetermined number of packets from the data source,
95. A method as claimed in claim 93 or claim 94 wherein the data source comprises two transmission ports, each port forming a node within a respective one of the two routes.
96. A method as claimed in any of claims 93 to 95 wherein the data transmission system is such that the route via which each data packet is transmitted from the data source is determined in dependence on an algorithm.
97. A method as claimed in claim 96 wherein the data source and the receiver have access to the algorithm.
98. A method as claimed in claim 96 or claim 97 further comprising the step of: at the data source, modifying information within a packet of data in dependence on the route determined for transmission of that packet.
99. A method as claimed in claim 98 wherein the said information is a Media Access Control (MAC) address.
100. A method as claimed in claim 98 or claim 99 further comprising the step of, on receiving a data packet from the data source, determining from the modified information the route via which the packet was transmitted
101. A method as claimed in claim 96 wherein the algorithm is such as to balance the transmission of data over the routes such that substantially the same number of bytes is transmitted over each route in a given time period.
102. A method as claimed in any of claims 93 to 101 wherein the protocol is such as to support the selection of one of multiple routes between a source and a receiver for routing of each packet in dependence on data contained in the respective packet.
103. A method as claimed in claim 102 wherein the protocol is TCP over Internet Protocol.
104. A method as claimed i in claim 103 as dependent on claim 96 wherein the algorithm is such as to determine the said route for each packet in dependence on a TCP sequence number contained in the packet.
105. A method as claimed in claim 96 wherein the step of determining whether any packets have not been received comprises determining by means of the algorithm the routes via which (i) the said data packet, (ii) the said previous packet, and (iii) any packets in sequence betwe'en the said packet and the said previous packet were transmitted.
106. A method as claimed in any of claims 93 to 105 wherein the sequence of data packets in the data transmission system is defined by identifiers of bytes contained in a data stream from which the packets are formed.
107. A method as claimed in claim 106 wherein each of the said bytes is the byte constituting the first byte of traffic data within a respective packet.
108. A receiver in a data transmission system, the receiver being arranged for receiving data from a data source via two routes by a protocol according to which data is transmitted in the form of sequential packets and the transmission by the receiver of a predetermined number of acknowledgement messages in respect of a packet is used to signal the data- source to retransmit subsequent packets, the predetermined number being greater than one, the receiver comprising: a transmission unit for transmitting acknowledgement messages; and a determining unit arranged to, on receiving a data packet, determine whether any packets in sequence between that packet and the previous packet received over the same route as that packet for which a first acknowledgement message has been transmitted by the transmission unit have not been received, and if they have not, cause the transmission unit to transmit a further acknowledgement message to the data source in respect of that previous packet.
109. A receiver as claimed in claim 108 wherein the transmission unit is arranged to transmit an acknowledgement message to the data source in response to receiving at the receiver a predetermined number of packets from the data source.
110. A receiver as claimed in claim 108 or claim 109 wherein the route via which each data packet is transmitted from the data source is determined in dependence on an algorithm.
111. A receiver as claimed in claim 110 wherein the data source and the receiver have access to the algorithm.
112. A receiver as claimed in claim 110 or claim 111 wherein the data transmission system is such that information within a packet of data is modified prior to transmission from the data source in dependence on the route determined for transmission of that packet.
113. A receiver as claimed in claim 112 wherein the said information is a Media Access Control (MAC) address.
114. A receiver as claimed in claim 112 or claim 113 further arranged to determine from the modified information in a received packet the route via which the packet was transmitted.
115. A receiver as claimed in claim 110 wherein the algorithm is such as to balance the transmission of data over the routes such that substantially the same number of bytes is transmitted over each route in a given time period.
116. A receiver as claimed in any of claims 108 to 115 wherein the protocol is such as to support the selection of one of multiple routes between a source and a receiver for routing of each packet in dependence on data contained in the respective packet.
117. A receiver as claimed in claim 116 wherein the protocol is TCP over Internet Protocol.
118. A receiver as claimed in claim 117 as dependent on claim 110 wherein the algorithm is such as to determine the said route for each packet in dependence on a TCP sequence number contained in the packet.
119. A receiver as claimed in claim 110 wherein the determining unit is arranged to determine by means of the algorithm the routes via which (i) the said data packet, (ii) the said previous packet, and (iii) any packets in sequence between the said packet and the said previous packet were transmitted.
120. A receiver as claimed in any of claims 108 to 119 wherein the sequence of data packets in the data transmission system is determined in dependence on identifiers of bytes contained in a data stream from which the packets are formed.
121. A receiver as claimed in claim 120 wherein each of the said bytes is the byte constituting the first byte of data traffic within a respective packet.
122. A receiver as claimed in any of claims 108 to 121 wherein each of the routes includes a respective transmission port at the data source.
123. A method for transmitting a data message from a transmitter to a receiver, the message being conveyed in the form of data units each containing a portion of the message, there being a plurality of routes available for transmitting the data units between the transmitter and the receiver, the method comprising: selecting for each data unit one of the plurality of routes, the selection being performed in dependence on the position in the data message of at least part of the portion of the data message contained in the data unit; and transmitting that data unit over the selected route.
124. A method as claimed in claim 123, wherein the said at least part of the portion is the byte in the portion whose position in the data message is closest to the start of the message.
125. A method as claimed in claim 123 or 124, wherein the portion of the data message consists of contiguous bytes of the data message in their order in the data message.
126. A method as claimed in any of claims 123 to 125, wherein each data unit comprises only a single portion of the data message.
127. A method as claimed in any of claims 123 to 126, wherein the position is determined as the offset of the part of the portion from the start of the message.
128. A transmitter for transmitting a data message from a transmitter to a receiver, the message being conveyed in the form of data units each containing a portion of the message, there being a plurality of routes available for transmitting the data units between the transmitter and the receiver, the transmitter comprising: a selection unit arranged to select for each data unit one of the plurality of routes, the selection being performed in dependence on the position in the data message of at least part of the portion of the data message contained in the data unit; and a transmission unit arranged to transmit that data unit over the selected route.
PCT/GB2006/001389 2005-04-13 2006-04-13 Data processing system WO2006109087A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP06726786A EP1875708A2 (en) 2005-04-13 2006-04-13 Data processing system

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
GB0507482A GB0507482D0 (en) 2005-04-13 2005-04-13 Intercepting messages
GB0507482.8 2005-04-13
GB0507739.1 2005-04-15
GB0507739A GB0507739D0 (en) 2005-04-15 2005-04-15 Control data
GB0508288A GB0508288D0 (en) 2005-04-25 2005-04-25 Data transmission
GB0508288.8 2005-04-25

Publications (3)

Publication Number Publication Date
WO2006109087A2 true WO2006109087A2 (en) 2006-10-19
WO2006109087A3 WO2006109087A3 (en) 2007-02-01
WO2006109087B1 WO2006109087B1 (en) 2007-04-26

Family

ID=36617156

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2006/001389 WO2006109087A2 (en) 2005-04-13 2006-04-13 Data processing system

Country Status (2)

Country Link
EP (1) EP1875708A2 (en)
WO (1) WO2006109087A2 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6219825B1 (en) * 1995-01-10 2001-04-17 Hewlett-Packard Company Profile based optimization of shared libraries
WO2002001830A1 (en) * 2000-06-27 2002-01-03 Siemens Aktiengesellschaft Method and array for transmitting secured information
WO2002101549A2 (en) * 2001-06-11 2002-12-19 Sap Aktiengesellschaft Initializing virtual machine that subsequently executes application
US20030026218A1 (en) * 2001-10-25 2003-02-06 Sandeep Singhai System and method for token-based PPP fragment scheduling

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6219825B1 (en) * 1995-01-10 2001-04-17 Hewlett-Packard Company Profile based optimization of shared libraries
WO2002001830A1 (en) * 2000-06-27 2002-01-03 Siemens Aktiengesellschaft Method and array for transmitting secured information
WO2002101549A2 (en) * 2001-06-11 2002-12-19 Sap Aktiengesellschaft Initializing virtual machine that subsequently executes application
US20030026218A1 (en) * 2001-10-25 2003-02-06 Sandeep Singhai System and method for token-based PPP fragment scheduling

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
BALAKRISHNAN H ET AL: "IMPROVING RELIABLE TRANSPORT AND HANDOFF PERFORMANCE IN CELLULAR WIRELESS NETWORKS" 1 December 1995 (1995-12-01), WIRELESS NETWORKS, ACM, NEW YORK, NY, US, PAGE(S) 469-481 , XP000543510 ISSN: 1022-0038 section 3 *
BANIKAZEMI M ET AL: "COMPARISON AND EVALUATION OF DESIGN CHOICES FOR IMPLEMENTING THE VIRTUAL INTERFACE ARCHITECTURE (VIA)" LECTURE NOTES IN COMPUTER SCIENCE, SPRINGER VERLAG, NEW YORK, NY, US, vol. 1797, 2000, pages 145-161, XP001117458 ISSN: 0302-9743 *
COMPAQ COMPUTER CORP ET AL: "Virtual Interface Architecture Specification" VERSION 1.0, 16 December 1997 (1997-12-16), XP002216244 *
KUI-FAI LEUNG ET AL: "G-snoop: enhancing TCP performance over wireless networks" 28 June 2004 (2004-06-28), COMPUTERS AND COMMUNICATIONS, 2004. PROCEEDINGS. ISCC 2004. NINTH INTERNATIONAL SYMPOSIUM ON ALEXANDRIA, EGYPT JUNE 28 - JULY 1, 2004, PISCATAWAY, NJ, USA,IEEE, PAGE(S) 545-550 , XP010742059 ISBN: 0-7803-8623-X sections 2.B, 3.B *
MANSLEY K: "Engineering a UserLevel TCP for the CLAN Network" PROCEEDINGS OF ACM SIGCOMM, XX, XX, vol. 2003, 25 August 2003 (2003-08-25), pages 1-9, XP007901357 *
PRATT I ET AL: "Arsenic: a user-accessible gigabit ethernet interface" PROCEEDINGS IEEE INFOCOM 2001. THE CONFERENCE ON COMPUTER COMMUNICATIONS. 20TH. ANNUAL JOINT CONFERENCE OF THE IEEE COMPUTER ANDCOMMUNICATIONS SOCIETIES. ANCHORAGE, AK, APRIL 22 - 26, 2001, PROCEEDINGS IEEE INFOCOM. THE CONFERENCE ON COMPUTER COMMUNI, vol. VOL. 1 OF 3. CONF. 20, 22 April 2001 (2001-04-22), pages 67-76, XP010538686 ISBN: 0-7803-7016-3 *

Also Published As

Publication number Publication date
EP1875708A2 (en) 2008-01-09
WO2006109087A3 (en) 2007-02-01
WO2006109087B1 (en) 2007-04-26

Similar Documents

Publication Publication Date Title
US11210148B2 (en) Reception according to a data transfer protocol of data directed to any of a plurality of destination entities
US9021142B2 (en) Reflecting bandwidth and priority in network attached storage I/O
US9357003B1 (en) Failover and migration for full-offload network interface devices
US8190960B1 (en) Guaranteed inter-process communication
JP4825794B2 (en) User level stack
US8489761B2 (en) Onload network protocol stacks
US6941379B1 (en) Congestion avoidance for threads in servers
KR101365838B1 (en) Improved distributed kernel operating system
US8625431B2 (en) Notifying network applications of receive overflow conditions
US9667729B1 (en) TCP offload send optimization
KR101363167B1 (en) Improved distributed kernel operating system
US20030046330A1 (en) Selective offloading of protocol processing
JP2007527172A (en) Failover and load balancing
JP2013511884A (en) Dynamically connected transport service
US8605578B1 (en) System and method for handling of destination host side congestion
US20070291782A1 (en) Acknowledgement filtering
JP4071098B2 (en) Architecture and runtime environment for network filter drivers
WO2007074343A2 (en) Processing received data
WO2008121690A2 (en) Data and control plane architecture for network application traffic management device
US8150996B2 (en) Method and apparatus for handling flow control for a data transfer
EP1875708A2 (en) Data processing system
Madden Challenges Using the Linux Network Stack for Real-Time Communication
Fox et al. IBM's Shared Memory Communications over RDMA (SMC-R) Protocol

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

WWE Wipo information: entry into national phase

Ref document number: 2006726786

Country of ref document: EP

NENP Non-entry into the national phase in:

Ref country code: RU

WWW Wipo information: withdrawn in national office

Country of ref document: RU

WWP Wipo information: published in national office

Ref document number: 2006726786

Country of ref document: EP