WO2002023400A2 - Network distributed tracking wire transfer protocol - Google Patents

Network distributed tracking wire transfer protocol Download PDF

Info

Publication number
WO2002023400A2
WO2002023400A2 PCT/US2001/027383 US0127383W WO0223400A2 WO 2002023400 A2 WO2002023400 A2 WO 2002023400A2 US 0127383 W US0127383 W US 0127383W WO 0223400 A2 WO0223400 A2 WO 0223400A2
Authority
WO
WIPO (PCT)
Prior art keywords
ndtp
string
int
con
udp
Prior art date
Application number
PCT/US2001/027383
Other languages
French (fr)
Other versions
WO2002023400A3 (en
Inventor
John K. Overton
Stephen W. Bailey
Original Assignee
Overx, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/661,222 external-priority patent/US7103640B1/en
Application filed by Overx, Inc. filed Critical Overx, Inc.
Priority to EP01966551A priority Critical patent/EP1358576A2/en
Priority to AU2001287055A priority patent/AU2001287055A1/en
Publication of WO2002023400A2 publication Critical patent/WO2002023400A2/en
Publication of WO2002023400A3 publication Critical patent/WO2002023400A3/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/10015Access to distributed or replicated servers, e.g. using brokers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/26Special purpose or proprietary protocols or architectures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/30Definitions, standards or architectural aspects of layered protocol stacks
    • H04L69/32Architecture of open systems interconnection [OSI] 7-layer type protocol stacks, e.g. the interfaces between the data link level and the physical level
    • H04L69/322Intralayer communication protocols among peer entities or protocol data unit [PDU] definitions
    • H04L69/329Intralayer communication protocols among peer entities or protocol data unit [PDU] definitions in the application layer [OSI layer 7]

Definitions

  • This invention relates generally to the storage and retrieval of information, and in particular, to a protocol for dynamic and spontaneous global search and retrieval of information across a distributed network regardless of the data format.
  • Data can reside in many different places.
  • a client seeking information sends a request to a server.
  • a server typically, only data that are statically associated with that server are returned.
  • the search is also usually restricted to previously known systems. The search is thus conducted only where the server knows in advance to look.
  • Known retrieval systems are typically designed to search for data in limited forms.
  • a client requests files based on a subject, like a person's name. In the search results, therefore, only text files of peoples' names may be retrieved.
  • Another problem in current retrieval systems is that the client may receive text and image files in the search results, but could not seamlessly access the image files.
  • Yet another problem in current retrieval systems is that video and sound files related to the request may not even be found in the search results. For example, a doctor might be able to retrieve medical records on a specific patient, but cannot view an MRI or X-Ray results associated with that record.
  • a distributed data collection is a system where data is stored and retrieved among multiple machines connected by a network.
  • each machine in which some portion of the data in a distributed data collection may reside is called a "data repository machine", or simply a "data repository”.
  • data repository machine or simply a “data repository”.
  • One commonly asked question in a data repository environment is: Where is data associated with a particular entity in a distributed data collection? The data location is a key question when a distributed data collection has highly dynamic data distribution properties.
  • the invention provides a network distributed tracking wire transfer protocol, and a system and method for using the protocol in a networked environment.
  • the network distributed tracking wire transfer protocol includes two basic components: identification strings for specifying the identity of an entity in the distributed data collection, and location strings for specifying network locations of data associated with an entity.
  • the protocol accommodates variable length identifier and location strings. Relationships between identification strings and location strings can be dynamically and spontaneously manipulated thus allowing the corresponding data relationships also to change dynamically, spontaneously, and efficiently.
  • the network distributed tracking wire transfer protocol is application independent, organizationally independent, and geographically independent.
  • a system of components using the network distributed tracking protocol for storing and identifying data with a distributed data collection.
  • the components include (1) a data repository for storing data in the distributed data collection, (2) a client entity for manipulating data in the distributed data collection, and (3) a first server entity operative to locate data in the distributed data collection, which may be coupled to a client entity and/or data repository.
  • a client entity transmits an identifier string to the first server entity along with the client request, and the first server entity provides a set of location strings to the client entity in response thereto.
  • the first server entity maps the identifier string received from the client entity to a set of location strings.
  • the network may also include any number of additional server entities coupled to the first server entity.
  • a location string specifies the location of data associated with an entity in the distributed data collection and the identification string specifies the identification of an entity in a distributed data collection.
  • a first data repository entity stores data by associating an identification string with each particular stored unit of data, and by mapping the identification string to a location string associated with the first data repository.
  • the identification string and location string for the particular unit of data are at a first server entity coupled to the first data repository entity.
  • a request is transmitted from a client entity to the first server entity to retrieve at least one location string associated with the stored unit of data in the distributed data collection.
  • the request includes the identification string associated with the particular stored unit of data.
  • the request is received at the first server entity, which responds to the client entity by providing at least one location string associated with the particular stored unit of data to the client entity.
  • the request may also be transmitted to a second server entity prior to responding to the client entity, where the second server entity is coupled to the first server entity and includes the mapping of the identification string and location strings for the particular units of data.
  • the second server entity responds to the client entity by providing the at least one location string associated with the particular stored unit of data to the client entity.
  • the network distributed tracking protocol of the invention is a networking protocol that efficiently manages mappings from one or more identifier strings to zero or more location strings.
  • the protocol permits client entities to add and remove identifier/location associations, and request the current set of locations for an identifier or identifiers from server entities that comply with the protocol.
  • the protocol is designed for use in the larger context of a distributed data collection. As such, it supports an architecture in which information about where data associated with particular application entities can be managed and obtained independently of the data itself.
  • the protocol and its associated servers thus maintain a mapping between entity identifiers and data locations.
  • the identifier/location mapping maintained by the servers is very dynamic. Regardless of the expected system context in a distributed data collection, the protocol can be used for any application in which one-to- one or one-to-many associations among strings are to be maintained and accessed on a network.
  • the protocol supports identifier and location strings of up to 2 32 -4 bytes in length, but in most applications it is expected that the strings are typically short. String length is not fixed by the protocol, except by the upperbound. Accordingly, string formats are controlled at a local level and not dictated by the protocol.
  • FIG. 1 is an example of multiple outstanding protocol requests.
  • FIG. 2 is a layout of one presently preferred string format.
  • FIG. 3 is a layout of one presently preferred NDTP_GET message.
  • FIG. 4 is a layout of one presently preferred NDTP_GET_RSP message.
  • FIG. 5 is a layout of one presently preferred NDTP_PUT message.
  • FIG. 6 is a layout of one presently preferred NDTP_PUT_RSP message.
  • FIG. 7 is a layout of one presently preferred NDTP_DEL message.
  • FIG. 8 is a layout of one presently preferred NDTP_DEL_RSP message.
  • FIG. 9 is a layout of one presently preferred NDTP_RDR_RSP message, where FIG. 9(a) shows a server table layout, and FIG. 9(b) shows a redirection function layout.
  • FIG. 10 is a system block diagram showing a multi-server implementation environment of the network distributed tracking wire transfer protocol of the invention.
  • FIG. 11 is a system diagram showing an NDTP server constellation configuration.
  • FIG. 12 is a system diagram showing a client-centric constellation approach.
  • FIG. 13 is a system diagram showing a server-centric constellation approach.
  • NDTP network distributed tracking protocol
  • An "identifier string” or an “identifier” is a unique string with which zero or more location strings are associated in an NDTP server.
  • a "data location” or a “location” is a string that is a member of a set of strings associated with an identifier string in an NDTP server.
  • An “NDTP client” or a “client” is a network-attached component that initiates update or lookup of identifier/location mappings from an NDTP server with NDTP request messages.
  • NDTP server or a “server” is a network- attached component that maintains a set of identifier/location mappings that are modified or returned in response to NDTP request messages from NDTP clients.
  • Network Byte Order is the ordering of bytes that compose an integer of larger than a single byte as defined in the Internet Protocol (IP) suite.
  • IP Internet Protocol
  • Network Byte Order specifies a big-endian, or most significant byte first, representation of multibyte integers. In this specification a byte is preferably composed of eight bits.
  • NDTP Network Distributed Tracking Protocol
  • NDTP Network Distributed Tracking Protocol
  • NDTP is a transactional protocol, which means that each operation within NDTP consists of a request message from an NDTP client to an NDTP server, followed by an appropriate response message from the NDTP server to the NDTP client.
  • NDTP defines an entity key (or "identifier") as a unique string used to refer to an entity about which a distributed query is performed.
  • the NDTP server treats the entity key as an unstructured stream of octets, which is assumed to be unique to a particular entity.
  • the precise structure of the NDTP entity key and the mechanism for ensuring its uniqueness are a function of the application in which the NDTP server is used.
  • the NDTP entity key might be a unique customer identifier, for example, a Social Security Number, in either printable or binary integer form, as is appropriate to the application.
  • NDTP also defines a data location specifier as a string used to specify a data respository in which data associated with a particular entity may be found.
  • NDTP server treats NDTP data location specifiers as unstructured streams of octets.
  • the structure of an NDTP data location specifier is a function of the application in which the NDTP server is used.
  • an NDTP data location specifier might be an Internet machine name, and a TCP/IP port number for a relational database server, or an HTTP Universal Resource Locator(URL), or some concatenation of multiple components.
  • the NDTP server efficiently maintains and dispenses one to zero or one to many relationships between entity keys and data location specifiers.
  • an entity key may be associated with any number of data location specifiers.
  • the NDTP server is updated to indicate an association between the entity key and the data repository's data location specifier.
  • the NDTP server supplies the set of data repositories in which data may be found for that entity key.
  • the protocol of the invention is designed to provide maximum transaction throughput from the NDTP server, associated clients, and associated data repositories when appropriate.
  • the design goal is realized through two design principles:
  • NDTP messages should preferably be as short as possible to maximize the rate of NDTP transactions for a given network communication bandwidth.
  • NDTP messages should preferably be structured for efficient processing on existing machine architectures. Design Optimizations.
  • Numerical fields of an NDTP message are preferably represented in binary integer format rather than ASCII or other printable format to minimize host processing overhead and network utilization. Numerical fields of NDTP messages are also aligned on 32-bit boundaries to minimize the cost of manipulation on current machine architectures. Manipulating unaligned multibyte integer quantities on modern machine architectures usually incurs an extra cost ranging from mild to severe compared to manipulating the same quantities in aligned form. In keeping with other network protocol standards including TCP/IP, multioctet integer quantities in NDTP are preferably encoded using the big endian integer interpretation convention, as set forth above.
  • NDTP is designed to support asynchronous operation, where many requests may be sent to an NDTP server before a response from any of them is received.
  • Each NDTP message is preceded by a fixed size, 12-octet header, using the preferred data structure: typedef struct ndtp- dr ⁇ uint8_T op; /* opcode */ uint8_t pad [3 ] ; uint32_t id; /* transaction identifier */ uint32_t size; /* total request size following the header */
  • NDTP_DEL delete request
  • NDTP_DEL_RSP delete response
  • NDTP RDR RSP provide redirection id:
  • Each "_RSP" message echoes the id field of the associated request. size:
  • Size in octets of the remainder of the NDTP message.
  • the size field should always be a multiple of 4 octets.
  • Variably sized portions of NDTP messages are preferably defined with a size field rather than some other delimiter mechanism to facilitate efficient reading of NDTP messages. Requests may be made to the network layer to read the entire variably sized portion of an NDTP message, rather than reading small chunks while scanning for a delimiter. Furthermore, client and server resource management can be more efficient since the size of NDTP messages is known before reading.
  • the variably sized portions of NDTP messages are composed of zero or more NDTP stings: typedef struct ndtp_str ⁇ uint32_t len; uint8_t data [ ] ; ⁇ ndtp_str_t;
  • arrays denoted in this document with "[ ]” imply a dimension which is only known dynamically and this indefinite array size specifier is not allowed in C struct definitions. Note also the following:
  • len the number of significant octets of data following the len field in the data area.
  • data len octets of data, followed by up to 3 octets of padding, to ensure that the total length of the NDTP string structure is a multiple of 4 octets.
  • variable sized portion NDTP messages are composed of zero or more NDTP stings and NDTP strings preferably occupy an even multiple of 4 octets, this ensures that the "size" field of NDTP message headers will preferably be a multiple of 4 octets.
  • NDTP preferably has a simple, stateless request/response structure.
  • Each request message 10 sent by a client 12 has a corresponding response message 14 returned by the server
  • NDTP is asynchronous in nature. Many requests 10 from a single client 12 may be outstanding simultaneously, and responses 14 may or may not be returned from the server 16 in the order in which the requests 10 were issued. Each NDTP request 10 contains an NDTP request identifier 18 that is returned in the NDTP response 14 for the associated request 10. An NDTP client 12 uses a unique NDTP request identifier 18 for each NDTP request 10 that is outstanding at the same time to an NDTP server 16 if it wishes to correlate responses with requests. There are four operations preferably supported by the protocol:
  • the response to adding a location association to an identifier 18 is a simple acknowledgement. If the location is already associated with the identifier 18, adding the association has no effect, but the request 10 is still acknowledged appropriately. In other words, the NDTP add operation is idempotent.
  • the response to deleting a location association from an identifier 18 is a simple acknowledgement. If the location is not currently associated with the identifier 18, deleting the association has no effect, but the request 10 is still acknowledged appropriately. In other words, the NDTP delete operation is idempotent.
  • the response 14 to getting all locations associated with an identifier 18 is a list of the locations presently associated with an identifier 18. If no locations are currently associated with an identifier 18, a list of length zero is returned.
  • NDTP messages 10, 14 preferably have a regular structure that consists of a message operation code, followed by a request identifier 18, followed by a string area length (in bytes) 20, followed by zero or more strings 22, as shown in FIG. 2.
  • NDTP message formats are preferably independent of the network transport layer used to carry them.
  • NDTP preferably defines mappings of these messages 10, 14 onto TCP and UDP transport layers (described in detail below), but other mappings could also be defined and it is likely that these NDTP message formats would not require change. For example, the notation
  • ROUND4(x) means x, rounded up to the next multiple of 4.
  • Multibyte integers in NDTP messages are represented in network byte order; using the big-endian convention. In other words, the most significant byte of a multibyte integer is sent first, followed by the remainder of the bytes, in decreasing significance order.
  • Strings in NDTP are represented as counted strings, with a 32-bit length field 20, followed by the string data 22, followed by up to 3 bytes of padding 24 to make the total length of the string representation equal to
  • the NDTP_GET message has a message operation code 30 of 0, and a single NDTP string 32 which is the identifier string for which to get associated location strings. This layout is shown diagrammatically in FIG. 3.
  • the NDTP_GET_RSP message has a message operation code 40 of 1 , and zero or more strings 42 that are the locations currently associated with the requested identifier. This layout is shown diagrammatically in FIG. 4. NDTP PUT Format
  • the NDTP_PUT message has a message operation code 50 of 2, and two NDTP strings 52, 54.
  • the first string 52 is the identifier for which to add a location association, and the second string 54 is the location to add. This layout is shown diagrammatically in FIG. 5.
  • the NDTP_PUT_RSP message has a message operation code 60 of 3, and zero NDTP strings. This layout is shown diagrammatically in FIG. 6.
  • the NDTP_DEL message has a message operation code 70 of 4, and two NDTP strings 72, 74.
  • the first string 72 is the identifier from which to delete a location association, and the second string 74 is the location to delete. This layout is shown diagrammatically in FIG. 7.
  • NDTP DEL RSP Format The NDTP_DEL_RSP message has a message operation code 80 of 5, and zero NDTP strings. This layout is shown diagrammatically in FIG. 8.
  • the NDTP_RDR_RSP message has a message operation code 90 of 6, and one or more NDTP strings 92, 94. Two layouts apply, which are shown diagrammatically in FIGS. 9(a) and 9(b).
  • the NDTP_GET message contains a single NDTP string which is the entity key for which associated data locations are requested. typedef struct ndtp_get ⁇ ndtp hdr t hdr; ndtp_str_t key; ⁇ ndtp_get_t;
  • the NDTP_GET_RSP message contains zero or more NDTP strings which are the data location specifiers associated with the NDTP entity key: typedef struct ndtp_get_rsp ⁇ ndtp_hdr_t hdr; uint32_t rsps; ndtp_str_t values []; ⁇ ndtp_get_rsp_t;
  • the NDTP_PUT messages contains two NDTP strings which are (1) the NDTP entity key and (2) the NDTP data location specifier which is to be associated with the NDTP entity key.
  • the NDTP_PUT_RSP message has no NDTP strings, and simply indicates that the requested entity key/data location specifier association was added: typedef struct ndtp__put_rsp ⁇ ndtp_hdr_t hdr; ⁇ ndtp_put_rsp_t;
  • the requested entity key/data location specifier association is added in addition to any other associations already maintained by the NDTP server. If the requested entity key/data location specifier association is already in effect, the NDTP_PUT still succeeds and results in an NDTP_PUT_RSP message.
  • the NDTP_DEL message contains two NDTP strings which are (1) the NDTP entity key and (2) the NDTP data location specifier which is to be unassociated with the NDTP entity key: typedef struct ndtp_del ⁇ ndtp_hdr_t hdr; ndtp_str_t key; ndtp_str_t data; ⁇ ndtp del t;
  • the NDTP_DEL_RSP message has no NDTP strings, and simply indicates that the requested entity key/data location specifier association was deleted. typedef struct ndtp_del_rsp ⁇ ndtp_hdr_t hdr;
  • the NDTP_DEL still succeeds and results in an NDTP_DEL_RSP message.
  • NDTP supports a distributed server implementation for which two principle redirection methods apply: (1) embedded redirection links, and (2) passed functions.
  • the passed functions method in turn has two variants: (a) a well-known function, and (b) a communicated function. (These methods and variants are described in further detail below.)
  • the NDTP server network front end preferably maximizes NDTP transaction throughput including concurrent NDTP requests from a single client as well NDTP requests from multiple concurrent clients.
  • NDTP defines a transaction oriented protocol, which can be carried over any of a variety of lower level network transport protocols.
  • TCP/IP TCP/IP provides a ubiquitously implemented transport which works effectively on both local area and wide area networks.
  • An NDTP client using TCP/IP preferably connects with the NDTP server at an established TCP port number, and then simply writes NDTP request messages through the TCP/IP connection to the server, which then writes NDTP response messages back to the client through the same TCP/IP connection in the reverse direction.
  • TCP/IP implementations perform buffering and aggregation of many small messages into larger datagrams, which are carried more efficiently through the network infrastructure.
  • Running NDTP on top of TCP/IP will take advantage of this behavior when the NDTP client is performing many NDTP requests. For example, a data repository which is undergoing rapid addition of data records associated with various entities will perform many rapid NDTP_PUT operations to the NDTP server that can all be carried on the same NDTP TCP/IP connection.
  • UDP/IP If an NDTP client only performs occasional, isolated NDTP operations, or there are a vast number of clients communicating with an NDTP server, TCP/IP will not offer the best possible performance because many traversals of the network are required to establish a TCP/IP connection, and yet more network traversals are required to transfer actual NDTP messages themselves. For such isolated NDTP transactions, depending upon the application and network infrastructure in use, it is beneficial to have the NDTP server employ UDP/IP, which is a widely available connectionless datagram protocol.
  • UDP/IP does not support reliable data transfer, or any congestion control mechanism.
  • NDTP clients using UDP/IP must implement reliability and congestion control maintaining transaction timeouts and performing exponential retry backoff timers, precisely analogous to the congestion control mechanism implemented by Ethernet, and other well known UDP protocols.
  • the NDTP protocol is stateless from the standpoint of the NDTP server, which means that there is no congestion control or reliability burden on the server; it is all implemented in a distributed manner by the NDTP UDP/IP clients.
  • the NDTP server network front end preferably services NDTP query requests in a FIFO style by reading the NDTP_GET message, performing the lookup for the entity key in the NDTP server string store, and writing the NDTP_GET_RSP message.
  • Each NDTP query is independent of any other NDTP transactions (other queries or updates), so it is possible to process multiple NPTP queries simultaneously on multiprocessor machines.
  • the NPTP server permits this by not performing multiprocessor locking in the NDTP query processing path.
  • the current prototype NDTP server preferably does not create multiple read service threads per NDTP connection, so multiprocessing will only occur while processing queries from different NDTP connections. Nonetheless, the NDTP server could be extended to support multiprocessing of NDTP queries from a single NDTP connection if this turned out to be advantageous.
  • processing NDTP updates requires the high latency operation of committing the change to nonvolatile storage.
  • the NDTP server network front end preferably supports multiple concurrent asynchronous update transactions. Also, each update is preferably performed atomically to avoid creating an inconsistent state in the string store.
  • the string store supports only a single mutator thread, which means that all NDTP updates are serialized through the string store mutator critical code sections.
  • the string store mutation mechanism is implemented as a split transaction.
  • a call is made to the string store mutation function, which returns immediately indicating either that the mutation is complete, or that the completion will be signaled asynchronously through a callback mechanism.
  • the mutator function might indicate an immediate completion on an NDTP_PUT operation if the entity key/data location specifier mapping was already present in the database.
  • the network front end will immediately write the update response message back to the client.
  • the network front end maintains a queue of NDTP updates for which it is awaiting completion.
  • the completion callback is called by the string store log file update mechanism, the network front end writes the NDTP update response messages for all completed updates back to the clients. If no new NDTP update requests are arriving from NDTP clients, and there are some incomplete updates in the update queue, the network front end preferably calls the string store log buffer flush function to precipitate the completion of the incomplete updates in update queue.
  • the NDTP server network front end may be conditionally compiled to use either of two standard synchronous I/O multiplexing mechanisms, select or poll, or to use threads to prevent blocking the server waiting for events on individual connections.
  • the select and poll interfaces are basically similar in their nature, but different in the details.
  • the multiplexing function is called to determine if any of the connections have input available, and if so, it is read into the connection's input buffer. Once a complete NDTP request is in the buffer, it is acted upon. Similarly, the network front end maintains an output buffer for each connection, and if there is still a portion of an NDTP response message to send, and the connection has some output buffer available, more of the NDTP response message is sent.
  • the threaded version of the NDTP server network front end preferably creates two threads for each NDTP connection, one for reading and one for writing. While individual threads may block as input data or output buffer is no longer available on a connection, the thread scheduling mechanism ensures that if any of the threads can run, they will.
  • the threaded version of the NPTP server is most likely to offer best performance on modern operating systems, since it will permit multiple processors of a system to be used, and the thread scheduling algorithms tend to be more efficient than the synchronous I/O multiplexing interfaces. Nonetheless, the synchronous I/O multiplexing versions of NPTP server will permit it to run on operating systems with poor or nonexistent thread support.
  • mapping operation in both a TCP and UPP environment appears below.
  • TCP Internet Protocol
  • IP Internet Protocol
  • TCP provides reliable, bi-directional stream data transfer.
  • TCP also implements adaptive congestion avoidance to ensure data transfer across a heterogeneous network with various link speeds and traffic levels.
  • NPTP is preferably carried on TCP in the natural way.
  • An NPTP/TCP client opens a connection with a server on a well-known port. (The well- known TCP (and UPP) port numbers can be selected arbitrarily by the initial NPTP implementer. Port numbers that do not conflict with existing protocols should preferably be chosen.)
  • the client sends NPTP requests 10 to the server 16 on the TCP connection, and receives responses 14 back on the same connection.
  • protocol errors are detected on an NPTP/TCP connection, the NPTP/TCP connection should be closed immediately. If further NPTP/TCP communication is required after an error has occurred, a new NPTP/TCP connection should be opened.
  • detectable protocol errors include:
  • NPTP/TCP servers 16 and clients 12 need not maintain any additional form of operation timeout. The only transport errors that can occur will result in gross connection level errors.
  • a client 12 should assume that any NPTP requests 10 for which it has not received responses 14 have not been completed. Incomplete operations may be retried. However, whether unacknowledged NPTP requests 10 have actually been completed is implementation dependent. Any partially received NPTP messages should also be ignored.
  • Unreliable Patagram Protocol is a best-effort datagram protocol that, like TCP, is also part of the universally implemented subset of the IP suite.
  • UPP provides connectionless, unacknowledged datagram transmission.
  • the minimal protocol overhead associated with UPP can deliver extremely high performance if used properly.
  • NPTP/UPP clients 12 send UPP datagrams with NPTP request messages 10 to a well-known UPP port (see above).
  • NPTP/UPP servers 16 return NPTP response messages 14 to the client 12 selected local UPP port indicated in the NPTP/UPP datagram containing the requests 10.
  • NPTP/UPP does not require any form of connection or other association to be established in advance. An NPTP interchange begins simply with the client request message 10.
  • the mapping of NPTP on to UPP permits multiple NPTP messages to be sent in a single UPP datagram.
  • UPP datagrams encode the length of their payload, so when a UPP datagram is received, the exact payload length is available.
  • the recipient of an NPTP/UPP datagram will read NPTP messages from the beginning of the UPP datagram payload until the payload is exhausted.
  • a sender of an NPTP/UPP datagram is free to pack as many NPTP messages as will fit in a UPP datagram.
  • IP provides mechanisms for discovering this maximum transfer size, called the Path Maximum Transfer Unit (Path MTU), but a discussion of these mechanisms is beyond the scope of this specification.
  • An implementation of NPTP/UPP should preferably respect these datagram size limitations.
  • an NPTP/UPP client 12 implementation should implement a timeout mechanism to await the response for each outstanding NPTP request 10.
  • this response timer is implementation dependent, and may be set adaptively as a client 12 receives responses from a server 16, but a reasonable default maximum value is preferably 60 seconds. If a response 14 is not received within the response timeout, the client 12 may retransmit the request 10. NPTP/UPP servers 16 need not maintain any timeout mechanisms.
  • the client 12 retry mechanism may place some requirements on a client's 12 use of the NPTP request identifier 18 field. If the response timer is shorter than the maximum lifetime of a datagram in the network, it is possible that a delayed response will arrive after the response timer for the associated request has expired. An NPTP/UPP client 12 implementation should ensure that this delayed response is not mistaken for a response to a different active NPTP request 10. Pistinguishing current responses from delayed ones is called antialiasing.
  • One presently preferred way to perform antialiasing in NPTP/UPP is to ensure that NPTP request identifier 18 values are not reused more frequently than the maximum datagram lifetime.
  • NPTP/UDP client 12 implementations that use the NPTP request identifier 18 for antialiasing should ignore (i.e., skip) NPTP messages within a NPTP/UPP datagram with invalid NPTP request identifier 18 values.
  • NPTP/UPP implementations detecting any other protocol error should also preferably discard the remainder of the current NPTP/UPP datagram without processing any further NPTP requests from that datagram.
  • Some examples of such detectable errors include: • Illegal NPTP message operation code;
  • NDTP_GET NDTP_GET_RSP, NDTP_PUT Or NDTP_DEL;
  • NPTP/UPP Because NPTP/UPP messages are limited to the length of a single UPP datagram payload, NPTP/UPP cannot be used to transfer long NPTP messages. For example, it would be very difficult to send an NDTP_GET message with NPTP/UPP for a 64K byte identifier string. This case is avoidable by a client 12 realizing that an NPTP message is too long to send as a UPP datagram and using NPTP/TCP instead. However, a greater limitation is that NPTP currently provides no mechanism for an NPTP server
  • the NPTP server 16 should not send a response 14, and it may or may not chose to complete the request 10.
  • the recovery mechanism in this case preferably is, after several unsuccessful attempts to use NPTP/UPP, a client 12 may try again with NPTP/TCP.
  • UPP does not provide any form of congestion avoidance it is possible that the simple retry strategy specified for NPTP/UPP can create network congestion.
  • Network congestion can cause a severe degradation in the successful delivery of all network traffic (not just NPTP traffic, nor just the traffic from the particular client/server 12, 16 pair) through a congested network link.
  • Congestion will occur when an NPTP/UPP implementation is sending datagrams faster than can be accommodated through a network link. Sending a large number of NPTP/UPP datagrams all at once is the most likely way to trigger such congestion. Sending a single NPTP/UPP datagram, assuming it is smaller than the Path MTU, and then waiting for a response 14 is unlikely to create congestion.
  • NPTP/UPP should be confined to contexts where clients 12 send few outstanding requests at a time, or where network congestion is avoided through network engineering techniques.
  • network congestion is a highly dynamic property that is a function of network traffic from all sources through a network link and will vary over time over any given network path.
  • An NPTP/UPP client 12 implementation can recover from network congestion by switching to NPTP/TCP after several failed retries using NPTP/UPP. Failure due to network congestion may be indistinguishable from failure due to UPP packet size limitations, but since the recovery strategy is the same in both cases, there is no need to distinguish these cases.
  • NPTP/UPP Given the stateless, transactional nature of NPTP, NPTP/UPP generally performs much better than NPTP/TCP. This performance improvement is measurable both in terms of the maximum sustainable transaction rate of an NPTP server 16, and the latency of a single response to an NPTP client 12.
  • PNS Pomain Name Service
  • NPTP fits naturally in the UPP model. It is a working assumption of NPTP (and PNS) that for every NPTP transfer, there will be an associated transfer of real data that is an order of magnitude or more greater in size than the NPTP protocol traffic. This property will naturally limit the amount of NPTP traffic on a network.
  • NPTP/UPP congestion avoidance mechanism for NPTP/UPP.
  • security implies an existing, durable association between NPTP clients 12 and NPTP servers 16. This association is much like (and in the case of SSL, it is) a network connection. Therefore, depending upon what security technology is applied, developing a congestion avoidance mechanism for NPTP/UPP may be an irrelevant exercise.
  • NPTP provides two mechanisms for server redirection.
  • the redirection mechanisms allow cluster and hierarchical topologies, and mixtures of such topologies (described in detail below).
  • the first redirection mechanism supported by NPTP, embedded redirection links uses an application specific convention to return redirection pointers as NPTP data location strings.
  • location strings are W3C URLs
  • a URL with the schema ndtp: could be a server indirection pointer.
  • An NDTP_GET_RSP message may contain any mixture of real data location strings and NPTP server redirection pointers. In this case, the client must issue the same NDPT_GET query message to other NPTP servers indicated by the redirection pointers.
  • the total set of data location strings associated with the supplied identifier string is the collection of all the data location strings returned from all the NDTP servers queried.
  • the embedded redirection link technique does not require any specific NDTP protocol support. Therefore, it could be used within the
  • the second redirection mechanism which is specified as a future extension of NDTP, is having the server return an NDTP_RDR_RSP message in response to an NDTP request for which the NDTP server has no ownership of the supplied identifier string.
  • the NDTP_RDR_RSP mechanism applies to all NDTP requests, not just NDTP_GET.
  • the second redirection mechanism has two variants.
  • the first variant of the NDTP_RDR_RSP function mechanism specifies a well-known function that all NDTP server and client implementations know when they are programmed, and the NDTP_RDR_RSP message carries a table of NDTP server URLs.
  • the format of the NDTP_RDR_RSP message with an NDTP server URL table is shown in FIG. 9(a).
  • the appropriate NDTP server is selected from the table in the NDTP_RDR_RSP message by applying a well-known function to the identifier string and using the function result as an index into the NDTP server table.
  • the well-known function preferably applied is the hashpjw function presented by Aho, Sethi and Ullman in their text Compilers, Principles, Techniques and
  • the size parameter is the number of elements in the NDTP server URL table returned in the NDTP_RDR_RSP message.
  • the size parameter must be a prime number, therefore the NDTP server URL table must also have a prime number of elements.
  • the same NDTP server may appear multiple times in the NDTP server URL table. For example, if the server URL table has 2039 elements, by putting one NDTP server URL in the first 1019 table elements, and a second NDTP server URL in the second 1020 table elements, the responsibility for the index string space will be split roughly in half.
  • the second variant of the NDTP_RDR_RSP function mechanism specifies that a general function description will be sent to the NDTP client in the NDTP_RDR_RSP message.
  • the NDTP client will apply this function to the identifier string and the output of the function will be the NDTP server URL to which to send NDTP requests for the particular identifier string.
  • the advantage of this technique over the well-know function approach is that it allows application-specific partitions of the identifier string space. This can permit useful administrative control. For example, if General Electric manages all identifiers beginning with the prefix "GE", a general function can be used to make this selection appropriately.
  • the disadvantage of using a general function is it may be less efficient to compute than a well-known function.
  • NDTP is expected to be applied in environments that make extensive use of the Java programming platform. Therefore the
  • NDTP_RDR_RSP mechanism preferably uses a feature of the Java programming language called "serialized representation" to communicate generalized functions in the NDTP_RDR_RSP message.
  • serialized representation A serialized form of a Java object is a stream of bytes that represents the precise state of the object, including its executable methods. For example, the Java Remote Method
  • RMI Invocation
  • NDTP_RDR__RSP contains the serialized form of an object implementing this Java interface:
  • the format of the NDTP_RDR_RSP message with a Java Serialized form of the NDTP redirection function is specifically identified in FIG. 9(b).
  • the NDTP server redirection mechanism also permits construction of NDTP server clusters (described below). It is expected that the identifier string hash function will be defined at the time NDTP is implemented, but the actual list of NDTP servers 90 will change from application to application and within a single application throughout the lifetime of the system. Therefore, it is necessary for clients to be able to discover updated NDTP server lists, and any other relevant dynamic parameters of the server selection function as these inputs change.
  • An NDTP server hierarchy 100 permits identifier/location association data to be owned and physically controlled by many different entities.
  • An NDTP server cluster should be managed by a single administrative entity 102, and the distribution of data can be for performance and scaling purposes.
  • a server hierarchy would provide some fault isolation so portions of the identifier/location association data can be accessed and updated in the presence of failures of some NDTP servers 104.
  • an NDTP server hierarchy can localize NDTP update operations (NDTP_PUT and NDTP_DEL), which can improve performance and reduce network load.
  • a hierarchical NDTP server topology also allows organizations to maintain their own local NDTP server 104 or NDTP server cluster 102 that manages associations to data locations that are within the organizations' control.
  • Upper tier NDTP servers 108 would be used to link the various leaf
  • NDTP servers 104 NDTP servers 104.
  • FIG. 11 illustrates an NDTP server constellation 110 as it relates to a client 112 and a data repository 114.
  • the client 112 and data repository 114 of FIG. 11 were merged into the single client entity 106 for ease of discussion. Their distinction can now be separated and identified in order to illustrate the storage and retrieval of data in a distributed data collection.
  • a client 112 consults the server constellation 110, which may be construed in either of two forms (see FIGS. 12 and 13), and which returns location strings in response to a client 112 request. Once the client 112 has the location string for a particular unit of data, the client 112 contacts and retrieves information directly from the data repository 114. In one embodiment, if the client 112 contains a data repository 114, internal application logic would facilitate this interaction.
  • data collection is being employed rather than the term “database” because database frequently invokes images of Relational Database Systems (which is only one application of the protocol); an NDTP data collection could just as easily be routing tables as it could be files or records in a RDBS database.
  • NDTP server constellations 110 preferably have two basic organizational paradigms: Client-Centric and Server-Centric.
  • NDTP supports both by design, and both approaches apply to all aspects of managing the relationships between identifiers and locations, such as data retrieval, index manipulation, and server redirection. Each will be discussed separately below.
  • NDTP_RDR_RSP redirection response message
  • This design constructs operating patterns for (1) redirection, (2) index operations, and (3) hierarchical or cluster topologies.
  • the important point is that the Network Distributed Tracking Protocol is designed to support highly configurable methods for processing index-related operations.
  • NDTP supports two specific redirection mechanisms, which are not mutually exclusive and may be combined in any way within a single NDTP server constellation 110. This formation may increase performance when many clients (not shown) participate, since client processing is emphasized rather than server processing.
  • the first NDTP redirection mechanism uses a distinctively encoded location string for each NDTP server 120a,b that contains additional location strings associated with the identifier string supplied in the NDTP request 122a,b. This is an embedded redirection link. For example, if location strings are some form of HTTP URL, a URL with the schema specifier ndtp: would indicate a redirection. Using this scheme, the location strings associated with an identifier string may be spread among multiple NDTP servers 120a, b. In addition to redirection, in FIG. 12, all index manipulation operations continue to apply, but they are directed at the correct NDTP server 110b for which they apply: NDTP_GET, NDTP_PUT,
  • the second NDTP redirection mechanism uses a NDTP_RDR_RSP message to indicate that the server 120a to which the NDTP request 122a was directed does not contain any of the location strings associated with the identifier string supplied in the NDTP request 122a.
  • the NDTP_RDR_RSP message contains all the information required for the originator of the NDTP request 122a to reissue the original NDTP request 122b to a different NDTP server 120b that does have location strings associated with the identifier string supplied in the NDTP request 122b.
  • This information may be an array of NDTP server hosts from which one is selected by applying a well-known function to the identifier string supplied in the NDTP request 122b, or the communicated function to apply as well as a list or other description of the NDTP server hosts from which to choose, as described above.
  • Figure 12 illustrates a cluster topology for client interaction with NDTP servers 120.
  • a single client queries a first server 120a (ServerO), learns of a new index location (Serverl), and then contacts that server 120b (Serverl) for the operations it wishes to execute on the index that the client identifies.
  • the basic idea is that a client asks a server 120a to process an index operation. If the contacted server 120a does not have all the information, as for example in a redirect, then it passes the request to another server 120b. If the second server 120b is appropriate it responds appropriately, or it passes the request on to another server (not shown), and so on.
  • Figure 12 could also illustrate a hierarchical topology if a client (not shown) contacted another client in a handoff as shown in FIG. 10, where a client 106 "asks up" to another client 106, and so on.
  • the server constellation 110 could also be using a hierarchical organization or a cluster organization for managing indices.
  • the important point of this topology is pushing processing emphasis toward clients (not shown) rather than toward servers 120a,b.
  • Such protocol design has scale implications as the number of participating machines/mechanisms increases, since it distributes aggregate processing.
  • FIG. 13 shows the server constellation 110 characterizing "server-centric" functionality.
  • an NDTP server 130a receives a request 132a from a client (not shown).
  • the server 130a passes the request to a second server 130b (Serverl), which is an appropriate server for the process, and the second server 130b returns a response 134a to the first server 130a (ServerO). If the second server 130a (Serverl) was not appropriate, it could pass the request to another server (not shown), and so on.
  • Each NDTP server 130a,b will combine the results of NDTP requests 132a,b it has performed of other NDTP servers 130a,b with whatever responses 134a,b it generates locally for the original NDTP request 132a, and the combined response 134b will be the appropriate response for the original NDTP request 132a.
  • This design constructs operating patterns for (1) index operations and
  • NDTP server 130a,b for which they apply: NDTP_GET, NDTP_PUT, NDTP_DEL.
  • FIG. 13 illustrates an hierarchical topology for client interaction with NDTP servers 130.
  • a single client queries a first server 130a (ServerO), which is not appropriate, and so the first server 130a (not the client) itself contacts an appropriate server 130b (Serverl) for operations it "passes through” to execute on the index that the client has identified.
  • FIG. 13 could illustrate a cluster topology if a server 130a contacted another server 130b in a what is known as a "peer" handoff.
  • the important point of this topology is that it pushes processing emphasis toward servers 130a,b rather than toward clients. Since index processing services can be centralized, administration of the indices can be administered more conveniently in certain cases.
  • NDTP server constellation 110 is a single server 130, and the protocol is designed to permit massive scale with a single or simple server constellation. Highly configurable installations are possible using "client- centric” or “server-centric” techniques. NDTP server constellations 110 composed of more than one NDTP server may use any combination of the two approaches for performance optimization and data ownership properties. Client-centric and server-centric approaches can be used to build NDTP server clusters, NDTP server trees, NDTP server trees of NDTP server clusters, or any other useful configuration. NDTP design thus explicitly addresses the emerging "peer-to-peer” topologies called “pure” and "hybrid”.
  • the "pure” peer-to-peer approach emphasizes symmetric communication among peers, and is achievable through the “server-centric” approach.
  • the “hybrid” peer-to-peer approach emphasizes asymmetric communication among non-peer participants, and is achievable through the “client-centric” approach.
  • NDTP permits any additional mixtures between client-centric and server-centric approaches to provide superior configurability and performance tuning.
  • NDTP preferably has no provisions for security. Three key features of security should therefore be provided:
  • NPTP/TCP will be extended using SSL/X.509 to support these security features in a straightforward, 'industry standard' way.
  • IPSec supports securing all IP traffic, not just TCP between two endpoints. IPSec is a somewhat more heavyweight technology than SSL, and the rate of adoption in industry is somewhat slow. Nonetheless, it can provide the relevant capabilities to NPTP/UPP.
  • TCP has provided decades of solid service, and is so widely implemented that the mainstream computer industry could not imagine using another protocol to replace it.
  • TCP lacks several features that may be necessary to enable the next step in network applications.
  • the TCP design assumed pure software implementations by relatively powerful host computer computers.
  • developments in network technology have increased the packet rate that a TCP implementation must handle to deliver full network speed beyond the capabilities of even increasingly powerful host computers.
  • TCP's design makes this very difficult. It is unclear whether it will become possible to implement the relevant portions of TCP in hardware in a timely fashion.
  • NPTP clients could likely still use a cheaper software implementation of the new transport because of individual clients' modest performance demands.
  • the Network Pistributed Tracking Protocol is a networking protocol that runs on top of any stream (e.g. TCP) or datagram (e.g. UPP) network transport layer.
  • the goal of NPTP is to support a network service that efficiently manages mappings from each individual key string, an identifier, to an arbitrary set of strings, locations.
  • NPTP permits protocol participating clients to add and remove identifier/location associations, and request the current set of locations for an identifier from protocol servers.
  • NPTP is designed for use in the larger context of a distributed data collection. As such, it supports an architecture, in which information about where data associated with particular application entities, can be managed and obtained independently of the data itself. One way to understand this is as a highly dynamic PNS for data.
  • NPTP maintains a mapping between names and machines.
  • NPTP and its associated servers maintain a mapping between entity identifiers and data locations.
  • the identifier/location mapping maintained by NPTP servers is much more dynamic (more frequent updates), than the domain name/IP address mapping maintained by PNS.
  • NPTP is designed to support very fast, very diverse, and very large scale mapping manipulations.
  • NPTP can be used for any application in which one-to-zero or one-to-many associations among strings are to be maintained and accessed on a network.
  • identifier is likely to make sense in most cases, but the term location may not.
  • NPTP supports identifier and location strings of up to 2 32 -4 bytes in length, it is a general assumption that the strings are typically short.
  • the invention provides for the management and manipulation of indices and their associated relationships. Even more importantly, it is the manipulation of dynamic and spontaneous relationships between indices and locations, not the indices and locations, that is the core significance.
  • the Network Pistributed Tracking Protocol was written to manipulate these relationships, of which indices (identifiers) and locations are components of the aggregate solution.
  • c NDTP server test client TCP-specific code ndtpc_udp .c: NDTP server test client UDP-specific code ndptd.c: NDTP server top level ndptd.h: NDTP server top level definitions ndtpd_sync .c : NDTP server synchronous I/O multiplexing version nd pd_sync .h: NDTP server synch I/O multiplexing version definitions ndtpd_thr.c : NDTP server threaded I/O version ndtpd_thr.h: NDTP server threaded I/O version definitions ndtpd_udp .
  • c NDTP server UDP-specific code ndtpd_udp .h: NDTP server UDP-specific code definitions ox_mac . : OverX machine specific abstraction definitions ox_thread.h: OverX thread abstraction definitions ss_test .c: String store test driver string_store.c: String store implementation string_s ore . h : String store interface definitions timestamp.h: Performance testing time stamp definitions
  • An ndtp_str_t is variable length, but always padded to a multiple
  • ndtp_get_rsp has enough ndtp_str_ts at the end to fill the payload */ typedef struct ndtp_get_rsp ⁇ ndtp_hdr_t hdr; ndtp_str_t value; /* actually n of these */
  • printf (tendif printf (" -P test put operation ⁇ n” ) ; printf (" -P: server port ⁇ n”); printf ( ' -r: use random keys and data ⁇ n” ) ; printf ( ' -s n[u] : string (key and return data) size ⁇ n”); printf ( ' u: k: 1024 bytes, m: 1024*1024 bytes ⁇ n” ) ; printf ( g: 1024*1024*1024 bytes (default is lm) ⁇ n”); printf ( -c, -d, -g, and -p may be used in any combination.
  • ⁇ a * (1024 * 1024) ; break; default: usage ( ) ,- ⁇ return a; ⁇ static void init_generators (void)
  • ⁇ sock socket ( PF_INET, SOCK_STREAM, 0 ) ; if (sock ⁇ 0 ) ⁇ perror ( " socket ailed” ) ,- exit ( 17 ) ; )
  • test_upd (iterations, outstanding, 1) ; ⁇ void test__get (int iterations, int outstanding, int testCheck)
  • #include ⁇ sys/types .h> ((include ⁇ sys/uio.h> #include ⁇ errno.h> ((include ⁇ sys/socket.h> ((include ⁇ netdb.h> ((include ⁇ unistd.h> ((include ⁇ string.h> ((include ⁇ assert.h> ((include ⁇ netinet/in.h> /* struct sockaddr_in */ ((include ⁇ stdio.h> ((include ⁇ sys/time.h> ((include ⁇ stdlib.h> ((include ⁇ limits.h>
  • buf f_udp init_udp (ntohs (addr->sin_port ) , 0 , 64 * 1024 ) ; bcopy ( ( caddr_t ) addr, ( caddr_t ) & server Addr , sizeof ( * addr) ) ; ) void test_put ( int iterations , int outstanding)
  • test_upd ( iterations , outstanding, 1 ) ;
  • PRINTD (DEBUG_TOP
  • static int upds_live; static int upds_ready; static int upd_head; static int upd_tail; static upd_state_t *upds; static int do_reset 0; static void serve (ss_desc_t *ss) ; static void proc_req(ss_desc_t *ss) ; static void get (ndtp_get_t *req, ss_desc_t *ss) ; static void upd(ndtp_put_t *req, ss_desc_t *ss) ,- static void upd_rsp(upd_state_t *upd) ; static void upd_done (void *arg, ss_i_t completions) ,- static void send_upd_rsps (void) ; void ndtpd_serve (ss_desc_t *ss, int port, int sock
  • PRINTD (DEBUG_TOP, ("waiting for updates: %d ⁇ n", upds_live) ) ; ss_wait_update (ss) ,- ⁇ else ⁇
  • PRINTD (DEBUG_TOP, ("snoozing for input ⁇ n")); udp_tx_flush (buff_udp) ; udp_rx_avai1 (buff_udp, 1) ; ⁇ ⁇ ⁇ void proc_req (ss_desc_t *ss)
  • ndtp_hdr_t *hdr &req->hdr
  • ndtp_str_t *key &req->key
  • ssize — be; ssize_t bcO; int assocs; ss_i_t totalLen; ndtp_get_rsp_t *r; ndtp_str_t *s; ndtp_str_t *s0 ; mt i; key->len ntohl (key->len) ,- if (hdr->size ⁇ key->len) C fprint (stderr, "get payload (%u) too short for key (%u) ⁇ n", hdr->size, key->len) ; return; ⁇ if (udp_tx_left(buff_udp) ⁇ sizeof (ndtp_hdr_t) + ND
  • ⁇ a * (1024 * 1024) ,- break; default: usage ( ) ; ⁇ return a;
  • uint32_t *p (uint32_t *) hdr; printf ( " 0x%08x 0x%08x 0x%08x ⁇ n " , p [ 0 ] , p [ l ] , p [ 2 ] )
  • static dq_t active_cons static dq_t active_cons ; static dq_t upds_waiting; static int listen_sock; static int sock_buf_size; static int tcp_no_delay; static int cons_size; static con_t *cons;
  • static void do_listen(int port) static void serve (ss_desc_t *ss) ; static void do_accept (void) ; static void do_read (con_t *con) ; static void proc_req(con_t *con) ; static void get(con_t *con, ss_desc_t *ss) ; static void upd(con_t *con, ss_desc_t *ss) ; static void do_write(con_t *con) ,- static void add_upd_rsp (con_t *con, uint32_t id, uint8_t rspOp) ; static void upd_done (void *arg, ss_i_t completions); static void close_con(con_t *con) ; static void find_max_sock(dq_t *con, int *maxFd) ; void ndt
  • static int async_upds 0; static ss_desc_t *string_store; static int listen_sock; static THR_T update_thread; static int updated; static dq_t cons_free; static int waiting_cons; static MUT_T cons_lock; static COND_T cons_avail; static dq_t upds_free; static int waiting_upds; / of threads waiting for an UPD block * /
  • Static MUT_T upds_lock static COND_T upds_avail; static int upds_live; static dq_t upds_pending; static int waiting_mutate; static MUT__T mutate_lock; static COND_T mutate_go; static int empty_req; static MUT_T empty_lock; static COND_T empty_go; static THR_T empty_thread;
  • static void *emptier (void *ss) ; static void do_listen(int port) ; static void serve (void) ; static void do_accept (void) ; static void *reader(void *arg) ; static int recv_bytes (con_t *con, uint8_t *buf, size_t bufSize); static void close_con(con_t *con) ; static int get(con_t *con, ndtp_hdr_t *hdr, uint8_t *payload, ss_desc_t *ss) ; static int upd(con_t *con, ndtp_hdr_t *hdr, uint8_t *payload, ss_desc_t *ss) ; static void upd_done (void *arg, ss_i_t completions); static void *writer(void *arg) ,- static void *update
  • CON_FLAG_OPEN; if ( !
  • upd->con con
  • upd->rsp.op del ?
  • NDTP_DEL_RSP NDTP_PUT_RSP
  • upd->rsp.id hdr->id
  • upd->rsp.size 0; con->updsPending++ ; upds_live++;
  • upds_live--; con->updsPending--- ; MUTATE_DONE; be write (con->sock, &upd->rsp, sizeof (upd->rsp) ) ; FREE_UPD(upd) ; if (be ⁇ sizeof (upd->rsp) ) ⁇ if (be ⁇ 0) ⁇ fprintf (stderr, "error sending update rsp: %s ⁇ n” , strerror (errno) ) ,- ⁇ else ⁇ fprintf (stderr, "only sent %lu bytes of %lu byte update rsp ⁇ n",
  • MUT_LOCK (con->writerLock) ; INSQT (&con->updsDone, upd) ; MUT_UNLOCK (con->writerLock) ; ⁇ else ⁇ / *
  • This program source is the property of OverX, Inc. and contains information which is confidential and proprietary to OverX, Inc.. No part of this source may be copied, reproduced, disclosed to third parties, or transmitted in any form or by any means, electronic or mechanical for any purpose without the express written consent of OverX, Inc..
  • This program source ' is the property of OverX, Inc. and contains

Abstract

A network distributed tracking wire transfer protocol for storing and retrieving data across a distributed data collection. The protocol includes a location string for specifying the network location of data associated with an entity in the distributed data collection, and an identification string for specifying the identity of an entity in the distributed data collection. According to the protocol, the length of the location string and the length of the identification string are variable, and an association between an identification string and a location string can be spontaneously and dynamically changed. The network distributed tracking wire transfer protocol is application independent, organizationally independent, and georgraphically independent. A method for using the protocol in a distributed data collection environment and a system for implementing the protocol are also provided.

Description

NETWORK DISTRIBUTED TRACKING WIRE TRANSFER PROTOCOL
RELATED APPLICATIONS
This application claims priority to provisional patent application serial no. 60/153,709, entitled SIMPLE DATA TRANSPORT PROTOCOL METHOD
AND APPARATUS, filed on September 13, 1999, and to regular patent application no. 09/111 ,896, entitled SYSTEM AND METHOD FOR ESTABLISHING AND RETREIVING DATA BASED ON GLOBAL INDICES, filed on July 8, 1998.
FIELD OF THE INVENTION
This invention relates generally to the storage and retrieval of information, and in particular, to a protocol for dynamic and spontaneous global search and retrieval of information across a distributed network regardless of the data format.
BACKGROUND OF THE INVENTION
Data can reside in many different places. In existing retrieval systems and methods, a client seeking information sends a request to a server. Typically, only data that are statically associated with that server are returned. Disadvantageously, the search is also usually restricted to previously known systems. The search is thus conducted only where the server knows in advance to look.
Another disadvantage of known retrieval systems is the difficulty in accessing data in different forms. Known retrieval systems are typically designed to search for data in limited forms. One example is where a client requests files based on a subject, like a person's name. In the search results, therefore, only text files of peoples' names may be retrieved. Another problem in current retrieval systems is that the client may receive text and image files in the search results, but could not seamlessly access the image files. Yet another problem in current retrieval systems is that video and sound files related to the request may not even be found in the search results. For example, a doctor might be able to retrieve medical records on a specific patient, but cannot view an MRI or X-Ray results associated with that record.
A distributed data collection is a system where data is stored and retrieved among multiple machines connected by a network. Typically, each machine in which some portion of the data in a distributed data collection may reside is called a "data repository machine", or simply a "data repository". One commonly asked question in a data repository environment is: Where is data associated with a particular entity in a distributed data collection? The data location is a key question when a distributed data collection has highly dynamic data distribution properties.
In networked environments where there are a large number of data repositories and any particular entity does not store data in all the repositories, a mechanism is needed that would permit queries to be directed only at data repositories with relevant information. It would also be beneficial to permit membership in the set of data repositories itself to be highly dynamic. Such a system would support on-the-fly addition and removal of data repositories from a distributed data collection seamlessly and without the need to reprogram the client and server participants.
BRIEF SUMMARY OF THE INVENTION In view of the above, the invention provides a network distributed tracking wire transfer protocol, and a system and method for using the protocol in a networked environment. The network distributed tracking wire transfer protocol includes two basic components: identification strings for specifying the identity of an entity in the distributed data collection, and location strings for specifying network locations of data associated with an entity. The protocol accommodates variable length identifier and location strings. Relationships between identification strings and location strings can be dynamically and spontaneously manipulated thus allowing the corresponding data relationships also to change dynamically, spontaneously, and efficiently. In addition, the network distributed tracking wire transfer protocol is application independent, organizationally independent, and geographically independent.
In another aspect of the invention, a system of components using the network distributed tracking protocol are provided for storing and identifying data with a distributed data collection. The components include (1) a data repository for storing data in the distributed data collection, (2) a client entity for manipulating data in the distributed data collection, and (3) a first server entity operative to locate data in the distributed data collection, which may be coupled to a client entity and/or data repository. In a network with these components, a client entity transmits an identifier string to the first server entity along with the client request, and the first server entity provides a set of location strings to the client entity in response thereto. The first server entity maps the identifier string received from the client entity to a set of location strings. The network may also include any number of additional server entities coupled to the first server entity.
According to yet another aspect of the invention, a method is provided for storing and retrieving tracking information over a network using a wire transfer protocol. A location string specifies the location of data associated with an entity in the distributed data collection and the identification string specifies the identification of an entity in a distributed data collection. A first data repository entity stores data by associating an identification string with each particular stored unit of data, and by mapping the identification string to a location string associated with the first data repository. The identification string and location string for the particular unit of data are at a first server entity coupled to the first data repository entity. A request is transmitted from a client entity to the first server entity to retrieve at least one location string associated with the stored unit of data in the distributed data collection. The request includes the identification string associated with the particular stored unit of data. The request is received at the first server entity, which responds to the client entity by providing at least one location string associated with the particular stored unit of data to the client entity. The request may also be transmitted to a second server entity prior to responding to the client entity, where the second server entity is coupled to the first server entity and includes the mapping of the identification string and location strings for the particular units of data. In such case, the second server entity responds to the client entity by providing the at least one location string associated with the particular stored unit of data to the client entity.
The network distributed tracking protocol of the invention is a networking protocol that efficiently manages mappings from one or more identifier strings to zero or more location strings. The protocol permits client entities to add and remove identifier/location associations, and request the current set of locations for an identifier or identifiers from server entities that comply with the protocol.
The protocol is designed for use in the larger context of a distributed data collection. As such, it supports an architecture in which information about where data associated with particular application entities can be managed and obtained independently of the data itself. The protocol and its associated servers thus maintain a mapping between entity identifiers and data locations. The identifier/location mapping maintained by the servers is very dynamic. Regardless of the expected system context in a distributed data collection, the protocol can be used for any application in which one-to- one or one-to-many associations among strings are to be maintained and accessed on a network.
In any context, the protocol supports identifier and location strings of up to 232-4 bytes in length, but in most applications it is expected that the strings are typically short. String length is not fixed by the protocol, except by the upperbound. Accordingly, string formats are controlled at a local level and not dictated by the protocol.
These and other features and advantages of the invention will become apparent upon a review of the following detailed description of the presently preferred embodiments of the invention, when viewed in conjunction with the appended drawings. BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is an example of multiple outstanding protocol requests.
FIG. 2 is a layout of one presently preferred string format.
FIG. 3 is a layout of one presently preferred NDTP_GET message. FIG. 4 is a layout of one presently preferred NDTP_GET_RSP message.
FIG. 5 is a layout of one presently preferred NDTP_PUT message.
FIG. 6 is a layout of one presently preferred NDTP_PUT_RSP message. FIG. 7 is a layout of one presently preferred NDTP_DEL message.
FIG. 8 is a layout of one presently preferred NDTP_DEL_RSP message.
FIG. 9 is a layout of one presently preferred NDTP_RDR_RSP message, where FIG. 9(a) shows a server table layout, and FIG. 9(b) shows a redirection function layout.
FIG. 10 is a system block diagram showing a multi-server implementation environment of the network distributed tracking wire transfer protocol of the invention.
FIG. 11 is a system diagram showing an NDTP server constellation configuration.
FIG. 12 is a system diagram showing a client-centric constellation approach.
FIG. 13 is a system diagram showing a server-centric constellation approach.
DETAILED DESCRIPTION OF THE PRESENTLY PREFERRED
EMBODIMENTS OF THE INVENTION
The following terms are used to describe the operation of the presently preferred network distributed tracking protocol (NDTP). An "identifier string" or an "identifier" is a unique string with which zero or more location strings are associated in an NDTP server. A "data location" or a "location" is a string that is a member of a set of strings associated with an identifier string in an NDTP server. An "NDTP client" or a "client" is a network-attached component that initiates update or lookup of identifier/location mappings from an NDTP server with NDTP request messages. An "NDTP server" or a "server" is a network- attached component that maintains a set of identifier/location mappings that are modified or returned in response to NDTP request messages from NDTP clients. The term "Network Byte Order" is the ordering of bytes that compose an integer of larger than a single byte as defined in the Internet Protocol (IP) suite. Preferably, Network Byte Order specifies a big-endian, or most significant byte first, representation of multibyte integers. In this specification a byte is preferably composed of eight bits.
Network Distributed Tracking Protocol (NDTP)
The Network Distributed Tracking Protocol (NDTP) efficiently tracks the location of data associated with an individual entity in a distributed data collection. NDTP is a transactional protocol, which means that each operation within NDTP consists of a request message from an NDTP client to an NDTP server, followed by an appropriate response message from the NDTP server to the NDTP client. NDTP defines an entity key (or "identifier") as a unique string used to refer to an entity about which a distributed query is performed. The NDTP server treats the entity key as an unstructured stream of octets, which is assumed to be unique to a particular entity. The precise structure of the NDTP entity key and the mechanism for ensuring its uniqueness are a function of the application in which the NDTP server is used. In a customer oriented application, the NDTP entity key might be a unique customer identifier, for example, a Social Security Number, in either printable or binary integer form, as is appropriate to the application. NDTP also defines a data location specifier as a string used to specify a data respository in which data associated with a particular entity may be found.
As with NDTP entity keys, the NDTP server treats NDTP data location specifiers as unstructured streams of octets. The structure of an NDTP data location specifier is a function of the application in which the NDTP server is used. For example, an NDTP data location specifier might be an Internet machine name, and a TCP/IP port number for a relational database server, or an HTTP Universal Resource Locator(URL), or some concatenation of multiple components.
The NDTP server efficiently maintains and dispenses one to zero or one to many relationships between entity keys and data location specifiers. In other words, an entity key may be associated with any number of data location specifiers. When data for a particular entity is added to a data repository, the NDTP server is updated to indicate an association between the entity key and the data repository's data location specifier. When a query is performed for an entity key, the NDTP server supplies the set of data repositories in which data may be found for that entity key.
General NDTP Mechanics
The protocol of the invention is designed to provide maximum transaction throughput from the NDTP server, associated clients, and associated data repositories when appropriate. The design goal is realized through two design principles:
1. NDTP messages should preferably be as short as possible to maximize the rate of NDTP transactions for a given network communication bandwidth. 2. NDTP messages should preferably be structured for efficient processing on existing machine architectures. Design Optimizations.
Numerical fields of an NDTP message are preferably represented in binary integer format rather than ASCII or other printable format to minimize host processing overhead and network utilization. Numerical fields of NDTP messages are also aligned on 32-bit boundaries to minimize the cost of manipulation on current machine architectures. Manipulating unaligned multibyte integer quantities on modern machine architectures usually incurs an extra cost ranging from mild to severe compared to manipulating the same quantities in aligned form. In keeping with other network protocol standards including TCP/IP, multioctet integer quantities in NDTP are preferably encoded using the big endian integer interpretation convention, as set forth above.
To overcome network latency, NDTP is designed to support asynchronous operation, where many requests may be sent to an NDTP server before a response from any of them is received.
Each NDTP message is preceded by a fixed size, 12-octet header, using the preferred data structure: typedef struct ndtp- dr { uint8_T op; /* opcode */ uint8_t pad [3 ] ; uint32_t id; /* transaction identifier */ uint32_t size; /* total request size following the header */
} ndtp_hdr_t ; where: op_:
NDTP message numerical operation code.
NDTP_GET: get request
NDTP_GET_RSP: get response
NDTP_PUT: put request
NDTP_PUT_RSP: put response
NDTP_DEL: delete request
NDTP_DEL_RSP: delete response
NDTP RDR RSP: provide redirection id:
Client supplied operation request used to distinguish responses from multiple outstanding NDTP asynchronous requests. Each "_RSP" message echoes the id field of the associated request. size:
Size, in octets of the remainder of the NDTP message. The size field should always be a multiple of 4 octets. Variably sized portions of NDTP messages are preferably defined with a size field rather than some other delimiter mechanism to facilitate efficient reading of NDTP messages. Requests may be made to the network layer to read the entire variably sized portion of an NDTP message, rather than reading small chunks while scanning for a delimiter. Furthermore, client and server resource management can be more efficient since the size of NDTP messages is known before reading.
The variably sized portions of NDTP messages are composed of zero or more NDTP stings: typedef struct ndtp_str { uint32_t len; uint8_t data [ ] ; { ndtp_str_t;
Note that the C struct definitions in this document are schematic, and not necessarily fully compliant structures in the C programming language.
Specifically, arrays denoted in this document with "[ ]" imply a dimension which is only known dynamically and this indefinite array size specifier is not allowed in C struct definitions. Note also the following:
len: the number of significant octets of data following the len field in the data area. data: len octets of data, followed by up to 3 octets of padding, to ensure that the total length of the NDTP string structure is a multiple of 4 octets.
The padding octets are not included in the len field. Because variable sized portion NDTP messages are composed of zero or more NDTP stings and NDTP strings preferably occupy an even multiple of 4 octets, this ensures that the "size" field of NDTP message headers will preferably be a multiple of 4 octets. Protocol Structure
An example of multiple outstanding NDTP requests and the use of request identifiers is shown in FIG. 1. NDTP preferably has a simple, stateless request/response structure. Each request message 10 sent by a client 12 has a corresponding response message 14 returned by the server
16. To maximize server 16 throughput and use of available network bandwidth, NDTP is asynchronous in nature. Many requests 10 from a single client 12 may be outstanding simultaneously, and responses 14 may or may not be returned from the server 16 in the order in which the requests 10 were issued. Each NDTP request 10 contains an NDTP request identifier 18 that is returned in the NDTP response 14 for the associated request 10. An NDTP client 12 uses a unique NDTP request identifier 18 for each NDTP request 10 that is outstanding at the same time to an NDTP server 16 if it wishes to correlate responses with requests. There are four operations preferably supported by the protocol:
• Add a location association to an identifier.
• Delete a location association from an identifier.
• Get all locations associated with an identifier.
• Provide a redirect instruction to identify an alternative server. The response to adding a location association to an identifier 18 is a simple acknowledgement. If the location is already associated with the identifier 18, adding the association has no effect, but the request 10 is still acknowledged appropriately. In other words, the NDTP add operation is idempotent. The response to deleting a location association from an identifier 18 is a simple acknowledgement. If the location is not currently associated with the identifier 18, deleting the association has no effect, but the request 10 is still acknowledged appropriately. In other words, the NDTP delete operation is idempotent. The response 14 to getting all locations associated with an identifier 18 is a list of the locations presently associated with an identifier 18. If no locations are currently associated with an identifier 18, a list of length zero is returned. Message Formats
NDTP messages 10, 14 preferably have a regular structure that consists of a message operation code, followed by a request identifier 18, followed by a string area length (in bytes) 20, followed by zero or more strings 22, as shown in FIG. 2. As those skilled in the art will appreciate, NDTP message formats are preferably independent of the network transport layer used to carry them. NDTP preferably defines mappings of these messages 10, 14 onto TCP and UDP transport layers (described in detail below), but other mappings could also be defined and it is likely that these NDTP message formats would not require change. For example, the notation
ROUND4(x) means x, rounded up to the next multiple of 4.
Integer Format
Multibyte integers in NDTP messages are represented in network byte order; using the big-endian convention. In other words, the most significant byte of a multibyte integer is sent first, followed by the remainder of the bytes, in decreasing significance order.
String Format
Strings in NDTP are represented as counted strings, with a 32-bit length field 20, followed by the string data 22, followed by up to 3 bytes of padding 24 to make the total length of the string representation equal to
ROUND4(length). This layout is shown diagrammatically in FIG. 2.
NDTP GET Format
The NDTP_GET message has a message operation code 30 of 0, and a single NDTP string 32 which is the identifier string for which to get associated location strings. This layout is shown diagrammatically in FIG. 3.
NDTP GET RSP Format
The NDTP_GET_RSP message has a message operation code 40 of 1 , and zero or more strings 42 that are the locations currently associated with the requested identifier. This layout is shown diagrammatically in FIG. 4. NDTP PUT Format
The NDTP_PUT message has a message operation code 50 of 2, and two NDTP strings 52, 54. The first string 52 is the identifier for which to add a location association, and the second string 54 is the location to add. This layout is shown diagrammatically in FIG. 5.
NDTP PUT RSP Format
The NDTP_PUT_RSP message has a message operation code 60 of 3, and zero NDTP strings. This layout is shown diagrammatically in FIG. 6.
NDTP DEL Format The NDTP_DEL message has a message operation code 70 of 4, and two NDTP strings 72, 74. The first string 72 is the identifier from which to delete a location association, and the second string 74 is the location to delete. This layout is shown diagrammatically in FIG. 7.
NDTP DEL RSP Format The NDTP_DEL_RSP message has a message operation code 80 of 5, and zero NDTP strings. This layout is shown diagrammatically in FIG. 8.
NDTP RPR RSP Format
The NDTP_RDR_RSP message has a message operation code 90 of 6, and one or more NDTP strings 92, 94. Two layouts apply, which are shown diagrammatically in FIGS. 9(a) and 9(b).
A general description of the usage and operation of these protocol messages is provided below.
NDTP GET Transaction
The NDTP_GET message contains a single NDTP string which is the entity key for which associated data locations are requested. typedef struct ndtp_get { ndtp hdr t hdr; ndtp_str_t key; } ndtp_get_t;
The NDTP_GET_RSP message contains zero or more NDTP strings which are the data location specifiers associated with the NDTP entity key: typedef struct ndtp_get_rsp { ndtp_hdr_t hdr; uint32_t rsps; ndtp_str_t values []; } ndtp_get_rsp_t;
NDTP PUT Transaction
The NDTP_PUT messages contains two NDTP strings which are (1) the NDTP entity key and (2) the NDTP data location specifier which is to be associated with the NDTP entity key. typedef struct ndtp_put { ndtp_hdr_t hdr; ndtp_str_t key; ndtp_str_t data; } ndtp_put_t;
The NDTP_PUT_RSP message has no NDTP strings, and simply indicates that the requested entity key/data location specifier association was added: typedef struct ndtp__put_rsp { ndtp_hdr_t hdr; } ndtp_put_rsp_t;
The requested entity key/data location specifier association is added in addition to any other associations already maintained by the NDTP server. If the requested entity key/data location specifier association is already in effect, the NDTP_PUT still succeeds and results in an NDTP_PUT_RSP message. NDTP DELETE Transaction
The NDTP_DEL message contains two NDTP strings which are (1) the NDTP entity key and (2) the NDTP data location specifier which is to be unassociated with the NDTP entity key: typedef struct ndtp_del { ndtp_hdr_t hdr; ndtp_str_t key; ndtp_str_t data; } ndtp del t;
The NDTP_DEL_RSP message has no NDTP strings, and simply indicates that the requested entity key/data location specifier association was deleted. typedef struct ndtp_del_rsp { ndtp_hdr_t hdr;
} ndtp_del_rsp_t;
If the requested entity key/data location specifier association is not in effect, the NDTP_DEL still succeeds and results in an NDTP_DEL_RSP message.
NDTP RPR RSP Message
NDTP supports a distributed server implementation for which two principle redirection methods apply: (1) embedded redirection links, and (2) passed functions. The passed functions method in turn has two variants: (a) a well-known function, and (b) a communicated function. (These methods and variants are described in further detail below.)
Network Front End
The NDTP server network front end preferably maximizes NDTP transaction throughput including concurrent NDTP requests from a single client as well NDTP requests from multiple concurrent clients. Network Communication Mechanism
NDTP defines a transaction oriented protocol, which can be carried over any of a variety of lower level network transport protocols. For example: TCP/IP: TCP/IP provides a ubiquitously implemented transport which works effectively on both local area and wide area networks. An NDTP client using TCP/IP preferably connects with the NDTP server at an established TCP port number, and then simply writes NDTP request messages through the TCP/IP connection to the server, which then writes NDTP response messages back to the client through the same TCP/IP connection in the reverse direction. TCP/IP implementations perform buffering and aggregation of many small messages into larger datagrams, which are carried more efficiently through the network infrastructure. Running NDTP on top of TCP/IP will take advantage of this behavior when the NDTP client is performing many NDTP requests. For example, a data repository which is undergoing rapid addition of data records associated with various entities will perform many rapid NDTP_PUT operations to the NDTP server that can all be carried on the same NDTP TCP/IP connection.
UDP/IP: If an NDTP client only performs occasional, isolated NDTP operations, or there are a vast number of clients communicating with an NDTP server, TCP/IP will not offer the best possible performance because many traversals of the network are required to establish a TCP/IP connection, and yet more network traversals are required to transfer actual NDTP messages themselves. For such isolated NDTP transactions, depending upon the application and network infrastructure in use, it is beneficial to have the NDTP server employ UDP/IP, which is a widely available connectionless datagram protocol.
However, UDP/IP does not support reliable data transfer, or any congestion control mechanism. This means that NDTP clients using UDP/IP must implement reliability and congestion control maintaining transaction timeouts and performing exponential retry backoff timers, precisely analogous to the congestion control mechanism implemented by Ethernet, and other well known UDP protocols. Those skilled in the art will note that the NDTP protocol is stateless from the standpoint of the NDTP server, which means that there is no congestion control or reliability burden on the server; it is all implemented in a distributed manner by the NDTP UDP/IP clients.
Still Higher Performance (ST): Both TCP/IP and to a lesser degree UDP/IP suffer from high host CPU overhead. Like the relatively long latency of TCP/IP, this host CPU consumption is considered just the "cost of doing business" where TCP/IP provides ubiquitous connectivity. If an NDTP server were running in a more constrained environment, where ubiquitous connectivity was not required, its absolute performance could be improved substantially by using a different protocol that is optimized to reduce CPU overhead and latency, such as the Scheduled Transfer (St) protocol.
None of these network implementation issues are particularly unique to NDTP, however. All similar protocols face similar tradeoffs, and what art exists to improve the performance of such protocols applies fully to NDTP as well.
NDTP Query Processing
Servicing NDTP query requests does not require high latency operations, such as disk I/O. Therefore, the NDTP server network front end preferably services NDTP query requests in a FIFO style by reading the NDTP_GET message, performing the lookup for the entity key in the NDTP server string store, and writing the NDTP_GET_RSP message. Each NDTP query is independent of any other NDTP transactions (other queries or updates), so it is possible to process multiple NPTP queries simultaneously on multiprocessor machines. The NPTP server permits this by not performing multiprocessor locking in the NDTP query processing path.
The current prototype NDTP server preferably does not create multiple read service threads per NDTP connection, so multiprocessing will only occur while processing queries from different NDTP connections. Nonetheless, the NDTP server could be extended to support multiprocessing of NDTP queries from a single NDTP connection if this turned out to be advantageous. NDTP Update Processing
Unlike NDTP queries, processing NDTP updates requires the high latency operation of committing the change to nonvolatile storage. To maintain high performance on NDTP updates, the NDTP server network front end preferably supports multiple concurrent asynchronous update transactions. Also, each update is preferably performed atomically to avoid creating an inconsistent state in the string store. Currently, the string store supports only a single mutator thread, which means that all NDTP updates are serialized through the string store mutator critical code sections. As is traditional in transactional systems, the string store mutation mechanism is implemented as a split transaction.
When an NDTP update is processed, a call is made to the string store mutation function, which returns immediately indicating either that the mutation is complete, or that the completion will be signaled asynchronously through a callback mechanism. The mutator function might indicate an immediate completion on an NDTP_PUT operation if the entity key/data location specifier mapping was already present in the database. In this case, the network front end will immediately write the update response message back to the client. For updates which are not immediately completed, the network front end maintains a queue of NDTP updates for which it is awaiting completion. When the completion callback is called by the string store log file update mechanism, the network front end writes the NDTP update response messages for all completed updates back to the clients. If no new NDTP update requests are arriving from NDTP clients, and there are some incomplete updates in the update queue, the network front end preferably calls the string store log buffer flush function to precipitate the completion of the incomplete updates in update queue.
Multiple Connection Handling Handling multiple clients in a single server process requires that the server process not block waiting for events from a single client, such as newly received data forming an NDTP request message, or clearing a network output buffer so an NDTP response message can be written. The NDTP server network front end may be conditionally compiled to use either of two standard synchronous I/O multiplexing mechanisms, select or poll, or to use threads to prevent blocking the server waiting for events on individual connections. The select and poll interfaces are basically similar in their nature, but different in the details. When compiled for synchronous I/O multiplexing, the NDTP server network front end maintains an input buffer for each connection. The multiplexing function is called to determine if any of the connections have input available, and if so, it is read into the connection's input buffer. Once a complete NDTP request is in the buffer, it is acted upon. Similarly, the network front end maintains an output buffer for each connection, and if there is still a portion of an NDTP response message to send, and the connection has some output buffer available, more of the NDTP response message is sent.
The threaded version of the NDTP server network front end preferably creates two threads for each NDTP connection, one for reading and one for writing. While individual threads may block as input data or output buffer is no longer available on a connection, the thread scheduling mechanism ensures that if any of the threads can run, they will. The threaded version of the NPTP server is most likely to offer best performance on modern operating systems, since it will permit multiple processors of a system to be used, and the thread scheduling algorithms tend to be more efficient than the synchronous I/O multiplexing interfaces. Nonetheless, the synchronous I/O multiplexing versions of NPTP server will permit it to run on operating systems with poor or nonexistent thread support.
A more detailed description of the mapping operation in both a TCP and UPP environment appears below.
TCP Mapping As those skilled in the art will appreciate, the Transmission Control
Protocol (TCP) is a connection-oriented protocol that is part of a universally implemented subset of the Internet Protocol (IP) suite. TCP provides reliable, bi-directional stream data transfer. TCP also implements adaptive congestion avoidance to ensure data transfer across a heterogeneous network with various link speeds and traffic levels. NPTP is preferably carried on TCP in the natural way. An NPTP/TCP client opens a connection with a server on a well-known port. (The well- known TCP (and UPP) port numbers can be selected arbitrarily by the initial NPTP implementer. Port numbers that do not conflict with existing protocols should preferably be chosen.) The client sends NPTP requests 10 to the server 16 on the TCP connection, and receives responses 14 back on the same connection. While it is permissible for a single client 12 to open multiple NPTP/TCP connections to the same server 16, this practice is discouraged to preserve relatively limited connection resources on the NPTP server 16. The asynchronous nature of NPTP should make it unnecessary for a client 12 to open multiple NPTP/TCP connections to a single server 16.
If protocol errors are detected on an NPTP/TCP connection, the NPTP/TCP connection should be closed immediately. If further NPTP/TCP communication is required after an error has occurred, a new NPTP/TCP connection should be opened. Some examples of detectable protocol errors include:
• Illegal NPTP message operation code;
• Nonzero String Area Length in NDTP_PUT_RSP or
NDTP_GET_RSP ;
• Inconsistent String Area Length and String Length(s) in NDTP_GET, NDTP_GET_RSP , NDTP_PUT Or NDTP_DEL;
• Unexpected NPTP request identifier by client.
Pue to the reliable nature of TCP, NPTP/TCP servers 16 and clients 12 need not maintain any additional form of operation timeout. The only transport errors that can occur will result in gross connection level errors. A client 12 should assume that any NPTP requests 10 for which it has not received responses 14 have not been completed. Incomplete operations may be retried. However, whether unacknowledged NPTP requests 10 have actually been completed is implementation dependent. Any partially received NPTP messages should also be ignored.
UDP Mapping
As those skilled in the art will appreciate, the Unreliable Patagram Protocol (UPP) is a best-effort datagram protocol that, like TCP, is also part of the universally implemented subset of the IP suite. UPP provides connectionless, unacknowledged datagram transmission. The minimal protocol overhead associated with UPP can deliver extremely high performance if used properly. NPTP/UPP clients 12 send UPP datagrams with NPTP request messages 10 to a well-known UPP port (see above). NPTP/UPP servers 16 return NPTP response messages 14 to the client 12 selected local UPP port indicated in the NPTP/UPP datagram containing the requests 10. NPTP/UPP does not require any form of connection or other association to be established in advance. An NPTP interchange begins simply with the client request message 10.
For efficiency, the mapping of NPTP on to UPP permits multiple NPTP messages to be sent in a single UPP datagram. UPP datagrams encode the length of their payload, so when a UPP datagram is received, the exact payload length is available. The recipient of an NPTP/UPP datagram will read NPTP messages from the beginning of the UPP datagram payload until the payload is exhausted. Thus, a sender of an NPTP/UPP datagram is free to pack as many NPTP messages as will fit in a UPP datagram.
The largest possible UPP datagram payload is presently slightly smaller than 64K bytes. In addition, there may be a performance penalty sending UPP datagrams that are larger than the maximum datagram size allowed by the physical network links between the sender and intended recipient. IP provides mechanisms for discovering this maximum transfer size, called the Path Maximum Transfer Unit (Path MTU), but a discussion of these mechanisms is beyond the scope of this specification. An implementation of NPTP/UPP should preferably respect these datagram size limitations.
Unlike TCP, UPP does not provide reliable data delivery. Therefore, an NPTP/UPP client 12 implementation should implement a timeout mechanism to await the response for each outstanding NPTP request 10.
The exact duration of this response timer is implementation dependent, and may be set adaptively as a client 12 receives responses from a server 16, but a reasonable default maximum value is preferably 60 seconds. If a response 14 is not received within the response timeout, the client 12 may retransmit the request 10. NPTP/UPP servers 16 need not maintain any timeout mechanisms.
Pepending upon the exact timeout values selected, the client 12 retry mechanism may place some requirements on a client's 12 use of the NPTP request identifier 18 field. If the response timer is shorter than the maximum lifetime of a datagram in the network, it is possible that a delayed response will arrive after the response timer for the associated request has expired. An NPTP/UPP client 12 implementation should ensure that this delayed response is not mistaken for a response to a different active NPTP request 10. Pistinguishing current responses from delayed ones is called antialiasing. One presently preferred way to perform antialiasing in NPTP/UPP is to ensure that NPTP request identifier 18 values are not reused more frequently than the maximum datagram lifetime.
NPTP/UDP client 12 implementations that use the NPTP request identifier 18 for antialiasing should ignore (i.e., skip) NPTP messages within a NPTP/UPP datagram with invalid NPTP request identifier 18 values. Client
12 or server 16 NPTP/UPP implementations detecting any other protocol error should also preferably discard the remainder of the current NPTP/UPP datagram without processing any further NPTP requests from that datagram.
Some examples of such detectable errors include: • Illegal NPTP message operation code;
• Nonzero String Area Length in NDTP_PUT_RSP or
NDTP GET RSP; • Inconsistent String Area Length and String Length(s) in
NDTP_GET , NDTP_GET_RSP, NDTP_PUT Or NDTP_DEL;
• Inconsistent NDTP message length and UPP datagram length.
Because NPTP/UPP messages are limited to the length of a single UPP datagram payload, NPTP/UPP cannot be used to transfer long NPTP messages. For example, it would be very difficult to send an NDTP_GET message with NPTP/UPP for a 64K byte identifier string. This case is avoidable by a client 12 realizing that an NPTP message is too long to send as a UPP datagram and using NPTP/TCP instead. However, a greater limitation is that NPTP currently provides no mechanism for an NPTP server
16 to indicate that a response is too large to fit in a UPP datagram. In this case, the NPTP server 16 should not send a response 14, and it may or may not chose to complete the request 10. The recovery mechanism in this case preferably is, after several unsuccessful attempts to use NPTP/UPP, a client 12 may try again with NPTP/TCP.
Because UPP does not provide any form of congestion avoidance it is possible that the simple retry strategy specified for NPTP/UPP can create network congestion. Network congestion can cause a severe degradation in the successful delivery of all network traffic (not just NPTP traffic, nor just the traffic from the particular client/server 12, 16 pair) through a congested network link. Congestion will occur when an NPTP/UPP implementation is sending datagrams faster than can be accommodated through a network link. Sending a large number of NPTP/UPP datagrams all at once is the most likely way to trigger such congestion. Sending a single NPTP/UPP datagram, assuming it is smaller than the Path MTU, and then waiting for a response 14 is unlikely to create congestion. Therefore, the use of NPTP/UPP should be confined to contexts where clients 12 send few outstanding requests at a time, or where network congestion is avoided through network engineering techniques. Those skilled in the art will appreciate that network congestion is a highly dynamic property that is a function of network traffic from all sources through a network link and will vary over time over any given network path. An NPTP/UPP client 12 implementation can recover from network congestion by switching to NPTP/TCP after several failed retries using NPTP/UPP. Failure due to network congestion may be indistinguishable from failure due to UPP packet size limitations, but since the recovery strategy is the same in both cases, there is no need to distinguish these cases.
NDTP/UDP Congestion Avoidance
Given the stateless, transactional nature of NPTP, NPTP/UPP generally performs much better than NPTP/TCP. This performance improvement is measurable both in terms of the maximum sustainable transaction rate of an NPTP server 16, and the latency of a single response to an NPTP client 12. In the same way as the Pomain Name Service (PNS), NPTP fits naturally in the UPP model. It is a working assumption of NPTP (and PNS) that for every NPTP transfer, there will be an associated transfer of real data that is an order of magnitude or more greater in size than the NPTP protocol traffic. This property will naturally limit the amount of NPTP traffic on a network. However, in applications where NPTP traffic reaches high levels, particularly at network 'choke points' which are not within the control of network engineers, it may be desirable to support a congestion avoidance mechanism for NPTP/UPP. However, those skilled in the art will appreciate that the other main future requirement of NPTP, security (described below), implies an existing, durable association between NPTP clients 12 and NPTP servers 16. This association is much like (and in the case of SSL, it is) a network connection. Therefore, depending upon what security technology is applied, developing a congestion avoidance mechanism for NPTP/UPP may be an irrelevant exercise.
Server Redirection Mechanism
NPTP provides two mechanisms for server redirection. The redirection mechanisms allow cluster and hierarchical topologies, and mixtures of such topologies (described in detail below). The first redirection mechanism supported by NPTP, embedded redirection links, uses an application specific convention to return redirection pointers as NPTP data location strings. For example, if location strings are W3C URLs, a URL with the schema ndtp: could be a server indirection pointer. An NDTP_GET_RSP message may contain any mixture of real data location strings and NPTP server redirection pointers. In this case, the client must issue the same NDPT_GET query message to other NPTP servers indicated by the redirection pointers. The total set of data location strings associated with the supplied identifier string is the collection of all the data location strings returned from all the NDTP servers queried. The embedded redirection link technique does not require any specific NDTP protocol support. Therefore, it could be used within the
NDTP protocol as is, and does not require further description in this specification.
The second redirection mechanism, which is specified as a future extension of NDTP, is having the server return an NDTP_RDR_RSP message in response to an NDTP request for which the NDTP server has no ownership of the supplied identifier string. Those skilled in the art will note that unlike the embedded redirection links mechanism, the NDTP_RDR_RSP mechanism applies to all NDTP requests, not just NDTP_GET.
As mentioned above, the second redirection mechanism has two variants. The first variant of the NDTP_RDR_RSP function mechanism specifies a well-known function that all NDTP server and client implementations know when they are programmed, and the NDTP_RDR_RSP message carries a table of NDTP server URLs. The format of the NDTP_RDR_RSP message with an NDTP server URL table is shown in FIG. 9(a).
The appropriate NDTP server is selected from the table in the NDTP_RDR_RSP message by applying a well-known function to the identifier string and using the function result as an index into the NDTP server table. The well-known function preferably applied is the hashpjw function presented by Aho, Sethi and Ullman in their text Compilers, Principles, Techniques and
Tools: uint32_t hash (uint8_t *s, uint32_t slen, uint32_t size) { uint32_t g; uint32_t i; unit32_t h = 0; uint8_t c;
for (i = 0; i < slen; i++) { c = s [i] ; h = (h « 4) + c; g = (h & OxfOOOOOOO) ; if (g) { h Λ= g » 24; h Λ= g; } } return h % size; }
In this case, the size parameter is the number of elements in the NDTP server URL table returned in the NDTP_RDR_RSP message. For the hashpjw function to behave correctly, the size parameter must be a prime number, therefore the NDTP server URL table must also have a prime number of elements. Those skilled in the art will appreciate that the same NDTP server may appear multiple times in the NDTP server URL table. For example, if the server URL table has 2039 elements, by putting one NDTP server URL in the first 1019 table elements, and a second NDTP server URL in the second 1020 table elements, the responsibility for the index string space will be split roughly in half. The second variant of the NDTP_RDR_RSP function mechanism specifies that a general function description will be sent to the NDTP client in the NDTP_RDR_RSP message. The NDTP client will apply this function to the identifier string and the output of the function will be the NDTP server URL to which to send NDTP requests for the particular identifier string. The advantage of this technique over the well-know function approach is that it allows application-specific partitions of the identifier string space. This can permit useful administrative control. For example, if General Electric manages all identifiers beginning with the prefix "GE", a general function can be used to make this selection appropriately. The disadvantage of using a general function is it may be less efficient to compute than a well-known function.
There are a variety of possible mechanisms for sending function descriptions. NDTP is expected to be applied in environments that make extensive use of the Java programming platform. Therefore the
NDTP_RDR_RSP mechanism preferably uses a feature of the Java programming language called "serialized representation" to communicate generalized functions in the NDTP_RDR_RSP message. A serialized form of a Java object is a stream of bytes that represents the precise state of the object, including its executable methods. For example, the Java Remote Method
Invocation (RMI) mechanism uses serialized objects to send executable code to a remote platform for execution. The NDTP_RDR__RSP message contains the serialized form of an object implementing this Java interface:
interface NDTPRedirectFunction {
String selectServer (byte [] identifier); }
The format of the NDTP_RDR_RSP message with a Java Serialized form of the NDTP redirection function is specifically identified in FIG. 9(b).
The NDTP server redirection mechanism also permits construction of NDTP server clusters (described below). It is expected that the identifier string hash function will be defined at the time NDTP is implemented, but the actual list of NDTP servers 90 will change from application to application and within a single application throughout the lifetime of the system. Therefore, it is necessary for clients to be able to discover updated NDTP server lists, and any other relevant dynamic parameters of the server selection function as these inputs change.
Hierarchical Server Topology
While the NDTP server topology supported by the server redirection mechanism described above and shown in FIGS. 9(a) and 9(b) is an extremely powerful and general scaling technique, suitable for diverse topology deployments, some applications might still benefit from a specifically hierarchical server topology. An NDTP server hierarchy 100, such as that shown in FIG. 10, permits identifier/location association data to be owned and physically controlled by many different entities. An NDTP server cluster should be managed by a single administrative entity 102, and the distribution of data can be for performance and scaling purposes. Furthermore, a server hierarchy would provide some fault isolation so portions of the identifier/location association data can be accessed and updated in the presence of failures of some NDTP servers 104. Finally, an NDTP server hierarchy can localize NDTP update operations (NDTP_PUT and NDTP_DEL), which can improve performance and reduce network load.
A hierarchical NDTP server topology also allows organizations to maintain their own local NDTP server 104 or NDTP server cluster 102 that manages associations to data locations that are within the organizations' control. Upper tier NDTP servers 108 would be used to link the various leaf
NDTP servers 104.
Server Constellations
The NDTP server organization also allows NDTP servers to be combined in various ways to build server constellations that offer arbitrary server performance scalability and administrative control of the location of portions of the identifier/data location relationship mappings. Figure 11 illustrates an NDTP server constellation 110 as it relates to a client 112 and a data repository 114. In FIG. 10, the client 112 and data repository 114 of FIG. 11 were merged into the single client entity 106 for ease of discussion. Their distinction can now be separated and identified in order to illustrate the storage and retrieval of data in a distributed data collection.
As shown in FIG. 11 , a client 112 consults the server constellation 110, which may be construed in either of two forms (see FIGS. 12 and 13), and which returns location strings in response to a client 112 request. Once the client 112 has the location string for a particular unit of data, the client 112 contacts and retrieves information directly from the data repository 114. In one embodiment, if the client 112 contains a data repository 114, internal application logic would facilitate this interaction. Those skilled in the art will appreciate that the term "data collection" is being employed rather than the term "database" because database frequently invokes images of Relational Database Systems (which is only one application of the protocol); an NDTP data collection could just as easily be routing tables as it could be files or records in a RDBS database.
NDTP server constellations 110 preferably have two basic organizational paradigms: Client-Centric and Server-Centric. NDTP supports both by design, and both approaches apply to all aspects of managing the relationships between identifiers and locations, such as data retrieval, index manipulation, and server redirection. Each will be discussed separately below.
Client-Centric Approach The first basic pattern that NDTP supports is driven by the client 112, and can be called "client-centric". Referring to FIG. 12, a single client (not shown) asks a server 120a in the server constellation 110 for operations that the client desires executed (represented by arrow 1 in FIG. 12). If the client doesn't receive the data requested, it will receive a redirection response message (NDTP_RDR_RSP) from the contacted server 120a (arrow 2). The client then uses the information it receives to ask another server 120b for the operations the client wants to initiate (arrow 3). A successful response from the second server 120b is then sent to the client (arrow 4).
This design constructs operating patterns for (1) redirection, (2) index operations, and (3) hierarchical or cluster topologies. The important point is that the Network Distributed Tracking Protocol is designed to support highly configurable methods for processing index-related operations.
NDTP supports two specific redirection mechanisms, which are not mutually exclusive and may be combined in any way within a single NDTP server constellation 110. This formation may increase performance when many clients (not shown) participate, since client processing is emphasized rather than server processing. The first NDTP redirection mechanism uses a distinctively encoded location string for each NDTP server 120a,b that contains additional location strings associated with the identifier string supplied in the NDTP request 122a,b. This is an embedded redirection link. For example, if location strings are some form of HTTP URL, a URL with the schema specifier ndtp: would indicate a redirection. Using this scheme, the location strings associated with an identifier string may be spread among multiple NDTP servers 120a, b. In addition to redirection, in FIG. 12, all index manipulation operations continue to apply, but they are directed at the correct NDTP server 110b for which they apply: NDTP_GET, NDTP_PUT,
NDTP_DEL.
The second NDTP redirection mechanism uses a NDTP_RDR_RSP message to indicate that the server 120a to which the NDTP request 122a was directed does not contain any of the location strings associated with the identifier string supplied in the NDTP request 122a. The NDTP_RDR_RSP message contains all the information required for the originator of the NDTP request 122a to reissue the original NDTP request 122b to a different NDTP server 120b that does have location strings associated with the identifier string supplied in the NDTP request 122b. This information may be an array of NDTP server hosts from which one is selected by applying a well-known function to the identifier string supplied in the NDTP request 122b, or the communicated function to apply as well as a list or other description of the NDTP server hosts from which to choose, as described above.
Figure 12 illustrates a cluster topology for client interaction with NDTP servers 120. A single client queries a first server 120a (ServerO), learns of a new index location (Serverl), and then contacts that server 120b (Serverl) for the operations it wishes to execute on the index that the client identifies. The basic idea is that a client asks a server 120a to process an index operation. If the contacted server 120a does not have all the information, as for example in a redirect, then it passes the request to another server 120b. If the second server 120b is appropriate it responds appropriately, or it passes the request on to another server (not shown), and so on. Figure 12 could also illustrate a hierarchical topology if a client (not shown) contacted another client in a handoff as shown in FIG. 10, where a client 106 "asks up" to another client 106, and so on. Behind the scenes, the server constellation 110 could also be using a hierarchical organization or a cluster organization for managing indices. The important point of this topology is pushing processing emphasis toward clients (not shown) rather than toward servers 120a,b. Such protocol design has scale implications as the number of participating machines/mechanisms increases, since it distributes aggregate processing.
Server-Centric Approach
The second basic pattern that the Network Distributed Tracking Protocol provides is a "Server-Centric Approach". Figure 13 shows the server constellation 110 characterizing "server-centric" functionality. In this figure, an NDTP server 130a (ServerO) receives a request 132a from a client (not shown). The server 130a (ServerO) passes the request to a second server 130b (Serverl), which is an appropriate server for the process, and the second server 130b returns a response 134a to the first server 130a (ServerO). If the second server 130a (Serverl) was not appropriate, it could pass the request to another server (not shown), and so on. Each NDTP server 130a,b will combine the results of NDTP requests 132a,b it has performed of other NDTP servers 130a,b with whatever responses 134a,b it generates locally for the original NDTP request 132a, and the combined response 134b will be the appropriate response for the original NDTP request 132a. This design constructs operating patterns for (1) index operations and
(2) hierarchical or cluster topologies. The important point is that the Network Distributed Tracking Protocol is designed to support highly configurable methods for processing index-related operations, but this method emphasizes server-processing rather than client-processing. In FIG. 13, all index manipulation operations continue to apply, but they are directed at the correct
NDTP server 130a,b for which they apply: NDTP_GET, NDTP_PUT, NDTP_DEL.
Figure 13 illustrates an hierarchical topology for client interaction with NDTP servers 130. A single client queries a first server 130a (ServerO), which is not appropriate, and so the first server 130a (not the client) itself contacts an appropriate server 130b (Serverl) for operations it "passes through" to execute on the index that the client has identified. Alternatively, FIG. 13 could illustrate a cluster topology if a server 130a contacted another server 130b in a what is known as a "peer" handoff. The important point of this topology is that it pushes processing emphasis toward servers 130a,b rather than toward clients. Since index processing services can be centralized, administration of the indices can be administered more conveniently in certain cases.
The simplest NDTP server constellation 110 is a single server 130, and the protocol is designed to permit massive scale with a single or simple server constellation. Highly configurable installations are possible using "client- centric" or "server-centric" techniques. NDTP server constellations 110 composed of more than one NDTP server may use any combination of the two approaches for performance optimization and data ownership properties. Client-centric and server-centric approaches can be used to build NDTP server clusters, NDTP server trees, NDTP server trees of NDTP server clusters, or any other useful configuration. NDTP design thus explicitly addresses the emerging "peer-to-peer" topologies called "pure" and "hybrid". The "pure" peer-to-peer approach emphasizes symmetric communication among peers, and is achievable through the "server-centric" approach. The "hybrid" peer-to-peer approach emphasizes asymmetric communication among non-peer participants, and is achievable through the "client-centric" approach. Beyond the pure and hybrid approaches that NDTP allows, as described above, NDTP permits any additional mixtures between client-centric and server-centric approaches to provide superior configurability and performance tuning.
Security
NDTP preferably has no provisions for security. Three key features of security should therefore be provided:
• Data privacy (encryption)
• Client 12 authentication • Client 12 authorization
NPTP/TCP will be extended using SSL/X.509 to support these security features in a straightforward, 'industry standard' way.
Adding security to NPTP/UPP also requires technology other than SSL. For example, IPSec supports securing all IP traffic, not just TCP between two endpoints. IPSec is a somewhat more heavyweight technology than SSL, and the rate of adoption in industry is somewhat slow. Nonetheless, it can provide the relevant capabilities to NPTP/UPP.
Additional Transport Layers
The early-adopter portion of the industry is in a state of turmoil regarding network transport protocols. On one hand, TCP has provided decades of solid service, and is so widely implemented that the mainstream computer industry could not imagine using another protocol to replace it. On the other hand, TCP lacks several features that may be necessary to enable the next step in network applications. In particular, the TCP design assumed pure software implementations by relatively powerful host computer computers. However, developments in network technology have increased the packet rate that a TCP implementation must handle to deliver full network speed beyond the capabilities of even increasingly powerful host computers. To take the next step, much of the packet processing work must be off-loaded to hardware, and TCP's design makes this very difficult. It is unclear whether it will become possible to implement the relevant portions of TCP in hardware in a timely fashion. If this does not happen, one of the many new transport layers currently under development (ST, SCTP, VI, etc.) may emerge as a market leader in high performance networking. In this case, a layering of NPTP on top of a new hardware accelerated transport would permit NPTP servers to deliver greatly increased transaction rates.
Even with the use of a hardware accelerated transport layer, however, the only benefit to a typical NPTP client would be lower cost of service due to cheaper NPTP server platform requirements. On the flip side, NPTP clients could likely still use a cheaper software implementation of the new transport because of individual clients' modest performance demands.
As can be seen, the Network Pistributed Tracking Protocol is a networking protocol that runs on top of any stream (e.g. TCP) or datagram (e.g. UPP) network transport layer. The goal of NPTP is to support a network service that efficiently manages mappings from each individual key string, an identifier, to an arbitrary set of strings, locations. NPTP permits protocol participating clients to add and remove identifier/location associations, and request the current set of locations for an identifier from protocol servers. NPTP is designed for use in the larger context of a distributed data collection. As such, it supports an architecture, in which information about where data associated with particular application entities, can be managed and obtained independently of the data itself. One way to understand this is as a highly dynamic PNS for data. PNS maintains a mapping between names and machines. NPTP and its associated servers maintain a mapping between entity identifiers and data locations. The identifier/location mapping maintained by NPTP servers is much more dynamic (more frequent updates), than the domain name/IP address mapping maintained by PNS. NPTP is designed to support very fast, very diverse, and very large scale mapping manipulations.
Regardless of the expected system context of NPTP in a distributed data collection, those skilled in the art will appreciate that NPTP can be used for any application in which one-to-zero or one-to-many associations among strings are to be maintained and accessed on a network. In applications of NPTP other than distributed databases, the term identifier is likely to make sense in most cases, but the term location may not. In any context, however, although NPTP supports identifier and location strings of up to 232-4 bytes in length, it is a general assumption that the strings are typically short.
Those skilled in the art will note that the invention provides for the management and manipulation of indices and their associated relationships. Even more importantly, it is the manipulation of dynamic and spontaneous relationships between indices and locations, not the indices and locations, that is the core significance. The Network Pistributed Tracking Protocol was written to manipulate these relationships, of which indices (identifiers) and locations are components of the aggregate solution.
It is to be understood that a wide range of changes and modifications to the embodiments described above will be apparent to those skilled in the art, and are contemplated. It is therefore intended that the foregoing detailed description be regarded as illustrative, rather than limiting, and that it be understood that it is the following claims, including all equivalents, that are intended to define the spirit and scope of the invention.
Manifest of OverX NDTP Server (OxServer) files: buff_udp.c: buffered UDP sockets implementation buff_udp.h: buffered UDP sockets interface dq.h: doubly linked, circular queue package ndtp.h: NDTP protocol definition ndtpc . c : NDTP server test client top level ndtpc .h: NDTP server test client top level definitions ndtpc_tcp . c : NDTP server test client TCP-specific code ndtpc_udp .c: NDTP server test client UDP-specific code ndptd.c: NDTP server top level ndptd.h: NDTP server top level definitions ndtpd_sync .c : NDTP server synchronous I/O multiplexing version nd pd_sync .h: NDTP server synch I/O multiplexing version definitions ndtpd_thr.c : NDTP server threaded I/O version ndtpd_thr.h: NDTP server threaded I/O version definitions ndtpd_udp . c : NDTP server UDP-specific code ndtpd_udp .h: NDTP server UDP-specific code definitions ox_mac . : OverX machine specific abstraction definitions ox_thread.h: OverX thread abstraction definitions ss_test .c: String store test driver string_store.c: String store implementation string_s ore . h : String store interface definitions timestamp.h: Performance testing time stamp definitions
buff_udp.c Page 1
* OverX Network Distributed Tracking Protocol Server
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ static const char *buff_udp_id = "$Id: buff_udp.c,v 1.8 2000/02/10 00:10:19 steph Exp $ (OVERX)";
#include "ox_mach.h" /* includes inttypes */
#include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <errno.h> #include <string.h> #include <assert.h> #ifdef USE_POLL #include <poll.h> #endif
#include <sys/types.h> ^include <sys/time.h> #include <sys/param.h> #include <sys/socket .h> #include <netinet/in.h> #include <arpa/inet .h> #include "buff_udp.h" ffdefine UDP_MTU 1400
/*
* Internal function prototypes */ static int udp_do_rx(buff_udp_t *udp) ; buff_udp_t * init_udp (int port, int server, int sockBufSize)
{ int i; buff_udp_t *udp = (buff_udp_t *) malloc (sizeof (buff_udp_t) ) ; struct sockaddr_in servAddr; if (udp) { udp->SOCk = socket (PF_INET, SOCK_DGRAM, 0) ; if (udp->sock < 0) { perror ( "can ' t open socket"); exit (11) ; } bzero ( (caddr_t) &servAddr, sizeof (servAddr) ) , servAddr. sin_family = PF_INET;
Copyright 2000 OverX, Inc., All Rights Reserved buff_udp.c Page 2
servAddr . sin_port = htons (server ? port : 0 ) ; servAddr . sin_addr . s_addr = htonl (INADDR_ANY) ; if (bind (udp->sock, (struct sockaddr * ) &servAddr, sizeof (servAddr) ) < 0) ( perror ( "can ' t bind socket " ) ; exi ( 12 ) ; } if ( 0 ) { struct sockaddr_in addr; i = sizeof (addr) ; getsockname(udp->sock, (struct sockaddr *) &addr, &i) ; printf ( "bound to %s:%d\n", inet_ntoa(addr. sin_addr) , ntohs (addr.sin_port) ) ; } i = sockBufSize,- if (setsockopt(udp->sock, SOL_SOCKET, SO_SNDBUF, (void *) &i, sizeof(i)) < 0) { perror ( " setsockopt S0_SNDBUF" ) ; } i = sockBufSize; if ( setsockopt (udp->sock, S0L_S0CKET, SO_RCVBUF, (void *) &i, sizeo (i) ) < 0) { perror ( " setsockopt S0_RCVBUF" ) ; } udp->port = port; udp->rxBuffSize = UDP_MTU; udp->rxPacketLen = 0; udp->rxConsumer = 0; udp->rxPackets = 0; udp->rxBuff = (uint8_t *) malloc (udp->rxBu fSize) ; if ( !udp->rxBuff) { free (udp) ; udp = NULL; } else { udp->txBu£fSize = UDP_MTU; udp->txProducer = 0; udp->txPackets = 0; udp->txBytes = 0; udp->txBu£f = (uint8_t *) malloc (udp->txBuffSize) ; if (!udp->txBuff) { free(udp->rxBuf ) ; free (udp) ; udp = NULL; } } } return udp; ) int udp_rx_avail (buff_udp_t *udp, int wait)
{
Copyright 2000 OverX, Inc., All Rights Reserved buff_udp.c Page 3
if (udp->rxConsumer != udp->rxPacketLen) ( return 1 ; #ifdef ndef /*
* This was one way to do it, but we don't seem to get blasted out of our
* recvfrom when we get a signal (I believe there's a way to make that
* happen, but I don't remember at the moment), so instead of doing an
* rx when we want to wait, we still do a poll/select. It's slightly
* less efficient at times of lower load because we do two syscalls
* instead of one, but it shouldn't really have a big impact on performance */
} else if (wait) { int status = udp_do_rx(udp) ,- if (status) { fprintf (stderr, "udp_rx_avail: error waiting: %s\n", strerror (status) ) ; } return status == 0; #endif
} else { #ifdef USE_POLL int ready; struct pollfd pollfdsflj; pollfds[0] .fd = udp->sock; pollfds [0] .events = POLLIN; pollfds [0] .revents = 0; ready = poll (pollfds, 1, wait ? -1 : 0) ; if (ready < 0 && errno != EAGAIN && errno != EINTR) { fprintf (stderr, "poll error: %s\n" , strerror (errno) ) ,- assert (0) ; /* BUG */ } return ready > 0; #else fd_set readFds int maxSelect; struct timeval tv;
FD_ZERO (&readFds) ;
FD_SET(udp->sock, &readFds) ; tv. tv_sec = 0 ; tv . v_usec = 0 ; maxSelect = select (udp->sock + 1, &readFds, NULL, NULL, wait ? NULL : &tv) ; if (maxSelect < 0 && errno != EINTR) { fprintf (stderr, "error in select: %s\n" , strerro (errno) ) ; assert (0) /* BUG */ } return maxSelect > 0,- #endif
} }• ssize_t udp_rx (buff_udp_t *udp, ssize_t be, struct sockaddr_in *addr, void **data)
{ if (udp->rxConsumer == udρ->rxPacketLen) {
Copyright 2000 OverX, Inc., All Rights Reserved buff_udp.c Page 4
int status = udp_do_rx(udp) ,- if (status) { fprintf (stderr, "udp_rx: error in do_rx: %s", strerror (status) ) ; return 0; } } bcopy( (caddr_t) &udp->rxAddr, (caddr_t) addr, sizeo (udp->rxAddr) ) ; *data = &udp->rxBuff [udp->rxConsumer] ; return udp_more_rx(udp, be); } static int udp_do_rx (buff_udp_t *udp)
{ int addrSize = sizeof (udp->rxAddr) ; bzero( (caddr_t) &udp->rxAddr, addrSize) ,- udp->rxAddr.sin_family = PF_INET; udp->rxAddr.sin_port = htons (udp->port) ; udp->rxAddr. sin_addr. s_addr = htonl (INADDR_ANY) ; udp->rxConsumer = 0 ; udp->rxPacketLen = recvfrom(udp->sock, udp->rxBuff, udp->rxBuffSize, 0,
(struct sockaddr *) &udp->rxAddr, kaddrSize) ; if (udp->rxPacketLen == -1) { printf ("blasted out of RX\n"); // BOGUS udp->rxPacketLen = 0 ; return errno; } udp->rxPackets++; return 0; }
ssize_t udp_more_rx (buff_udp_t *udp, ssize_t be)
{ ssize_t bcO = be; if (udp->rxConsumer + be > udp->rxPacketLen) { bcO = udp->rxPacketLen - udp->rxConsumer; } udp->rxConsumer += bcO ; return bcO; } ssize_t udp_tx_left (buff_udp_t *udp)
C return udp->txBu fSize - udp->txProducer; } caddr_t udp_tx (buf _udp_t *udp, ssize_t be, struct sockaddr_in *addr)
{ caddr_t p;
Copyright 2000 OverX, Inc., All Rights Reserved buff_udp.c Page 5
if (be > udp->txBuffΞize) { fprintf (stderr, "upd_tx: data too large for packet (%d versus %d)\n", (int) be, (int) udp->txBuf Size) ; return NULL; } if (udp->txProducer + be > udp->txBuffSize
|| (bcmp( (caddr_t) &udp->txAddr, (caddr_t) addr, sizeof (udp->txAddr) ) && udp->txProducer != 0) ) { udp_tx_flush (udp) ; } if (udp->txProducer == 0) { bcopy( (caddr_t) addr, (caddr_t) &udp->txAddr, sizeof (*addr) ) ; } p = &udp->txBuff [udp->txProducer] ,- udp->txProducer += be; return p; } void udp_untx (buff_udp_t *udp, ssize_t be)
{ if (be > udp->txProducer) { fprintf (stderr, "upd_untx: too much (%d only have %d\n" , (int) be, (int) udp->txProducer) ; } udp->txProducer -= be; } void udp_tx_flush (buf _udp_t *udp)
{ if (udp->txProducer != 0) { ssize_t be; be = sendto(udp->sock, udp->txBuff, udp->txProducer, 0,
(struct sockaddr *) &udp->txAddr, sizeof (udp->txAddr) ) ; if (be < 0) { perror ( "udp_tx: error in sendto"); } else if (be != udp->txProducer) { fprintf (stderr, "short UDP sendto (tried %d, sent %d)\n", (int) udp->txProducer, (int) be); } else { udp->txPackets++; udp->txBytes += be; } udp->txProducer = 0; } }
Copyright 2000 OverX, Inc., All Rights Reserved buff_udp.h Page 1
/*
* OverX Network Distributed Tracking Protocol Server
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
* $Id: buff_udp.h,v 1.5 2000/01/26 21:20:28 steph Exp $
*/ lifndef _BUFF_UDP_H #define _BUFF_UDP_H
#include "ox_mach.h" /* includes inttypes */
#include <sys/types.h> ((include <sys/socket .h>
/*
* I think this thing is going to be opaque to the client, but
* I haven't committed to it, so it's here. Certainly, the client
* doesn't need to (and shouldn't) modify anything in it */ typedef struct buff_udp { int sock; uintl6_t port; ssize_t rxBuffSize; ssize_t rxPacketLen; ssize_t rxConsumer; caddr_t rxBu f ; int rxPackets; struct sockaddr_in rxAddr,- ssize_t txBuffSize; ssize_t txProducer; caddr_t txBuff; struct sockaddr_in txAddr; int txPackets; ssize_t txBytes; } buff_udp_t; buff_udp_t *init_udp(int port, int server, int sockBufSize) ; int udp_rx_avail (buff_udp_t *udp, int wait); ssize_t udp_rx (buf f_udp_t *udp, ssize_t be, struct sockaddr_in *addr, void **data) ; ssize_t udp_more_rx (buf f_udp_t *udp, ssize_t be) ; ssize_t udp_tx_left (buff_udp_t *udp) ; caddr_t udp_tx(buf f_udp_t *udp, ssize_t be, struct sockaddr_in *addr) ,- void udp_untx (buf _udp_t *udp, ssize_t be) ; void udρ_tx_ lush (buf f_udp_t *udp) ; #endif
Copyright 2000 OverX, Inc., All Rights Reserved. dq.h Page 1
* Doubly Linked Queue Manipulation
* Copyright 2000 OverX, Inc.
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc..
$Id: dq.h,v 1.2 1999/10/05 21:12:54 steph Exp $
"I
#ifndef _DQ_H ifdefine _DQ_H /*
* Doubly linked list manipulation
*/ typedef struct dq { struct dq *n; struct dq *p;
} dq_t;
#define INITQ(q) \ do { \
(q)->n = (q) ; \
(q) ->p = (q) ; \
} while (0) ;
#define EMPTYQ(q) ((q)->n == (q))
/* recover element pointer from embedded DQ links */
((define QLTOE(p, t. f) \
((t) ( ( (ptrdif f_t) (P)) \
- (( (ptrdif f_t) (&((t) (p))->f)) - ( (ptrdif _t) ((t) (p))))))
((define INSQH(q, e) \ do ( \
((dq_t *) (e))->n = (q)->n; \
((dq_t *) (e))->p = (q) ; \
(q)->n->p = (dq_t *) (e); \
(q)->n = (dq_t *) (e); \
} while (0) ;
((define INSQT(q, e) \ do { \
((dq_t *) (e))->n = (q).; \
(■(dq_t *) (e))->p = (q)->p; \
(q)->p->n = (dq_t *) (e); \
(q)->p = (dq_t *) (e) ; \
} while (0) ;
#define REMQ(e) do {
(((dq_t *) (e))->p)- •>n ((<3q_t (e) )- >n; (((dg_t *) (e))->n)- >P ((dq_t (e) )- >p;
} while (0);
((define REMQH(q, e) do {
Copyright 2000 OverX, Inc., All Rights Reserved. dq.h Page 2
((dσ_t *) e) = (q)->n; \
(((dq_t *) (e))->p)->n = ( (dq_t - M (e)) ->n; \
(((dq_t *) (e))->n)->p = ( (dq_t J ') (e)) ->p; \
) while (0) ;
#define REMQT(q, e) \ do { \
((dq_t *) e) = tq)->p; \
(((dq_t *) (e))->p)->n = ( (dq_t 'k) (e)) ->n; \
(((dq_t *) (e))->n)->p = ( (dq_t 'k) (e)) ->p; \
} while (0);
#define APQ(q, f, a) \ do { \ dq_t *e, *e0, *q0; \ qO = (q) ; \ for (e=q0->n; e != qO; e = e0- { \ eO = e->n; \
(void) (f) (e, a) ; \
} \
} while (0) ;
/* Like MAPQ except f is predicate . Ξtop when f is true
((define FINDQ(q, f, a) \ do { \ dq_t *e, *e0, *q0; \ qO = (q); \ for (e=q0->n; e != qO; e = eO) { \ eO = e->n; \ if ((f) (e, a) ) break; \
} \
} while (0) ;
((define FORALLQ(q, e) \ for(((dσ_t *) (e)) = (q) ->n; \
((dq_t *) (e)) != (q) ; \
((dq_t *) (e)) = ((dq_t *) (e) ) ->n) ffendif
Copyright 2000 OverX, Inc., All Rights Reserved. ndtp.h Page 1
/ *
* Data types for OverX Network Distributed Tracking Protocol
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
* $Id: ndtp.h,v 1.7 1999/11/18 21:45:49 steph Exp $ */
Kifndef _NDTP_H_ #define _NDTP_H_
((include "ox_mach.h" /* includes inttypes */
/* NDTP network service port number */ ((define NDTP_P0RT 4444 enum {
NDTP_GET,
NDTP_GET_RSP,
NDTP_PUT,
NDTP_PUT_RSP,
NDTP_DEL,
NDTP_DEL_RSP,
NDTP_RDR_RSP };
/*
* An ndtp_str_t is variable length, but always padded to a multiple
* of sizeof (uint32_t)
* Use NDTP_NEXT_STR to skip to the next ndtp_str_t in a protocol message
* with multiple strings. */ typedef struct ndtp_str { uint32_t len; uint8_t data[l] ; } ndtp_str_t;
/* number of bytes occupied by an ndtp_str of length 1 */ ((define NDTP_STR_SIZE(1) \
(((1) + 2 * sizeof (uint32_t) - 1) & -(sizeof (uint32_t) - 1)) ((define NDTP_NEXT_STR(s) \
( (ndtp_str_t *) ( ( ( (ndtp_str_t *) (s))->data) \
+ ( ( ( (ndtp_str_t *) (s))->len + sizeof (uint32_t) - 1) \ & -(sizeof (uint32_t) - 1)))) typedef struct ndtp_hdr { uint8_t op; /* opcode */ uint8_t pad [3] ; uint32_t id; /* transaction identifier */ uint32_t size; /* total request size following the header */ ) ndtp_hdr_t; typedef struct ndtp_get {
Copyright 2000 OverX, Inc., All Rights Reserved. ndtp.h Page 2
ndtp_hdr_t hdr; ndtp_str_t key; ) ndtp_get_t;
/* An ndtp_get_rsp has enough ndtp_str_ts at the end to fill the payload */ typedef struct ndtp_get_rsp { ndtp_hdr_t hdr; ndtp_str_t value; /* actually n of these */
} ndtp_get_rsp_t; typedef struct ndtp_put { ndtp_hdr_t hdr; ndtp_str_t key;
/* ndtp_str_t value; */ /* commented out 'cause key is variable len */ } ndtp_put_t; typedef struct ndtp_put_rsp { ndtp_hdr_t hdr; } ndtp_put_rsp_t; typedef struct ndtp_del { ndtp_hdr_t hdr; ndtp_str_t key;
/* ndtp_str_t value; */ / * commented out ' cause key is variable len * / } ndtp_del_t; typedef struct ndtp_del_rsp { ndtp_hdr_t hdr; } ndtp_del_rsp_t; typedef struct ndtp_rdr_rsp { ndtp_hdr_t hdr; ndtp_str_t servers / * [ ] * / ; } ndtp_rdr_rsp_t; typedef union ndtp_req { ndtp_hdr_t hdr; ndtp_get_t get; ndtp_get_rsp_t get_rsp; ndtp_put_t put ; ndtp_put_rsp_t put_rsp; ndtp_del_t del ; ndtp_del_rsp_t del_rsp; ndtp_rdr_rsp_t rdr_rsp; } ndtp_req_ ;
#endif
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc.c Page 1
/ *
* OverX Network Distributed Tracking Protocol Test Client
* Main Driver
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc..
static const char *ndtpc_id =
"$Id: ndtpc.c,v 1.13 2000/01/26 19:14:54 steph Exp $";
#include "ox_mach.h"
((include <sys/types.h> ((include <errno.h> ((include <netdb.h> ((include <sys/socket .h> ((include <netinet/in.h> #include <unistd.h> ((include <string.h> ((include <assert.h> ((include <stdio.h> ((include <sys/time.h> ((include <stdlib.h> ((include <limits.h> (fifdef HAVE_GETOPT_H # include <getopt.h> (fendif
((include "ndtp.h" ((include "ndtpc.h"
/*
* Global variables (visible to the protocol specific implementation)
*/ char *progname; ssize_t string_size = 10; int ndtpc_debug = 0;
/*
* Internal global variables
*/ static int use_random = 0; #define STATE_SIZE 8 ((define STATE_SEED 5 #ifdef HAVE_RANDOM_STATE static char req_string_state[STATE_SIZE] ; static char rsp_string_state[STATE_SIZE] ; ffelse static int req seed:
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc.c Page 2
static int rsp_seed; ttendif static int next_req; static int next_rsp;
/*
* Internal function prototypes
*/ static void usage (void); static unsigned long read_num_arg(void) ; static void init_generators (void) ; static void get_string(char *s, size_t size, int req); int main (int argc, char *argv[])
{ long outstanding = 1; long iterations = 1; int passes = 1; char *host = "localhost"; int port = NDTP_P0RT; int tcpNoDelay = 0; int sockBufSize = 64 * 1024; int testDelete = 0; int testPut = 0; int testGet = 0; int testCheck = 0; int puts = 0 ; int gets = 0 ,- int deletes = 0; double putElapsed = 0.0; double getElapsed = 0.0; double deleteElapsed = 0.0; double rate; struct hostent *hostent; struct sockaddr_in addr; int c; int i; struct timeval stop; struct timeval start; progname = argv [0] ; for (;;) { c = getoptfargc, argv, "b:cD:dgpi:I:ph:P:no:rs: " ) ; if (c == -1) break; switch (c) { case 'b' : sockBufSize = read_num_arg ( ) ; break; case ' c' : testCheck = (testCheck == 0) ; break; case 'd' : testDelete = ( testDelete == 0 ) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc.c Page 3
break; case ' D ' : ndtpc_debug = read_num_arg ( ) ; break; case ' g ' : testGet = (testGet == 0) ,- break; case 'h' : host = optarg,- break; case ' i ' : iterations = read_num_arg ( ) ; break; case ' I ' : passes = read_num_arg ( ) ; break; case 'n' : tcpNoDelay = (tcpNoDelay == 0); break; case 'o' : outstanding = read_num_arg ( ) ; break; case 'p' : testPut = (testPut == 0) ,- break; case 'P' : port = atoi (optarg) ; break; case 'r ' : use_random = (use_random == 0) ,- break; case 's ' : string_size = read_num_arg ( ) ,- break; default: usage ( ) ; } } hostent = gethostbyname(host) ; if (hostent == NULL) { perror ("Host not found"); exit (3); } bzero( (caddr_t) &addr, sizeof (addr) ) ; addr.sin_port = htons(port); addr. sin_family = AF_INET; bcopy(hostent->h_addr, (caddr_t) &addr.sin_addr, hostent->h_length) ; init_proto(&addr, sockBufSize, tcpNoDelay); for ( i = 0 ; i < passes ; i++ ) { if ( testPut) { init_generators ( ) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc.c Page 4
gettimeofday (&start, NULL); test_put (iterations, outstanding) ,- gettimeofday (&stop, NULL) ,- putElapsed +=
(stop.tv_sec - start .tv_sec)
+ (((double) (stop. tv_usec - start . tv_usec) ) / le6) ; puts += iterations ; } if (testGet) { init_generators ( ) ; gettimeofday (&start, NULL) ; test_get (iterations, outstanding, testCheck); gettimeofday (Sstop, NULL) ; getElapsed +=
(stop.tv_sec - start . tv_sec)
+ (((double) (stop. tv_usec - start .tv_usec) ) / leδ) ; gets += iterations ,- } if (testDelete) { init_generators ( ) ,- gettimeofday(&start, NULL) ; test_delete(iterations, outstanding) ; gettimeofday(Scstop, NULL) ; deleteElapsed +=
(stop.tv_sec - start . tv_sec)
+ (((double) (stop.tv_usec - star . tv_usec) ) / le6) ; deletes += iterations; } } print_summary( ) ; if (puts) { rate = 0.0; if (putElapsed != 0.0) { rate = puts / putElapsed; } printf("%u puts in %5.2f sec, %5.2f puts/sec\n", puts, putElapsed, rate) ; } if (gets) { rate = 0.0; if (getElapsed != 0.0) { rate = gets / getElapsed; } printf("%u gets in %5.2f sec, %5.2f gets/sec\n", gets, getElapsed, rate) ; }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc.c Page 5
if (deletes) { rate = 0.0; if (deleteElapsed != 0.0) { rate = deletes / deleteElapsed; } printf("%u deletes in %5.2f sec, %5.2f deletes/sec\n" , deletes, deleteElapsed, rate) ; } if (puts I I deletes ] | gets) { rate = 0.0; if (putElapsed + getElapsed + deleteElapsed != 0.0) { rate = (puts + gets + deletes)
/ (putElapsed + getElapsed + deleteElapsed) ; } printf("%u total ops in %5.2f sec, %5.2f total ops/sec\n", puts + gets + deletes, putElapsed + getElapsed + deleteElapsed, rate) ;
return 0 ; } static void usage (void) { fprintf (stderr "usage: %s [opts]\n", progname); print ( " -c: check get results (only applies to -g)\n"); print ( " -D n: set debug flags (default is 0)\n"); printf ( " -d: test string store delete operation\n" ) ; printf ( " -g= test get operatιon\n" ) ; print ( " -h server host\n" ) ; printf ( " -i n[u] iteration count\n"); printf (" u: k: 1024 iters, m: 1024*1024 iters\n"),- printf (" (default is l)\n") ; printf ( " -I n[u] pass (outer loop) count\n"); printf (" u: k: 1024 iters, m: 1024*1024 passes\n"); printf ( " (default is l)\n") ; tifdef ndef printf ( " -m n[u] ; mappings per key\n" ) ; printf ( " u: k: 1024 iters, m: 1024*1024 iters\n"); printf (" (default is l)\n") ;
(tendif printf (" -P test put operation\n" ) ; printf (" -P: server port\n"); printf ( ' -r: use random keys and data\n" ) ; printf ( ' -s n[u] : string (key and return data) size\n"); printf ( ' u: k: 1024 bytes, m: 1024*1024 bytes\n" ) ; printf ( g: 1024*1024*1024 bytes (default is lm)\n"); printf ( -c, -d, -g, and -p may be used in any combination. \n" ) ; printf ( '\nEXAMPLES:\n"'); printf ( ' %Ξ -p - -i 64000 -h mecca \n", progname) ,- print ( " %s -p --c -g -d -i 2000 -I 10 -h servl -P 3333\n\n", progname) exit (10 3);
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc.c Page 6
static unsigned long read_num_arg (void) { unsigned long a; char *s; a = strtoul (optarg, &s, 0) ; switch (s[0]) { case ' \0 ' : break; case 'g' : case 'G' : if (s[l] != '\0') { usage ( ) ;
} a *= (1024 * 1024 * 1024) ; break; case 'k' : case 'K' : if (s[l] != '\0') { usage ( ) ;
} a *= 1024; break; case 'm' : case 'M' : if (s[l] != '\0') { usage ( ) ,-
} a *= (1024 * 1024) ; break; default: usage ( ) ,- } return a; } static void init_generators (void)
{
(fifdef HAVE_RAND0M_STATE initstate(STATE_SEED, req_string_state, STATE_SIZE) initstate(STATE_SEED, rsp_string_state, STATE_SIZE) #else rsp_seed = STATE_SEED; req_seed = STATE_SEED; srandom(STATE_SEED) ,- #endif next_req = 1 ; next_rsp = 1 ; } void get_req_string (char *s, size_t size)
{ get_string(s, size, 1) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc.c Page 7
} void get_rsp_string ( char *s , size_t size)
{ get_string (s, size, 0); }
/*
* Can pass NULL in s to tell getstring to just consume a value in whatever
* iterator sequence it is using. This is used by the get test to ensure that
* getting the same values that were put . Each put consumes multiple values
* from the iterator one for the key, and one for each value.
*/ static void get_string(char *s, size_t size, int req)
( int i; unsigned long n = 0; if ( !use_random) { if (req) { next_req++; } else { next_rsp++ ; } if (!s) { return; } (fifdef HAVE_RANDOM_STATE } else { setstate(req ? req_string_state : rsp_string_state) ,- (tendif } for ( i = 0 ; i < size ; i++ ) { if ( i % ( sizeof (int) * 2 ) == 0 ) { if (use_random) { (tifdef HAVE_RANDOM_STATE n = (unsigned long) random! ) ; ttelse srandom (req ? req_seed : rsp_seed) ; n = (unsigned long) randomO ; (req ? req_seed : rsp_seed) = n ; #endif
} else { n = req ? next_reg : next_rsp; } } if (s) { s[i] = 'a' + (n & Oxf) ; n »= 4; } } } void
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc.c Page 8
compare_string(char *expected, int expectedLen, char *got, int gotLen) ( if (gotLen != expectedLen
|| bcmp(got, expected, gotLen) ) { printf ( "value miscompare\n" ) ; printf (" expected: "); print_bytes (expected, expectedLen) ; printf (" (%d bytes)\n got: ", expectedLen); print_bytes (got, gotLen) ; printf (" (%d bytes)\n", gotLen) ; > } void print_bytes (char *bytes, int len)
{ int i ; for (i = 0; i < len; i++) { printf ("%c", bytes [i] );
} } void dump_buf (char *buf, int len)
{ int i; #define DUMP_ IDTH 16 for(i = 0; i < len; i++) { if (!(i % DUMP_ IDTH) ) { printf ("\t") ; } printf ("%02x ", buf[i]); if ((i % DUMP_ IDTH) == DUMP_WIDTH - 1) { printf ( "\n" ) ,- } } if (i % DUMP_WIDTH) { printf ("\n") ; }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc.h Page 1
* OverX Network Distributed Tracking Protocol Server
* Copyright 2000 OverX, Inc.
*
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc..
*
* "$Id: ndtpc.h,v 1.3 2000/01/24 15:54:18 steph Exp $ (OVERX)"; */
#ifndef _NDTPC_H ((define _NDTPC_H #include <sys/types.h> ((include <netinet/in.h>
/*
* Global flags */ extern char *progname; extern int ndtpc_debug; extern ssize_t string_size;
/*
* Exported functions */ void get_req_string(char *s, size_t size) ; void get_rsp_string(char *s, size_t size) ; void compare_string (char *expected, int expectedLen, char *got, int gotLen); void print_bytes (char *bytes, int len) ,- void durnp_buf (char *buf, int len);
/*
* External entries into protocol specific code */ extern void init_proto (struct sockaddr_in *addr, ssize_t sockBufSize, int tcpNoDelay) ; extern void test_put(int iterations, int outstanding); extern void test_get(int iterations, int outstanding, int testCheck); extern void test_delete(int iterations, int outstanding) ; extern void print_summary (void) ;
/*
* Debugging print macros */
# ef ine NDTPC_DEBUG
((define DEBUG_TOP 0x1
((def ine DEBUG_GET 0x2
((define DEBUG_PUT 0x4
((def ine DEBUG_DEL 0x8
((define DEBUG_C0N 0x10 ttifdef NDTPC_DEBUG
#define DINIT (N) static char *module = N
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc.h Page 2
((define PRINTD(F, X) \ do ( \ if (ndtpc_debug & (int) (F) ) { \
/* VARARGS */ \ printf ("%-20s", module); \
/* VARARGS */ \ printf X; \
} \
} while (0)
((define PRINT0D(F, X) \ do { \ if (ndtpc_debug & (int) (F) ) { \
/* VARARGS */ \ printf X; \
} \ } while (0) ((define PRINTSD(F, X) \ do { \ if (ndtpc_debug & ( int) (F) ) { \ ss_i_t i; \ /* VARARGS */ \ printf ("%-20s", module) ; \ for (i = 0; i < (X) ->len; i++) { \ /* VARARGS */ \ printf ( "%c" , (X) ->data [ i ) ) ; \ } \ } \
} while (0) (telse
((define DINIT (F) ((define PRINTD (F, X) ifendif #endif
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc_tcp.c
/*
* OverX Network Distributed Tracking Protocol Test Client
* TCP Module
* Copyright 2000 OverX, Inc.
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc..
* */ static const char *ndtpc_tcp_id =
"$Id: ndtpc_tcp.c,v 1.4 2000/01/26 19:14:55 steph Exp $",-
((include "ox_mach.h" /* includes inttypes */
((include <sys/types.h>
((include <sys/uio.h>
((include <errno.h>
((include <sys/socket .h>
((include <netdb.h>
((include <unistd.h>
((include <string.h>
((include <assert.h>
((include <netinet/in.h> /* struct sockaddr_in */ ttifdef HAVE_NETINET_TCP_H
((include <netinet/tcp.h> /* TCP_NODELAY */
(fendif
((include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>
((include "ndtp.h" #include "ndtpc.h"
/*
* Local globals */ static int sock;
/*
* Internal function prototypes */ static void test_upd(int iterations, int outstanding, int put) ; static int recv_bytes ( int sock, uint8_t *buf, ssize_t bufSize) ; void init_proto(struct sockaddr_in *addr, ssize_t sockBufSize, int tcpNoDelay)
{ sock = socket ( PF_INET, SOCK_STREAM, 0 ) ; if (sock < 0 ) { perror ( " socket ailed" ) ,- exit ( 17 ) ; )
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc_tcp.c
if (setsockopt (sock, SOL_SOCKET, S0_SNDBUF, (void *) ksockBu Size, sizeof (sockBufSize) ) < 0) { perror ("setsockopt SO_SNDBUF failed"); } if (setsockopt (sock, SOL_SOCKET, S0_RCVBUF, (void *) &sockBufSize, sizeof (sockBufSize) ) < 0) { perro ("setsockopt SO_RCVBUF failed"); } if (connec (sock, (struct sockaddr *) addr, sizeof (*addr) ) < 0) { perror ("tcp connect failed"); exi (3) ; } if (setsockopt (sock, IPPROTO_TCP, TCP_NODELAY, (void *) &tcpNoDelay, sizeof (tcpNoDelay) ) < 0) { perror ( "setsockopt TCP_NODELAY failed" ),- } } void test_put (int iterations, int outstanding)
{ test_upd(iterations, outstanding, 1) ; } void test__get (int iterations, int outstanding, int testCheck)
{ size_t ndtpStrSize = NDTP_STR_SIZE(string_size) ; ndtp_hdr_t *hdr = (ndtp_hdr_t *) malloc (ndtpStrSize + sizeof (ndtp_hdr_t) ndtp_str_t *ns = & ( (ndtp_get_t *) hdr) ->key; size_t rxBufSize = 128; uint8_t *rxBuf = malloc (rxBufSize) ; ndtp_str_t *value = (ndtp_str_t *) rxBuf; char *checkData = malloc ( string_size) ; long currentlyOutstanding = 0; int id = 0; ndtp_hdr_t rxHdr; ssize_t be; int i; hdr->op = NDTP_GET; hdr->size = htonl (ndtpStrSize) ,- ns->len = htonl (string_size) ; assert (hdr) ; assert (rxBuf) ; for (i = 0; i < iterations; i++) { while (id < iterations && currentlyOutstanding < outstanding) { hdr->id = id; id++; get_req_string( (char *) ns->data, string_size) ; get_req_string(NULL, string_size) ; be = write(sock, hdr, ndtpStrSize + sizeo ( *hdr) ) ; if (be < 0) { perror ( "error writing") ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc_tcp.c
} if (be < ndtpStrSize + sizeof ( *hdr) ) { fprintf (stderr, "short write %lu\n", (unsigned long) be); } currentlyOutstanding++ ; ) recv_bytes (sock, (void *) &rxHdr, sizeof (rxHdr) ) ; rxHdr.size = ntohl (rxHdr .size) ; if (rxBufSize < rxHdr.size) { freefrxBuf) ; rxBuf = malloc (rxHdr.size) ,- assert (rxBuf) ; } recv_bytes (sock, (void *) rxBuf, rxHdr.size); if (testCheck) { get_rsp_string(NULL, string_size) ; get_rsp_string(checkData, string_size) ; if (rxHdr.size != ndtpStrSize) C printf ( "wrong size get response (expected %d, got %d)\n", ndtpStrSize, rxHdr.size); dump_buf ( (caddr_t) rxBuf, rxHdr.size); } else { value->len = ntohl (value->len) ; compare_string(checkData, string_size, value->data, value->len) ; } } currentlyOutstanding-- ,- } free (hdr) ; free(rxBuf) ; free(checkData) ; } void test_delete (int iterations, int outstanding)
{ test_upd(iterations, outstanding, 0) ; } void print_summary (void)
{
}
static void test_upd (int iterations , int outstanding, int put)
{ size_t ndtpStrSize = 2 * NDTP_STR_SIZE(string_size) ; ndtp_hdr_t *hdr = (ndtp_hdr_t *) malloc (sizeof (ndtp_hdr_t) + ndtpStrSize); long currentlyOutstanding = 0; int id = 0; ndtp_hdr_t rxHdr; ndtp_str_t *ns ; int i;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc_tcp.c
assert (hdr) ; hdr->op = put ? NDTP_PUT : NDTP_DEL; hdr->size = htonl (ndtpStrSize) ; for (i = 0; i < iterations; i++) { while (currentlyOutstanding < outstanding && id < iterations) ( ns = &( (ndtp_put_t *) hdr) ->key; hdr->id = id; id++; ns->len = htonl (string_size) ; get_req_string( (char *) ns->data, string_size) ; ns = (ndtp_str_t *) ( ( (uint8_t *) ns) + NDTP_STR_SIZE (string_size) ) ; ns->len = htonl (string_size) ; get_req_string( (char *) ns->data, string_size) ,- write(sock, hdr, sizeof (*hdr) + ndtpStrSize); currentlyOutstanding++; } recv_bytes (sock, (void *) &rxHdr, sizeo (rxHdr) ) ; rxHdr.size = ntohl (rxHdr.size) ; if (rxHdr.size != 0) { printf ( "Bogus data: hdr. size == 0x%x\n" , rxHdr.size) ,- } currentlyOutstandin — ; } free (hdr) ,- } static int recv_bytes (int sock, uint8_t *buf, ssize_t bufSize)
{ ssize_t be; ssize_t bcO; be = 0; while (be < bufSize) { bcO = recv(sock, &(buf[bc]), bufSize - be, 0); if (bcO <= 0) { return errno ;
) be += bcO; } return 0; }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc_udp.c Page 1
* OverX Network Distributed Tracking Protocol Test Client
* UDP Module
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ static const char *ndtpc_udp_id =
"$Id: ndtpc_udp.c,v 1.5 2000/01/26 21:20:28 steph Exp $";
((include "ox_mach.h" /* includes inttypes */
#include <sys/types .h> ((include <sys/uio.h> #include <errno.h> ((include <sys/socket.h> ((include <netdb.h> ((include <unistd.h> ((include <string.h> ((include <assert.h> ((include <netinet/in.h> /* struct sockaddr_in */ ((include <stdio.h> ((include <sys/time.h> ((include <stdlib.h> ((include <limits.h>
((include "ndtp.h" #include "ndtpc. h" ((include "buff__udp.h"
/"
Local globals
*/ static buff_udp_t *buff_udp; static struct sockaddr_in serverAddr; static int putRxPackets = 0; static int putTxPackets = 0; static int getRxPackets = 0; static int getTxPackets = 0; static int deleteRxPackets 0; static int deleteTxPackets 0;
I*
* Internal function prototypes
*/ static void test_upd(int iterations, int outstanding, int put) ; static ndtp_hdr_t *get_rsp(void) ; void init_proto (struct sockaddr_in *addr, ssize_t sockBufSize, int tcpNoDelay)
{
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc_udp.c Page 2
buf f_udp = init_udp (ntohs (addr->sin_port ) , 0 , 64 * 1024 ) ; bcopy ( ( caddr_t ) addr, ( caddr_t ) & server Addr , sizeof ( * addr) ) ; ) void test_put ( int iterations , int outstanding)
{ test_upd ( iterations , outstanding, 1 ) ;
} void test_get ( int iterations , int outstanding , int testCheck)
{
DINIT ( " test_get " ) ; size_t ndtpStrSize = NDTP_STR_SIZE(string_size) ; char *checkData = malloc (string_size) ,- long currentlyOutstanding = 0; int done = 0 ; int id = 0; int startTxPackets,- int startRxPackets; int reqsPerPacket; int wantMore; int i; ndtp_hdr_t *hdr; ndtp_str_t *ns; assert (checkData) ; startTxPackets = buff_udp->txPackets; startRxPackets = buf _udp->rxPackets; reqsPerPacket = udp_tx_left (buff_udp) / (sizeof (ndtp_hdr_t) + ndtpStrSize) ,- if (reqsPerPacket > outstanding) { reqsPerPacket = outstanding; } while (done < iterations) { wantMore = (outstanding - currentlyOutstanding >= reqsPerPacket
&& id < iterations) ; if ( IwantMore
|| udp_rx_avail (buf _udp, 0)) { PRINTD(DEBUG_TOP | DEBUG_GET, ("getting response\n" ) ) ; hdr = get_rsp ( ) ; if (testCheck) { get_rsp_string(NULL, string_size) ; get_rsp_string (checkData, string_size) ; -i««-£_,.»---?*.-.- if (ndtpStrSize != hdr->size) { printf ( "wrong size get response (expected %d, got %d) \n" , ndtpStrSize, hdr->size) ; dump_buf ( (caddr_t) hdr, sizeof (ndtp_hdr_t) + hdr->size) ; } else { ns = & ( (ndtp_get_rsp_t *) hdr)->value; ns->len = ntohl (ns->len) ,- compare_string (checkData, string_size, ns->data, ns->len) ; } }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc_udp.c Page 3
currentlyOutstanding— ; done++; } else if (wantMore) ( for (i = 0; i < reqsPerPacket && id < iterations; i++) { hdr = (ndtp_hdr_t *) udp_tx(buf _udp, sizeof (ndtp_hdr__t) + ndtpStrSize, &serverAddr) ; if (!hdr) { fprintf (stderr, "test_get: unable to TX %d bytes\n", sizeof (ndtp_hdr_t) + ndtpStrSize); exit (9) ; } hdr->op = NDTP_GET; hdr->id = id; hdr->size = htonl (ndtpStrSize) ; ns = & ( (ndtp_get_t *) hdr)->key; ns->len = htonl (string_size) ; get_req_string( (char *) ns->data, string_size) ; get_req_string(NULL, string_size) ; currentlyθutstanding++; id++; }
PRINTD(DEBUG_TOP | DEBUG_GET, ("sent %d gets\n", i) ) ; udp_tx_flush(buff_udp) ; } ) getTxPackets += buf _udp->txPackets - StartTxPackets; getRxPackets += buff_udp->rxPackets - startRxPackets; free (checkData) ; } void test_delete (int iterations, int outstanding)
{ test_upd ( iterations , outstanding , 0 ) ; ) void print_summary (void)
{ if (putTxPackets) { printf ("%d put packets sent, %d put response packets received\n", putTxPackets, putRxPackets) ; } if (getTxPackets) { print ("%d get packets sent, %d get response packets received\n" , getTxPackets, getRxPackets) ; } if (deleteTxPackets) { printf ("%d delete packets sent, %d delete response packets received\n", deleteTxPackets, deleteRxPackets) ; } printf ("%d packets sent, %d response packets received\n", putTxPackets + getTxPackets + deleteTxPackets, putRxPackets + getRxPackets + deleteTxPackets) ,- )
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc_udp.c Page 4
static void test_upd ( t iterations, mt outstanding, mt put)
£
DINIT ( " test_upd" ) , sιze_t ndtpStrSize = 2 * NDTP_STR_SIZE ( strmg_size) ; long currentlyOutstanding = 0 ; t done = 0 ; mt id = 0 ; int startTxPackets; mt startRxPackets; int reqsPerPacket; t wantMore; mt l; ndtp_hdr_t *hdr; ndtp_str_t *ns; startTxPackets = buff_udp->txPackets; startRxPackets = buff_udp->rxPackets; reqsPerPacket = udp_tx_left (bu f_udp) / (sizeof (ndtp_hdr_t) + ndtpStrSize); if (reqsPerPacket > outstanding) { reqsPerPacket = outstanding; } while (done < iterations) { wantMore = (outstanding - currentlyOutstanding >= reqsPerPacket
&& id < iterations) ; if ( 'wantMore
|| udp_rx_avaιl (buff_udp, 0)) { PRINTD (DEBUG_TOP | DEBUG_PUT, ("getting response\n" ) ) , hdr = get_rsp ( ) ; currentlyOutstanding-- ; done++; } else if (wantMore) { for (i = 0; l < reqsPerPacket && id < iterations; ι++) { hdr = (ndtp_hdr_t *) udp_tx(buff_udp, sizeo (ndtp_hdr_t) + ndtpStrSize, sserverAddr) ; if (!hdr) { fprintf (stderr, "test_upd: unable to tx %d bytes\n", sizeof (ndtp_hdr_t) + ndtpStrSize); exit (8) ; > hdr->op = put ? NDTP_PUT : NDTP_DEL; hdr->sιze = htonl (ndtpStrSize) ; hdr->id = id; ns = & ( (ndtp_put_t *) hdr)->key; ns->len = htonl (stnng_sιze) ; get_req_strmg( (char *) ns->data, strmg_sιze) ; ns = (ndtp_str_t *) ( ( (umt8_t *) ns) + NDTP_STR_SIZE(strmg_sιze) ) ; ns->len = htonl (stnng_sιze) ; get_req_strmg( (char *) ns->data, strmg_sιze) ; currentlyOutstandmg++ ; id++; } PRINTD (DEBUG_TOP | DEBUG_PUT,
("sent %d %s\n", I, put ' "puts" : "deletes")); udp_tx_flush (buff_udp) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpc_udp.c Page 5
} } if (put) { putTxPackets += buff_udp->txPackets - startTxPackets; putRxPackets += buff_udp->rxPackets - startRxPackets,- } else { deleteTxPackets += buff_udp->txPackets - startTxPackets; deleteRxPackets += buff_udp->rxPackets - startRxPackets,- } } static ndtp_hdr_t * get_rsp (void)
{ ssize_t be; ndtp_hdr_t *hdr; struct sockaddr_in addr; be = udp_rx(buff_udp, sizeof (*hdr) , &addr, (void **) &hdr) ; if (be < sizeof (*hdr) ) { fprintf (stderr, "runt response header received (%d bytes) \n" , (int) bc); } hdr->size = ntohl (hdr->size) ; be = udp_more_rx (bu f_udp, hdr->size) ; if (be < hdr->size) { fprintf (stderr,
"runt response payload received (%d versus %d) \n" , (int) be, (int) hdr->size) ; ) return hdr; }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_udp.c Page 1
* OverX Network Distributed Tracking Protocol Server
* Synchronous Multiplexing Variant
* Copyright 2000 OverX, Inc.
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ static const char *ndtpd_udp__id = "$Id: ndtpd_udp.c,v 1.7 2000/02/09 18:01:59 steph Exp $ (OVERX) ";
((include "ox_mach.h" /* includes inttypes */
((include <stdio.h> ((include <unistd.h> ((include <stdlib.h> ((include <string.h> #include <errno.h> ((include <assert.h> ((include <signal.h> ((include <fcntl.h> ((include <sys/types.h> ((include <sys/time.h> (finclude <sys/socket.h> ifinclude <netinet/in.h> ((include <arpa/inet.h> ((ifdef HAVE__NETINET_TCP_H ((include <netinet/tcp.h> #endif
#ifdef NDTPD_POLL ((include <poll.h> ifendif
#include "dq.h" ((include "string_store.h" ((include "ndtp.h" ((include "timestamp.h" #include "ndtpd.h" #include "ndtpd_udp.h" (finclude "buff_udp.h"
/*
* Global state
*/ static int ops_done; static int puts_done; static int deletes_done; static int gets_done; static int assocs_returned; buff_udp_t *buff_udp; ss_desc_t *string_store; static int max_upds;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_udp.c Page 2
static int upds_live; static int upds_ready; static int upd_head; static int upd_tail; static upd_state_t *upds; static int do_reset = 0; static void serve (ss_desc_t *ss) ; static void proc_req(ss_desc_t *ss) ; static void get (ndtp_get_t *req, ss_desc_t *ss) ; static void upd(ndtp_put_t *req, ss_desc_t *ss) ,- static void upd_rsp(upd_state_t *upd) ; static void upd_done (void *arg, ss_i_t completions) ,- static void send_upd_rsps (void) ; void ndtpd_serve (ss_desc_t *ss, int port, int sockBufSize, int tcpNoDelay, int maxUpds)
{ string_store = ss; max_upds = maxUpds; ss->callbackFunc = upd_done; ss->callbackArg = NULL; ops_done = 0; puts_done = 0 ,- deletes_done = 0 ; gets_done = 0; assocs_returned = 0; upds_live = 0; upds_ready = 0; upd_head = 0; upd_tail = 0; upds = (upd_state_t *) malloc (sizeof (upd_state_t) * max_upds) , if ( !upds) { fprintf (stderr, "unable to malloc upds table\n"); assert (0); /* MEMERR */ } bzero( (caddr_t) upds, sizeof (upd_state_t) * max_upds) ; buff_udp = init_udp(port, 1, sockBufSize); serve (ss) ;
} void . ndtpd_reset_store (ss_desc_t *ss)
{ do_reset = 1; } static void serve (ss_desc_t *ss)
{
DINITCserve" ) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_udp.c Page 3
for ( ; ; ) { if (upds_live < max_upds && udp_rx_avail (buff_udp, 0)) {
PRINTD (DEBUG_TOP, ("processing request\n" ) ) ; proc_req(ss) ; } else if (do_reset) { ss_empty (ss) ,- do_reset = 0 ; } else if (upds_ready) ( send_upd_rsps ( ) ; } else if (upds_live) {
PRINTD (DEBUG_TOP, ("waiting for updates: %d\n", upds_live) ) ; ss_wait_update (ss) ,- } else {
PRINTD (DEBUG_TOP, ("snoozing for input\n")); udp_tx_flush (buff_udp) ; udp_rx_avai1 (buff_udp, 1) ; } } } void proc_req (ss_desc_t *ss)
{
DINI ( "proc_req" ) ; struct sockaddr_in *addr = &upds [upd_head] .addr; ndtp_hdr_t *hdr; ssize_t be; be = udp_rx(bu _udp, sizeof (ndtp_hdr__t) , addr, (void **) &hdr) ; if (be < sizeof (ndtp_hdr__t) ) { fprintf (stderr, "runt NDTP request (%d byes)\n", (int) be) ; }
PRINTD (DEBUG_TOP,
("request from %s:%d\n", inet_ntoa (addr->sin_addr) , ntohs (addr->sin_port) ) ) ,- hdr->size = ntohl (hdr->size) ,- be = udp_more_rx(buff_udp, hdr->size) ,- if (be < hdr->size) { fprintf (stderr, "runt NDTP payload (%d versus %d) \n" , (int) be, (int) hdr->size) ; return; } switch (hdr->op) { case NDTP_GET: get ( (ndtp_get_t *) hdr, string_store) ; break; case NDTP_PUT: upd ( (ndtp_put_t * ) hdr, string_store) ; break; case NDTP_DEL : upd ( (ndtp_put_t * ) hdr, string_store) ; break; default : fprintf ( stderr, "unexpected ndtp request : 0x%x\n" , hdr->op) ■
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_udp.c Page 4
} } static void get (ndtp_get_t *req, ss_desc_t *ss)
C
DINITC'get") ; ndtp_hdr_t *hdr = &req->hdr; ndtp_str_t *key = &req->key; ssize be; ssize_t bcO; int assocs; ss_i_t totalLen; ndtp_get_rsp_t *r; ndtp_str_t *s; ndtp_str_t *s0 ; mt i; key->len = ntohl (key->len) ,- if (hdr->size < key->len) C fprint (stderr, "get payload (%u) too short for key (%u)\n", hdr->size, key->len) ; return; } if (udp_tx_left(buff_udp) < sizeof (ndtp_hdr_t) + NDTP_STR_SIZE(0) ) { udp_tx_flush (bu f_udp) ; } be = udp_tx_left (buff_udp) ; r = (ndtp_get_rsp_t *) udp_tx(buff_udp, be, &upds [upd_head] .addr) ,- assert (r) ; s = &r->value; bcO = be - sizeof (ndtp_hdr_t) ; s->len = bcO; assocs = ss_lookup (ss, key, s, StotalLen) ; if (beO < totalLen) { udp_untx(buf£_udp, be) ; udp_tx_ lush (buff_udp) ; r = (ndtp_get_rsp_t *) udρ_tx(buff_udp, sizeof (ndtp_hdr_t) + totalLen, &upds [upd_head] .addr) ,- if (!r) { fprintf (stderr, "get response too long for datagram (%u)\n", totalLen); return;
} s = &r->value; s->len = totalLen; assocs = ss_lookup(ss, key, s, &totalLen) ; ) else { udp_untx(buff_udp, bcO - totalLen) ,- } for (i = 0; l < assocs; ι++) {
SO = NDTP_NEXT_STR ( s) ; s->len = htonl (s->len) ; s = sO; }
Copyright 2000 OverX, Inc., All Rights Reseived. ndtpd_udp.c Page 5
r->hdr.op = NDTP_GET_RSP; r->hdr.id = hdr->id; r->hdr.size = htonl (totalLen) ,- ops_done++; gets_done++; assocs_returned += assocs; PRINTD (DEBUG_GET,
("get from %s:%d\n" , inet_ntoa (upds [upd_head] .addr. sin_addr) , ntohs(upds [upd_head] .addr.sin_port) ) ) ; ) static void upd (ndtp_put_t *req, ss_desc_t *ss)
{ ndtp_hdr_t *hdr = &req->hdr; ndtp_str_t *key = &req->key; ndtp_str_t *data; uint32_t keyLen; uint32_t dataLen,- keyLen = ntohl (key->len) ; if (hdr->size < NDTP_STR_SIZE (keyLen) + NDTP_STR_SIZE(0) ) ( fprint (stderr, "update payload (%u) too short for key (%u)\n", hdr->size, keyLen) ; return; } else { key->len = keyLen,- data = NDTP_NEXT_STR(key) ; dataLen = ntohl (data->len) ; if (hdr->size < NDTP_STR_SIZE (keyLen) + NDTP_STR_ΞIZE (dataLen) ) { fprintf (stderr,
"update payload (%u) too short for key (%u) + data (%u)\n" hdr->size, keyLen, dataLen) ; } else { int olduh = upd_head; int del = (hdr->op == NDTP_DEL) ; uint8_t rspOp = del ? NDTP_DEL_RSP : NDTP_PUT_RSP; int async; data->len = dataLen; upds [upd_head] . id = hdr->id; upds [upd_head] .op = rspOp; upd_head = (upd_head + 1) % max_upds; upds_live++; async = del ? ss_delete(ss, key, data) : ss_add(ss, key, data) ; if ( ! async) { upd_rs ( &upds [olduh] ) ; upd_head = olduh; upds_live — ; }
} static void
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_udp.c Page 6
upd_rsp (upd_state_t *upd) C ndtp_hdr_t *rsp ; assert ( sizeof (ndtp_put_rsp_t ) == sizeo (ndtp_hdr_t ) ) ; assert ( sizeof (ndtp_del_rsp_t ) == sizeof (ndtp_hdr_t ) ) ; rsp = (ndtp_hdr_t *) udp_tx(buff_udp, sizeof (ndtp_hdr_t) , &upd->addr) ; if (Srsp) { fprintf (stderr, "upd_rsp: couldn't send %u byte response\n" , sizeof (ndtp_hdr_t) ) ; return; } rsp->op = upd->op; rsp->id = upd->id; rsp->size = 0; if (upd->op == NDTP_DEL_RSP) { deletes_done++ ; } else { puts_done++; } } static void upd_done (void *arg , ss_i_t completions ) t
TIMESTAMP_ENDP ( ( " upd_done %u %u" , completions , upds_live ) ) ; assert ( completions <= upds_live) ; upds_ready += completions ; send_upd_rsps ( ) ,- } static void send_upd_rsps (void) C upd_state_t *upd; assert (sizeof (ndtp_put_rsp_t) == sizeof (ndtp_hdr_t) ) ; assert(sizeof (ndtp_del_rsp_t) == sizeof (ndtp_hdr_t) ) ; assert (upds_ready <= upds_live) ; while (upds_ready && udp_tx_left (buff_udp) >= sizeof (ndtp_hdr__t) ) { upd = &upds [upd_tail] ; upd_rsp (upd) ,- upds__live--; upds_ready-- ; upd_tail = (upd_tail + 1) % max_upds; } udp_tx_flush (buff_udp) ; }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd.c Page 1
* OverX Network Distributed Tracking Protocol Server
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ static const char *ndtpd_id = "$Id: ndtpd.cv 1.19 2000/02/09 18:01:58 steph Exp $ (OVERX)";
((include "ox_mach.h" /* includes inttypes */
((include <stdio.h> (tinclude <unistd.h> ((include <stdlib.h> ((include <string.h> ((include <assert.h> ((include <signal.h> (tifdef HAVE_GETOPT_H ((include <getopt.h> #endif
#include "ndtpd.h" ((include "string_store.h"
((define DEFAULT_MAX_UPDS 1024
/*
* Global values */ char *progname; int ndtpd_debug = 0 ,-
/*
* Internal function prototypes */ static unsigned long read_num_arg (void) ; static void usage (void) ; static void reset__store(void) ; ss_desc_t ss; int main (int argc, char **argv)
{ char *storeName = "/var/tmp/test .ss" ; int tcpNoDelay = 0 ; int sockBufSize = 64 * 1024; int port = NDTP_PORT; ss_i_t storeSize = 1 * 1024 * 1024; ssize_t logBufSize = 64 * 1024; int logBufs = 2; int load = 0 ,- int maxUpds = DEFAULT_MAX_UPDS;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd.c Page 2
struct sigaction sigAct; int status; int c; char *s;
/* TODO: basename, FreeBSD doesn't have it */ progname = argv[0] ; for (;;) { c = getopt(argc, argv, "b:d: lnN:p:S:S:u: " ) ; if (c == -1) break; switch (c) { case 'b' : sockBufSize = read_num_arg ( ) ; break; case 'd' : ndtpd_debug = strtoul (optarg, &s, 0) ; if (*s != '\0') { usage ( ) ;
} break; case ' 1 ' : load = 1 ; break; case 'p' : port = strtoul (optarg, &s, 0) ; if (*s != '\0') { usage ( ) ;
} break; case ' s ' : storeSize = read_num_arg ( ) ; brea ; case 'S' : logBufSize = rea _num_arg ( ) ; break; case 'n' : tcpNoDelay = (tcpNoDelay == 0) ,- brea ; case 'N' : logBufs = read_num_arg ( ) ; break; case 'u' : maxUpds = read_num_arg ( ) ; break; default: usage () ; } } if (optind < argc) { storeName = argv [optind] ; } ss_debug j = ndtpd_debug >> 16 ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd.c Page 3
ndtpd_debug &= Oxffff; ss.arenaSize = storeSize; ss.chunkSize = SS_CHUNK_SIZE; ss . chunksPerHash = SS_CHUNKS_PER_HASH; ss.callbackFunc = NULL; ss.callbackArg = NULL; ss.logBufSize = logBufSize; ss.logBufs = logBufs; status = ss_init (&ss) ; if (status) { fprintf (stderr, "unable to initialize string store: %s\n" , strerror (status) ) ; exit (4) ; } printf ( "starting with %lu chunks free\n", (unsigned long) ss.chunksFree) ; if (load) { status = ss_read_log(&ss, storeName) ,- if (status) { fprintf (stderr, "unable to read log %s : %s\n", storeName, strerror (status) ) ,- exit (5) ; } printf ("%lu chunks free after reading %s\n" ,
(unsigned long) ss.chunksFree, storeName); } status = Ξs_new_log(&ss, storeName, 0) ,- if (status) { fprintf (stderr, "unable to open new log %s: %s\n", storeName, strerror (status) ) ; exit (6) ,- } sigAct .sa_handler = (void (*)()) reset_store; sigemptyset (&sigAct.sa_mask) ; sigAct. sa_flags = 0; if (sigaction (SIGUSRl, &sigAct, NULL) < 0) { perror ( "unable to install SIGUSRl handler") ; exit (7) ; } ndtpd_serve(&ss, port, sockBufSize, tcpNoDelay, maxUpds); ss_close_log(&ss) ; return 0; } static unsigned long read_num_arg (void) { unsigned long a; char *s; a = strtoul (optarg, &s , 0) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd.c Page 4
switch (s[0]) { case ' \0 ' : break,- case ' g' : case 'G' : if (s[l] != '\0') { usage ( ) ;
} a *= (1024 * 1024 * 1024) ; break; case 'k' : case 'K' : if (s[l] != '\0') { usage ( ) ;
} a *= 1024; break; case 'm' : case 'M' : if (s[l] != '\0') { usage ( ) ;
} a *= (1024 * 1024) ,- break; default: usage ( ) ; } return a;
} static void usage (void) C printf ("usage: %s [opts] [ssFile]\n", progname); printf (" -d n: set debug flags (default is 0)\n"); printf ( " -1: test string store load operation\n" ) ; printf ( " -n: set TCP_NODELAY socket optionW); printf ( " -p n: set port number (default is %d) \n" , NDTP_PORT) print ( " -s n[u] string store arena size\n"); print ( " u: k: 1024 bytes, m: 1024*1024 bytes\n"),- printf ( " g: 1024*1024*1024 bytes (default is lm) \n" ) ; exit (100) ; } static void reset_store (void) ndtpd_reset_store(&ss) ;
void dump_hdr (void *hdr)
{ uint32_t *p = (uint32_t *) hdr; printf ( " 0x%08x 0x%08x 0x%08x\n " , p [ 0 ] , p [ l ] , p [ 2 ] )
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd.c Page 5
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd.h Page 1
/*
* OverX Network Distributed Tracking Protocol Server
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted any form or by any means , electronic or mechanical
* for any purpose without the express written consent of OverX, Inc. *
* "$Id: ndtpd.h,v 1.6 2000/02/09 18:01-58 steph Exp $ (OVERX) " ; */
(fifndef _NDTPD_H {define _NDTPD_H ((include "strmg_store.h"
/*
* Global flags */ extern char *progname; extern t ndtpd_debug;
/*
* Convenience function defined ndtpd */ extern void dump_hdr (void *hdr) ;
/*
* External entries into actual server code */ extern void ndtpd_serve (ss_desc_t *ss, t port, t sockBufSize, mt tcpNoDelay, int maxUpds), extern void ndtpd_reset_store (ss_desc_t *ss) ,
/*
* Debugging print macros */
((define NDTPD_DEBUG
((define DEBUG_T0P 0x1
((define DEBUG_GET 0x2 ff efine DEBUG_PUT 0x4
((define DEBUG_C0N 0x8
(fifdef NDTPD_DEBUG
((define DINIT(N) static char *module = N
((define PRINTD (F, X) \ do { \ if (ndtpd_debug & ( t) (F) ) { \
/* VARARGS */ \ printf ("%-20s" , module); \
/* VARARGS */ \ printf X; \
} \ } while (0) ffdefme PRINT0D(F, X) \ do { \
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd.h Page 2
if (ndtpd_debug & (int) (F) ) [ \
/* VARARGS */ \ printf X; \
} \
} while (0)
((define PRINTSD(F, X) \ do { \ if (ndtpd_debug & (int) (F) ) { \ ss_i_t i; \
/* VARARGS */ \ printf ("%-20s" , module); \ for (i = 0; i < (X)->len; i++) { \
/* VARARGS */ \ printf ("%c", (X) ->data [i] ) ; \
} \
} \
} while (0)
((else
#define DINIT (F) ((define PRINTD (F, X) ifendif #endif
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 1
/*
* OverX Network Distributed Tracking Protocol Server
* Synchronous Multiplexing Variant
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ static const char *ndtpd_sync_id = "$Id: ndtpd_sync .c, v 1.6 2000/02/09 18:01:59 steph Exp $ (OVERX) ",-
((include "ox_mach.h" /* includes inttypes */
((include <stdio.h>
((include <unistd.h>
((include <stdlib.h>
((include <string.h>
((include <errno.h>
(finclude <assert.h>
(finclude <signal.h>
((include <fcntl.h>
((include <sys/types.h>
((include <sys/time.h>
(finclude <sys/socket .h>
((include <netinet/in.h>
((include <arpa/inet .h>
(tifdef HAVE_NETINET_TCP_H
((include <netinet/tcp.h>
((endif
(tifdef USE_P0LL
(finclude <poll.h>
#endif
#include "dq.h"
((include "string_store.h"
#include "ndtp.h"
#include "timestamp.h"
((include "ndtpd.h"
((include "n tpd_sync . "
#define CON_BACKLOG 8 ((define DEFAULT_BUF_BC 2048
/*
* Global state */ static int ops_done; static int puts_done; static int deletes_done; static int gets_done; static int assocs_returned; ss_desc_t *string_store; static dq_t free_cons;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 2
static dq_t active_cons ; static dq_t upds_waiting; static int listen_sock; static int sock_buf_size; static int tcp_no_delay; static int cons_size; static con_t *cons;
#if ef USE_POLL static struct pollfd *pollfds;
#endi static int max_upds; static int upds_live; static int upd_head; static int upd_tail; static upd_state_t *upds;
(tifndef USE_POLL static fd_set read_fds; static fd_set write_fds; static int max_fd; #endi static int do_reset = 0;
static void do_listen(int port) ; static void serve (ss_desc_t *ss) ; static void do_accept (void) ; static void do_read (con_t *con) ; static void proc_req(con_t *con) ; static void get(con_t *con, ss_desc_t *ss) ; static void upd(con_t *con, ss_desc_t *ss) ; static void do_write(con_t *con) ,- static void add_upd_rsp (con_t *con, uint32_t id, uint8_t rspOp) ; static void upd_done (void *arg, ss_i_t completions); static void close_con(con_t *con) ; static void find_max_sock(dq_t *con, int *maxFd) ; void ndtpd_serve (ss_desc_t *ss, int port, int sockBufSize, int tcpNoDelay, int maxUpds)
{ int i; string_store = ss; sock_buf_size = sockBufSize; tcp_no_delay = tcpNoDelay; max_upds = maxUpds ; ss->callbackFunc = upd_done; ss->callbackArg = NULL; ops_done = 0 ; puts_done = 0 ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 3
deletes_done = 0; gets_done = 0 ; assocs_returned = 0; upds_live = 0; upd_head = 0; upd_tail = 0; upds = (upd_state_t *) malloc (sizeof (upd_state_t) * max_upds) ; if (iupds) { fprintf (stderr, "unable to malloc upds table\n"); assert (0); /* MEMERR */ }
INITQ(&free_cons) ;
INITQ(Seactive_cons) ;
INITQ (&upds_waiting) ; cons_size = getdtablesizeO ; cons = (con_t *) malloc (sizeof (con_t) * cons_size) ,- if ( icons) { fprintf (stderr, "unable to malloc cons table\n"); assert (0); /* MEMERR */ } (tifdef USE_POLL pollfds = malloc (sizeof (struct pollfd) * cons_size) ; if ('pollfds) { fprintf (stderr, "unable to malloc pollfd tableW); assert(O); /* MEMERR */ } #endif for (i = 0; i < cons_size; i++) { con_t *con = &cons[i],- con->reqBuf = (uint8_t *) malloc (DEFAULT_BUF_BC) ; if ( !con->reqBu ) { fprintf (stderr, "unable to malloc request buffer\n" ) ; assert (0); /* MEMERR */ } con->reqBufSize = DEFAULT_BUF_BC; con->rspBuf = (uint8_t *) malloc (DEFAULT_BUF_BC) ; if ( !con->rspBuf) { fprintf (stderr, "unable to malloc response buf er\n"); assert (0); /* MEMERR */ ) con->rspBufSize = DEFAULT_BUF_BC; INSQT(&free_cons, &cons[i] ) ; } do_listen(port) ,- serve (ss) ; } void ndtpd_reset_store (ss_desc_t *ss)
{ do_reset = 1 ;
}
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 4
static void do_listen (int port) { struct sockaddr_in addr; int i; listen_sock = socket (PF_INET, SOCK_STREAM, 0); if (listen_sock < 0) { perror ( "can' t open socket"); exit (11); } addr.sin_family = PF_INET; addr.sin_port = htons (port) ; addr.sin_addr.s_addr = htonl (INADDR_ANY) ; if (bind(listen_sock, (struct sockaddr *) &addr, sizeof (addr) ) ) { perror ( "Can' t bind address"),- exit (12) ; } i = 1; if (setsockopt (listen_sock, SOL_SOCKET, SO_REUSEADDR, &i, sizeo (i)) < 0) { perror ("Can't set SO_REUSEADDR" ) ,- exit (13) ; } if (setsockopt(listen_sock, S0L_SOCKET, SO_SNDBUF, (void *) &sock_buf_size, sizeof (sock_buf_size) ) < 0) { perror ( " setsockopt S0_SNDBUF" ) ; } if (setsockopt (listen_sock, S0L_S0CKET, S0_RCVBUF, (void *) &sock_buf_size, sizeof (sock_buf_size) ) < 0) { perror ( " setsockopt S0_RCVBUF" ) ; } if (fcntl(listen_sock, F_SETFL, O_N0NBL0CK) ) { perror ("error setting socket nonblocking" ) ,- exit (14) ; } if (listen(listen_sock, C0N_BACKL0G) ) { perror ( "error listening" ) ; exit (15) ; }
#if defined (USE_P0LL)
/* use poll (2) */ static void serve (ss_desc_t *ss)
( con_t *con; int tmo;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 5
int list; int check; int ready; int i ; for (;;) { list = 0; check = 0 ; if ( !EMPTYQ(&free_cons) ) { pollfds [0] . fd = listen_sock; pollfds [0] .events = POLLIN; check++ ; list = 1; }
FORALLQ(&active_cons, con) { pollfds [check] .events = 0; if (con->flags & CON_FLAG_READING) { pollfds [check] .events = POLLIN; } if (con->flags & CON_FLAG_ RITING) { pollfds [check] .events [= POLLOUT; } if (pollfds [check] .events) ( pollfds [check] .fd = con->sock; check++; } } if (upds_live) { tmo = 0; } else { (tifdef INFTIM tmo = INFTIM; Seise tmo = -1; #endif } ready = poll (pollfds, check, tmo) ,- if (ready < 0) {. if (errno == EINTR) { ready = 0; } else { fprintf (stderr, "error in poll: %s\n" , strerror (errno) ) ; assert (0) ; /* BUG */ } } if (do_reset) { ss_empty(ss) ; do_reset = 0 ; } i = 0; if (ready) { if (list) { if (pollfds [0] .revents & POLLIN) ( assert ( !EMPTYQ(&free_cons) ) ; do_accept ( ) ; } i++;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 6
} for (; i < check; i++) C if (pollfds [i ] .revents & POLLIN) { do_read (&cons [pollfds [i] . fd] ) ; } if (pollfds [i] .revents & POLLOUT) { do_write(&cons [pollfds [i] . fd] ) ; } } } else { ss_wait_update (ss) ; } } }
((else
/* use select (2) */ static void serve (ss_desc_t *ss) { int maxSelect; struct timeval tv; struct timeval *tvp; fd_set readFdsO,- fd_set writeFdsO; int i ; max_fd = listen_sock; FD_ZERO (&read_fds ) ; FD_SET(listen_sock, &read_fds) ; FD_ZERO (&write_fds) ,- tv. v_sec = 0 ; tv. v_usec = 0 ; for (;;) { if (upds_live) { tvp = &tv; } else { tvp = NULL; } readFdsO = read_fds; writeFdsO = write_fds; maxSelect = select (max_fd + 1, &readFdsO, &writeFds0, NULL, tvp) ; if (do_reset) { ss_empty(ss) ; do_reset = 0; } else { if (maxSelect < 0) { fprint (stderr, "error in select: %s\n", strerror (errno) ) ; assert(O); /* BUG */ } if (maxSelect) { if (FD_ISSET(listen_sock, kreadFdsO) ) { do_accept ( ) ; } for (i = 0; i <= max_fd; i++) { if (i 1= listen_sock && FD_IΞSET(i, StreadFdsO)) { do_read(&cons[i] ) ; }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 7
} for (i = 0; i <= max_fd; i++) [ if (i != listen_sock && FD_ISSET(ι, &writeFdsO) ) { do_write (Scons [i] ) ; } } } else { ss_wait_update(ss) ; } } } } #endif static void do_accept (void) { int tmp; int sock; struct sockaddr_in addr; con_t *con; tmp = sizeof (addr) ; sock = accept (listen_sock, (struct sockaddr *) &addr, &tmp) ; if (sock < 0) { perror ( "accept ailed"); assert (0) ; /* BUG */
} if (setsockopt (sock, IPPROTO_TCP, TCP_NODELAY, (void *) &tcp_no_delay, sizeof (tcp_no_delay) ) < 0) { perror ( " setsockopt TCP_NODELAY" ) ; } con = &cons [sock] ; con->sock = sock; con->flags = (CON_FLAG_OPEN | CON_FLAG_READING) ; con->reqBufOff = 0; con->rspBufOff = 0; con->rspBc = 0; con->updsDive = 0;
REMQ(con) ;
INSQT (&active_cons , con) ; (Sifndef USE_POLL if ( sock > max_fd) { max_fd = sock;
}
FD_SET (sock, &read_f ds ) ; if (EMPTYQ (&free_cons) ) {
FD_CLR ( listen_sock, &read_fds ) ;
} # endif } static void do_read (con_t *con)
{
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 8
size_t be; size_t bcO; int readPayload = (con->flags & CON_FLAG_READ_PAYLOAD) ; uint8_t *buf; if (readPayload) ( buf = &con->reqBuf [con->reqBufOff] ; be = con->reqHdr .size - con->reqBufOff ; } else { buf = (uint8_t *) &con->reqHdr; be = sizeof (con->reqHdr) - con->reqBufOff ; } bcO = read(con->sock, &buf [con->reqBufO f] , be); if (bcO <= 0) { if (bcO == 0) {
/* connection was closed by the far end */ close_con(con) ; } else if (errno != EAGAIN) { fprintf (stderr, "error reading from con: %s\n" , strerror (errno) ) ; close_con(con) ; } } else if (be == bcO) { if (! readPayload) { con->reqHdr.size = ntohl (con->reqHdr. size) ; } if (readPayload | | con->reqHdr . size == 0) { Cθn->flags &= ~CON_FLAG_READ_PAYLOAD; proc_req(con) ; ) else {
Cθn->flags |= CON_FLAG_READ_PAYLOAD; } con->reqBufO f = 0; } else { con->reqBufOff += bcO; } } static void proc_req (con_t *con)
{ switch (con->reqHdr.op) { case NDTP_GET: get (con, string_store) ; break; case NDTP_PUT: upd(con, string_store) ; break; case NDTP_DEL: upd(con, string_store) ; break; default: fprintf (stderr, "unexpected ndtp request: 0x%x\n", con->reqHdr.op) ; close_con(con) ; } }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 9
static void get (con_t *con, ss_desc_t *ss)
{ ndtp_str_t *key = (ndtp_str_t *) con->reqBu ; ndtp_hdr_t *hdr = &con->reqHdr; int assocs,- ss_i_t totalLen; ndtp_get_rsp_t *r; ndtp_str_t *s; ndtp_str_t *s0; int i; assert!! (con->flags & CON_FLAG_WRITING) ) ; assert (con->rspBc == 0) ; assert (con->rspBufOff == 0); key->len = ntohl (key->len) ; if (hdr->size < key->len) { fprintf (stderr, "get payload (%u) too short for key (%u)\n", hdr->size, key->len) ; close_con(con) ; return; } for (;,-) { s = &( (ndtp_get_rsp_t *) con->rspBuf) ->value; s->len = con->rspBufSize - sizeof (ndtp_hdr_t) ; assocs = ss_lookup(ss, key, s, &totalLen) ; if (con->rspBufSize - sizeof (ndtp_hdr_t) < totalLen) { uint8_t *rspBuf0; rspBufO = (uint8_t *) malloc (totalLen + sizeof (ndtp_hdr_t) ) ; if (IrspBufO) { fprintf (stderr, "couldn't allocate %u bytes for response buffer\n", totalLen + sizeof (ndtp_hdr_t) ) ; close_con(con) ; return;
} free(con->rspBuf) ,- con->rspBuf = rspBufO; con->rspBufSize = totalLen + sizeof (ndtp_hdr_t) ; } else { break; ) } s = & ( (ndtp_get_rsp_t * ) con->rspBuf ) ->value; for ( i = 0 ; i < assocs ; i++) {
SO = NDTP_NEXT_STR (s ) ; s->len = htonl (s->len) ; s = sO ; } r = (ndtp_get_rsp_t * ) con->rspBuf ; r->hdr . op = NDTP_GET_RSP ; r->hdr . id = hdr->id; r->hdr . size = htonl ( totalLen) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 10
con->rspBufOff = 0; con->rspBc = sizeof (r->hdr) + totalLen; ops_done++; gets_done++; assocs_returned += assocs; do_write(con) ; } static void upd (con_t *con, ss_desc_t *ss)
C ndtp_str_t *data; ndtp_hdr_t *hdr = &con->reqHdr; ndtp_str_t *key = (ndtp_str_t *) con->reqBuf; uint32_t keyLen; uint32_t dataLen; keyLen = ntohl (key->len) ; if (hdr->size < NDTP_STR_SIZE (keyLen) + NDTP_STR_SIZE(0) ) { fprintf (stderr, "update payload (%u) too short for key (%u)\n", hdr->size, keyLen) ; close_con(con) ; return; } else { key->len = keyLen; data = NDTP_NEXT_STR(key) ; dataLen = ntohl (data->len) ; if (hdr->size < NDTP_STR_SIZE (keyLen) + NDTP_STR_SIZE (dataLen) ) { fprintf (stderr,
"update payload (%u) too short for key (%u) + data (%u)\n", hdr->size, keyLen, dataLen) ,- close_con(con) ,- } else ( if (upds_live < max_upds) { int olduh = upd_head; int del = (con->reqHdr.op == NDTP_DEL) ; uint8_t rspOp = del ? NDTP_DEL_RSP : NDTP_PUT_RSP; int async; data->len = dataLen; upds [upd_head] . id = hdr->id; upds [upd_head] . con = con; upds [upd_head] . op = rspOp; upd_head = (upd_head + 1 ) % max_upds ; upds_live++; con->updsLive++ ; if (del) { async = ss_delete (ss, key, data) ; } else { async = ss_add(ss, key, data) ,- } if (!async) { upd_head = olduh; upds_live--; con->updsLive— ; add_upd_rsp(con, hdr->id, rspOp) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 11
if ( ! (con->flags & CON_FLAG_WRITING) ) { do_write (con) ; } }
) else { (fifndef USE_POLL
FD_CLR ( con->sock, &read_fds ) ; # endif
INSQT (&upds_waiting, &con->updWaitLinks) ,- con->flags = ( (con->f lags | CON_FLAG_UPD_ AITING) & -CON_FLAG_READING) ; } } } } static void do_write (con_t *con)
C size_t be; size_t bcO ; int beenHereBefore = (con->flags & CON_FLAG_ RITING) ; bcO = con->rspBc - con->rspBufOf f ; be = write (con->sock, &con->rspBu [con->rspBufOf f ] , bcO) ; if (be < 0) { if (errno == EAGAIN) [ be = 0; } else { fprintf (stderr, "error sending %u byte get rsp: %s\n" , bcO, strerror (errno) ) ; close_con (con) ; return; } } con->rspBufOff += be; if (con->rspBufO f == con->rspBc) { con->rspBufθ f = 0; con->rspBc = 0 ; if (beenHereBefore) { (fifndef USE_POLL
FD_CLR(con->sock, &write_fds) ; #endif if ( ! (con->flags & CON_FLAG_UPD_AITING) ) { con->flags = ( (con->flags | CON_FLAG_READING) & ~CON_FLAG_WRITING) ; (fifndef USE_POLL
FD_SET(con->sock, &read_fds) ; #endif
} else { con->flags &= ~CON_FLAG_ RITING; } } } else { if ( ! beenHereBefore) { (fifndef USE_POLL
FD_SET ( con->sock, &write_fds) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 12
FD_CLR(con->sock, &read_fds ) ; (fendif con->flags = ( (con->flags | CON_FLAG_WRITING) & ~CON_FLAGJREADING) ; } } } static void add_upd_rsp (con_t *con, uint32_t id, uint8_t rspOp)
I ndtp_hdr_t *r; assert (sizeof (ndtp_put_rsp_t) == sizeof (ndtp_hdr_t) ) ; assert (sizeof (ndtp_del_rsp_t) == sizeof (ndtp_hdr_t) ) ; if (con->rspBufSize - con->rspBc < sizeof (ndtp_hdr_t) ) [ uint8_t *newBuf; size_t be; be = con->rspBufSize + (con->updsLive + 1) * sizeof (ndtp_hdr_t) ,- newBuf = malloc (be),- if (!newBuf) { fprintf (stderr, "couldn't allocate %u bytes for response buffer\n" , be); close_con(con) ; return; } bcopy(con->rspBuf, newBu , con->rspBc) ; free(con->rspBuf) ,- con->rspBuf = newBuf; con->rspBu Size = be; } r = (ndtp_hdr_t *) (con->rspBuf + con->rspBc) ,- r->op = rspOp; r->id = id; r->size = 0; con->rspBc += sizeof (ndtp_hdr_t) ; if (rspOp == NDTP_DEL_RSP) { deletes_done++ ; } else { puts_done++; } } static void upd_done (void *arg, ss_i_t completions)
{ int i; upd_state_t *us; dq_t updRspCons; dq_t *1; con_t *con;
TIMESTAMP_ENDP ( ( "upd_done %u %u" , completions , upds_live) ) ; assert (completions <= upds_live) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 13
INITQ ( &UpdRspCons ) ; for ( i = 0 ; i < completions ; i++ ) { us = &upds [upd_tail ] ,- con = us->con; if ( con) { con->updsLive-- ,- add_upd_rsp (con, us->id, us->op) ; if ( ( con->flags & ( CON_FLAG_UPD_RSP | CON_FLAG_WRITING) ) == 0 ) { con->flags | = CON_FLAG_UPD_RSP ; INSQT (&updRspCons , &con->updRspLinks) ; } } upd_tail = (upd_tail + 1 ) % max_upds ; } while ( !EMPTYQ(&updRspCons) ) {
REMQH (&updRspCons, 1) ; con = QLTOE ( 1, con_t * , updRspLinks) ; con-> lags &= ~CON_FLAG_UPD_RSP; do_write(con) ; } upds_live -= completions; while ( !EMPTYQ(&upds_wai ing) && upds_live < max_upds) { REMQH (&upds_waiting, con) ; upd(con, string_store) ; if ( (con->flags & (CON_FLAG_OPEN | CON_FLAG_WRITING) ) == CON_FLAG_OPEN) { #ifndef USE_POLL
FD_SET(con->sock, &read_fds) ; #endif con->flags = ( (con->flags | CON_FLAG_READING) & ~CON_FLAG_UPD_ AITING) ; } else { con-> lags &= -CON FLAG UPD WAITING; } } } static void close_con (con_t *con)
{ int i; if (con->updsLive) { for (i = upd_tail; i != upd_head; i = (i + 1) % max_upds) { if (upds[i].con == con) { upds [i] . con = NULL; } } } if (con->flags & CON_FLAG_UPD_ AITING) {
REMQ(&con->updWaitLinks) ; } if (con->flags & CON_FLAG_UPD_RSP) {
REMQ(&con->updRspLinks) ; } #ifndef USE_POLL
FD_CLR(con->sock, &read_fds) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.c Page 14
FD_CLR(con->sock, &write_fds) ; if (con->sock == max_fd) { max_fd = listen_sock;
MAPQ(&active_cons, f ind_max_sock, &max_fd) ; } if (EMPTYQ(&free_cons) ) {
FD_SET(listen_sock, &read_fds) ,- } # endif
REMQ(con); /* from active_cons */
INSQT(&free_cons, con); close (con->sock) ; con->flags = 0; } static void f ind_max_sock (dq_t *con, int *maxFd)
C int sock = ( (con_t *) con)->sock; if (sock > *maxFd) { *maxFd = sock;
} }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.h Page 1
* OverX Network Distributed Tracking Protocol Server Definitions
* Copyright 2000 OverX, Inc.
*
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ (fifndef _NDTPD_SYNC_H (fdefine _NDTPD_SYNC_H #include "ox__mach.h" /* includes inttypes */ ((include "ndtp.h" ((include "dq.h"
/*
* This is the state for an open connection */ typedef struct con { dq_t links; int sock; int flags; uint8_t *reqBuf; size_t reqBufSize; size_t reqBufOff; ndtp_hdr_t reqHdr; uint8_t *rspBuf; size_t rspBufSize; size_t rspBufO f; size_t rspBc; int updsLive; dq_t updWaitLinks; dq_t updRspLinks; } con_t;
((define CON_FLAG_READING 0x01 #define CON_FLAG_WRITING 0x02 #define CON_FLAG_OPEN 0x04 #define CON_FLAG_READ_PAYLOAD 0x08 #define CON_FLAG_UPD_WAITING 0x10 ((define CON_FDAG_UPD RSP 0x20 r
This is the state for a pending update (PUT or DEL) request
"I typedef struct upd_state { dq_t links; uint32_t id; /* protocol id for response uint8_t op; /* opcode for response */ con_t *con; } upd_state_t;
(fendif
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_sync.h Page 1
/ *
* OverX Network Distributed Tracking Protocol Server Definitions
* Copyright 2000 OverX, Inc.
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ (fifndef _NDTPD_SYNC_H ((define _NDTPD_SYNC_H
((include "ox_mach.h" ' /* includes inttypes */ #include "ndtp.h" #include "dq.h"
/*
* This is the state for an open connection */ typedef struct con { dq_t links; int sock; int flags; uint8_t *reqBuf; size_t reqBufSize; size_t reqBufOff; ndtp_hdr_t reqHdr; uint8_t *rspBuf; size_t rspBufSize; size_t rspBufOff; size_t rspBc; int updsLive; dq_t updWaitLinks; dq_t updRspLinks; } con_t;
#define CON_FLAG_READING 0x01
((define CON_FLAG_WRITING 0x02
#define CON_FLAG_OPEN 0x04
((define CON_FLAG_READ_PAYLOAD 0x08
#define CON_FLAG_UPD_WAITING 0x10
#define CON_FLAG_UPD_RSP 0x20
/*
* This is the state for a pending update (PUT or DEL) request */ typedef struct upd_state { dq_t links; uint32_t id; /* protocol id for response */ uint8_t op; /* opcode for response */ con_t *con; } upd_state_t;
#endif
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd hr.c Page 1
/ *
* OverX Network Distributed Tracking Protocol Server
* Threaded Variant
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ static const char *ndtpd_thr_id = "$Id: ndtpd_thr.c,v 1.14 2000/02/10 00:10:19 steph Exp $ (OVERX)";
((include "ox_mach.h" /* includes inttypes */
((include <stdio.h> ((include <unistd.h> ((include <stdlib.h> #include <string.h> (tinclude <errno.h> ((include <assert.h> ((include <signal.h> ((include <fcntl.h> ((include <sys/types.h> #include <sys/time.h> ((include <sys/socket.h> (finclude <netinet/in.h> (finclude <arpa/inet .h> (tifdef HAVE_NETINET_TCP_H ((include <netinet/tcp.h> #endif
#include "ox_thread.h" #include "dq.h" #include "string_store.h" (finclude "ndtp.h" #include "timestamp.h" #include "ndtpd.h" #include "ndtpd_thr .h"
#define C0N_BACKL0G 8
#define INITIAL_CONS 32
((define DEFAULT_BUF_SIZE 2048
#define UPDATE_INTERVAL 20000 /* in microseconds */
/*
* Global variables
*/ static int tcp_no_delay = 0; static int sock_buf_size = 64 * 1024; static int ops_done; static int puts_done; static int deletes_done; static int gets_done; static int assocs_returned; static int upds = 0;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.c Page 2
static int async_upds = 0; static ss_desc_t *string_store; static int listen_sock; static THR_T update_thread; static int updated; static dq_t cons_free; static int waiting_cons; static MUT_T cons_lock; static COND_T cons_avail; static dq_t upds_free; static int waiting_upds; / of threads waiting for an UPD block * /
Static MUT_T upds_lock; static COND_T upds_avail; static int upds_live; static dq_t upds_pending; static int waiting_mutate; static MUT__T mutate_lock; static COND_T mutate_go; static int empty_req; static MUT_T empty_lock; static COND_T empty_go; static THR_T empty_thread;
((define MUTATE_START do {
MUT_LOCK(mutate_lock) ; waiting_mutate++ ; if (waiting_mutate > 1) {
COND_AIT(mutate_go, mutate_lock)
}
MUT_UNLOCK(mutate_lock) ; } while (0)
((define MUTATE_DONE do {
MUT_LOCK(mutate_lock) ; waiting_mutate— ; if (waiting_mutate) { COND_SIGNAL(mutate_go) ;
}
MUT_UNLOCK(mutate_lock) ; } while (0)
((define GET_UPD(u) do {
MUT_L0CK(upds_lock) ; waiting_upds++; while (EMPTYQ(&upds_free) ) {
COND_WAIT (upds_avail , upds_lock) ;
}
REMQH (&upds_free, (u) ) ; waiting_upds — ;
MUT_UNLOCK ( upds_lock ) ; } while (0)
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.c Page 3
((define FREE_UPD (u) do C
MUT_LOCK (upds_lock) ;
INSQT ( &upds_f ree , (u) ) ; if (waiting_upds ) {
COND_SIGNAL (upds_avai1)
}
MUT_UNLOCK(upds_lock) ; } while (0)
static void *emptier (void *ss) ; static void do_listen(int port) ; static void serve (void) ; static void do_accept (void) ; static void *reader(void *arg) ; static int recv_bytes (con_t *con, uint8_t *buf, size_t bufSize); static void close_con(con_t *con) ; static int get(con_t *con, ndtp_hdr_t *hdr, uint8_t *payload, ss_desc_t *ss) ; static int upd(con_t *con, ndtp_hdr_t *hdr, uint8_t *payload, ss_desc_t *ss) ; static void upd_done (void *arg, ss_i_t completions); static void *writer(void *arg) ,- static void *updater (void *ss) ; void ndtpd_serve (ss_desc_t *ss, int port, int sockBufSize, int tcpNoDelay, int maxUpds) t mt i ; int status; ss->callbackFunc = upd_done; ss->callbackArg = NULL; string_store = ss; sock_buf__size = sockBufSize; tcp_no_delay = tcpNoDelay; ops_done = 0; puts_done = 0 ,- deletes_done = 0 ; gets_done = 0; assocs_returned = 0; upds_live = 0; waiting_cons = 0; status = MUT_INIT(cons_lock) ; if (status) { fprintf (stderr, "unable to init cons list lock: %s\n" , strerror (status) ) ; assert (0); /* MEMERR */ } status = COND_INIT(cons_avail) ; if (status) { fprintf (stderr, "unable to init cons avail cond: %s\n" , strerror (status) ) ; assert (0); /* MEMERR */ }
INITQ(&cons_free) ; for (i = 0; i < INITIAL_CONS; i++) {
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.c Page 4
con_t *con; con = (con_t *) malloc (sizeof (con_t) ) ; if (Icon) { fprintf (stderr, "unable to malloc connection state block\n"); assert (0); /* MEMERR */ } con->rspBuf = (uint8_t *) malloc (DEFAULT_BUF_SIZE) ; if ( !con->rspBuf) { fprintf (stderr, "unable to malloc response buffer\n"); assert (0); /* MEMERR */ } con->rspBufSize = DEFAULT_BUF_SIZE; status = MUT_INIT(con->readerLock) ; if (status) { fprint (stderr, "unable to init con lock: %s\n" , strerror (status) ) ; assert (0); /* MEMERR */ } status = COND_INIT(con->readerGo) ; if (status) { fprintf (stderr, "unable to init go cond: %s\n" , strerror (status) ) ,- assert ( 0 ) ; / * MEMERR */ } con->rspBufSize = DEFAULT_BUF_SIZE; status = MUT_INIT(con->readerLock) ; if (status) £ fprintf (stderr, "unable to init con lock: %s\n" , strerror (status) ) ; assert (0); /* MEMERR */ ) status = COND_INIT(con->readerGo) ; if (status) { fprintf (stderr, "unable to init go cond: %s\n" , strerror (status) ) ,- assert (0) ,- /* MEMERR */ } con->flags = 0; con->inUpdDoneCons = 0; INITQ(&con->updsDone) ,- INSQT(&cons_free, con) ; } waiting_cons = 0; status = MUT_INIT(upds_lock) ; if (status) { fprintf (stderr, "unable to init upds lock: %s\n" , strerror (status) ) ; assert (0); /* MEMERR */ } status = COND_INIT(upds_avail) ; if (status) { fprintf (stderr, "unable to init upds avail cond: %s\n", strerror (status) ) ,- assert (0); /* MEMERR */ }
INITQ(&upds_free) ; for (i = 0; i < maxUpds; i++) { upd_t *upd; upd = (upd_t * ) malloc ( sizeof (upd_t) ) ; INSQT ( &upds_free , upd) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd„thr.c Page 5
} waiting_mutate = 0; status = MUT_INIT(mutate_lock) ; if (status) { fprintf (stderr, "unable to init mutator lock: %s\n" , strerror (status) ) ; assert(O); /* MEMERR */ ) status = COND_INIT(mutate_go) ; if (status) { fprintf (stderr, "unable to init mutator go cond: %s\n", strerror (status) ) ; assert(O); /* MEMERR */ } INITQ(&upds_pending) ; empty_req = 0;
MUT_INIT(empty_lock) ;
COND_INIT(empty_go) ; status = THR_CREATE(empty_thread, emptier, string_store) ; if (status) { fprintf (stderr, "empty thread creation failed: %s\n" , strerror (status) ) ; assert(O); /* MEMERR */ } updated = 0 ; status = THR_CREATE(update_thread, updater, string_store) ; if (status) { fprintf (stderr, "updater thread creation failed: %s\n" , strerror (status) ) ,- assert (0); /* MEMERR */ } do_listen (port) ; serve ( ) ; } void ndtpd_reset_store (ss_desc_t *ss)
{ printf ("Asking for empty\n" ) ; // BOGUS
MUT_LOCK(empty_lock) ; empty_req++ ;
COND_SIGNAL(empty_go) ;
MUT_UNLOCK(empty_lock) ; } static void * emptier (void *ss) {
MUT_LOCK(empty_lock) ; while (1) { if (empty_req) { printf ( " EmptyingNn" ) ; / / BOGUS empty_req = 0 ; MUT_UNLOCK ( empty__lock) ; MUTATE_START;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.c Page 6
ss_empty( (ss_desc_t *) ss); MUTATE_DONE; MUT_LOCK(empty_lock) ; } else { printf ("Snoozing for empty\n"); // BOGUS COND_WAIT(empty_go, empty_lock) ; } } } static void do_listen (int port) { struct sockaddr_m addr; int i; listen_sock = socket (PF_INET, SOCK_STREAM, 0) ; if (listen_sock < 0) { perror ( "can ' t open socket" ) ; exit (11) ; } addr. sin_family = PF_INET; addr. sin_port = htons (port) ; addr.sin_addr.s_addr = htonl (INADDR_ANY) ; if (bind(listen_sock, (struct sockaddr *) &addr, sizeof (addr) ) ) ( perror ( "Can' t bind address"); exit (12) ; } i = 1; if (setsockopt (listen_sock, S0L_S0CKET, SO_REUSEADDR, &i, sizeof (i)) < 0) ( perror ("Can't set SO_REUSEADDR" ) ; exit(13) ; } if ( setsockopt (listen_sock, SOL_SOCKET, SO_SNDBUF, (void *) &sock_buf_size, sizeo (sock_buf_size) ) < 0) { perror ( " setsockopt S0_SNDBUF" ) ; } if (setsockopt (listen_sock, SOL_SOCKET, SO_RCVBUF, (void *) &sock_buf_size, sizeof (sock_buf_size) ) < 0) { perro ( "setsockopt SO_RCVBUF" ) ; } if ( listen (listen_sock, CON_BACKLOG) ) { perror ( "error listening" ) ; exit (15) ; } } static void serve (void)
{ for ( ; ; ) {
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.c Page 7
MUT_LOCK(cons_lock) ; while (EMPTYQ(&cons_free) ) ( waiting_cons = 1;
COND_WAIT(cons_avail, cons_lock) ;
} waiting_cons = 0; MUT_UNLOCK(cons_lock) ; do_accept ( ) ; } } static void do_accept (void) { int tmp; int sock; int status; struct sockaddr_in sockaddr; con_t *con; tmp = sizeof (sockaddr) ; sock = accept (listen_sock, (struct sockaddr *) &sockaddr, &tmp) ,- if (sock < 0) { perror ( "accept failed"); return; } if (setsockopt (sock, IPPR0T0_TCP, TCPJMODELAY, (void *) &tcp_no_delay, sizeof (tcp_no_delay) ) < 0) ( perror ( " setsockopt TCP_NODELAY" ) ; }
MUT_LOCK (cons_lock) ; REMQH (&cons_free, con); MUT_UNLOCK(cons_lock) ; MUT_LOCK(con->readerLock) ; con->sock = sock; con->updsPending = 0; con->flags |= CON_FLAG_OPEN; if ( ! (con->flags & CON_FLAG_HAVE_THREADS) ) { status = THR_CREATE(con->reader, reader, con) ; if (status) { fprintf (stderr, "reader thread creation failed: %s\n", strerror(status) ) ; assert(O); /* MEMERR */ } status = THR_CREATE(con->writer, writer,' con) ; if (status) { fprintf (stderr, "reader thread creation failed: %s\n", strerror (status) ) ; assert(O); /* MEMERR */ )
/* TODO: should handle errors here by just closing the connection */ con-> lags |= CON_FLAG_HAVE_THREADS; }
COND_SIGNAL(con->readerGo) ; MUT_UNLOCK(con->readerLock) ; )
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.c Page 8
static void * reader (void *arg) { con_t *con = (con_t * ) arg; size_t reqBufSize = DEFAULT_BUF_SIZE; uint8_t *reqBuf = malloc (reqBufSize) ; ndtp_hdr_t hdr; int opened; int status; if ("reqBuf) { fprintf (stderr, "unable to malloc req buf for thread\n"); assert(O); /* MEMERR */ } for (;;) {
MUT_LOCK(con->readerLock) ; while ( ! (con->flags & CON_FLAG_OPEN) ) {
COND_WAIT(con->readerGo, con->readerLock) ; }
MUT_UNLOCK(con->readerLock) ; opened = 1 ; while (opened) ( status = recv_bytes (con, (uint8_t *) &hdr, sizeof (hdr) ) ; if (status) { fprintf (stderr, "unable to receive header: %s\n", strerror (status) ) ,- opened = 0 ; continue;
} hdr. size = ntohl (hdr. size) ; if (reqBufSize < hdr. size) { uint8_t *reqBufO; reqBufO = (uint8_t *) malloc (hdr . size) ; if (! reqBufO) { fprintf (stderr, "couldn't allocate %u bytes for request bu er\n" , hdr. size) ,- opened = 0; continue; } freetreqBuf) ; reqBuf = reqBufO; reqBufSize = hdr. size; } status = recv_bytes(con, reqBuf, hdr. size); if (status) { fprintf (stderr, "unable to receive payload: %s\n", strerror (status) ) ; opened = 0; continue; } switch (hdr . op) { case NDTP_GET : opened = ge (con, &hdr , reqBuf , string_store) ; break; case NDTP_PUT : case NDTP_DEL : opened = upd f con, &hdr, reqBuf , string_store) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.c Page 9
break; default: fprintf (stderr, "unexpected ndtp request: Ox%x\n", hdr. op) ; opened = 0; continue;
} } printf ("%u updates, %u async\n" , upds, async_uρds) ; upds = 0; async_upds = 0 ; close_con(con) ; } } static int recv_bytes (con_t *con, uint8_t *buf, size_t bufSize)
{ size_t be; size_t bcO; be = 0; while (be < bufSize) { bcO = read(con->sock, &buf[bc], bufSize - be); if (bcO <= 0) { if (bcO == 0) { return ECONNRESET; } else { return errno; } } be += bcO; } return 0 ; } static void close_con (con_t *con)
C upd_t *upd;
MUTATE_START; if (con->updsPending) {
FORALLQ (&upds_pending, upd) { if (upd->con == con) { upd->con = NULL;
} con->updsPending-- ; } }
MUTATE_DONE; assert ( ! con->upds Pending) ; close (con->sock) ,- MUT_LOCK (con->readerLock) ; con->flags &= ~CON_FLAG_OPEN; MUT_UNLOCK (con->readerLock) ; MUT_LOCK (cons_lock) ; INSQH (&cons_free , con) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.c Page 10
if (waiting_cons) {
COND_SIGNAL(cons_avail) ; }
MUT_UNLOCK(cons_lock) ; } static int get (con_t *con, ndtp_hdr_t *hdr, uint8_t *payload, ss_desc_t *ss)
C ndtp_str_t *key = (ndtp_str_t *) payload; int assocs; size_t be; ss_i_t totalLen; ndtp_get_rsp_t *r; ndtp_str_t *s; ndtp_str_t *s0; int i; key->len = ntohl (key->len) ; if (hdr->size < key->len) { fprintf (stderr, "get payload (%u) too short for key (%u)\n", hdr->size, key->len) ; return 0 ; } for (;;) { s = &( (ndtp_get_rsp_t *) con->rspBuf ) ->value; s->len = con->rspBufSize - sizeof (ndtp_hdr_t) ; assocs = ss_lookup(ss, key, s, ktotalLen) ; if (con->rspBufSize - sizeof (r->hdr) < totalLen) { uint8_t *rspBuf0; rspBufO = (uint8_t *) malloc (totalLen + sizeof (ndtp_hdr_t) ) ,- if (irspBufO) { fprintf (stderr, "couldn't allocate %u bytes for response buffer\n", totalLen + sizeof (ndtp_hdr_t) ) ,- return 0;
} free (con->rspBuf) ; con->rspBuf = rspBufO; con->rspBufSize = totalLen + sizeof (ndtp_hdr_t) ; } else { break; } } s = &( (ndtp_get_rsp_t *) con->rspBuf) ->value; for (i = 0; i < assocs; i++) { s0 = NDTP_NEXT_STR (s) ; s->len = htonl (s->len) ; s = sO; } r = (ndtp_get_rsp_t * ) con->rspBuf ,- r->hdr . op = NDTP_GET_RSP ; r->hdr . id = hdr->id; r->hdr . size = htonl ( totalLen) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.c Page 11
be = write (con->sock, r, sizeof (r->hdr) + totalLen); if (be < 0) { perror ( "error writing"); return 0; } else if (be < sizeof (r->hdr) + totalLen) { fprint (stderr, "short write\n"); return 0; } ops_done++; gets_done++; assocs_returned += assocs; return 1; } static int upd (con_t *con, ndtp_hdr_t *hdr, uint8_t *payload, ss_desc_t *ss)
{ int del = (hdr->op == NDTP_DEL) ; ndtp_str_t *key = (ndtp_str_t *) payload; int async; ndtp_str_t *data; upd_t *upd; size_t be; assert (sizeof (ndtp_put_rsp_t) == sizeof (ndtp_hdr_t) ) ,- assert (sizeof (ndtp_del_rsp_t) == sizeof (ndtp_hdr_t) ) ; key->len = ntohl (key->len) ; if (hdr->size < NDTP_STR_SIZE (key->len) + NDTP_STR_SIZE(0) ) ( fprint (stderr, "upd payload (%u) too short for key (%u)\n", hdr->size, key->len) ,- return 0;
} data = NDTP_NEXT_STR(key) ; data->len = ntohl (data->len) ; if (hdr->size < NDTP_STR_SIZE(key->len) + NDTP_STR_SIZE(data->len) ) { fprint (stderr, "upd payload (%u) too short for key (%u) + data (%u)\n", hdr->size, key->len, data->len) ; return 0; }
GET_UPD(upd) ;
MUTATE_START; upd->con = con; upd->rsp.op = del ? NDTP_DEL_RSP : NDTP_PUT_RSP; upd->rsp.id = hdr->id; upd->rsp.size = 0; con->updsPending++ ; upds_live++;
INSQT(&upds_pending, upd) ,- async = del ? ss_delete (ss, key, data) : ss_add(ss, key, data) ; if (async) {
MUTATE_DONE; } else {
REMQ(upd) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.c Page 12
upds_live--; con->updsPending-- ; MUTATE_DONE; be = write (con->sock, &upd->rsp, sizeof (upd->rsp) ) ; FREE_UPD(upd) ; if (be < sizeof (upd->rsp) ) { if (be < 0) { fprintf (stderr, "error sending update rsp: %s\n" , strerror (errno) ) ,- } else { fprintf (stderr, "only sent %lu bytes of %lu byte update rsp\n",
(unsigned long) be, (unsigned long) sizeof (upd->rsp) ) ; } return 0 ; } } return 1; }
/*
* Environment :
* This function is only entered by the single thread currently
* inside a MUTATE_START/MUTATE_END pair */ static void upd_done (void *arg, ss_i_t completions )
C int i ; dq_t cons ,- con_t *con; upd_t *upd; upds++; updated = 1 ; INITQ ( icons ) ; for ( i = 0 ; i < completions ; i++) { REMQH ( &upds_pending , upd ) ; con = upd->con ; if (con) { if ( ! con->inUpdDoneCons) { con->inUpdDoneCons = 1 ; INSQT (&cons , &con->updDoneConsLinks) ; }
MUT_LOCK (con->writerLock) ; INSQT (&con->updsDone, upd) ; MUT_UNLOCK (con->writerLock) ; } else { / *
* This is not a nominal path (connection was closed with pending
* updates) , so performance isn ' t critical */
FREE_UPD (upd) ;
} } while ( !EMPTYQ(&cons) ) { REMQH ({.cons, con) ; con = QLTOE(con, con_t *, updDoneConsLinks) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd hr.c Page 13
con->mUpdDoneCons = 0 ; MUT_LOCK ( con->wrιterLock) ; if (con->wrιterSleep) {
COND_SIGNAL ( con->wrιterGo) ;
} MUT_UNLOCK (con->wrιterLock) ;
} } static void * writer (void *arg)
{ con_t *con = (con_t *) arg; upd_t *upd; sιze_t be; for (;;) {
MUT_LOCK(con->wrιterLock) , while (EMPTYQ(&con->updsDone) ) { con->wrιterSleep = 1;
COND_WAI (con->wrιterGo, con->wrιterLock) ; } con->wrιterSleep = 0; REMQH (&con->updsDone, upd) ; con->updsPendmg-- ; MUT_UNLOCK(con->wrιterLock) ; be = write (con->sock, &upd->rsp, sizeof (upd->rsp) ) ,- FREE_UPD(upd) ; f (be < sizeof (upd->rsp) ) [ if (be < 0) { fprintf (stderr, "error sending update rsp. %s\n" , strerror (errno) ) ,
} else { fprintf (stderr, "only sent %lu bytes of %lu byte update rsp\n" , (unsigned long) be, (unsigned long) sizeof (upd->rsp) ) ;
}
/* TODO: force a close */ } } }
/*
* The first test this code is fast (no locking) and approximate
* When the server is well loaded, thie first test will rule out the
* need for an update most times .
* The second test is accurate, and 3ust saves some work when the
* approximation suggests that an update might be necessary, when
* it actually is not. It is actually safe to call ss_waιt_update
* any number of times and at any time (inside a
* MUTATE_START/MUTATE_END pair), it's just less efficient. */ static void *updater (void *ss) { for (;,-) { if ( ' updated && ΕMPTYQ ( &upds_pendmg) ) { MUTATE_START; if ( ' updated && ! EMPTYQ ( &upds_pendmg ) ) {
Copyright 2000 OverX, Inc , All Rights Reserved. ndtpd_thr.c Page 14
async_upds++ ; ss_wait_update( (ss_desc_t *) ss); }
MUTATE_DONE ; } updated = 0 ; usleep (UPDATE_INTERVAL) ; ) }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.h Page 1
OverX Network Distributed Tracking Protocol Server Definitions Copyright 2000 OverX, Inc.
This program source is the property of OverX, Inc. and contains information which is confidential and proprietary to OverX, Inc.. No part of this source may be copied, reproduced, disclosed to third parties, or transmitted in any form or by any means, electronic or mechanical for any purpose without the express written consent of OverX, Inc..
*/ (fifndef _NDTPD_THR_H ((define _NDTPD_THR_H ((include "ox_mach.h" /* includes inttypes */ #include "ox_thread.h" (finclude "ndtp.h" ((include "dq.h"
/"
This is the state for an open connection
*/ typedef struct con [ dq_t links; int sock; volatile int flags; uint8_t *rspBuf; size_t rspBufSize; int updsPending;
THR_T reader;
MUT_T readerLock;
COND_T readerGo;
THR_T writer;
MUT_T writerLock;
COND_T writerGo; int writerSleep; dg_t updsDone; int inUpdDoneCons ; /* owned and operated by upd_done */ dq_t updDoneConsLinks; /* owned and operated by upd_done */ } con_t ;
#define CON_FLAG_OPEN 0x01 ((define CON_FLAG HAVE THREADS 0x02
/* * This is the state for a pending update (PUT or DEL) request */ typedef struct upd { dg_t links; con_t *con; ndtp_hdr_t rsp; } upd_t;
#endif
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_thr.h Page 1
/*
* OverX Network Distributed Tracking Protocol Server Definitions
* Copyright 2000 OverX, Inc.
*
* This program source' is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ (fifndef _NDTPD_THR_H ((define _NDTPD_THR_H
((include "ox_mach.h" /* includes inttypes */
#include "ox_thread.h" ((include "ndtp.h" #include "dq.h"
/*
* This is the state for an open connection */ typedef struct con { dg_t links; int sock; volatile int flags; uint8_t *rspBuf; size_t rspBufSize; int updsPending;
THR_T reader;
MUT_T readerLock;
C0ND_T readerGo;
THR_T writer;
MUT_T writerLock;
COND_T writerGo; int writerSleep; dq_t updsDone; int inUpdDoneCons; /* owned and operated by upd_done */ dg_t updDoneConsLinks ; /* owned and operated by upd_done */ } con_t;
((define CON_FLAG_OPEN 0x01
((define CON_FLAG_HAVE_THREADS 0x02
/*
* This is the state for a pending update ( PUT or DEL) request */ typedef struct upd { dq_t links ; con_t *con; ndtp_hdr_t rsp; } upd_t ;
# endif
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_udp.c Page 1
/*
* OverX Network Distributed Tracking Protocol Server
* Synchronous Multiplexing Variant
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ static const char *ndtpd_udp_id = "$Id: ndtpd_udp.c,v 1.7 2000/02/09 18:01:59 steph Exp $ (OVERX)"; ftinclude "ox_mach.h" /* includes inttypes */
(tinclude <stdio.h>
((include <unistd.h>
((include <stdlib.h>
#include <string.h>
((include <errno.h> tfinclude <assert.h>
((include <signal.h>
((include <fcntl.h>
((include <sys/types.h>
((include <sys/time.h>
#include <sys/socket .h>
((include <netinet/in.h>
#include <arpa/inet .h>
(tifdef HAVE_NETINET_TCP_H
#include <netinet/tcp.h>
(fendif ttifdef NDTPD_POLL
#include <poll.h>
#endif
#include "dq.h"
((include "string_store.h"
#include "ndtp.h"
((include "timestamp.h"
#include "ndtpd.h"
#include "ndtpd_udp.h"
#include "buff_udp.h"
/*
* Global state
*/ static int ops_done; static int puts_done; static int deletes_done; static int gets_done; static int assocs_returned; buff_udp_t *buff_udp; ss_desc_t *string_store; static int max_upds;
Copyright 2000 OverX, Inc., All Rights Reserved. - I l l - ndtpd_udp.c Page 2
static int upds_live; static int upds_ready; static int upd_head; static int upd_tail; static upd_state_t *upds; static int do_reset = 0 ; static void serve (ss_desc_t *ss) ; static void proc_req (ss_desc_t *ss) ; static void get (ndtp_get_t *req, ss_desc_t *ss) ; static void upd(ndtp_put_t *req, ss_desc_t *ss) ,- static void upd_rsp(upd_state_t *uρd) ,- static void upd_done (void *arg, ss_i_t completions); static void send_upd_rsps (void) ; void ndtpd_serve (ss_desc_t *ss, int port, int sockBufSize, int tcpNoDelay, int maxUpds)
{ string_store = ss; max_upds = maxUpds; ss->callbackFunc = upd_done; ss->callbackArg = NULL; ops_done = 0 ; puts_done = 0 r deletes_done = 0; gets_done = 0; assocs_returned = 0; upds_live = 0; upds_ready = 0 ; upd_head = 0; upd_tail = 0 ,- upds = (upd_state_t *) malloc ( sizeof (upd_state_t) * max_upds) if (lupds) { fprintf (stderr, "unable to malloc upds table\n" ) ; assert(O); /* MEMERR */ } bzero( (caddr_t) upds, sizeof (upd_state_t) * max_upds) ; buf _udp = init_udp(port, 1, sockBufSize); serve (ss) ;
} void ndtpd_reset_store (ss_desc_t *ss)
{ do_reset = 1 ; } static void serve (ss_desc_t *ss)
{
DINIT C serve" ) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_udp.c Page 3
for (;;) { if (upds_live < max_upds && udp_rx_avail (buff_udp, 0)) {
PRINTD (DEBUG_TOP, ("processing request\n" ) ) ; proc_req(ss) ,- } else if (do_reset) { ss_empty(ss) ; do_reset = 0; } else if (upds_ready) { send_upd_rsps ( ) ; } else if (upds_live) {
PRINTD(DEBUG_TOP, ("waiting for updates: %d\n", upds_live) ) ; ss_wait_update (ss) ; } else {
PRINTD (DEBUG_TOP, ("snoozing for input\n")); udp_tx_flush (buff_udp) ,- udp_rx_avail (buff_udp, 1) ; } } } void proc_req (ss_desc_t *ss)
{
DINIT ( "proc_req" ) ; struct sockaddr_in *addr = &upds [upd_head] .addr; ndtp_hdr_t *hdr; ssize_t be; be = udp_rx(buff_udp, sizeof (ndtp_hdr_t) , addr, (void **) &hdr) ; if (be < sizeof (ndtp_hdr_t) ) { fprintf (stderr, "runt NDTP request (%d byes)\n", (int) be) ; }
PRINTD (DEBUG_TOP,
("request from %s:%d\n", inet_ntoa(addr->sin_addr) , ntohs (addr->sin_port) ) ) ; hdr->size = ntohl (hdr->size) ; be = udp_more_rx(buff_udp, hdr->size) ; if (be < hdr->size) { fprint (stderr, "runt NDTP payload (%d versus %d)\n", (int) be, (int) hdr->size) ; return; } switch (hdr->op) { case NDTP_GET: get ( (ndtp_get_t *) hdr, string_store) ; break; case NDTP_PUT: upd( (ndtp_put_t *) hdr, string_store) ; break; case NDTP_DEL: upd ( (ndtp_put_t * ) hdr, string_store) ; break; default : fprintf ( stderr, "unexpected ndtp request : 0x%x\n" , hdr->op) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_udp.c Page 4
} } static void get (ndtp_get_t *req, ss_desc_t *ss)
{
DINITC'get") ; ndtp_hdr_t *hdr = &req->hdr; ndtp_str_t *key = &req->key; ssize_t be; ssize_t bcO; int assocs; ss_i_t totalLen; ndtp_get_rsp__t *r; ndtp_str_t *s; ndtp_str_t *s0; int i ,- key->len = ntohl (key->len) ; if (hdr->size < key->len) { fprintf (stderr, "get payload ( u) too short for key (%u)\n", hdr->size, key->len) ; return; } if (udp_tx_left(buff_udp) < sizeof (ndtp_hdr_t) + NDTP_STR_SIZE(0) ) { udp_tx_ lush (buff_udp) ; } be = udp_tx_left (buff_udp) ; r = (ndtp_get_rsp_t *) udp_tx (buff_udp, be, &upds [upd_head] .addr) ; assert (r) ,- s = &r->value; bcO = be - sizeof (ndtp_hdr_t) ,- s->len = bcO; assocs = ss_lookup(ss, key, s, &totalLen) ,- if (bcO < totalLen) { udp_untx(buff_udp, be) ; udp_tx_ lush (buff_udp) ; r = (ndtp_get_rsp_t *) udp_tx(buff_udp, sizeof (ndtp_hdr_t) + totalLen, &upds [upd_head] .addr) ; if (!r) { fprintf (stderr, "get response too long for datagram (%u)\n", totalLen); return;
} s = &r->value; s->len = totalLen; assocs = ss_lookup(ss, key, s, &totalLen) ; } else { udp_untx(buff_udp, bcO - totalLen); } for (i = 0; i < assocs; i++) {
S0 = NDTP_NEXT_STR(S) ; s->len = htonl (s->len) ; s = s0; }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtρd_udp.c Page 5
r->hdr.op = NDTP_GET_RSP; r->hdr.id = hdr->id; r->hdr.size = htonl (totalLen) ; ops_done++; gets_done++; assocs_returned += assocs;
PRINTD (DEBUG_GET,
("get from %s:%d\n", inet ntoa (upds [upd_head] .addr . sin_addr) , ntohs (upds [upd_head] .addr.sin_port) ) ) ; } static void upd (ndtp_put_t *req, ss_desc_t *ss)
{ ndtp_hdr_t *hdr = &req->hdr; ndtp_str_t *key = &req->key; ndtp_str_t *data; uint32_t keyLen; uint32_t dataLen; keyLen = ntohl (key->len) ; if (hdr->size < NDTP_STR_SIZE ( keyLen) + NDTP_STR_SIZE ( 0 ) ) { fprintf ( stderr, " update payload ( %u) too short for key ( %u) \n" , hdr->size , keyLen) ; return; } else { key->len = keyLen; data = NDTP_NEXT_STR(key) ; dataLen = ntohl (data->len) ,- if (hdr->size < NDTP_STR_SIZE (keyLen) + NDTP_STR_SIZE (dataLen) ) { fprintf (stderr,
"update payload (%u) too short for key (%u) + data (%u)\n" hdr->size, keyLen, dataLen) ; } else { int olduh = upd_head; int del = (hdr->op == NDTP_DEL) ; uint8_t rspOp = del ? NDTP_DEL_RSP : NDTP_PUT_RSP; int async; data->len = dataLen; upds [upd_head] .id = hdr->id; upds [upd_head] .op = rspOp; upd_head = (upd„head + 1) % max_upds; upds_live++; async = del ? ss_delete(ss, key, data) : ss_add(ss, key, data) ; if (lasync) { upd_rsp (&upds [olduh] ) ; upd_head = olduh; upds_live— ; } } } } static void
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_udp.c Page 6
upd_rsp (upd_state_t *upd) { ndtp_hdr_t *rsp; assert (sizeo (ndtp_put_rsp_t) == sizeof (ndtp_hdr_t) ) ; assert (sizeof (ndtp_del_rsp_t) == sizeof (ndtp_hdr_t) ) ; rsp = (ndtp_hdr_t *) udp_tx(buff_udp, sizeof (ndtp_hdr_t) , &upd->addr) ; if (!rsp) { fprintf (stderr, "upd_rsp: couldn't send %u byte response\n", sizeof (ndtp_hdr_t) ) ; return; } rsp->op = upd->op; rsp->id = upd->id; rsp->size = 0; if (upd->op == NDTP_DEL_RSP) { deletes_done++ ; } else { puts_done++ ; } } static void upd_done (void *arg, ss_i_t completions)
{
TIMESTAMP_ENDP( ( "upd_done %u %u", completions, upds_live) ) ,- assert (completions <= upds_live) ; upds_ready += completions; send_upd_rsps ( ) ; } static void send_upd_rsps (void) { upd_state_t *upd; assert (sizeof (ndtp_put_rsp_t) == sizeof (ndtp_hdr_t ) ) ,- assert ( sizeof (ndtp_del_rsp_t) == sizeof (ndtp_hdr_t) ) ,- assert (upds_ready <= upds_live) ; while (upds_ready && udp_tx_lef t (buf f_udp) >= sizeof (ndtp_hdr_t ) ) { upd = &upds [upd_tail ] ; upd_rsp (upd) ; upds_live-- ; upds_ready-- ; upd_tail = (upd_tail + 1 ) % max_upds ; } udp_tx_f lush (buf f_udp) ; }
Copyright 2000 OverX, Inc., All Rights Reserved. ndtpd_udp.h Page 1
/*
* OverX Network Distributed Tracking Protocol Server Definitions
* Copyright 2000 OverX, Inc.
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ (fifndef _NDTPD_SYNC_H ((define _NDTPD_SYNC_H
({include "ox_mach.h" /* includes inttypes */
# include "ndtp .h" (t include "dq.b."
/ *
* This is the state for a pending update (PUT or DEL) request */ typedef struct upd_state { dq_t links; struct sockaddr_in addr; / * IP address for response */ uint32_t id; /* protocol id for response */ uint8_t op; /* opcode for response */
} upd_state_t ; # endif
Copyright 2000 OverX, Inc., All Rights Reserved. ox mach.h Page 1
OverX machine/OS specific definitions Copyright 2000 OverX, Inc.
This program source is the property of OverX, Inc. and contains information which is confidential and proprietary to OverX, Inc.. No part of this source may be copied, reproduced, disclosed to third parties, or transmitted in any form or by any means, electronic or mechanical for any purpose without the express written consent of OverX, Inc..
$Id: ox_mach.h,v 1.5 2000/01/2 15:54:19 steph Exp $
*/
(fifndef _0X_MACH_H # efine _OX_MACH_H tfdefine HAVE_NETINET_TCP_H ((define OX_0_BINARY 0 ((define OX_POINTER_SIZE 32
(tifdef CYGWIN .
#de ine HAVE_GETOPT_H ftundef HAVE_NETINET_TCP_H ttundef OX_0_BINARY ((define OX 0 BINARY 0 BINARY
/* replacement for inttypes.h */ #include <sys/types .h> typedef u_int8_t uint8_t; typedef u_intl6_t uintl6_t typedef u_int32_t uint32_t typedef u_int64_t uint64_t typedef ptrdiff_t intptr_t typedef ptrdiff_t uintptr_t (fendif
#ifdef FreeBSD
((include <inttypes.h> (fdefine HAVE_RANDOM_STATE #endif ffifdef linux. . ffinclude <inttypes.h> #endif
#ifdef sun ffinclude <inttypes . h> ((endif
#ifdef osf
((include <sys/types . h> typedef u_char uint8_t ; typedef u_short uintl6_t ; typedef u_int uint32_t; typedef u_long uint64_t ; typedef ptrdif f_t intptr_t ; typedef ptrdiff_t uintptr_t;
Copyright 2000 OverX, Inc., All Rights Reserved. ox_mach.h Page 2
(tundef OX_POINTER_SIZE ((def ine OX_POINTER_SIZE 64 # endif ft endif
Copyright 2000 OverX, Inc., All Rights Reserved. ox_thread.h Page 1
/*
* OverX thread abstraction definitions (Sun vs Pthreads)
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
* $Id$ */
(fi ndef _OX_THREAD_H (fdefine _OX_THREAD_H
(fif defined (USE_SUN_THREADS) #include <thread.h> #include <string.h> typedef thread_t THR_T; typedef mutex_t MUT_T; typedef cond_t C0ND_T;
((define THR_CREATE (t, f, a) (thr_create(NULL, 0, (f), (a), 0, &(t))) ((define MUT_INIT(m) (mutex_init (&(m) , USYNC_THREAD, NULL)) ((define COND_INIT(c) (cond_init (&(c) , USYNC_THREAD, NULL))
((define MUT_LOCK(m) do ( \ int status = mutex_lock(&(m) ) ; \ if (status) { \ fprintf (stderr, "lock failed: %s\n" , strerror (status)) ; \ abort ( ) ; \
} \ } while (0)
#define MUT_UNLOCK(m) do { \ int status = mutex_unlock(& (m) ) ; \ if (status) { \ fprintf (stderr, "unlock failed: %s\n" , strerror (status) ) ; \ abort ( ) ; \
} \
} while (0)
(fdefine COND_SIGNAL(c) do { \ int status = cond_signal (&(c) ) ,- \ if (status) { \ fprintf (stderr, "signal failed: %s\n" , strerror (status) ) ; \ abort ( ) ; \
} \
} while (0)
((define COND_WAIT(c, m) do { \ int status = cond_wait (&(c) , &(m)); \ if (status) { \ fprintf (stderr, "signal failed: %s\n", strerror(status) ) ; \ abort ( ) ; \
} \
} while (0)
#else
((include <pthread.h> #include <string.h>
Copyright 2000 OverX, Inc., All Rights Reserved. ox_thread.h Page 2
typedef pthread_t THR_T; typedef pthread_mutex_t MUT_T; typedef pthread_cond_t COND_T;
#define THR_CREATE ( t, f, a) (pthread_create(& ( t) , NULL, (f ) , (a))) ((define MUT_INIT(m) (pthread_mutex_init (&(m) , NULL)) ((define COND_INIT(c) (pthread_cond_init (& (c) , NULL))
#define MUT_LOCK(m) do { \ int status = pthread_mutex_lock(&(m) ) ; \ if (status) { \ fprintf (stderr, "lock failed: %s\n", strerror (status) ) ; \ abort ( ) ; \
} \ } while (0)
((define MUT_UNLOCK(m) do { \ int status = pthread_mutex_unlock(& (m) ) ; \ if (status) { \ fprintf (stderr, "unlock failed: %s\n" , strerror (status) ) ; \ abort ( ) ,- \
} \
} while (0)
((define COND_SIGNAL(c) do { \ int status = pthread_cond_signal (&(c) ) ; \ if (status) { \ fprintf (stderr, "signal failed: %s\n", strerror (status) ) ; \ abort ( ) ; \
} \
} while (0)
((define COND_WAIT (c, m) do ( \ int status = pthread_cond_wait (& ( c ) , & (m) ) ,- \ if ( status ) { \ fprintf ( stderr, " signal failed: %s\n" , strerror ( status ) ) ; \ abort ( ) ; \
} \ } while ( 0 ) # endif (f endif /* _OX_THREAD_H * /
Copyright 2000 OverX, Inc., All Rights Reserved. ss_test.c Page 1
/ *
* OverX String Store Test Driver
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ static const char *ss_test_id = "$Id: ss_test.c,v 1.16 2000/01/18 15:30:06 steph Exp $ (OVERX)";
((include "ox_mach.h" /* includes inttypes */
((include <stdio.h>
((include <unistd.h>
((include <stdlib.h>
((include <string.h>
(finclude <assert.h>
#include <sys/time.h>
(tifdef HAVE_GETOPT_H
(finclude <getopt.h> tfendif
((include "string_store.h"
#include "ndtp .h"
((define BLOCK_SIZE 512
((define DEBUG_ADD 0x1 ftdefine DEBUG_DELETE 0x2 #define DEBUG_GET 0x4 int debug; char *progname; static ss_desc_t ss; static unsigned long read_num_arg(void) ; static void usage (void); static void add(ndtp_str_t *key, ndtp_str_t *data) ; static void delete (ndtp_str_t *key, ndtp_str_t *data) ; static void get (ndtp_str_t *key, ndtp_str_t *checkData, int checkElts); static void print_string(ndtp_str_t *s) ; static int eq_string(ndtp_str_t *sl, ndtp_str_t *s2) ; static void callback (void *arg, ss_i_t completions); static ss_i_t async_updates_done; static ss_i_t sync_updates_done;
#define MAX_STRING 128 int main (int argc, char **argv)
[ char *storeName = " /var/trnp/test . ss" ;
Copyright 2000 OverX, Inc., All Rights Reserved. ss_test.c Page 2
ss_i_t storeSize = 1 * 1024 * 1024; ss_i_t logBufSize = 64 * 1024; ss_i_t logBufs = 2; int passes = 1; int iters = 1; int mappings = 1; int deleteMappings = 0; int testLoad = 0; int testPut = 0; int testDelete = 0; int testGet = 0; int testCheck = 0; int testDump = 0 ; int status; int c; int i; int pass; int j ; int ops; int puts; int deletes; int gets; int madeLog; char *s; ndtp_str_t *key; char keyData[NDTP_STR_SIZE(MAX_STRING) ] ; ndtp_str_t *data; char dataData[NDTP_STR_SIZE(MAX_STRING) ] ; ndtp_str_t *checkData; ndtp_str_t *eheckData0; struct timeval start; struct timeval end; double loadElapsed; double putElapsed; double deleteElapsed; double closeElapsed; double getElapsed; double dumpElapsed; double totalElapsed; debug = 0 ;
/* TODO: basename, FreeBSD doesn't have it */ progname = argv[0] ; for (;;) { c = getopt(argc, argv, "cD:dgi : I : lm:m:N:ps: S:w" ) ; if (c == -1) break; switch (c) { case * c ' : testCheck = (testCheck == 0); break; case 'D' : debug = strtoul (optarg, &s, 0) ,- if (*s != '\0') {
Copyright 2000 OverX, Inc., All Rights Reserved. ss_test.c Page 3
usage ( ) ;
} break; case 'd' : testDelete = (testDelete == 0) ; deleteMappings = 1; break,- case 'g' : testGet = (testGet == 0) ; break; case ' i ' : iters = read_num_arg ( ) ; break; case ' I ' : passes = read_num_arg( ) ; break; case ' 1 ' : testLoad = (testLoad == 0) ,- break; case 'M' : deleteMappings = read_num_arg ( ) ; break; case 'N' : logBufs = read_num_arg ( ) ; break; case 'p' : testPut = (testPut == 0) ; break; case ' s' : storeSize = read_num_arg ( ) ; brea ; case ' S' : logBufSize = read_num_arg ( ) ; break; case 'w' : testDump = (testDump == 0) ; break; default : usage ( ) ; } } if (optind < argc) { storeName = argv [optind] ; } ss_debug | = debug » 16 ; ops = 0 ; puts = 0 ; deletes = 0 ; gets = 0 ; async_updates_done = 0 ; sync_updates_done = 0 ; loadElapsed = 0 . 0 ; putElapsed = 0 . 0 ; deleteElapsed = 0. 0 ;
Copyright 2000 OverX, Inc., All Rights Reserved. ss_test.c Page 4
closeElapsed = 0.0; getElapsed = 0.0; dumpElapsed = 0.0; checkData = NULL; if (testCheck) { checkData = (ndtp_str_t *) malloc (mappings * NDTP_STR_SIZE(MAX_STRING) ) ; if (! checkData) { fprintf (stderr, "unable to malloc check data\n"); exit (2) ; } }
ss.arenaSize = storeSize; ss.chunkSize = SS_CHUNK_SIZE; ss. chunksPerHash = SS_CHUNKS_PER_HAΞH; ss.callbackFunc = callback; ss.callbackArg = NULL; ss .logBufSize = logBu Size; ss.logBufs = logBufs; status = ss_init (&ss) ; if (status) { fprintf (stderr, "unable to initialize string store: %s\n" , strerror (status) ) ; exit (4) ; } printf ("starting with %lu chunks free\n", (unsigned long) ss.chunksFree); madeLog = 0 ,- for (pass = 0; pass < passes; pass++) { if (testLoad) { gettimeofday (&start, NULL) ,- status = ss_read_log (&ss, storeName) ; if (status) { fprintf (stderr, "unable to read log %s: %s\n" , storeName, strerror (status) ) ; exi (5) ; } gettimeofday(&end, NULL); loadElapsed += (end.tv_sec - start .tv_sec)
+ (((double) (end.tv_usec - start.tv_usec) ) / le6) ; printf ("%lu chunks free after reading %s\n" ,
(unsigned long) ss.chunksFree, storeName); } if (!madeLog && (testPut || testDelete)) { status = ss_new_log(&ss, storeName, 0) ,- if (status) { fprintf (stderr, "unable to open new log %s: %s\n" , storeName, strerror (status) ) ; exit (6) ,- } madeLog = 1 ; }
Copyright 2000 OverX, Inc., All Rights Reserved. ss_test.c Page 5
key = (ndtp_str_t *) keyData; data = (ndtp_str_t *) dataData; if (testPut) C srandom(O) ; gettimeofday (&start, NULL) ; for (i = 0; i < iters; i++) { sprintf(key->data, "%lu", (unsigned long) random!) ); key->len = strlen(key->data) ; for (j = 0; j < mappings; j++) { sprintf (data->data, "%lu", (unsigned long) random(l); data->len = strlen(data->data) ; add (key, data) ; ops++; puts++; } } gettimeofday(&end, NULL) ,- putElapsed += (end.tv_sec - start .tv_sec)
+ (((double) (end.tv_usec - start . tv_usec) ) / le6) ; } if ( testDelete) {
/ * Delete the first association for each put * / srandom( O ) ; gettimeofday (&start , NULL) ; for ( i = 0 ; i < iters ; i++) { sprintf (key->data, "%lu", (unsigned long) random(J); key->len = strlen(key->data) ,- for (j = 0; j < mappings; j++) { if (j < deleteMappings) { sprintf (data->data, "%lu", (unsigned long) random!)); data->len = strlen(data->data) ,- delete (key, data) ; ops++; deletes++; } else {
/* keep the random number generator in sync */ (void) randomO; } } ) gettimeofday(&end, NULL) ; deleteElapsed += (end.tv_sec - start. tv_sec)
+ (((double) (end.tv_usec - start. tv_usec) ) / le6) ; } if ((testPut || testDelete) && pass == passes - 1) { gettimeofday(&start, NULL) ; ss_close_log(&ss) ; ge 11 imeo f day ( &end , NULL ) ; closeElapsed += ( end. tv_sec - start . tv_sec)
+ ( ( ( double) (end . tv_usec - start . tv_usec) ) / le6 ) ; } if ( testGet) { srandom ( O ) ;
Copyright 2000 OverX, Inc., All Rights Reserved. ss_test.c Page 6
gettimeofday(&start, NULL); for (i = 0; i < iters + 1; i++) { if (testCheck) { sprintf (key->data, "%lu", (unsigned long) random(l); key->len = strlen(key->data) ,- checkDataO = checkData; for (j = 0; j < mappings; j++) { if (j < deleteMappings) {
(void) randomO; } else { sprintf (checkDataO->data, "%lu", (unsigned long) randomO); checkDataO->len = strlen(checkData->data) ; assert (checkDataO->len <= MAX_STRING) ; checkDataO = NDTP_NEXT_STR (checkDataO) ; } } if (i == iters) {
/* Test a key that (probably) isn't there */ get (key, checkData, 0) ; } else { get(key, checkData, mappings - deleteMappings); } ops++; gets++; } else { for (j = 0; j < mappings + 1; j++) { sprintf (key->data, "%lu", (unsigned long) random))); key->len = strlen(key->data) ,- get (key, NULL, 0) ; ops++; gets++; } } } gettimeofday(&end, NULL) ,- getElapsed += (end.tv_sec - start . tv_sec)
+ (((double) (end.tv_usec - start . tv_usec) ) / le6) ; } if (testDump) { gettimeofday(&start, NULL); status = ss_new_log(&SS, storeName, 1) ; if (status) { fprintf (stderr, "unable to dup to new log %s: %s\n", storeName, strerror (status) ) ; exi (7) ; } ss_close_log(&ss) ; gettimeofday(&end, NULL) ; dumpElapsed += (end.tv_sec - start . tv_sec)
+ (((double) (end.tv_usec - start . tv_usec) ) / le6) ; }
if (testLoad) ( double rate = 0.0;
Copyright 2000 OverX, Inc., All Rights Reserved. ss_test.c Page 7
if (loadElapsed != 0.0) { rate = ss.logRecsRead / loadElapsed; } printf("%d recs loaded in %5.2f s, %5.2f reads/s\n" , ss.logRecsRead, loadElapsed, rate) ; } if (testPut) ( double rate = 0.0; if (putElapsed != 0.0) { rate = puts / putElapsed; } printf ("%d puts in %5.2f s, %5.2f puts/s\n", puts, putElapsed, rate);
} if (testDelete) { double rate = 0.0; if (deleteElapsed != 0.0) ( rate = deletes / deleteElapsed; } printf("%d deletes m %5.2f s, %5.2f deletes/s\n" , deletes, deleteElapsed, rate) ; } if (testDelete | | testPut) { printf ("%u async updates, %u sync updates, %u updates\n" , async_updates_done, sync_updates_done, puts + deletes) ; } if (testGet) { double rate = 0.0; if (getElapsed != 0.0) { rate = gets / getElapsed; } printf ("%d gets in %5.2f s, %5.2f gets/s\n", gets, getElapsed, rate); } if (testDump) ( double rate = 0.0; if (dumpElapsed != 0.0) { rate = (ss.chunks - ss.chunksFree) / dumpElapsed; } printf ("%d chunks dumped in %5.2f s, %5.2f chunks/s\n", ss. chunks - ss.chunksFree, dumpElapsed, rate); } if (testPut I I testDelete | | testDump) { double rate = 0.0; if ((putElapsed + deleteElapsed + dumpElapsed) . ' = 0.0) { rate = ss.logRecsWritten /
(putElapsed + deleteElapsed + dumpElapsed) ; } printf ("%d recs logged, %5.2f writes/s\n", ss.logRecsWritten, rate) ,- }
{ double rate = 0.0; totalElapsed = putElapsed + deleteElapsed + closeElapsed + getElapsed; if (totalElapsed) { rate = ops / totalElapsed; }
Copyright 2000 OverX, Inc., All Rights Reserved. ss_test.c Page 8
printf ("%d ops done in %5.2f s total, %5.2f ops/s\n", ops, totalElapsed, rate) ; } return 0; } static unsigned long read_num_arg (void) { unsigned long a; char *s; a = strtoul (optarg, &s, 0); switch (s[0]) { case ' \0 ' : break; case 'g' : case 'G' : if (s[l] != '\0') { usage ( ) ;
} a *= (1024 * 1024 * 1024) ; break; case 'k' : case 'K' : if (s[l] != '\0') { usage 0 ;
} a *= 1024; break; case 'm' : case 'M' : if (s[l] != '\0') { usage ( ) ;
} a *= (1024 * 1024) ; break; default: usage ( ) ; } return a; } static void usage (void) { fprintf (stderr, "usage: %s [opts] [ssFile]\n", progname); printf ( " -c: check get results (only makes sense with -g)\n"); printf ( " -D n set debug flags (default is 0)\n"); printf ( " -d: test string store delete operation\n" ) ; printf ( " -g; test get operation\n" ) ; printf ( " -i nn[u] : iteration count\n") ; printf ( " u: k: 1024 iters, m: 1024*1024 iters\n"); printf (" (default is l)\n") ; printf ( " -I nn[u] : pass count\n"); printf (" u: k: 1024 passes, m: 1024*1024 passes\n"); printf ( " (default is l)\n") ;
Copyright 2000 OverX, Inc., All Rights Reserved. ss test.c Page 9
printf ( " -1: test string store load operation\n" ) ; printf ( " -m n[u] : mappings per key\n"); printf ( " u: k: 1024 iters, m: 1024*1024 itersW); printf ( " (default is l)\n") ,- printf ( " -p: test put operation\n" ) ; printf ( " -s n[u] : string store arena size\n" ) ; printf ( " u: k: 1024 bytes, m: 1024*1024 bytesXn"),- printf (" g: 1024*1024*1024 bytes (default is lm)\n»); printf ( ' -w: test string dump (write) operation\n\n" ) ; printf ( ' -c. -d, -g. -1, -p and -w may be used in any combination. \n" ) , printf ( " \nEXAMPLES : \n" ) ; printf (" %s -p -s 64m -i 64000 -m\n" , progname); printf (" %s -s 64m -1 -c -g -m 2 -i 64000\n\n", progname); exit (100) ;
static void add (ndtp_str_t *key, ndtρ_str_t *data)
( sync_updates_done += (ss_add(&ss, key, data) ? 0 : 1) ; if (debug & DEBUG_ADD) { print_string(key) ; printf ( " mapped to " ) ; print_string(data) ; printf ("\n") ; } } static void- delete (ndtp_str_t *key, ndtp_str_t *data)
{ sync_updates_done += (ss_delete (&ss, key, data) ? 0 : 1) ; if (debug & DEBUG_DELETE) { print_string(data) ; printf ( " unmapped from " ) ; print_string(key) ; printf ("\n") ; }
(fdefine GET_BUFFER_SIZE 2048 static void get (ndtp_str_t *key, ndtp_str_t *checkData, int checkElts)
{ uint8_t getBuffer[GET_BUFFER_SIZE] ; ndtp_str_t *data; int elts; ss_i_t totalLen; int i; data = (ndtp_str_t *) getBuffer; data->len = GET_BUFFER_SIZE; elts = ss_lookup(&ss, key, data, &totalLen) ,-
if (totalLen > GET_BUFFER_SIZE) {
Copyright 2000 OverX, Inc., All Rights Reserved. ss_test.c Page 10
print_string(key) ; printf (" would get %lu bytes, but we only supplied %lu\n",
(unsigned long) totalLen, (unsigned long) GET_BUFFER_SIZE) ,- } if (checkData) { if (elts != checkElts) { print_string(key) ; printf (" maps to %d elements, %d expected\n" , elts, checkElts) ,- } checkData = NULL; } if (elts) { if (debug & DEBUG_GET) { print_string(key) ; printf ("maps to %d elts:\n", elts) ; } for (i = 0; i < elts; i++) { if (debug & DEBUG_GET) { printf ("\t") ; print_string(&data[i] ) ;
} if (checkData && !eq_strmg(data, checkData)) { print_string(key) ; printf ( " mapped to " ) ; print_string(data) ; printf (" expected "); print_string (checkData) ; printf ( " \n" ) ; checkData = 0 ;
} data = NDTP_NEXT_STR(data) ; if (checkData) { checkData = NDTP_NEXT_STR (checkData) ;
} .} } else { if (debug & DEBUG_GET) { printf ("no mapping for "); print_string(key) ; printf ("\n") ; } } } - - - " static void callback (void *arg, ss_i_t completions)
{ async_updates_done += completions; } static void print_string (ndtp_str_t *s)
( ss_i_t i ;
Copyright 2000 OverX, Inc., All Rights Reserved. ss_test.c Page 11
for (i = 0; i < s->len; i++) { printf ("%c", s->data [i] ) ; } ) static int eq_string (ndtp_str_t *sl , ndtp_str_t *s2 )
{ return sl->len != s2->len
| | memcmp (sl->data, s2->data, sl->len) , }
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 1
/ *
* OverX String Store
* Copyright 2000 OverX, Inc. *
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
*/ static const char *string_store_id = "$Id: string_store.c,v 1.20 2000/02/09 18:01:59 steph Exp $ (OVERX)";
#include "ox_mach.h" /* includes inttypes */
((include <stddef.h> ((include <errno.h> (finclude <fcntl.h> ((include <math.h> ffinclude <stdio.h> ftinclude <stdlib.h> (finclude <string.h> ((include <unistd.h> ((include <assert.h> ((include <sys/mman.h> ((include <sys/stat.h> ((include <sys/types.h> (tifdef SS_RTAIO ffinclude <aio.h> #endif tfifdef SS_THREAD_IO ((include "ox_thread.h" #endif ffinclude "dq.h" ((include "string_store.h" ttdefine L0G_FILE_MODE (S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP)
/*
* Arenas smaller than this are silly (not to mention that the
* initialization code might break) */
((define MIN_ARENA_SIZE (1024 * 1024)
/*
* Mapping function helper types */ typedef struct { ss_desc_t *ss; int status ,- } ss_log_cstr_helper_t; typedef struct { ss_desc_t * ss ; ndtp_str_t *s ; ss_i_t left ;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 2
int status; ss_i_t lastOffset; ) ss_new_cstr_helper_t; typedef struct { ndtp_str_t *s,- int eq; } eq_helper_t; typedef struct { ndtp_str_t *data; ss_i_t left; ss_i_t totalLen; ss_i_t index; } ss_copy_mapping_helper_t • typedef struct { ss_rec_t *found; ss_i_t foundNum; ss_cstr_t * ata; ss_rec_t *free; ss_i_t freeNum; ss_rec_t **anchor; } ss_find_assoc_helper_t;
/* * Log file reader state */ typedef struct ss_read_log_state { ss_desc_t *ss; ss_i_t state; ss_cstr_t *cstr; ss_cstr_t *cstrChunk; ss_cstr_t **cstrPrev; ss_i_t cstrLen; ss_i_t cstrChunks; ss_i_t cstrChunkldx; ss_i_t cstrChunkLeft; ss_i_t cstrHash; } ss_read_log_state_t; enum {
LOG_STATE_IDLE,
LOG_STATE_NEED_MORE,
LOG_STATE_CSTR_CHUNKS,
LOG_STATE_CSTR_CHARS,
LOG_STATE_ALIG ,
LOG_STATE_DONE };
/* * Internal function prototypes */ static ss_i_t ss_next_prime (ss_i_t p) ; static int ss_init_log_buf (ss_desc_t *ss, ss_log_buf_t *lb, ssize_t size, uint8_t *b) ;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 3
ififdef SS_THREAD_IO static void *ss_log_buf_writer (void *arg) ; itendif static void ss_flush_log(ss_desc_t *ss) ; static int ss_dump_to_log(ss_desc_t *ss) ; static void ss_start_log_write(ss_desc_t *ss) ; static void ss_wait_buf_not_full (ss_log_buf_t *lb) ; " static void ss_sched_completion(ss_desc_t *ss) ; static uint8_t *ss_log_buf_alloc(ss_desc_t *ss, size_t be); static void ss_log_cstr (ss_desc_t *ss, ss_cstr_t *cs) ; static int ss_log_cstr_chunk_helper (ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_i_t be, void *arg) ,- static int ss_log_cstr_chars_helper (ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_i_t be, void *arg) ; static void ss_log_new_map(ss_desc_t *ss, ss_rec_t **recp, ss_rec_t *rec, ss_cstr_t *data) ; static void ss_log_map(ss_desc_t *ss, ss_rec_t *from, ss_cstr_t *data) ; static void ss_log_delete(ss_desc_t *ss, ss_cstr_t *key, ss_cstr_t *data) ; static int ss_proc_log_rec(ss_log_rec_t *rec, ss_read_log_state_t *ls) ; static int ss_log_start_string(ss_read_log_state_t *ls, ss_log_string_t *rec) ; static int ss_log_proc new_map(ss_read_log_state_t *ls, ss_log_new_map_t *rec) ; static int ss_log_proc_map(ss_read__log_state_t *ls, ss_log_map_t *rec) ; static int ss_log_proc_unmap(ss_read_log_state_t *ls, ss_log_unmap_t *rec) ; static int ss_log_cstr_next_chunk(ss_i_t offset, ss_read_log_state_t *ls) ; static ss_i_t ss_log_cstr_next_chars (uint8_t *p, ss_i_t left, ss_read_log_state_t *ls) ; static void ss_init_chunk_free(ss_desc_t *ss) ; static void *ss_alloc_chunk(ss_desc_t *ss) ; static int ss_valid_o set(ss_desc_t *ss, ss_i_t offset) ; static int ss_valid_chunk(ss_desc_t *ss, ss_i_t chunkOffset) ,- static int ss_valid_free_chunk(ss_desc_t *ss, ss_i_t chunkOffset) ; static void *ss_get_chunk(ss_desc_t *ss, ss_i_t chunkOffset); static void ss_free_chunk(ss_desc_t *ss, void *chunk) ,- static ss_i_t ss_len_to_chunks (ss_i_t len); typedef int (*ss_cstr_map_f_t) (ss_cstr_t *ce, uint8_t *p, ss_i_t offset, ss_i_t be, void *arg) ; static void ss_cstr_map(ss_cstr_t *cs, ss_cstr_map_f_t f, void *arg, int postDeref) ; static ss_cstr_t *ss_nev/_cstr (ss_desc_t *ss, ndtp_str_t *s, ss_i_t hash) ; static int ss_new_cstr__helper (ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_i__t be, void *arg) ; static void ss_inc_cstr_ref (ss_cstr_t *cs) ; static void ss_dec_cstr_ref (ss_desc_t *ss, ss_cstr_t *cs) ; static void ss_free_cstr (ss_desc_t *ss, ss_cstr_t *cs) ; static int ss_free_cstr_chunk(ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_i_t be, void *arg) ; static int ss_ ree_rec(ss_rec_t *rec, ss_i_t recNum, ss_cstr_t *cs, ss_i_t fieldNum, void *arg) ; static int ss_cstr_eq(ndtp_str_t *s, ss_cstr_t *cs) ; static int ss_cstr_eq_helper(ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_i_t be, void *arg) ; static void ss_cstr_to_str (ss_cstr_t *cs, ndtp_str_t *s) ; static int ss_cstr_to_str_helper (ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_i_t be, void *arg) ; typedef int (*ss_rec_map_f_t) (ss_rec_t *rec, ss_i_t recNum, ss_cstr_t *cs, ss_i_t fieldNum, void *arg) ; static void ss_rec_map(ss_cstr_t *cs, ss_rec_map_f_t f, void *arg) ;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 4
static int ss_copy_mapping(ss_rec_t *rec, ss_i_t recNum, ss_cstr_t *cs, ss_i_t fieldNum, void *arg) ; static ss_i_t ss_hash(uint8_t *s, ss_i_t len, ss_i_t h) ; static ss_cstr_t *ss_find(ss_desc_t *ss, ndtp_str_t *s, ss_i_t *hp) ; static int ss_find_assoc (ss_rec_t *rec, ss_i_t recNum, ss_cstr_t *cs, ss_i_t fieldNum, void *arg) ; static int ss_do_delete(ss_desc_t *ss, ss_cstr_t *key, ss_cstr_t *data) ; ss_cstr_t *find_one_value(ss_cstr_t *key) ,- static void dumpBuf (uint8_t *buf , int bufSize) ;
/* * Log file record names */ static char *ss_lqg_rec names [ ] = {
"end" ,
"string" ,
"new_map" ,
"map" ,
"unmap" }; int ss_debug; ffifdef STRING_STORE_DEBUG #define DINIT(N) static char *module = N #define PRINTD (F, X) \ do { \ if (ss_debug & (int) (F) ) ( \ /* VARARGS */ \ printf ("%-20s", module); \ /* VARARGS */ \ printf X; \
} \
} while (0)
((define PRINT0D(F, X) \ do { \ if (ss_debug & (int) (F) ) { \
/* VARARGS */ \ printf X; \
} \
} while (0) ffdefine PRINTSD(F, X) \ do { \ if (ss_debug & (int) (F) ) { \ ss_i_t i; \
/* VARARGS */ \ printf ("%-20s", module); \ for (i = 0; i < (X)->len; i++) { \
/* VARARGS */ \ printf ("%c", (X) ->data[i] ) ; \
} \
} \
} while (0) ffelse tfdefine DINIT(F) ({define PRINTD(F, X) ftendif
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 5
/************************************************************************* *************************************************************************
* ** ***
*** Initialization ***
*** ** * ************************************************************************* *************************************************************************/
int ss_init (ss_desc_t *ss)
{
DINIT("ss_init") ; ss_i_t i; uint8_t *b; size_t be; int status; ss_debug |= SS_DEBUG_ERR;
PRINTD (SS_DEBUG_INIT,
("arenaSize: 0x%lx, chunkSize 0x%lx, chunksPerHash:' %f\n" , (unsigned long) ss->arenaSize, (unsigned long) ss->chunkSize, ss->chunksPerHash) ) ;
/* quick sanity checks */ assert (sizeof (ss_cstr_t> == SS_CHUNK_SIZE) ,- assert ( sizeof (ss_rec_t) == SS_CHUNK_SIZE) ; if (SS_CHUNK_SIZE != ss->chunkSize) { PRINTD ( SS_DEBUG_ERR | SS_DEBUG_INIT,
("wrong chunk size (%lu expected %lu) \n" , (unsigned long) SS_CHUNK_SIZE, (unsigned long) ss->chunkSize), ' return EINVAL; } if (ss->logBufSize < 64) { ss->logBufSize = 64; } if (ss->logBufs < 2) { ss->logBufs = 2; } if (ss->arenaSize < MIN„ARENA_SIZE) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("arena too small (%lu)\n", (unsigned long) ss->arenaSize) ) ; return EINVAL,- } ss->logFile = -1; ss->hashSize = ss_next_prime ( (ss_i_t) (((double) (ss->arenaSize / ss->chunkSize) ) / ss->chunksPerHash) ) ; ss->firstChunk =
( (ss->hashSize * sizeof (dq_t) + (ss->chunkSize - 1)) & - (ss->chunkSize - D);
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 6
ss->chunks = (ss->arenaSize - ss->firstChunk) / ss->chunkSize; be = ss->arenaSize + sizeof (ss_log_buf_t) * ss->logBufs
+ ss->logBufSize * ss->logBufs + getpagesize!); ss->arena = (uint8_t *) malloc(bc); if ( !ss->arena) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("string store malloc of %u bytes failed\n" , be)); return ENOMEM; } ss->hash = (dq_t *) ss->arena; for (i = 0; i < ss->hashSize; i++) {
INITQ(&ss->hash[i] ) ; } ss->firstChunk += (ptrdiff_t) ss->arena; ss_init_chunk_free(ss) ; b = (uint8_t *) (ss->firstChunk + ss->chunks * ss->chunkSize) ; ss->logBufDesc = (ss_log_buf_t *) b; b += sizeof (ss_log_buf_t) * ss->logBufs; b = (uint8_t *)
(( (ptrdiff_t) (b + getpagesize( ) - 1)) & - (( (ptrdiff_t) getpagesize () ) - 1) ) ; for (i = 0; i < ss->logBu s; i++) { status = ss_init_log_buf (ss, &ss->logBufDesc[i] , ss->logBufSize, b) ,- if (status) ( free(ss->arena) ,- return status;
} b += ss->logBufSize; } ss->nextLogBuf = 0; ss->logRecsRead = 0 ; ss->logRecsWritten = 0;
PRINTD (SS_DEBUG_INIT, ( "done\n" ) ) ; return 0; } static ss_i_t ss_next_prime (ss_i_t p) { ss_i_t prime, factor, maxfactor; prime = (p & -1) + 1; for (;;) { maxfactor = ((int) sqrt ( (double) prime)) + 1; for (factor = 3; factor <= maxfactor && prime % factor != 0 ; factor += 2); if (factor > maxfactor) ( break;
} prime += 2 ,- } return prime;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 7
************************************************************************* ************************************************************************* ** * * * *
*** Log File Writing ***
* * * * * *
************************************************************************* ************************************************************************* int ss_new_log (ss_desc_t *ss, const char *logFileName, int dump)
{
DINIT( "ss_new_log" ) ; ss_log_desc_t logDesc; uint32_t t; int logFile; int oldLogFile,- off_t oldFilePos; off_t filePos; int status ; ss_flush_log(ss) ,- if ( (logFile = open (logFileName,
0_WRONLY I 0_CREAT | 0_TRUNC | 0_APPEND | 0X_0_BINARY, LOG_FILE_MODE) ) < 0) { status = errno;
PRINTD ( SS_DEBUG_ERR | SS_DEBUG_INIT, ("create %s failed\n" , logFileName)) return status ,- } t = 0x11223344; logDesc . endian =
(*((uint8_t *) &t) == 0x44) ? SS_ENDIAN_LITTLE : SS_ENDIAN_BIG; logDesc.cellSize = sizeof (ss_i_t) ; logDesc.chunkSize = ss->chunkSize; logDesc.chunks = ss->chunks; if (write (logFile, &logDesc, sizeof (logDesc) ) < 0) { status = errno; close(logFile) ; return status ,- } filePos = sizeof (logDesc) ; if (dump) { oldLogFile = ss->logFile; oldFilePos = ss->filePos; ss->logFile = logFile; ΞΞ->filePos = filePos; status = ss_dump_to_log(ss) ; if (status) { ss->logFile = oldLogFile;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 8
ss->filePos = oldFilePos; close (logFile) ; return status ; } filePos = ss->filePos ; ss->logFile = oldLogFile; ΞΞ->filePos = oldFilePos ; } status = ss_close_log(ss) ; if (status) { close(logFile) ; return status; } ss->logFile = logFile; ss->filePos = filePos; return status; } int ss_close_log (ss_desc_t *ss)
{ if (ss->logFile != -1) { ss_flush_log(ss) ; if (close (ss->logFile) < 0) { return errno; } ss->logFile = -1; } return 0; } void ss_wait_update (ss_desc_t *ss)
{ ss_log_buf_t *lb; int i;
/* Search for the oldest full buffer and wait for it to be emptied */ for (i = 1; i < ss->logBufs; i++) { lb = &ss->logBufDesc[ (ss->nextLogBuf + i) % ss->logBufs] ; if (lb->full) { ss_wait_buf_not_full (lb) ; return; } } lb = &ss->logBufDesc[ss->nextLogBuf] ; if (lb->used) { ss_start_log_write(ss) ; ss_wait_buf_not_full (lb) ,- } } void ss_wait_update_all (ss_desc_t *ss)
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 9
C ss_flush_log(ss) ;
} static void ss_flush_log (ss_desc_t *ss)
{ int i; if (ss->logFile != -1) { if (ss->logBufDesc [ss->nextLogBuf] .used) { ss_start_log_write(ss) ; } for(i = 0; i < ss->logBufs; i++) { ss_wait_buf_not_full (&ss->logBufDesc [i] ) ; } } } static int ss_init_log_buf (ss_desc_t *ss, ss_log_buf_t *lb, ssize_t size, uint8_t *b)
{
DINIT ( " ss_init_log_buf " ) ; int status = 0; lb->size = size; lb->used = 0; lb->buf = b; lb->completions = 0 ,- lb->full = 0; lb->ss = ss; (fif define (SS_RTAIO) lb->aiocb.aio_buf = b; lb->aiocb.aio_sigevent.sigev_notify = SIGEV_NONE; lb->aiocb.aio_reqprio = 0; lb->aiocb.aio_o fset = 0; ttelif defined (SS_THREAD_IO) lb->flags = 0; if ((status = MUT_INIT(lb->lock) ) != 0) { PRINTD(SS_DEBUG_ERR | SS_DEBUG_INIT,
("unable to init mutex: %s\n", strerror (status) )) ; } else if ((status = COND_INIT(lb->writeReq) ) != 0) { PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("unable to init writeReq cond var: %s\n" , strerror (status) ).) ; } else if ((status = COND_INIT(lb->writeDone) ) != 0) { PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("unable to init writeDone cond var: %s\n", strerror (status) )) ; } else if ((status = THR_CREATE(lb->thread, ss_log_buf_writer, lb)) != 0) { PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("unable to create thread: %s\n" , strerror (status) )) ; } tfendif /* SS_RTAIO elif SS_THREAD_IO */ return status; }
(fifdef SS_THREAD_IO
Copyright 2000 OverX, Inc., All Rights Reserved. stπng_store c Page 10
static void * ss_log_buf_wrιter (void *arg)
C ss_log_buf_t *lb = (ss_log_buf_t *) arg, ss_desc_t *ss = lb->ss;
MUT_LOCK(lb->lock) , for (;;) { if ( lb-> lags & LOG_BUF_FLAG_WRITING) ( MUT_UNLOCK (lb->lock) ; if (pwπte ( ss->logFιle, lb->buf , lb->used, lb->fιlePos ) < 0 ) { lb->status = errno ; }
MUT_LOCK ( lb->lock) ; lb->flags &= ~LOG_BUF_FLAG_WRITING ; if ( lb->flags & LOG_BUF_FLAG_WAIT_WRITE) { COND_SIGNAL ( lb->wrιteDone) ; } } else {
COND_WAIT ( lb->wπteReq, lb->lock) , } } return 0 ; } ffendlf / * SS_THREAD_IO * /
/*
* Dump a complete snapshot of the string store to a log file
* This must only be done at the beginning of the log file */ static t ss_dump_to_log (ss_desc_t *ss)
C ss_ι_t l ; ss_ι_t 3 ; mt newed; dq_t *cs; ss_ree_t *rec; ss_rec_t * *anchor; for (l = 0; l < ss->hashSιze; ι++) { for (cs = ss->hash[ι] .n; cs != &ss->hash[ι] ; cs = cs->n) { ss_log_cstr(ss, (ss_cstr_t *) cs);
} } for (ι = 0; l < ss->hashSιze; ι++) { for (cs = ss->hash[ι] .n; cs '= &ss->hash[ι] ; cs = cs->n) { /*
* There may be some error scenarios where a string is m the hash
* table but unused. Theoretically, these are (at the current time)
* the result of programming errors. Nonetheless, this check is
* in here for robustness */ if (((ss_cstr_t *) cs)->rec || ((ss_cstr_t *) cs)->ref) { anchor = &((ss_cstr_t *) cs)->rec;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 11
for (rec = *anchor; rec; rec = rec->next) { newed = 0 ; for (j = 0; j < SS_REC_FIELDS; j++) C if (rec->fields[j] ) { if (newed) { ss_log_map(ss, rec, rec->fields [j] ) ; } else { ss_log__new_map(ss, anchor, rec, rec->fields [j] ) ; newed = 1; } } } anchor = &rec->next; } } } } return 0; } static void ss_start_log_write (ss_desc_t *ss)
{
DINIT( "ss_start_log_write" ) ; ss_log_buf_t *lb = &ss->logBufDesc [ss->nextLogBuf] ; size_t used = lb->used; assert ( ! lb->full ) ; (tifdef SS_RTAIO lb->aiocb.aio_fildes = ss->logFile; lb->aiocb.aio nbytes = used; if (aio_write(&lb->aiocb) < 0) {
PRINTD(SS_DEBUG_LOG_WRITE | SS_DEBUG_ERR,
("aio_write of 0x%lx bytes to log file failed: %s\n" , (unsigned long) used, strerror (errno) )) ,- assert(O); /* IOERR */ } (telif defined (SS_THREAD_IO) MUT_LOCK(lb->lock) ; assert (! (lb->flags & LOG_BUF_FLAG_WRITING) ) ; lb->filePos = ss->filePos; lb->flags |= LOG_BUF_FLAG_WRITING; lb->status = 0; COND_SIGNAL(lb->writeReq) ; MUT_UNLOCK(lb->lock) ; #else if (write (ss->logFile, lb->buf, used) < 0) { PRINTD (SS_DEBUG_LOG_WRITE | SS_DEBUG_ERR,
("write of 0x%lx bytes to log file failed: %s\n" , (unsigned long) used, strerror (errno) )) ; assert(O); /* IOERR */ } #endif lb->full = 1; ss->filePos += used; ss->nextLogBuf = (ss->nextLogBuf + 1) % ss->logBufs; }
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 12
static void ss_wait_buf_not_full (ss_log_buf_t *lb)
{
DINIT ( " ss_wait_buf_not_full " ) ; ss_desc_t *ss = lb->ss; if (lb->full) { (fifdef SS_RTAIO const struct aiocb *aiocbl[l]; int status; aiocbl[0] = &lb->aiocb; if (aio_suspend(aiocbl, 1, NULL) < 0) {
PRINTD (SS_DEBUG_LOG_WRITE | SS_DEBUG_ERR,
( "aio_suspend failed: %s\n" , strerror (status) )) ,- assert(O); /* IOERR */
} status = aio_error (&lb->aiocb) ; if (status) {
PRINTD (SS_DEBUG_LOG_WRITE | SS_DEBUG_ERR,
("aio_write ultimately failed: %s\n" , strerror (status) )) ; asser (0); /* IOERR */ } ffelif defined(SS_THREAD_IO) MUT_LOCK(lb->lock) ; if (lb->flags & LOG_BUF_FLAG_WRITING) { lb->flags |= LOG_BUF_FLAG_WAIT_WRITE; while (lb->flags & LOG_BUF_FLAG_WRITING) { COND_WAIT(lb->writeDone, lb->lock) ; } lb->flags &= ~LOG_BUF_FLAG_WAIT_WRITE; }
MUT_UNLOCK(lb->lock) ; if (lb->status) {
PRINTD (SS_DEBUG_LOG_WRITE | SS_DEBUG_ERR,
("write ultimately failed: %s\n" , strerror (lb->status) )) ; assert(O); /* IOERR */ } #endif if (fsync(ss->logFile) < 0) {
PRINTD (SS_DEBUG_LOG_WRITE | SS_DEBUG_ERR,
("fsync failed: %s\n" , strerror (errno) )) ; assert(O); /* IOERR */ } lb->full = 0; lb->used = 0; if (lb->completions) [
(ss->callbackFunc) (ss->callbackArg, lb->completions) ; lb->completions = 0; } } } static void ss_sched_completion (ss_desc_t *ss)
{
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 13
/ *
* TODO: if buffer is full proactively start the write
*/ ss->logBufDesc [ss->nextLogBuf] . completions++;: ss->logRecsWritten++; } static uint8_t * ss_log_buf_alloc (ss_desc_t *ss, size_t be)
C ss_log_buf_t *lb; uint8_t *p; again: lb = &ss->logBufDesc [ss->nextLogBuf] ; ss_wait_buf_not_full (lb) ; assert (be <= lb->size) ; if (lb->size - lb->used < be) { /* * This path will only be taken once */ ss_start_log_write (ss) ; goto again; } p = &lb->buf [lb->used] ; lb->used += be; return p; } static void ss_log_cstr (ss_desc_t *ss, ss_cstr_t *cs)
{ ss_log_string_t *ls; ss_log_cstr_helper_t ch;
Is = (ss_log_string_t *) ss_log_buf_alloc (ss, sizeof (ss_log_string_t) - sizeof (ls->chunks) ) ; ls->op = SS_LOG_STRING; ls->len = cs->len; ch.ss = ss; ss_cstr_map(cs, ss_log_cstr_chunk_helper, &ch, 0) ; ss_cstr_map(cs, ss_log_cstr_chars_helper, &ch, 0) ; } static int ss_log_cstr_chunk_helper (ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_i_t be, void *arg) { ss_log_cstr_helper_t *ch = (ss_log_cstr_helper_t *) arg; ss_desc_t *ss = ch->ss; ss_i_t *p0; p0 = ( ss_i_t * ) ss_log_buf_alloc ( ss , sizeof ( ss_i_t) ) ; *p0 = ( ss_i_t) ( ( (ptrdif f_t) cs) - ss->firstChunk) ;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 14
return 0; } static int ss_log_cstr_chars_helper (ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_i_t be, void *arg)
{ ss_log_cstr_helper_t *ch = (ss_log_cstr_helper_t *) arg; ss_desc_t *ss = ch->ss; uint8_t *p0; ss_i_t bcAlig; bcAlig = ((be + sizeof (ss_i_t) - 1) & ~ (sizeo (ss_i_t) - 1) ) ; pO = ss_log_buf_alloc(ss, bcAlig); bcopy (p, pO, be) ; return 0; } static void ss_log_new_map (ss_desc_t *ss, ss_rec_t **recp, ss_rec_t *rec, ss_cstr_t *data)
{ ss_log_new_map_t *lm; lm = (ss_log_new„map_t *) ss_log_buf_alloc(ss, sizeof (ss_log_new_map_t) ) ; lm->op = SS_LOG_NEW_MAP; lm->from = (ss_i_t) (( (ptrdiff_t) recp) - ss->firstChunk) ; lm->rec = (ss_i_t) ( ( (ptrdiff_t) rec) - ss->firstChunk) ,- lm->to = (ss_i_t) (( (ptrdiff_t) data) - ss->firstChunk) ; assert (ss_valid_offset (ss, lm->from) ) ,- assert (ss_valid_chunk(ss, lm->rec) ) ,- assert(ss_valid_chunk(ss, lm->to) ) ; ) static void ss_log_map (ss_desc_t *ss, ss_rec_t *from, ss_cstr_t *data)
{ ss_log_map_t *lm; lm = (ss_log_map_t *) ss_log_buf_alloc (ss, sizeof (ss_log_map_t) ) ; lm->op = SS_L0G_MAP; lm->rec = (ss_i_t) (( (ptrdiff_t) from) - ss->firstChunk) ; lm->to = (ss_i_t) (( (ptrdiff_t) data) - ss->firstChunk) ; assert (ss_valid_chunk(ss, lm->rec) ) ; assert (ss_valid_chunk(ss, lm->to) ) ; } static void ss_log_delete (ss_desc_t *ss, ss_cstr_t *key, ss_cstr_t *data)
{ ss_log_unmap_t *lu; lu = (ss_log_unmap_t *) ss_log_buf_alloc (ss, sizeof (ss_log_unmap_t) ) ; lu->op = SS_LOG_UNMAP;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 15
lu->from = (ss_i_t) (( (ptrdi _t) key) - ss->f irstChunk) ; lu->to = (ss_i_t) (( (ptrdif f_t) data) - ss->f irstChunk) ; assert (ss_valid_off set (ss, lu->from) ) ; assert (ss_valid_chunk(ss, lu->to) ) ; ) ************************************************************************* *************************************************************************
* * * * * *
*** Log File Reading ***
* * * * * * ************************************************************************* *************************************************************************/ ftdefine READ_BUF_SIZE 64 * 1024 int ss_read_log (ss_desc_t *ss, const char *logFileName)
{
DINIT("ss_read_log") ; int logFile; ss_log_desc_t logDesc; uint32_t t; int littleEndian; uint8_t logBuf [READ_BUF_3IZE] ; ss_log_rec_t recBuf; ss_log_rec_t *rec; uint8_t *p; uint8_t *morep; ss_i_t need; ss_read_log_state_t Is; ssize_t leftO; size_t left; ss_i_t be; int last; int status; int prevState; if ((logFile = open (logFileName, 0_RDWR | OX_0_BINARY, 0)) < 0) { status = errno;
PRINTD(SS_DEBUG_ERR | SS_DEBUG_INIT, ("open %s failed\n", logFileName) ) ; return status; } if (read (logFile, (void *) SlogDesc, sizeof (logDesc) ) < 0) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT, ( "descriptor read failed\n" )) ; status = errno; close (logFile) ; return status; } t = 0x11223344; littleEndian = (*((uint8_t *) &t) == 0x44); if ((littleEndian && logDesc. endian == SS_ENDIAN_BIG)
|| (! littleEndian && logDesc. endian == SS_ENDIAN_LITTLE) ) { PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("wrong endian (%s)\n", littleEndian ? "big" : "little")); close (logFile) ;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 16
return EINVAL; } if (logDesc . cellSize ! = sizeof ( ss_i_t) ) {
PRINTD ( SS_DEBUG_ERR | SS_DEBUG_INIT,
("wrong cell size (%d)\n", logDesc. cellSize) ) ; close (logFile) ; return EINVAL; } if (logDesc. chunkSize != ss->chunkSize) { PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("wrong chunk size; %lu, expected %lu)\n", (unsigned long) logDesc.chunkSize, (unsigned long) ss->chunkSize) ) ; close (logFile) ; return EINVAL; } if (logDesc.chunks > ss->chunks) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("too many chunks; %lu, only have %lu\n", (unsigned long) logDesc. chunks, (unsigned long) ss->chunks) ) ; close (logFile) ; return EINVAL; }
Is. State = LOG_STATE_IDLE; ls.ss = ss; left = 0; status = 0; need = 0; last = 0; p = NULL; morep = NULL; while (Is. state != L0G_STATE_D0NE) { prevState = Is. state; if (Heft) { if ((leftO = readdogFile, logBuf, READ_BUF_SIZE) ) < 0) { status = errno;
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("error reading: %s\n", strerror (status) )) ; Is. state = L0G_STATE_D0NE; continue; ) : .... left = leftO; last = (left < READ_BUF_SIZE) ; p = logBuf; } if (last && left < sizeof (rec->op) ) { if (Is. state == LOG_STATE_IDLE) { if (left) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
( " odd file size , ignoring final %d bytes \n " , left ) ) ; 1 PRINTD ( SS_DEBUG_INIT, ( " end of file\n" ) ) ;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 17
} else (
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT, ("truncated log file\n")); status = EINVAL; }
Is. State = L0G_STATE_D0NE; continue; } switch (Is. state) { case LOG_STATE_IDLE: rec = (ss_log_rec_t *) p; switch (rec->op) { case SS_LOG_END: need = sizeof (ss_log_end_t) ; break; case SS_LOG_ΞTRING: need = sizeof (ss_log_string_t) - sizeof (ss_i_t) ; break; case SS_LOG_NEW_MAP: need = sizeof (ss_log_new_map_t) ; break; case SS_LOG_MAP: need = sizeof (ss_log_map_t) ,- break; case SS_LOG_UNMAP: need = sizeof (ss_log_unmap_t) ; break; default:
PRINTD (SS_DEBUG_INIT | SS_DEBUG_ERR,
("unknown op 0x%lx\n" , (unsigned long) rec->op) ) ,- Status = EINVAL; Is. State = LOG_STATE_DONE; continue; } if (left < need) { if (last) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("truncated %s record\n", ss_log_rec names [rec->op] )) ;
Is. state = LOG_STATE_DONE; continue; } bcopy(p, (void *) &recBuf, left) ; morep = ( (uint8_t *) &recBuf) + left; need -= left; left = 0;
Is. state = LOG_STATE_NEED_MORE; } else { p += need; left -= need; status = ss_proc_log_rec(rec, &ls) ; } break; case LOG_STATE_NEED_MORE : rec = &recBuf ; if ( left < need) { PRINTD ( SS_DEBUG_ERR | SS_DEBUG_INIT,
( " truncated %s record\n" , ss_log_rec names [rec->op] ) ) ;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 18
close (logFile) ;
} bcopy(p, morep, need); p += need; left -= need; status = ss__proc_log_rec(&recBuf , &ls) ; break; case LOG_STATE_CSTR_CHUNKS: status = ss_log_cstr_next_chunk(* (ss_i_t *) p, &ls) ; left -= sizeof (ss_i_t) ; p += sizeof (ss_i_t) ; break; case LOG_STATE_CSTR_CHARS : be = ss_log_cstr_next_chars (p, left, &ls) ; p += be; left -= be; break; case LOG_STATE_ALIGN: if (left) { be = left; left = (left & -(sizeof (ss_i_t) - 1) ) ; be -= left; p += be;
}
Is. State = LOG_STATE_IDLE; break; default:
PRINTD (SS_DEBUG_INIT | SS_DEBUG_ERR,
("unknown log file reading state: 0x%x\n", Is. state) ),- abort ( ) ; } } if (status) {
PRINTD (SS_DEBUG_INIT | SS_DEBUG_ERR, ("error at 0x%lx: %s\n", (unsigned long) lseek(logFile, 0, SEEK_CUR) - left, strerror (status) ) ) ; } close (logFile) ,- return status; } static int ss_proc_log_rec (ss_log_rec_t *rec, ss_read_log_state_t *ls)
{
DINIT ( " ss_proc_log_rec" ) ; int status = 0 ; ss_desc_t * ss = ls->ss ; switch (rec->op) { case SS_LOG_END : ls->state = LOG_STATE_DONE; break ,- case SS_LOG_STRING : status = ss_log_start_string ( ls , & ( rec->string) ) ; break ,-
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 19
case SS_LOG_NEW_MAP: status = ss_log_proc_new_map(ls, & (rec->new_map) ) ; break; case SS_L0G_MAP: status = ss_log_proc_map(ls, & (rec->map) ) ; break; case SS_LOG_UNMAP: status = ss_log_proc_unmap(ls, & (rec->unmap) ) ; break; default:
PRINTD (ΞS_DEBUG_INIT | SS_DEBUG_ERR,
("unknown log file op read: 0x%lx\n" , (unsigned long) rec->op) ) ,-
Status = EINVAL; ls->state = LOG_STATE_DONE; } ss->logRecsRead++ ; return status; } static int ss_log_start_string ( ss_read_log_state_t *ls , ss_log_string_t *rec)
{ ls->cstr = NULL; ls->cstrPrev = &ls->cstr; ls->cstrLen = rec->len; ls->cstrChunks = ss_len_to_chunks (rec->len) ,- ls->cstrChunkIdx = 0; ls->state = LOG_STATE_CSTR_CHUNKS; return 0; } static int ss_log_proc_new_map (ss_read_log_state_t *ls, ss_log_new_map_t *rec)
(
DINIT( "ss_log_proc_new_map" ) ; ss_desc_t *ss = ls->ss; ss_cstr_t *data; ss_rec_t *ssRec; int status = 0; if (!(ssRec = (ss_rec_t *) ss_get_chunk(ss, rec->rec) ) ) { status = EINVAL; ls->state = LOG_STATE_DONE; } else if ( ! ss_valid_offset (ss, rec->from) ) {
PRINTD(SS_DEBUG_ERR | SS_DEBUG_INIT,
("bogus from 0x%lx\n" , (unsigned long) rec->from) ) ,-
Status = EINVAL; ls->state = LOG_STATE_DONE; ) else if ( !ss_valid_chunk(ss, rec->to) ) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("bogus to 0x%lx\n", (unsigned long) rec->from) ) ;
Status = EINVAL; ls->state = LOG_STATE_DONE; } else { data = (ss_cstr_t *) (ss->firstChunk + rec->to) ;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 20
bzero((void *) &ssRec->fields [1] ,
(SS_REC_FIELDS - 1) * sizeof (ssRec->fields [0] )) ; ssRec->next = NULL; ssRec->fields [0] = data;
*((ss_rec_t **) (ss->firstChunk + rec->from) ) = ssRec; ss_inc_cstr_ref (data) ; ls->state = LOG_ΞTATE_IDLE;
return status; } static int ss_log_proc_map (ss_read_log_state_t *ls, ss_log_map_t *rec)
{
DINIT ( " ss_log_proc_map" ) ; ss_desc_t *ss = ls->ss; ss_cstr_t *data; ss_rec_t *ssRec; ss_i_t i; int status = 0; if ( !ss_valid_chunk(ss, rec->rec) ) { PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("bogus rec 0x%lx\n" , (unsigned long) rec->rec) ) ; Status = EINVAL; ls->state = LOG_STATE_DONE; } else if ( !ss_valid_chunk(ss, rec->to) ) { PRINTD (SS_DEBUG_ERR j SS_DEBUG_INIT,
("bogus to 0x%lx\n", (unsigned long) rec->to) ) ; status = EINVAL; ls->State = LOG_STATE_DONE; } else { data = (ss_cstr_t *) (ss->firstChunk + rec->to) ,- ssRec = (ss_rec_t *) (ss->firstChunk + rec->rec) ; for (i = 0; i < SS_REC_FIELDS; i++) { if ( !ssRec->fields [i] ) { ssRec->fields[i] = data; ss_inc_cstr_ref (data) ; break; } } if (i == SS_REC_FIELDS) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("map 0x%lx full\n" , (unsigned long) ssRec) ) ; status = EINVAL; ls->state = L0G_STATE_D0NE; } else { ls->state = LOG_STATE_IDLE; } } return status; } static int ss_log_proc_unmap (ss_read_log_state_t *ls , ss_log_unmap_t *rec)
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 21
DINIT( "ss_log_proc_unmap" ) ; ss_desc_t *ss = ls->ss; ss_cstr_t *key; ss_cstr_t *data; int status = 0; if ( !ss_valid_chunk(ss, rec->from) ) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("bogus from: 0x%lx\n", (unsigned long) rec->from) ) ; status = EINVAL; ls->Ξtate = LOG_STATE_DONE; } else if ( !ss_valid_chunk(ss, rec->to) ) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_INIT,
("bogus to: 0x%lx\n", (unsigned long) rec->to) ) ,- status = EINVA ; ls->state = L0G_STATE_D0NE; } else { key = (ss_cstr_t *) (ss->firstChunk + rec->from) ,- data = (ss_cstr_t *) (ss->firstChunk + rec->to) ;
(void) ss_do_delete(ss, key, data); ls->state = L0G_STATE_IDLE; } return status; } static int ss_log_cstr_next_chunk (ss_i_t offset, ss_read_log_state_t *ls)
{ int status = 0; ss_desc_t *ss = ls->ss; ss_cstr_t *cstr; ss_cstr_t *cstr0; ss_i_t i; if ( ! (cstr = (ss_cstr_t *) ss_get_chunk(ss, offset))) ( cstr = ls->cstr; for (i = 0; i < ls->cstrChunkIdx; i++) { cstrO = cstr->cont; ss_free_chunk(ss, cstr) ; cstr = cstrO; } status = EINVAL; ls->state = L0G_STATE_D0NE; } else { if (!ls->cstr) { ls->cstr = cstr; cstr->rec = 0 ; cstr->ref = 0; cstr->len = ls->cstrLen; }
*ls->cstrPrev = cstr; ls->cstrPrev = &cstr->cont; ls->cstrChunkIdx++; if (ls->cstrChunkIdx == ls->cstrChunks) { ls->cstrChunkIdx = 0; ls->cstrHash = 0;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 22
ls->cstrChunk = ls->cstr; ls->cstrChunkLeft = SS_CSTR_DATA_SIZE; ls->state = LOG_STATE_CSTR_CHARS ; } } return status; } static ss_i_t
Ξs_log_cstr_next_chars (uint8_t *p, ss_i_t left, ss_read_log_state_t *ls)
{ ss_desc_t *ss = ls->ss; uint8_t *cp; ss_i_t be; cp = &((uint8_t *) ls->cstrChunk) [ss->chunkSize - ls->cstrChunkLeft] ; if (ls->cstrLen > ls->cstrChunkLeft) ( be = ls->cstrChunkLeft - sizeof (ls->cstr->cont) ; } else { be = ls->cstrLen; } if (be > left) { be = left; } ls->cstrHash = ss_hash(p, be, ls->cstrHash) ; bcopy (p, cp, be) ; ls->cstrChunkLeft -= be; ls->cstrLen -= be; if ( !ls->cstrLen) { ls->cstrHash %= ss->hashSize;
INSQH(St(ss->hash[ls->cstrHash] ) , ls->cstr) ; ls->state = L0G_STATE__ALIGN; } if (!ls->cstrChunkLeft) { ls->cstrChunkLeft = ss->chunkSize; ls->cstrChunk = ls->cstrChunk->cont; } return be; }
/************************************************************************* ******.*******************************************************************
*** ***
*** Memory Chunk Management ***
*** ** * ************************************************************************* *************************************************************************/
static void ss_init_chunk_free (ss_desc_t *ss)
(
DINI ( "ss_init_chunk_free" ) ; ss_i_t i;
INITQ (&ss->chunkFreeQ) ;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 23
for (i = 0; i < ss->chunks; i++) {
INSQT(&ss->chunkFreeQ, &((ss_cstr_t *) ss->firstChunk) [i] ) ;
} ss->chunksFree = ss->chunks;
PRINTD(SS_DEBUG_INIT | SS_DEBUG_CHUNK,
("%lu free\n", (unsigned long) ss->chunksFree) ) ; }
static void * ss_alloc_chunk (ss_desc_t *ss)
{
DINIT ( " ss_alloc_chunk" ) ; void *chunk; if (EMPTYQ(&ss->chunkFreeQ) ) { chunk = NULL; } else {
REMQH(&ss->chunkFreeQ, chunk); ss->chunksFree-- ; }
PRINTD (SS_DEBUG_CHUNK, ("alloced 0x%lx\n" , (unsigned long) chunk)); return chunk; } static int ss_valid_o fset (ss_desc_t *ss, ss_i_t offset)
{ if (offset < ss->chunks * ss->chunkSize) { return 1;
} return 0; } static int ss_valid_chunk (ss_desc_t *ss, ss_i_t chunkOffset)
( if ( !ss_valid_offset (ss, chunkOffset)
|| (chunkOffset & (ss->chunkSize - 1))) { return 0 ; } return 1; } static int ss_valid_free_chunk (ss_desc_t *ss, ss_i_t chunkOffset)
{ dq_t *chunk; ptrdif f_t offset ; chunk = (dq_t * ) (ss->f irstChunk + chunkOffset) ; offset = ( (ptrdiff_t) chunk->n) - ss->firstChunk; if ( ! ss_valid_chunk (ss, offset)
&& ( (dq_t * ) chunk->n) ! = &ss->chunkFreeQ) { return 0; } offset = ( (ptrdif f_t) chunk->p) - ss->f irstChunk;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 24
if ( !ss_valid_chunk(ss, offset)
&& ( (dq_t *) chunk->p) != &ss->chunkFreeQ) { return 0; } return 1; ) static void * ss_get_chunk (ss_desc_t *ss, ss_i_t chunkOffset)
{
DINIT ( "ss_get_chunk" ) ,- dq_t *chunk; if ( !ss_valid_chunk(ss, chunkOffset)) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_CHUNK, . _ ,. „-~
( "bogus chunk 0x%lx\n", (unsigned long) chunkOffset)); chunk = NULL; ) else if ( ! ss_valid_free_chunk(ss, chunkOffset)) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_CHUNK,
("not free 0x%lx\n" , (unsigned long) chunkOffset)); chunk = NULL; } else { chunk = (dq_t *) (ss->firstChunk + chunkOffset);
REMQ(chunk) ; ss->chunksFree-- ; }
PRINTD (SS_DEBUG_CHUNK, ("got 0x%lx\n" , (unsigned long) chunk) ) ,- return chunk; } static void ss_free_chunk (ss_desc_t *ss, void *chunk)
{
DINI ( "ss_free_chunk" ) ;
INSQH(&ss->chunkFreeQ, chunk); ss->chunksFree++;
PRINTD(SS_DEBUG_CHUNK, ( "freed 0x%lx\n" , (unsigned long) chunk)); }
/************************************************************************* ************************************************************************* *** ***
*** Chunked String Utilities ***
*** ***
************************************************************************* ************************************************************************* static ss_i_t ss_len_to_chunks (ss_i_t len) { ss_i_t recs = 1; if (len > SS_CSTR_DATA_SIZE) [ len -= SS_CSTR_DATA_SIZE - sizeof (ss_rec_t *); recs += (len + (SS_CSTR_CONT_DATA_SIZE - sizeof (ss_cstr_t *) ) )
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 25
/ (SS_CSTR_CONT_DATA_SIZE - sizeof (ss_cstr_t *)); if (len % (SS_CSTR_CONT_DATA_SIZE - sizeof (ss_cstr_t *))
<= sizeof (ss_cstr_t *)) { recs— ; } } return recs; } static void ss_cstr_map (ss_cstr_t *cs, ss_cstr_map_f_t f, void *arg, int postDeref)
{ ss_i_t left; ss_i_t be; ss_i_t chunkBc; ss_i_t offset; uint8_t *p; ss_cstr_t *cs0; left = cs->len; offset = 0; chunkBc = SΞ_CSTR_DATA_SIZE; p = cs->data; csO = NULL; while (left) { if (left > chunkBc) { be = chunkBc - sizeof (cs->cont) ; } else { be = left; } if (IpostDeref) { csO = cs->cont; } if ((*f)(cs, p, offset, be, arg) ) { break; } if (postDeref) { csO = cs->cont; } offset += be ; left -= bees = csO ; chunkBc = SS_CSTR_CONT_DATA_SIZE; p = SS_CSTR_CONT_DATA (cs) ; } } static ss_cstr_t * ss_new_cstr ( ss_desc_t *ss , ndtp_str_t *s , ss_i_t hash)
{ ss_cstr_t *cs ; ss_new_cstr_helper_t nh; cs = ( ss_cstr_t * ) ss_alloc_chunk ( ss ) ; if (cs ) {
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 26
nh . s = s; nh.left = s->len; nh. status = 0; nh.ss = ss; cs->ref = 0; cs->rec = NULL; cs->cont = NULL; cs->len = s->len; ss_cstr_map(cs, ss_new_cstr_helper, &nh, 1) ; if (nh. status) {
/* Free the chunks that were allocated on failure */ cs->len = nh.lastOffset - 1; ss_cstr_map(cs, ss_free_cstr_chunk, ss, 0) ; cs = NULL; } else {
INSQH(&(ss->hashrhash]) , cs) ; } } return cs; } static int ss_new_cstr_helper (ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_new_cstr_helper_t *nh = (ss_new_cstr_helper_t *) arg; bcopy (Se (nh->s->data [offset] ) , p, be) ; nh->left -= be; if (nh->left) { if (!(cs->cont = (ss_cstr_t *) ss_alloc_chunk(nh->ss) ) ) { nh->status = ENOMEM; nh->lastO fset = offset; return 1 ; } } return 0 ;
} static void ss_inc_cstr_ref (ss_cstr_t *cs)
{ cs->ref++; } static void ss_dec_cstr_ref (ss_desc_t *ss, ss_cstr_t *cs)
{ if (!(—cs->ref) && !cs->rec) { ss_free_cstr (ss, cs) ; }
static void
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 27
ss_free_cstr (ss_desc_t *ss, ss_cstr_t *cs) [
REMQ(cs) ; ss_rec_map(cs, ss_free_rec, ss) ; ss_cstr_map(cs, ss_free_cstr_chunk, ss, 0) ; ) static int ss_free_cstr_chunk (ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_i_t be, void *arg) { ss_desc_t *ss = (ss_desc_t *) arg; ss_free_chunk(ss, cs) ; return 0 ; } static int ss_free_rec (ss_rec_t *rec, ss_i_t recNum, ss_cstr_t *cs, ss_i_t fieldNum, void *arg) { ss_desc_t *ss = (ss_desc_t *) arg; if (IfieldNum) { ss_free_chunk(ss, rec) ; } return 0; } static int ss_cstr_eq (ndtp_str_t *s, ss_cstr_t *cs)
{ eq_helper_t eh; eh. s = s ; eh.eq = 1; ss_cstr_map(cs, ss_cstr_eq_helper, &eh, 0) ; return eh.eq; } static int ss_cstr_eq_helper (ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_i_t be, void *arg) { eq_helper_t *eh = (eq_helper_t *) arg; if (memcmp (&(eh->s->data [offset] ) , p, be)) { eh->eq = 0; return 1 ; } return 0 ; }
/* Client must ensure that s can hold all the data in cs */ static void
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 28
ss_cstr_to_str (ss_cstr_t *cs, ndtp_str_t *s) { s->len = cs->len; ss_cstr_map(cs, ss_cstr_to_str_helper, s, 0) ; } static int ss_cstr_to_str_helper (ss_cstr_t *cs, uint8_t *p, ss_i_t offset, ss_i_t be, void *arg) { ndtp_str_t *s = (ndtp_str_t *) arg; bcopyfp, &s->data[of f set] , be) ; return 0; }
/************************************************************************* ************************************************************************* *** * * *
*** Record Utilities ***
** * ** *
************************************************************************* ************************************************************************* static void ss_rec_map (ss_cstr_t *cs, ss_rec_map_f_t f, void *arg)
{ ss_rec_t *rec; ss_rec_t *rec0 ; ss_i_t i; ss_i_t recNum; recNum = 0; for (rec = cs->rec; rec; rec = recO) [ recO = rec->next; for (i = 0; i < SS_REC_FIELDS; i++) { if ((*f)(rec, recNum, rec->fields [i] , i, arg)) [ return; } recNum++ ; } } }
/************************************************************************* ************************************************************************* *** ***
*** Lookup ***
*** ***
************************************************************************* *************************************************************************/ int ss_lookup (ss_desc_t *ss, ndtp_str_t *key, ndtp_str_t *data, ss_i_t *totalLen)
{
DINIT ( " Ξ s_lookup " ) ; ss_cstr_t *keyE;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 29
ss_i_t i; ss_copy_mapping_helper_t ch; ch.data = data; ch.left = data->len; ch. totalLen = 0; ch . index = 0 ; keyE = ss_find(ss, key, &i) ; if (keyE) { ss_rec_map(keyE, ss_copy_mapping, &ch) ,- ) else {
PRINTSD(SS_DEBUG_LOOKUP, key);
PRINT0D(SS_DEBUG_LOOKUP, ("not found\n" ) ) ; }
*totalLen = ch. totalLen ,- return ch. index; } static int ss_copy_mapping (ss_rec_t *rec, ss_i_t recNum, ss_cstr_t *cs, ss_i_t fieldNum, void *arg) { ss_copy_mapping_helper_t *ch = (ss_copy_rnapping_helper_t *) arg; size_t s; if (cs) { s = NDTP_STR_SIZE(cs->len) ; ch->totalLen += s; if (s > ch->left) {
/* mapping doesn't fit in data; ensure that no more get copied */ ch->left = 0; } else {
(void) ss_cstr_to_str (cs, ch->data) ; ch->data = (ndtp_str_t *) ( ( (uint8_t *) ch->data) + s) ; ch->index++; ch->left -= s; } } return 0; } static ss_i_t ss_hash (uint8_t *s, ss_i_t len, ss_i_t h)
{ ss_i_t g; ss_i_t i; uint8_t c; for (i = 0; i < len; i++) { c = s [ i] ; h = (h « 4) + c; g = (h & SS_HASH_MASK) ; if (g) { h Λ= g » SS_HASH_SHIFT; h Λ= g;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 30
return h;
static ss_cstr_t * ss_find ( ss_desc_t *ss , ndtp_str_t *s , ss_i_t *hp)
{
DINI ( " ss_find" ) ; ss_i_t h; dq_t *cs ; h = ss_hash(s->data, s->len, 0) % ss->hashsize; *hp = h;
PRINTSD (SS_DEBUG_LOOKUP, s) ; PRINTOD (SS_DEBUG_LOOKUP,
(" (h: Ox%lx, len: Ox%lx) \n" , (unsigned long) h, (unsigned long) s->len) ) ; for (cs = ss->hash[h] .n; cs != &ss->hash[h] ; cs = cs->n) {
PRINTD (SS_DEBUG_LOOKUP, ("checking 0x%lx\n" , (unsigned long) cs) ) if (((ss_cstr_t *) cs)->len == s->len && ss_cstr_eq(s, (ss_cstr_t *) cs) ) { PRINTD (SS_DEBUG_LOOKUP, ("found\n") ) ; return (ss_cstr_t *) cs; } }
PRINTD (SS_DEBUG_LOOKUP, ( "not found\n" ) ) ; return NULL;
************************************************************************* *************************************************************************
*** ***
*** Add Record To Store ***
* ** *** ************************************************************************* *************************************************************************/ int ss_add (ss_desc_t *ss, ndtp_str_t *key, ndtp_str_t *data)
{
DINIT("ss_add") ; ss_i_t keyHash; ss_cstr_t *keyE; εs_i_t dataHash; ss_cstr_t *dataE; ss_find_assoc_helper_t fh; int newKey = 0; int newData = 0; ss_rec t *rec = NULL;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 31
keyE = ss_find(ss, key, &keyHash) ; dataE = ss_find(ss, data, kdataHash) ;
PRINTSD(SS_DEBUG_ADD, key) ;
PRINTOD(SS_DEBUG_ADD, (" key (Ox%lx, h: Ox%lx, len: Ox%lx) \n" ,
(unsigned long) keyE, (unsigned long) keyHash,
(unsigned long) key->len) ) ; PRINTSD(SS_DEBUG_ADD, data) ; PRINTOD(SS_DEBUG_ADD, (" data (0x%lx, h: 0x%lx, len: 0x%lx)\n",
(unsigned long) dataE, (unsigned long) dataHash,
(unsigned long) data->len) ) ; if (IkeyE) { if ( ! (keyE = ss_new_cstr (ss, key, keyHash))) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_ADD, ("key add_string failed\n")); assert(O); /* MEMERR */ } PRINTD (SS_DEBUG_ADD,
("key add_string %lx\n" , (unsigned long) keyE)); newKey = 1; if (IdataE && keyHash == dataHash && key->len == data->len SeSe !memcmp(key->data, data->data, key->len) ) { PRINTD (SS_DEBUG_ADD, ("data and key equivalents" )) ; dataE = keyE; } } if (IdataE) { if (! (dataE = ss_new_cstr (ss, data, dataHash))) {
PRINTD ( SS_DEBUG_ERR | SS_DEBUG_ADD, ("data add_string failed\n" ) ) ,- assert (0); /* MEMERR */ } newData = 1;
PRINTD ( SS_DEBUG_ADD, ( "data add_string %lx\n" , (unsigned long) dataE) ) ; }
fh. found = NULL; fh . free = NULL; fh . anchor = & (keyE->rec) ; fh. data = dataE; ss_rec_map (keyE, ss_f ind_assoc , &fh) ; if ( fh . found) {
PRINTD ( SS_DEBUG_ADD , ( " already associated\n " ) ) ; return 0 ; } if ( fh . free) { fh. free->f ields [ fh . freeNum] = dataE; } else { if ( ! (rec = ( ss_rec_t * ) ss_alloc_chunk ( ss) ) ) {
PRINTD (SS_DEBUG_ERR | SS_DEBUG_ADD, ("record chunk alloc failed\n")); assert(O); /* MEMERR */ }
PRINTD (SS_DEBUG_ADD, ("new record chunk: 0x%lx\n" , (unsigned long) rec) ) ; rec->fields [0] = dataE;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 32
bzero ( (voιd * ) &rec->f ields [1] ,
(SS_REC_FIELDS - 1 ) * sizeof (rec->f ields [ 0] ) ) ; rec->next = NULL; *fh. anchor = rec; } if (newKey) { ss_log_cstr ( ss , keyE) ; } if (newData) { ss_log_cstr (ss, dataE);
} if (rec) { ss_log_new_map(ss, fh.anchor, rec, dataE) ; } else { ss_log_map(ss, fh.free, dataE); } ss_mc_cstr_ref (dataE) ; ss_sched_completιon(ss) ; PRINTD (SS_DEBUG_ADD, ( "done\n" ) ) ; return 1; }
static t ss_f md_assoc ( ss_rec_t *rec, ss_ι_t recNum, ss_cstr_t *cs , ss_ι_t fieldNum, void *arg) { ss_fmd_assoc_helper_t * fh = (ss_fιnd_assoc_helper_t * ) arg, if (cs) { f (cs == fh->data) { fh->found = rec; fh->foundNum = fieldNum; return 1 ; } ) else if (!fh->free) { fh->free = rec; fh->freeNum = fieldNum; } if (fieldNum == SS_REC_FIELDS - 1) { fh->anchor = & (rec->next) ; } return 0 ;
************************************************************************* ************************************************************************* * ** ***
*** Delete Association From Store ***
*** *** ************************************************************************* *************************************************************************/ t ss_delete (ss_desc_t *ss , ndtp_str_t *key, ndtp_str_t *data)
Copyπght 2000 OverX, Inc., All Rights Reserved string_store.c Page 33
{
DINIT ( "ss_delete" ) ; ss_cstr_t *keyE; ss_cstr_t *dataE; ss_i_t i; int notDone = 0; if ( ! (keyE = ss_find(ss, key, &i) ) ) { PRINTSD (SS_DEBUG_DELETE, key) ;
PRINTOD(SS_DEBUG_DELETE, (" (key) not found\n" ) ) ; } else if (! (dataE = ss_find(ss, data, &i) ) ) { PRINTSD (SS_DEBUG_DELETE, data) ,-
PRINTOD(SS_DEBUG_DELETE, (" (data) not found\n")); } else {
PRINTD (SS_DEBUG_DELETE,
("key 0x%lx, data 0x%lx\n" , (unsigned long) keyE, (unsigned long) dataE) ) ,- if (ss_do_delete(ss, keyE, dataE)) { ΞΞ_log_delete(ss, keyE, dataE); ss_sched_completion(ss) ,- notDone = 1; } } return notDone; }
/* Return 1 if an actual association was deleted */ * static int ss_do_delete (ss_desc_t *ss, ss_cstr_t *key, ss_cstr_t *data)
{
DINIT ( " ss_do_delete" ) ,- ss_rec_t *rec; ss_find_assoc_helper_t fh; ss_i_t i; fh. found = NULL; fh.free = NULL; fh. anchor = &(key->rec) ; fh.data = data; ss_rec_map(key, ss_find_assoc, &fh) ; if ( ! (rec = fh. found) ) {
PRINTD (SS_DEBUG_DELETE, ( "no mapping\n" ) ) ,- return 0 ; } rec->fields [ fh . foundNum] = 0 ; ss_dec_cstr_ref ( ss , data) ; for (i = 0 ; i < SS_REC_FIELDS ; i++ ) { if (rec->fields [i ] ) (
/ * Don ' t need to reclaim any more storage * / return 1 ; } )
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.c Page 34
*(fh. anchor) = rec->next; ss_free_chunk(ss, rec); if (!key->rec && !key->ref) { ss_free_cstr (ss, key); } return 1;
/************************************************************************* *************************************************************************
* * * ***
*** Delete All Associations From Store ***
** * * * *
************************************************************************* *************************************************************************
/*
* This function's loop walks the hash table and deletes every mapping that it
* finds, one at a time. It keeps returning to the beginning of the hash chain
* after a mapping is deleted, rather than following a next pointer, because
* when the last mapping for a key is deleted, its next pointer will be junk,
* and furthermore, its successor may have been freed as well (since it might
* be the value that was just deleted) . We could do something cute like
* find the next key in the chain before we perform a delete, but our
* assumption is that the hash table is relatively unoccupied (which may or may
* not be the case, really, a hash occupancy of 5 on a database of a gazillion
* records is likely to be just fine, but it may make this algorithm run
* integer factors slower) */ void ss_empty (ss_desc_t *ss)
{ ss_cstr_t *keyE; ss_cstr_t *dataE; dq_t *head; int deleted,- int i ; for (i = 0; i < ss->hashSize; i++) { head = &(ss->hash[i] ) ; keyE = (ss_cstr_t *) head->n; while ((dq_t *) keyE != head) { if (keyE->rec) { dataE = find_one_value(keyE) ; assert (dataE) ; deleted = ss_do_delete(ss, keyE, dataE); assert (deleted) ; ss_log_delete(ss, keyE, dataE); keyE = (ss_cstr_t *) head->n; } else { keyE = (ss_cstr_t *) keyE->dq.n; } }
ss_cstr_t
Copyright 2000 OverX, Inc., All Rights Reserved. stπng_store.c Page 35
f md_one_value ( ss_cstr_t *key) { ss_cstr_t *data ; ss_rec_t *rec, mt l ; for (rec = key->rec; rec, rec = rec->next) { for (l = 0; l < SS_REC_FIELDS, 1++) { data = rec->fields [i] ; if (data) { return data; } } } return NULL;
/************************************************************************* *************************************************************************
* * * * * *
*** Utilities ***
* ** * ** ************************************************************************* ************************************************************************* static void dumpBuf (umt8_t *buf, t bufSize)
{ mt l ; ftdef e DUMP_WIDTH 16 for d = 0 , i < bufSize, ι++) { if ( ' ( l % DUMP_WIDTH) ) { printf ( " \t " ) ; } printf ("%02x ", buffi] ); if ((l % DUMP_WIDTH) == DUMP_WIDTH - 1) f printf ( "\n" ) ; } } if (l % DUMP_WIDTH) { printf ("\n") ; }
Copyπght 2000 OverX, Inc., All Rights Reserved. string_store.h Page 1
/ *
* Data types for OverX String Store
* Copyright 2000 OverX, Inc.
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
* $Id: string_store.h,v 1.12 2000/02/09 18:01:59 steph Exp $ */ ftifndef _STRING_STORE_H_ tfdefine _STRING_STORE_H_ ffinclude "ox_mach.h" /* includes inttypes */
#include "dq.h" ttifdef SS_RTAI0 ffinclude <aio.h>
#endif ttinclude <stddef.h> ffinclude <sys/types .h> ttifdef SS_THREAD_IO ffinclude "ox_thread.h"
#endif ffinclude "ndtp.h" /* for ndtp_str_t */
/*
* Memory manager chunk size (should be a power of 2) */ ffdefine SS_CHUNK_SIZE 128
/*
* This is the default ratio of memory manager chunks per hash table entry.
* Currently, once a string store is initialized this number can't be changed.
* Also, the hash table size ends up being pushed to the next prime. */ ftdefine SS_CHUNKS_PER_HASH 2.0
/*
* Define this if the string store storage arena is to use 32 bit pointers */
#if 0X_P0INTER_SIZE == 32 ftdefine SS_32BIT ftendif ttifdef SS_32BIT typedef uint32_t ss_i_t ; ftdefine SS_HASH_MASK OxfOOOOOOO ffdefine SS_HASH_SHIFT 24 tfelse typedef uint64_t ss_i_t ; ftdefine SS_HASH_MASK Oxf 000000000000000 ftdefine SS_HASH_SHIFT 56
# endif
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.h Page 2
/ *
* A string element in the string store. *
* If the actual string fits in SS_CSTR_DATA_SIZE bytes, the
* nextElt member may be used as string data storage. Otherwise,
* SS_CSTR_DATA_SIZE - sizeof (ss_hdr_t *) bytes are stored in the
* ss_cstr, and nextElt points to an ss_cstr_cont, with more data *
* There is no explicit ss_cstr_cont defined. It is just up to
* SS_CHUNK_SIZE starting at the base address, or
* or SS_CHUNK_SIZE-sizeof (e->nextElt) bytes and a nextElt pointer
* if the ss_cstr_cont is not the last one for a string *
* Reference counting is only performed for data strings. That is
* strings which are pointed by one of the ss_recs of a key string.
* Note that a string may be both a key and data, in which case
* ref != 0 && rec != NULL
* A string may be reclaimed when ref == 0 && rec == NULL. */ ffdefine SS_CSTR_DATA_SIZE (SS_CHUNK_SIZE \
- (sizeof (dq_t) + sizeof (struct ss_rec *) + 2 * sizeof (ss_i_t) ) ) typedef struct ss_cstr { dq_t dq; /* free list & hash links */ ss_ι_t ref; /* reference count */ struct ss_rec *rec; /* record pointer (if elt is a key) */ ss_i_t len; /* string length */ uint8_t data[SS_CSTR_DATA_SIZE - sizeof (struct ss_cstr *)]; struct ss_cstr *cont; } ss_cstr_t; ffdefine SS_CSTR_CONT_DATA_SIZE (SS_CHUNK_SIZE) ftdefine SS_CSTR_CONT_DATA(e) ( (uint8_t *) e)
/*
* This is a piece of a record in the key->record mapping.
* Each field in the record is a pointer to a string store element
* The ~~p' ' field of hdr is used to point to a continuation ss_rec
* with additional associations.
* record piece. */ ftdefine SS_REC_FIELDS \
{(SS_CHUNK_SIZE - sizeof (struct ss_rec *)) / sizeof (ss_cstr_t *)) typedef struct ss_rec { struct ss_rec *next; ss_cstr_t *fields [SS_REC_FIELDS] ; } ss_rec_t;
/*
* Log file buffer state
* Note that 0 is a reserved value of the transaction ID (tid), which
* means that no transaction is completed by writing out the
* current log buffer. This can happen if the log buffer only holds
* part of a (very long) string. */ typedef struct ss_log_buf { ssize t size;
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.h Page 3
uint8_t *buf ; size_t used; ss_i_t completions; int full; struct ss_desc *ss; ttifdef SS_RTAIO aiocb_t aiocb; #endif #ifdef SS_THREAD_IO
THR_T thread;
MUT_T lock;
COND_T writeReq;
COND_T writeDone; volatile off_t filePos; volatile int status ; volatile int flags; ftdefine LOG_BUF_FLAG_WRITING 0x1 #define LOG_BUF_FLAG_WAIT_WRITE 0x2 #endif } ss_log_buf_t;
Descriptor of a string store
'/ typedef void (*ss_callback_t) (void *arg, ss_i_t tid) ; typedef struct ss_desc { ss_i_t arenasize; /* input parameter to ss_init */ double chunksPerHash; /* input parameter to ss_init */ ss_i_t chunkSize; /* input parameter to ss_init */ ss_callback_t callbackFunc,- /* input parameter to ss_init */ void *callbackArg; /* input parameter to ss_init */ ssize_t logBufSize; /* input parameter to ss_init */ int logBufs; /* input parameter to ss_init */ uint8_t *arena; ss_i_t chunks; dq_t *hash; ss_i_t hashSize; ptrdiff_t firstChunk; dq_t chunkFreeQ; ss_i_t chunksFree; int logFile; off_t filePos; int nextLogBuf; ss_log_buf_t *logBufDesc;
/* Log file statistics */ ss_i_t logRecsRead; ss_i_t logRecsWritten; } ss_desc_t;
/*
* String store log file descriptor
* / typedef struct ss_log_desc { / *
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.h Page 4
* The first two fields are bytes so they can be read/written in
* an endian independent way */ uint8_t endian; uint8_t cellSize; uint8_t pad[6] ; ss_i_t chunkSize; /* chunk size */ ss_i_t chunks; /* max chunk */ ) ss_log_desc_t;
/*
* Values for ss_log_desc. endian */ enum [
SS_ENDIAN_LITTLE,
SS_ENDIAN_BIG };
/*
* String store log file records */
/*
* Log file record opcodes */ enum {
SS_L0G_END, /* no rec, just the opcode, must be 0 */
SS_LOG_STRING,
SS_LOG_NEW_MAP, /* map using new rec */
SS_LOG_MAP,
SS_LOG_UNMAP }; typedef struct ss_log_end { ss_i_t op; /* opcode, L0G_END */
} ss_log_end_t;
/*
* This record is variable length, in two (related) pieces:
* 1) the size of chunks is such that it accomodates the whole
* string, whose length is len
* 2) the string data follows the chunks and is of length len
* Finally the record is always padded to a sizeof (ss_i_t) boundary *
* The log file reader uses intimate knowledge of the details of the
* layout of these records, so do not modify or rearrange them
* without also modifying ss_read_log. */ typedef struct ss_log_string { ss_i_t op; /* opcode, LOG_STRING */ ss_i_t len; ss_i_t chunks [1] ; } ss_log_string_t; typedef struct ss_log_new_map { ss_i_t op; /* opcode LOG_NEW_MAP */ ss_i_t from; / * from pointer ( in string or rec) * /
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.h Page 5
ss_i_t rec; /* rec pointer (new rec) */ ss_i_t to; /* to pointer (string) */
} ss_log_new_map_t; typedef struct ss_log_map { ss_i_t op; /* opcode, LOG_MAP */ ss_i_t rec; /* rec pointer (previously allocated rec) */ ss_i_t to; /* to pointer (string) */
} ss_log_map_t; typedef struct ss_log_unmap { ss_i_t op; ss_i_t from; /* from pointer (key cstr) */ ss_i_t to; /* to pointer (data cstr) */
} ss_log_unmap_t; typedef union ss_log_rec { ss_i_t op; ss_log_end_t end; ss_log_string_t string; ss_log_new_map_t new_map; ss_log_map_t map; ss_log_unmap_t unmap; } ss_log_rec_t;
/*
* Call this function to initialize a string store */ int ss_ini (ss_desc_t *ss) ;
/*
* Call this function to open a new log file for a string store
* Set ^dump' ' to dump the current state of the string store
* to the newly opened log file
* If everything succeeds, the current log file will be flushed and closed. */ int ss_new_log (ss_desc_t *ss, const char *logFileName, int dump);
/*
* Call this function to close and flush a log file */ int ss_close_log(ss_desc_t *ss) ;
/*
* Call this function to read a log file in to a string store */ int ss_read_log (ss_desc_t *ss, const char *logFileName) ;
/*
* Call this function to return all mappings to key *
* data points to a memory area of size dataLen into which the association
* strings should be copied
* When ss_lookup is called, the client shall set data->len to the
* size of the entire data area, including the data->len field itself
*
* It returns the size of the data that was required to hold all the
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.h Page 6
* associations in the database in * totalLen, and the number of
* association strings which were actually copied into the data area
* as the function return value. */ int ss_lookup(ss_desc_t *ss, ndtp_str_t *key, ndtp_str_t *data, ss_i_t *totalLen) ;
/*
* Call this function to add a mapping from key to data
* It returns true if the mapping was actually added (as opposed to
* it already being present) . */ int ss_add(ss_desc_t *ss, ndtp_str_t *key, ndtp_str_t *data) ;
/*
* Call this function to delete a mapping from key to data
* It returns true if a mapping was actually deleted (as opposed to not
* being present in the first place) */ int ss_delete (ss_desc_t *ss, ndtp_str_t *key, ndtp_str_t *data) ;
/*
* Call this function to delete every mapping in a store
* Currently, it does not provide any confirmation when the mapping
* deletes have been flushed to disk because we aren't associating a
* protocol message with it, so confirmation is not necessary.
* This might be changed in the future. */ void ss_ernpty(ss_desc_t *ss) ;
/*
* Client calls this function when it can not do anything until
* some update transactions (add or delete) are completed (written to
* the log file) */ void ss_wait_update(ss_desc_t *ss) ;
/*
* Client calls this function to wait for all outstanding update transactions
* to be completed (written to the log file) */ void ss_wait_update_all (ss_desc_t *ss) ;
/*
* Set this variable to enable debugging prints */ extern int ss_debug; ftdefine STRING_ST0RE_DEBUG ffdefine SS_DEBUG_ERR 0x01 /* debug error cases */ ffdefine SS_DEBUG_INIT 0x02 /* debug initialization */ ffdefine SS_DEBUG_TABLE 0x04 /* debug string table management */ ffdefine SS_DEBUG_CHUNK 0x08 /* debug chunk management */ ffdefine SS_DEBUG_ADD 0x10 /* debug adding mappings */ ftdefine SS_DEBUG_L00KUP 0x20 /* debug looking up mappings */ ftdefine SS_DEBUG_DEDETE 0x40 /* debug deleting mappings */
Copyright 2000 OverX, Inc., All Rights Reserved. string_store.h Page 7
ftdefine SS_DEBUG_LOG_WRITE 0x80 / * debug writing the log file * / ft endif
Copyright 2000 OverX, Inc., All Rights Reserved. timestamp.h Page 1
/*
* OverX Timestamp (performance tuning) definitions
* Copyright 2000 OverX, Inc.
*
* This program source is the property of OverX, Inc. and contains
* information which is confidential and proprietary to OverX, Inc.. No
* part of this source may be copied, reproduced, disclosed to third parties,
* or transmitted in any form or by any means, electronic or mechanical
* for any purpose without the express written consent of OverX, Inc.. *
* $Id$ */ ffifndef _TIMESTAMP_H_ ffdefine _TIMESTAMP_H_ ttifdef ndef ffinclude <sys/time.h> static struct timeval ts_start; ffdefine TIMESTAMP_START gettimeofday(&ts_start, NULL) ftdefine TIMESTAMP_END(d) \ do { \ struct timeval ts_end; \ gettimeofday(&ts_end, NULL) ; \
(d) = (((double) (stop.tv_sec - start . tv_sec) ) \
+ (((double) (stop.tv_usec - start. tv_usec) ) / le6) ) ,- \ } while (0) ; ftdefine TIMESTAMP_ENDP(S) \ do { \ struct timeval ts_end; \ gettimeofday (&ts_end, NULL) ,- \ printf S ; \ printf (": %5.2f\n", \
((double) (ts_end. tv_sec - ts_start . tv_sec) ) \ + (((double) (ts_end. tv_usec - ts_start . v_usec) ) / le6)); \ } while (0) ; tfelse
#define TIMESTAMP_START #define TIMESTAMP_END(d) ftdefine TIMESTAMP_ENDP(S) ffendif
#endif
Copyright 2000 OverX, Inc., All Rights Reserved.

Claims

WE CLAIM:
1. A network distributed tracking wire transfer protocol comprising: a variable length identification string, the identification string for specifying the identity of an entity in a distributed data collection; and a variable length location string, the location string for specifying the network location of data associated with an entity in a distributed data collection; wherein a relationship between the identification string and the location string can be spontaneously and dynamically created and modified.
2. The network distributed tracking wire transfer protocol defined in claim 1 , wherein the protocol is application independent.
3. The network distributed tracking wire transfer protocol defined in claim 1 , wherein the protocol is organizationally independent.
4. The network distributed tracking wire transfer protocol defined in claim 1, wherein the protocol is geographically independent.
5. A system having a network distributed tracking wire transfer protocol for storing and identifying data with a distributed data collection, comprising: a data repository, the data repository for storing data in a distributed data collection; a client entity, the client entity for manipulating data in the distributed data collection; and a first server entity, the first server entity operative to locate data in the distributed data collection; wherein the client entity transmits an identifier string to the first server entity along with a client request and the first server entity provides at least one location string to the client entity in response thereto.
6. The system defined in claim 5, further comprising a second server entity coupled to the first server entity.
7. The system defined in claim 5, wherein the first server entity maps the identifier string received from the client entity to the at least one location string.
8. The system defined in claim 7, wherein the mapping is performed using a hash operation.
9. The system defined in claim 6, wherein the first server entity transmits the client request to the second server entity if the first server entity cannot provide the at least one location string to the client entity.
10. The system defined in claim 9, wherein the second server entity maps the identifier string received from the first server entity to the at least one location string.
11. The system defined in claim 10, wherein the second server entity transmits the at least one location string to the first server entity for transmission to the client entity.
12. A method for storing and retrieving tracking information over a network using a wire transfer protocol, comprising the steps of: providing a location string and an identification string, the location string for specifying the location of data associated with an entity in a distributed data collection and the identification string for specifying the identification of an entity in the distributed data collection; storing information at a data repository entity by associating an identification string with each particular stored unit of information and by mapping the identification string to at least one location string associated with the data repository entity, the identification string and the at least one location string for a particular unit of information being stored at a first server entity coupled to the data repository entity; transmitting a request from a client entity to the first server entity to retrieve at least one location string associated with a particular stored unit of information, the request including the identification string associated with the particular stored unit of information; and receiving the request at the first server entity and responding to the client entity by providing at least one location string associated with the particular stored unit of information to the client entity.
13. The method for storing and retrieving tracking information defined in claim 12, further comprising the step of transmitting the request to a second server entity prior to responding to the client entity, the second server entity coupled to the first server entity and having stored therewith the mapping of the identification string and the at least one location string for the particular unit of information.
14. The method for storing and retrieving tracking information defined in claim 13, wherein the second server entity responds to the client entity by providing the location string associated with the particular stored unit of information to the second client entity.
15. The method for storing and retrieving tracking information defined in claim 12, wherein the lengths of the identification string and the at least one location string are variable.
16. The method for storing and retrieving tracking information defined in claim 12, further comprising the step of spontaneously and dynamically manipulating the mapping of an identification string to a location string.
PCT/US2001/027383 2000-09-13 2001-09-04 Network distributed tracking wire transfer protocol WO2002023400A2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP01966551A EP1358576A2 (en) 2000-09-13 2001-09-04 Network distributed tracking wire transfer protocol
AU2001287055A AU2001287055A1 (en) 2000-09-13 2001-09-04 Network distributed tracking wire transfer protocol

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/661,222 US7103640B1 (en) 1999-09-14 2000-09-13 Network distributed tracking wire transfer protocol
US09/661,222 2000-09-13

Publications (2)

Publication Number Publication Date
WO2002023400A2 true WO2002023400A2 (en) 2002-03-21
WO2002023400A3 WO2002023400A3 (en) 2003-08-07

Family

ID=24652674

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/027383 WO2002023400A2 (en) 2000-09-13 2001-09-04 Network distributed tracking wire transfer protocol

Country Status (3)

Country Link
EP (1) EP1358576A2 (en)
AU (1) AU2001287055A1 (en)
WO (1) WO2002023400A2 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5913210A (en) * 1998-03-27 1999-06-15 Call; Charles G. Methods and apparatus for disseminating product information via the internet
US5950173A (en) * 1996-10-25 1999-09-07 Ipf, Inc. System and method for delivering consumer product related information to consumers within retail environments using internet-based information servers and sales agents
US5978773A (en) * 1995-06-20 1999-11-02 Neomedia Technologies, Inc. System and method for using an ordinary article of commerce to access a remote computer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5978773A (en) * 1995-06-20 1999-11-02 Neomedia Technologies, Inc. System and method for using an ordinary article of commerce to access a remote computer
US5950173A (en) * 1996-10-25 1999-09-07 Ipf, Inc. System and method for delivering consumer product related information to consumers within retail environments using internet-based information servers and sales agents
US5913210A (en) * 1998-03-27 1999-06-15 Call; Charles G. Methods and apparatus for disseminating product information via the internet

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BEITZ A ET AL: "Service location in an open distributed environment" SECOND INTERNATIONAL WORKSHOP ON SERVICES IN DISTRIBUTED AND NETWORKED ENVIRONMENTS, WHISTLER, BC, CA, 5 - 6 June 1995, pages 28-34, XP010148068 IEEE COMPUT. SOC., LOS ALAMITOS, CA, US ISBN: 0-8186-7092-4 *

Also Published As

Publication number Publication date
AU2001287055A1 (en) 2002-03-26
EP1358576A2 (en) 2003-11-05
WO2002023400A3 (en) 2003-08-07

Similar Documents

Publication Publication Date Title
US7814170B2 (en) Network distributed tracking wire transfer protocol
US5946467A (en) Application-level, persistent packeting apparatus and method
US7672275B2 (en) Caching with selective multicasting in a publish-subscribe network
US9547726B2 (en) Virtual file-sharing network
Currid TCP offload to the rescue: Getting a toehold on TCP offload engines—and why we need them
US20050086384A1 (en) System and method for replicating, integrating and synchronizing distributed information
US20020016867A1 (en) Cluster event service method and system
Allcock et al. The globus extensible input/output system (xio): A protocol independent io system for the grid
US8352619B2 (en) Method and system for data processing
US10536560B2 (en) System and method for implementing augmented object members for remote procedure call
Thompson et al. Ndn-cnl: A hierarchical namespace api for named data networking
WO2003060712A2 (en) Method and system of accessing shared resources using configurable management information bases
GB2412771A (en) System for managing cache updates
Alwagait et al. DeW: a dependable web services framework
JP3628514B2 (en) Data transmission / reception method between computers
WO2002023400A2 (en) Network distributed tracking wire transfer protocol
JP2003345709A (en) Cache device and network system using the same
US20030149797A1 (en) Fast socket technology implementation using doors and memory maps
Seifert et al. SCI SOCKET-A fast socket implementation over SCI
Martignetti Shared memory crash cast: a low level implementation of Paxos supporting crash failures in shared memory with RDMA
Friday et al. Experiences of using generative communications to support adaptive mobile applications
Eisenhauer et al. The dataexchange library
Wang Design and Implementation of TCPHA
Balay A lightweight middleware architecture and evaluation of middleware performance
Anderson et al. The DASH network communication architecture

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2001966551

Country of ref document: EP

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

WWP Wipo information: published in national office

Ref document number: 2001966551

Country of ref document: EP

WWW Wipo information: withdrawn in national office

Ref document number: 2001966551

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP