WO2000028437A9

WO2000028437A9 - Directory protocol based data storage

Info

Publication number: WO2000028437A9
Application number: PCT/US1999/025765
Authority: WO
Inventors: Lee Herzenberg; Wayne Moore; David Parks; Len Herzenberg; Vernon Oi
Original assignee: Lumen
Priority date: 1998-11-06
Filing date: 1999-11-05
Publication date: 2000-09-21
Also published as: AU2344000A; WO2000028437A1

Abstract

A directory access protocol as a means to both uniquely identify materials and to store within the directory itself data related to the materials (Fig. 1). A directory access protocol not just for names (101), but for data as well (100). The invention utilizes the directory access protocol as the basis for XML name spaces for scientific data to facilitate data interchange and viewing using directory services and protocols. An advantage of using a directory access protocol is the ease of search. Directories are designed to be searched quickly and efficiently, even when containing a large population of entries. A specialized set of standard types and standard objects are defined to extend the use of the directory to other fields, and in one embodiment, to the field of biological data. An advantage of the system is to be able to identify samples of biological material and all users associated with the biological material.

Description

DIRECTORY PROTOCOL BASED DATA STORAGE

BACKGROUND OF THE INVENTION

1. FIELD OF THE INVENΗON

This invention relates to the field of data storage and data storage file systems.

Portions of the disclosure of this patent document contain material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office file or records, but otherwise reserves all copyright rights whatsoever.

2. BACKGROUND ART

Current database systems do not operate efficiently when the database becomes large. Current databases also lack the flexibility to be changed to allow the addition of new types of data to a database entry. Problems with existing database systems can be understood by first reviewing database systems.

When data is collected and stored on a computer in an organized manner that collection of data is called a database. In an effort to make the data stored in databases easily retrievable, databases are organized according to a predetermined structure. Unfortunately, once the underlying structure of the database is implemented the process of changing it is cumbersome. To add new relationships to a database the structure of the entire database must often be redefined. As a result, current database models inherently lack flexibility. Database Organization

Databases are organized according to a data model that specifies the organizational structure of the database. A variety of different data models exist and each organizes data in a different manner. Examples of data models include the relational model, the object oriented model and the physical data model.

Once a data model is chosen the overall design of the database is implemented using that model. This overall design of the database is often referred to as the database schema and is defined by using a special language called a data definition language (DDL).

A database may contain one or more tables that are defined in a file called the data dictionary. Tables help keep the data in the database organized. Figure 1 illustrates a table 100 that contains information about customers. Each table is designed to store a collection of data and is comprised of a number of rows 101- 107. A row is separated into one or more columns 120-124 and each column has an associated name 140 and is designated to receive values. When data is placed into the table 100 it is placed in the appropriate column 120-124. For example, values 130-135 represent a series of customer identification numbers. These values are placed in column 120. A record typically refers to a row that contains an item of data in one or more of the columns of the row. Each table may hold numerous records. When a row 101-107 is filled with data it typically represents a unique set of data. For example, if data were placed in columns 120-124 of row 101 that data is representative of the customer that has the customer identification number 130.

A disadvantage of the way database tables are organized is that its organizational schema is predetermined and fixed. As a result current databases lack a flexible structure. For example, if a person using table 100 wanted to begin collecting other kinds of addressing information about a customer, such as the customers' work address or electronic mail address, a new column 206 to hold that information is required and must be defined. To define a new column a new table 200 that has an additional column 206 is created. Thus an inherent disadvantage of current database systems is that the user is locked into collecting the kind of information the table is pre-defined to hold. Table 100, for example, can only hold information pertaining to a customer's identification number, a customer's name, a customer's address, a customer's phone number, and a customer's fax number. To enter any other kind of information in Table 100 a new column must be defined.

Another disadvantage of current database systems is that every field in a table is assigned a value even if one does not exist. Referring now to Table 200 in Figure 1, if data is entered into one of the columns in row 102 data must also entered into all the remaining columns. When no real information exists to input into a column, some other value, such as a NULL value, zero, or some other value. For example, if the value "Bob" is placed in the column 121 of row 102 and the value "14 Main St" is placed in column 122 of row 102 the remaining columns in row 102 are assigned NULL values. Since values are assigned to every row in column 120, the remaining values of each row are filled with NULL values. This occurs regardless of whether additional information is actually entered into Table 200. Once a row is filled with one piece of data the remaining entries for that row are filled with some value. Placing values inside a table even when one is not supplied wastes memory and computing resources.

Data that is stored in the records of a table can form the basis of a relationship between another table in the database as long as the other table has a related record. Data stored in a column (or columns) of a table can form the basis for a relationship between that table and another table in the database having a related column (or columns). For example, the customer table could be related to a table the customer orders table if the customer table contains a series of records having fields with the names "customer identification", "last name", "first name", "street address", "city", "zip code" and the customer orders table has fields with the names "customer identification", "service provided", and "date service rendered." Since both of these tables share a field with the name "customer identification", the tables are both related to the same customer. Using a relationship between columns of two tables, it is possible to join these two tables to provide a single table of information that contains instances of rows from one table combined with related rows from the other table.

Tables may be related via one-to-one, one-to-many, or many-to-one, and many-to-many relationships. In a one-to-one relationship, one row in one table is related to a single row in a second table and vice versa. For example, a row in an employee table that contains information about an employee relates to a salaries table that contains the employee's salary information. Since an employee is typically only earning a single salary, there is a one-to-one relationship between an employee's employee table record and the employee's salary table record.

In a one-to-many relationship, a row in one table may be related to many rows in a second table, but each row in the second table matches only one row in the first table. For example, a state table that contains a state identifier and a state name can be related to multiple rows in the employee table. However, a row in the employees table identifies only one state of residence, for example. Conversely, a many-to-one relationship exists where many rows in one table match only one row in a second table, but each row in the second table may match many rows in the first table. To relate two tables, it is necessary to identify one or more columns that are common to both tables. These columns are typically referred to as keys. A primary key is a unique key within a table and uniquely identifies a row within the table. A foreign key in a second table is comprised of the column(s) containing a first table's primary key information. For example, in the employee table, an employee identifier (employeelD) can be assigned to uniquely identify each employee. The employeelD can be used as a primary key for the employees table. The employeelD can also be used as a foreign key in the salaries table. The employees and salaries tables can be joined by the employeelD columns in each table to have information from both tables available in a single record.

Applications are developed to provide a user with the ability to facilitate access and manipulation of the data contained in a DBMS. A DBMS includes a Data Manipulation Language (DML) such as Structured Query Language (SQL). A DML provides set-oriented relational operations for manipulating data in the DBMS. However, a DML requires a precise syntax that must be used to access and manipulate DBMS data. To use a DML, a user must understand and use the DML's syntax. Instead of requiring each user that wishes to modify a DBMS' data to learn the DML's syntax, applications are written that provide an interface between the user and a DBMS' DML.

Problems with current databases

There are a number of problems associated with current databases. These include the need for applications developers to know the structure of the database before generating applications that access the database, the limited ability to add new types of data to a database without changing the structure of the database, and lack of flexibility in searching and maintenance. In addition, although current databases may provide rules for uniqueness of entries within the database, there is no adequate scheme for ensuring global uniqueness of entries between databases.

Lack of uniqueness

In some cases, it would be desirable to be able to store a data and have it be unique or uniquely represent a unique real world counterpart. For example, in biological sciences, systematic nomenclature of materials and samples has been an essential tool. However, the method by which formal names are adopted has been professional meetings or governmental bodies and has not changed for centuries. Unwanted delays are introduced by the need to wait for naming conventions to be established. Currently, there is no effective way to provide temporary or permanent unique names for biological samples without meetings or governmental involvement. In addition, should a name be changed after its use has begun, there is no effective way to automatically cross reference new and old names.

Thus a problem with prior art database schemes is the inflexibility of data formats and the lack of uniqueness in database tables.

SUMMARY OF THE INVENTION

The present invention utilizes a directory access protocol as a means to both uniquely identify materials and to store within the directory itself data related to the materials. The invention utilizes a directory access protocol not just for names, but for data as well. The invention also utilizes the directory access protocol as the basis for XML name spaces for scientific data (e.g. genome and biological dta sets) to facilitate data interchange and viewing using directory services and protocols. An advantage of using a directory access protocol is the ease of search. Directories are designed to be searched quickly and efficiently, even when containing a large population of entries. A specialized set of standard types and standard objects are defined to extend the use of the directory to other fields, and in one embodiment, to the field of biological data. An advantage of the system is to be able to identify samples of biological material and all users associated with the biological material. For example, if multiple users purchased a particular biological sample from a company, and the company later wanted to contact all purchasers of that particular batch of that particular sample, it would be possible to do so using the directory protocol driven scheme of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is an example of a database structure.

Figure 2 is a tree structure of a standard LDAP directory.

Figure 3 is a tree structure of a directory with extensions of an embodiment of the present invention.

Figure 4 is a block diagram of a general computer system for implementing the present invention.

DETAILED DESCRIPTION OF THE INVENTION

In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without these specific details. In other instances, well-known features have not been described in detail in order not to unnecessarily obscure the present invention.

The present invention takes advantage of directory addressing protocols to store data instead of directory information. The invention results in the ability to uniquely identify data. In addition, the invention provides flexibility in storing data and associated parameters. One embodiment of the invention is used to store biological data such as flow cytometry data.

Biological Sample Data

Although the present invention has equal application to the storage of any data type, one embodiment relates to the storage of data associated with a biological sample. In particular, the storage of flow cytometry data.

Flow cytometry is a technique for obtaining information about cells and cellular processes that operates by allowing a thin stream of a single cell suspension to flow through one or more laser beams and measuring the resulting light scatter and emitted fluorescence. It is a widely applicable technique and is widely used in basic and clinical science, especially immunology. Its importance is increased by the fact that it is also possible to sort fluorescent labeled live cells for functional studies with an instrument called the Fluorescence Activated Cell Sorter (FACS). Several thousand medical and biological laboratories at locations throughout the world currently use flow cytometry instruments to count or study the properties of different types of cells coresident in blood or other organs.

Flow cytometry has always been computerized because without computers the data analysis would be infeasible. As flow cytometry has matured, the importance of combining flow data with data from other sources has become clear, as has the need for multi site collaborations, particularly for clinical research. This leads to the need to develop methods for naming or identifying flow cytometry samples, reagents and instruments (among other things) and in maintaining a shared repository of information about the samples.

Flow cytometry was revolutionized in the late 1970s with the introduction of monoclonal antibodies that could be coupled to a fluorochrome and used as FACS reagents. However, nomenclature for these reagents has been inconsistent, in spite of the fact that monoclonals are useful precisely because they can be uniquely and accurately named, i.e., the antibody produced by a clone is always the same whereas naturally produced sera are highly variable.

Directory Protocol

The present invention takes advantage of directory access protocols and systems to provide a manner of uniquely identifying biological samples such as flow cytometry data. One directory used in an embodiment of the invention is the Light Directory Access Protocol (LDAP). LDAP is a software protocol for enabling the location of organizations, individuals, and other resources such as files and devices in a network, whether on the Internet or on a corporate intranet. LDAP is a "lightweight" (smaller amount of code) version of DAP (Directory Access Protocol), which is part of X.500, a standard for directory services in a network.

In a network, a directory tells you where in the network something is located. On TCP/IP networks (including the Internet), the Domain Name System (DNS) is the directory system used to relate the domain name to a specific network address (a unique location on the network). However, sometimes the domain name is not known. There, LDAP makes it possible to search for an individual without knowing the domain.

One example of an LDAP directory is organized in a simple "tree" hierarchy consisting of the following levels:

1. The "root" directory (the starting place or the source of the tree), which branches out to

2. Countries, each of which branches out to

3. Organizations, which branch out to

4. Organizational units (divisions, departments, and so forth), which branches out to (includes an entry for)

5. Individuals (which includes people, files, and shared resources such as printers)

This example tree structure of an LDAP directory is illustrated in Figure 2. The parent node of the tree is the root node 201. The children of the root directory are country nodes 202.1 and 202.2. Each country node can have child organization nodes such as organization nodes 203.1 and 203.2 (children of country node 202.2).

Below the organization level are organization group nodes such as nodes 204.1, 204.2, and 204.3 which are children of organization node 203.2 Each group can have children nodes representing individuals such as group node 204.3 having children nodes 205.1, 205.2, and 205.3.

An LDAP directory can be distributed among many servers. Each server can have a replicated version of the total directory that is synchronized periodically. An LDAP server is called a Directory System Agent (DSA). An LDAP server that receives a request from a user takes responsibility for the request, passing it to other DSAs as necessary, but ensuring a single coordinated response for the user.

The present invention contemplates extensions and modifications to

LDAP protocols to make them usable not just as directories, but to also provide data itself. The present invention takes advantage of hierarchical levels of LDAP already established by the International Standards Organization (ISO) and uses those organizations to provide a first level of uniqueness to the biological sample to be named.

Embodiment of Invention

Using LDAP, objects such as monoclonal antibodies can be named relative to the unique distinguished name of an investigator or organization. That means that unique identifiers can be assigned to biological materials early in the scientific process and thus facilitate professional communication both informal and published. In the future, investigators who have this distinguished name can identify the material unambiguously via the unique name. If a directory services is maintained, an investigator can determine if the sample has been given an official name, if it has been shown to be equivalent to another entity or if it has been cited in the literature.

The embodiment of the invention provides definitions and attributes that can be used to define biological samples. The invention takes advantage of three parts of LDAP, the informational model, the functional model, and the namespace.

The information model defines entries which have a set of named attributes that can have one or more values and may be absent. The ability to have absent attributes solves the problem of databases that require an entry in every field. The invention can provide attributes that may only be rarely used with no worry about adding to overhead. Each attribute has a name and a type and each type has a name and a syntax which is expressed in Abstract Syntax Notation One (ASN.l). By default the types case exact string, case ignore string, telephone number, integer, distinguished name and binary are recognized. Every entry must have an attribute objectClass which defines what attributes are possible and which are required and may have an attribute aci (for access control information) which the server uses to control access to the entry. Object classes are hierarchical, i.e., a class can inherit attributes from a parent class and by defining new attributes extend its scope

The entries in a directory are organized hierarchically. That is to say that any entry may have one or more subentries so that the whole structure may be visualized as a tree. At every node each subentry is identified by a value of one of its attributes called a relative distinguished name (rdn) which must be unique within its level, for example "uid=wmoore". A distinguished name of a subentry is defined by concatenating its rdn with the dn of its parent entry which is likely to be itself a compound name, for example "uid=wmoore, ou=Shared FACS Facility, o=Stanford University".

The functional model defines a set of operations which may be applied to a directory: read, list, search, add, modify, delete and bind, unbind and abandon which are used to establish the users credentials, end a connection to the server and cancel a running query respectively.

The search function starts from a root dn and finds all entities further down in the hierarchy which pass a search filter constructed from the a group including equal, less than, contains, sounds like etc. applied to the attributes of the entity. A search filter may test the objectClass attribute and return only entries of a particular type. Clients can specify searches which return all the attributes of each entry or only a selected set of attributes.

Distinguished Names

A distinguished name is a comma separated list of attribute value pairs and is read from right to left (usually represented as a string). If a value contains special characters such as commas it must be quoted and in any case initial and final white space around attributes or values is ignored. For example, "cn=Wayne Moore, ou= Genetics Department, o=Stanford University".

Location names have as their root (right most) component the countryName or c attribute with the value being one of the ISO standard two letter country codes, for example c=US. Such names can be further restricted by specifying a stateOrProvinceName abbreviated st and a locality abbreviated 1, for example "l=San Francisco, st=California, c=US". Organizational names have as their root the name (registered with ISO) of a recognized organization and may be further qualified with one or more organizational units, for example "ou=Department of Genetics, ou=School of Medicine, o=Stanford University".

Domain names as used by the Domain Name Service (DNS) are represented with the dc attribute, for example, "dc=Darwin, dc=Stanford, dc=EDU".

Names of persons. There are two conventions for naming people. The older uses the commonName or en attribute of the Person objectClass but these are not necessarily unique. Some directories use the userld or UID attribute of inetOrgPerson, which is unique. Since uniqueness is important for scientific applications the latter will be used. The remainder of a persons dn is usually either an organizational or geographic name, for example "uid=wmoore, o=Stanford University" or "cn= Wayne Moore, l=San Francisco, st=California, c=US".

Naming Extensions

The following examples illustrate extensions that could apply to flow cytometry data in one embodiment of the present invention.

Gene loci, for example, "locus=Igh-l, o=Professional Society or locus=New, cn=Leonard Herzenberg, ou=Department of Genetics, ou=School of Medicine, o=Stanford University".

Gene alleles, for example, "allele=a, locus=Igh-l, o=Professional Society or allele=l, locus=127, ou=Department of Genetics, o=Stanford University". CD antigens, for example, "specificity=CD23, o=Human Leukocyte Differentiation Workshop".

New nomenclature schema

The following schemas arose from work on storing information about flow cytometry data in directories.

Monoclonal antibodies are distinguished by cloneName or clone which is unique within the parent entity which must be an investigator or organization.

Lymphocyte differentiation antigens, a thesaurus of the target specificities of monoclonal antibodies. Would include but not be limited to the official CD names.

FACS instruments are distinguished by the cytometer attribute which must be unique with respect to the organization parent, for example, "cytometer=Flasher II, ou=Shared FACS Facility, o=Stanford University".

FACS experiments are distinguished by the protocolldentifier or protocol attribute which must be unique with respect to the parent which may be a person, and instrument or and organization or some combination, e.g., "protocol=1234, cytometer=Flasher, uid=Moore, ou=Shared FACS Facility, o=Stanford University".

FACS samples are distinguished by a unique protocolCoordinate which must be unique within the parent FACS experiment, e.g., "coord=A12a, protocol=12345, cytometer=Mollusk, ou=Shared FACS Facility, o=Stanford University". Biological Object Schema

LDAP and X.500 define a set of standard types and standard objects mostly for describing persons and documents and more suitable for business than scientific use. However, the present invention contemplates types added for scientific use, particularly real numbers and possibly dimensional units, so that scientifically relevant information could be conveniently stored in and accessed from directories. The following are example sets of objects for the field of flow cytometry.

Table 1: Scientific Investigator

Table 2: Scientific Instrument

Table 4: Monoclonal antibodies

Table 6: FACS experiments

Table 7: FACS sample

Figure 3 illustrates the extension of the LDAP tree structure with the object extensions identified above in Tables 1 through 7. Table 1, the scientific investigator, can be at the individual level of the tree, such as individual 205.1. The scientific instrument used by the investigator can be identified in a child node 206. A publication associated with the work or experiment is at node 207. This node may be empty if the work is not published immediately. The use of LDAP permits the system to include an object that may be absent without the need for filling it with null values.

For flow cytometry specific items, a monoclonal antibody node 208 is defined, along with nodes 209, 210, and 211, corresponding to FACS instrument, FACS experiment, and FACS sample respectively.

The invention also utlizes the directory access protocol as the basis for XML name spaces for scientific data (e.g. genome and biological dta sets) to facilitate data interchange and viewing using directory services and protocols. XML (extensible markup language) is a language used to describe information, or more accurately, to make information self describing. Traditionally, web pages are built using HTML. HTML (hypertext markup language) describes the geometry and appearance of a page of data, in effect creating holes or slots in which data is inserted. However, there is no direct communication of the data that appears on the page in the HTML description. A user might be presented with a page that includes recognizable information, such as name, address, and phone number. But to HTML, the data is simply text to display.

XML, on the other hand, provides a protocol where the type of data being used can be identified. XML can do this in part using predefined "schemas" that can be used to understand the type of data being transmitted. If a standard schema is used, the data need only include a reference to the schema, which need not travel with the data. If a custom schema is used, it can be sent before or after the data, or explicit directions to the location of the schema can be provided. Embodiment of Computer Execution Environment (Hardware

An embodiment of the invention can be implemented as computer software in the form of computer readable code executed on a general purpose computer such as computer 400 illustrated in Figure 4, or in the form of bytecode class files running on such a computer. A keyboard 410 and mouse 411 are coupled to a bi-directional system bus 418. The keyboard and mouse are for introducing user input to the computer system and communicating that user input to processor 413. Other suitable input devices may be used in addition to, or in place of, the mouse 411 and keyboard 410. I/O (input/output) unit 419 coupled to bi-directional system bus 418 represents such I/O elements as a printer, A/V (audio /video) I/O, etc.

Computer 400 includes a video memory 414, main memory 415 and mass storage 412, all coupled to bi-directional system bus 418 along with keyboard 410, mouse 411 and processor 413. The mass storage 412 may include both fixed and removable media, such as magnetic, optical or magnetic optical storage systems or any other available mass storage technology. Bus 418 may contain, for example, thirty-two address lines for addressing video memory 414 or main memory 415. The system bus 418 also includes, for example, a 32-bit data bus for transferring data between and among the components, such as processor 413, main memory 415, video memory 414 and mass storage 412. Alternatively, multiplex data /address lines may be used instead of separate data and address lines.

In one embodiment of the invention, the processor 413 is a microprocessor manufactured by Motorola, such as the 680X0 processor or a microprocessor manufactured by Intel, such as the 80X86, or Pentium processor, or a SPARC microprocessor from Sun Microsystems, Inc. However, any other suitable microprocessor or microcomputer may be utilized. Main memory 415 is comprised of dynamic random access memory (DRAM). Video memory 414 is a dual-ported video random access memory. One port of the video memory 414 is coupled to video amplifier 416. The video amplifier 416 is used to drive the cathode ray tube (CRT) raster monitor 417. Video amplifier 416 is well known in the art and may be implemented by any suitable apparatus. This circuitry converts pixel data stored in video memory 414 to a raster signal suitable for use by monitor 417. Monitor 417 is a type of monitor suitable for displaying graphic images.

Computer 400 may also include a communication interface 420 coupled to bus 418. Communication interface 420 provides a two-way data communication coupling via a network link 421 to a local network 422. For example, if communication interface 420 is an integrated services digital network (ISDN) card or a modem, communication interface 420 provides a data communication connection to the corresponding type of telephone line, which comprises part of network link 421. If communication interface 420 is a local area network (LAN) card, communication interface 420 provides a data communication connection via network link 421 to a compatible LAN. Wireless links are also possible. In any such implementation, communication interface 420 sends and receives electrical, electromagnetic or optical signals which carry digital data streams representing various types of information.

Network link 421 typically provides data communication through one or more networks to other data devices. For example, network link 421 may provide a connection through local network 422 to local server computer 423 or to data equipment operated by an Internet Service Provider (ISP) 424. ISP 424 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the "Internet" 425. Local network 422 and Internet 425 both use electrical, electromagnetic or optical signals which carry digital data streams. The signals through the various networks and the signals on network link 421 and through communication interface 420, which carry the digital data to and from computer 400, are exemplary forms of carrier waves transporting the information.

Computer 400 can send messages and receive data, including program code, through the network(s), network link 421, and communication interface 420. In the Internet example, remote server computer 426 might transmit a requested code for an application program through Internet 425, ISP 424, local network 422 and communication interface 420.

The received code may be executed by processor 413 as it is received, and /or stored in mass storage 412, or other non-volatile storage for later execution. In this manner, computer 400 may obtain application code in the form of a carrier wave.

Application code may be embodied in any form of computer program product. A computer program product comprises a medium configured to store or transport computer readable code, or in which computer readable code may be embedded. Some examples of computer program products are CD-ROM disks, ROM cards, floppy disks, magnetic tapes, computer hard drives, servers on a network, and carrier waves. The computer systems described above are for purposes of example only. An embodiment of the invention may be implemented in any type of computer system or programming or processing environment.

A benefit of the directory protocol based approach of the present invention is access control. In prior art databases, access control is limited to a table or view granularity. With a directory structure, individual entries or even attributes can have access control. The invention also permits easy replication of databases, with the possibility of automatic and synchronous replication. It also permits a true federated approach to data storage.

Thus, a scheme for directory based protocol data storage has been described in conjunction with one or more specific embodiments. The invention is defined by the claims and their full scope of equivalents. Aspects of the invention are described in the attached Appendix.

Appendix A

A D IRE CTORY OF B IOL O GICAL MATERIALS

WAYNE A MOORE

Genetics Department, Beckman Center BOO'/ ', Stanford University,

Stanford, CA 94305-5318, USA

Systematic nomenclature has been an essential tool in biology since its emergence as a modern science However, the method by which formal or official names are adopted, namely meetings by professional or governmental bodies, has not changed since Linnaeus The last decade has seen rapid advances in the standardization (X.500, LDAP) and implementation of computerized directory services, including a global system of distinguished names This paper is a proposal that the biomedical community adopt X.500 as a standard for the machine representation of biological names Adherence to such a standard would permit the sharing of essential information about research materials through directories Adoption of unique names for biological mateπals facilitates collaboration by enabling investigators to exchange (vie e-mail or electronic publication) unique identifiers for materials An actively maintained directory of such mateπals would provide collaborators and future investigators with access to the primary data referenced by the literature, information about changes in nomenclature (for example adoption of a standard name by a professional society) and references, citations or hyperlinks to later work on the material We are implementing such a directory of flow cytometry samples and the monoclonal antibody reagents used to prepare them. A minimal set of names and objects drawn from thus effort is provided here as a concrete example

1 Introduction

Flow cytometry¹ is a technique for obtaining information about cells and cellular processes by allowing a thin stream of a single cell suspension to "flow" through one or more laser beams and measuring the resulting light scatter and emitted fluorescence Since there are many useful ways of rendenng cells fluorescent, it is a widely applicable technique and is very important in basic and clinical science, especially immunology Its importance is increased by the fact that it is also possible to sort fluorescent labeled live cells for functional studies with an instrument called the Fluorescence Activated Cell Sorter (FACS) At our FACS facility alone, we have processed millions of samples in the last 15 years

Flow cytometry has always been computenzed because without computers the data analysis would be infeasible As flow cytometry has matured, the importance of combimng flow data with data from other sources has become clear, as has the need for multi site collaborations, particularly for clinical research This lead to our interest in developing methods for naming or identifying flow cytometry samples, reagents and instruments (among other things) and in maintaining a shared repository of information about the samples etc

Flow cytometry was revolutionized in the late 1970s with the introduction of monoclonal antibodies² that could be coupled to a fluorochrome and used as FACS reagents However, nomenclature for these reagents has been a hodgepodge, in spite of the fact that monoclonals are useful precisely because they can be uniquely and accurately named, 1 e , the antibody produced by a clone is always the same whereas naturally produced sera are highly vanable Our work in capturing the experimental semantics of FACS experiments made it clear that we needed at least a local nomenclature and underscored the value of a global nomenclature for FACS data and monoclonal antibodies, which are useful in many fields beside flow cytometry

There are many existing nomenclatures m biology and medicine that provide uniqueness by specifying a central registry, usually mediated by a professional society Instead, to ensure uniqueness without global meetings, International Standards Organization (ISO) X 500 directory servers³ achieve umqueness with distinguished names (dn) that are assigned hierarchically ISO defines country names and registers organization names, e g , "c=US" and "o=Stanford University" respectively Governmental or non-governmental organizations then define how relative distinguished names are handed out, e g , by state "st=Calιforma, c=US" or by organizational umt "ou=Genetιcs Department, o=Stanford University"

It is easy to represent traditional standard names within the X 500 standard distinguished names simply make them relative to the organization which defines them Objects such as monoclonal antibodies can be named relative to the unique distinguished name of an investigator or organization That means that unique identifiers can be assigned to biological mateπals early m the scientific process and thus facilitate professional communication both informal and published Later, investigators who have this distinguished name can identify the matenal unambiguously and if a directory services is maintained, determine if it has been given an official name, if it has been shown to be equivalent to another entity or if it has been cited in the literature Thus I propose here, both for flow cytometry and as a general practice in biocomputing, the use of X 500 nomenclature At the Stanford Shared FACS Facility we are constructing a testbed for these concepts applied to flow cytometry, based on commercial LDAP directory servers

2 Background

2 1 Directories X 5002, LDAP v2 and v3

X 500³ is the core of a set of standards adopted by the International Standards Organization (ISO) beginning in 1988, which defines what may be simply called directory service A directory is fundamentally a database Directones were oπginally defined in order to allow users and their agents to find information about people, typically their telephone number but possibly including postal address, e- mail address and other information This was extended to include documents, groups of users and network accessible resources such as pnnters and more recently databases Three parts of the standard are of particular interest, the information model, the functional model and the namespace

The X 500 information model is very powerful and flexible The standard defines entries which have a set of named attπbutes that can have one or more values and may be absent Each attπbute has a name and a type and each type has a name and a syntax which is expressed in Abstract Syntax Notation One (ASN 1) By default the types case exact stπng, case ignore stnng, telephone number, mteger, distinguished name and binary are recognized Every entry must have an attπbute objectClass which defines what attπbutes are possible and which are required and may have an attπbute aci (for access control information) which the server uses to control access to the entry Object classes are hierarchical, l e , a class can mheπt attπbutes from a parent class and by defining new attributes extend it's scope

The entries in a directory are organized hierarchically That is to say that any entry may have one or more subentπes so that the whole structure may be visualized as a tree At every node each subentry is identified by a value of one of its attπbutes called a relative distinguished name (rdn) which must be unique within its level, for example "uιd=wmoore" A distinguished name of a subentry is defined by concatenating its rdn with the dn of its parent entry which is likely to be itself a compound name, for example "uιd=wmoore, ou=Shared FACS Facility, o=Stanford University" These distinguished names are the namespace mandated by X 500

The functional model defines a set of operations which may be applied to a directory read, list, search, add, modify, delete (which are pretty much self explanatory) and bind, unbind and abandon which are used to establish the users credentials, end a connection to the server and cancel a running query respectively The search function starts from a root dn and finds all entities further down in the hierarchy which pass a search filter constructed from the "usual suspects", l e , equal, less than, contains, sounds like etc applied to the attπbutes of the entity A search filter may of course test the objectClass attribute and return only entries of a particular type Clients can specify searches which return all the attπbutes of each entry or only a selected set of attributes

The protocol defined in X 500 for accessing the Directory Service Agent (DSA) is called Directory Access Protocol (DAP) and it runs on the Open System Interconnect (OSI) protocol stack which is also in its own πght an ISO standard This fact as well as the complexity of the secuπty mechanisms and abstract attπbute encoding of the full protocol made it difficult to implement DAP on lightweight clients, 1 e , PCs and Macs

The complexity of an X 500 directory client led to a desire for X 500 lite or a Lightweight Directory Access Protocol^{4 5} (LDAP) which would run on the TCP/IP protocol stack that is widely available on lightweight clients LDAP adopts the X 500 data model essentially intact It simplifies the functional model by collapsing the read, list and search functions into a single search function with object, one level or sub tree scope respectively It handles distinguished names as strings rather than the structured objects that DAP uses which transfers the responsibility for parsing them to the server Conversely most of the responsibility for mterpreting the attπbute values reverts to the client This results in some loss of robustness (because of weaker type checking) but relieves the client of the need to parse abstractly (ASN 1) defined objects LDAP returns the results as individual packets which allows lightweight clients to process result sets which they cannot store in memory LDAP does not include much of the elaborate secuπty and authentication mechanisms used by DAP and also simplifies the search constraints to the maximum number of entries to return and maximum time to spend searching

Unfortunately one X 500 function known as referral was not included in LDAP v2 This allows one DSA to return to the client a referral which directs the client to try again on a different DSA An LDAP v2 server is supposed to follow all referrals on behalf of the client and not return them to the client at all

LDAP v2⁵ was proposed to the Internet Engmeeπng Task Force (IETF) as a draft standard but was not adopted due to its technical limitations This lead to the effort to define a more acceptable version Also in this peπod the utility of stand alone LDAP servers, I e , servers which implemented the information and functional models directly rather than relying on a higher tier of X,500 servers became clear

LDAP v3⁶ addresses the problems discussed above and was adopted by IETF in 1998 as a proposed standard for read access only The IETF feels that the authentication mechanisms are inadequate for update access but has allowed the standard to proceed for read access when some other means of updating is used (See also, Hodges⁷)

In spite of the IETF reservations this version has rapidly gained wide acceptance All the major mail clients (Netscape, Outlook, Eudora etc ) support it and stand alone LDAP servers are available from several vendors (Novell, Netscape, Lotus/IBM, Innosoft etc ) as are X 500 gateways (Sun, Microsoft, etc ) It includes the concept of referrals and restores some but not all of the authentication and validation mechanisms of DAP It also includes a well defined syntax for encoding distinguished names⁸, attπbute values⁹ and search filters¹⁰ as strings 2 2 Existing technologies

The most familiar example of directory service is the rolodex or a box of 3X5 cards Like card files, directory servers manage small' sh packets of information (a directory entry or card) associated with a named persons or organizations that can record a diverse set of attributes Directory service is not simply a billion card rolodex however because the servers don't just maintain the information, they will search through it for you and return only selected information Servers can also suggest other servers (referrals) to enlist in the effort, 1 e , you may end up searching several directoπes to get a result but not need to be aware of this

Directory servers do not perform the join operation that relational databases use to combine information from different tables Instead they offer increasing flexibility in representing and searching for information An attribute of an entry m a directory may be missing or have multiple values While it is possible to represent multiple values in relational form it requires introducing new tables and joins, 1 e , substantial overhead and complexity so it is generally not done unless it is necessary Missing values are usually supported in relational databases but usually require stoπng a special missing data value The low overhead for missing and multiple values in a directory makes it much easier to accommodate rarely used attπbutes and occasional exceptions such as persons with multiple telephone numbers Directoπes are organized and searched hierarchically Again it is possible to do this with SQL stored procedures and temporary tables but it is awkward

A directory in many ways is an object oπented database The difference between directory service and a traditional OODB is that a directory associates attπbutes with objects but not methods and that binding to the attributes is done at runtime as a lookup operation rather than at compile time The first means that you can retrieve arbitrary data from an object but the only functions you can perform on it are the search, add, modify, delete etc defined by LDAP The latter consideration is similar to the relationship of interpreted BASIC to a compiled higher level languages and with analogous benefits (to the programmer and user) of simplicity, flexibility and rapid development and costs (to the computer) in performance

Frames are a data structure commonly used in artificial intelligence shells Their key feature of frames is that they lnheπt properties from their parents Directory entπes do not do this because objectClasses inheπt attπbutes but not attribute values from their parents However, this functionality can easily be implemented on the client side One simple scheme is to first look for the attπbute m the named frame and if it is not present stπp off the rdn and look for the attπbute in the frame named by the parent dn (if it has objectClass=aιFrame) A more flexible scheme would be to define an entry of class aiFrame to include a dn valued attπbute aiParentFrame and to trace that Eventually it might be beneficial to move this to the server side either by defining an LDAP extension or by defining a new ancestor scope option for the search function

Uniform Resource Locators (URL) are the internet standard for locating information For most protocols they are based in the Domain Name System (DNS) which identifies individual computers on the IP network This presents problems when more than one computer offers access to the resource or the computer serving the resource changes with time Distinguished names avoid this problem and may be served by many computers, 1 e , directory entries may be replicated or cached for reliability or performance and the responsible servers may change over time

2,3 Benefits of directories

A major advantage of LDAP is the availability of LDAP servers and client toolkits Standalone servers and LDAP to X 500 gateways are available from several sources LDAP client libraπes are available for the C language from Umv Michigan and Netscape and for the Java language from Sun and Netscape Furthermore LDAP is a standard which is directly utilized by the clients and all clients should be able to talk to all servers In contrast, SQL standardization has more to do with transportability of programmers and database schema than interoperability of databases

The X 500 information model is extremely flexible and search filters provide a powerful mechanism for selecting entπes, at least as powerful as SQL and probably more powerful than typical OODB The standard defines an extensibleObject which can have any attπbute and some standalone LDAP implementations permit relaxed schema checking, which m effect makes any object extensible Since an attribute value may be a distinguished name directory entπes can make arbitrary references to one another, I e , across branches of the directory hierarchy or between directoπes Some LDAP and X 500 servers¹ ' permit fine grained access control That is to say that access controls can be placed on individual entπes, whole sub trees (including the directory itself) and even individual attπbutes if necessary This level of control is not available in most existing databases

Referrals mean that one server which cannot resolve a request may refer the user to another server or servers which may be able to do so Dunng a search operation any referrals encountered are returned with the entnes located and the user (or client) has the option of continuing the search on the servers indicated This allows federation of directoπes which means that multiple LDAP/X 500 servers can present to the user a unified namespace and search results even though they are at widely separated locations and the implementations may actually be very different 2 4 Java Naming and Directory Interface

The Java Naming and Directory Interface 12 (JNDI) is a standard extension to the Java language introduced by Sun It includes an abstract implementation of name construction and parsing which encompasses the X 500 name space among others and an abstract directory that is essentially the X 500 information and functional models Specific implementations (service providers¹³) are available for LDAP, Network Information Server (NIS) and even the computers own file system

JNDI removes many of the limitations of LDAP as an OODB by providing a standard way to identify the Java class corresponding to a directory entity and instantiate it at runtime It is also possible to store seπalized Java objects as attπbute values Sun has proposed a set of standard attributes and objectClasses to

3 Naming

3 1 X 500 Distinguished Names

When represented as a stπng (essentially always with LDAP) a distinguished name is a comma separated list of attπbute value pairs and is read from πght to left If an value contains special characters such as commas it must be quoted and in any case imtial and final white space around attπbutes or values is ignored For example, "cn=Wayne Moore, ou= Genetics Department, o=Stanford University"

Location names have as their root (πght most) component the couπtryName or c attπbute with the value being one of the ISO standard two letter country codes, for example c=US Such names can be further restricted by specifying a stateOrProvinceName abbreviated st and a locality abbreviated 1, for example "l=San Francisco, st=Calιfoπua, c=US"

Organizational names have as their root the name (registered with ISO) of a recognized organization and may be further qualified with one or more organizational units, for example "ou=Department of Genetics, ou=School of Medicine, o=Stanford University"

Domain names as used by the Domain Name Service (DNS) are represented with the dc attπbute, for example, "dc=Darwιn, dc=Stanford, dc=EDU"

Names of persons There are two conventions for naming people The older uses the commonName or en attπbute of the Person objectClass but these are not necessaπly umque Some directoπes use the userld or ULD attπbute of metOrgPerson, which is umque Since uniqueness is important for scientific applications the latter will be used The remainder of a persons dn is usually either an orgamzational or geographic name, for example "uιd=wmoore, o=Stanford University" or "cn=Wayne Moore, l=San Francisco, st=Calιfoπua, c=US"

3 2 Encapsulating and extending existing nomenclatures

The following examples are chosen because they are referenced by the flow cytometry objects introduced below

Gene loci, for example, "locus=Igh-l, o=Professιonal Society or locus=New, cn=Leonard Herzenberg, ou=Department of Genetics, ou=School of Medicine, o=Stanford University"

Gene alleles, for example, "allele=a, locus=Igh-l, o=Professιonal Society or allele=l, locus=127, ou=Department of Genetics, o=Stanford Umversity"

CD antigens, for example, "specιficιty=CD23, o=Human Leukocyte Differentiation Workshop"

Literature references m the scientific literature are have essentially achieved the benefits of distmgmshed names without an explicit central authoπty However representing them as distmgmshed names will facility mechanical processing For example, "tιtle="A Directory of Biological Mateπals", volume=1999, o="Pacιfic Symposium on Biocomputmg" A true directory of such literature references would be of obvious value

3 3 New nomenclature schema

The following schemas arose from work on stoπng information about flow cytometry data in directoπes

Monoclonal antibodies are distmgmshed by cloneName or clone which is umque within the parent entity which must be an investigator or organization

Lymphocyte differentiation antigens, a thesaurus of the target specificities of monoclonal antibodies Would include but not be limited to the official CD names

FACS instruments are dist gmshed by the cytometer attπbute which must be umque with respect to the organization parent, for example, "cytometer=Flasher II, ou=Shared FACS Facility, o=Stanford Umversity"

FACS experiments are dist gmshed by the protocolldentifier or protocol attπbute which must be umque with respect to the parent which may be a person, and instrument or and organization or some combination, e g , "protocol=1234, cytometer=Flasher, uιd=Moore, ou=Shared FACS Facility, o=Stanford Umversity"

FACS samples are dist gmshed by a umque protocolCoordinate which must be umque within the parent FACS expeπment, e g , "coord=A12a, protocol= 12345, cytometer=Mollusk, ou=Shared FACS Facility, o=Stanford Umversity"

4 Biological Object Schema

X 500 defines a sparse set of standard types and standard objects mostly for descπbmg persons and documents and more suitable for business than scientific use However if types were added for scientific use, particularly real numbers and possibly dimensional units, much scientifically relevant information could be conveniently stored in and accessed from directoπes The following mimmal set of objects for the field of flow cytometry is presented to lend concreteness to the discussion A fuller and formal definition will follow

Table 1: Scientific Investigator

Table 4: Monoclonal antibodies

Table 5: FACS instrument

Table 6: FACS experiments

5 Conclusion

This paper examines the problem of computer-assisted commumcations m flow cytometry in particular, and biology in general, from the point of view of the emerging standards for computeπzed directory service Following Schulze- Kremer¹⁴ "To improve the current situation of non-unified and ambiguous vocabulary, the only solution is to develop a core of commonly agreeable definitions, and using these, to implement user interfaces to and between databases" As an example of how this goal can be accomplished, I have outlined how X 500 directory services accessed via LDAP from lightweight clients can be used to create and manage a umque namespace m the flow cytometry domain We plan to produce a concrete and useful implementation of a directory of the FACS experiments and sample data collected at Stanford, the National Instimtes of Health, Fox Chase Cancer Center and the University of Iowa. We also plan to create a registry of monoclonal antibodies based on input from the manufacturers and other interested parties such as the Human Leukocyte Differentiation Workshop. This work will be proposed for standardization to the National Information Standards Organization (NISO), a non profit organization accredited by the American National Standards Institute (ANSI) for information standards development, or to the working group on Accessing, Searching and Indexing Directories (ASID) of the Internet Engineering Task Force (TETF), which is responsible for internet standards activity.

The wide use and importance of flow cytometry in basic and clinical science today means that our directory will rapidly become a significant resource for the field. In addition, this project will make the primary data from flow cytometry and monoclonal antibody production available to the wider biomedical community, as is already done for gene sequence data. We believe that there are many other fields and instrument methodologies for which this would be a great benefit.

Acknowledgments

This work was supported by grants NLM04836 from the National Library of Medicine and CA42509 from the National Cancer Institute, both in the National Institutes of Health.

References

1. D. R- Parks, "Row Cytometry Instrumentation and Measurements" in Handbook of Experimental Immunology, Blackwell Scientific, (1996)

2. V. T. Oi, P. P. Jones, J. Goding and L. A. Herzenberg, Advances in Microbiol. (1978)

3. "Data Communications Networks Directory", Recommendations X.500- X.521, Volume VIII, IXth Plenary Assembly, Melbourne, (CCITT, 1988)

4. T. A. Howes, "The Lightweight Directory Access Protocol: X.500 Lite", Cm Technical Report 95-8 (1995)

5. W. Yeong, T. Howes, and S. Kille, "Lightweight Directory Access Protocol," RFC 1777, (1995)

6. M. Wahl, T. Howes, S. Kille "Lightweight Directory Access Protocol (v3)." RFC 2251 (1997)

7. J. Hodges, "An LDAP Roadmap & FAQ", Kings Mountain Systems, (1998), URL http://www.kingsmountain.com/ldapRoadmap.shtml S Kille, "A Stπng Representation of Distinguished Names", RFC 1779, (1995) Wahl, M , Coulbeck, A , Howes, T , and S Kille, "Lightweight Directory Access Protocol (v3) Attπbute Syntax Definitions" RFC 2252, (1997) T Howes "The Stπng Representation ot LDAP Search Filters " RFC 2254 (1997) "Netscape Directory Server Administrators Guide", Netscape, (1997) "JNDI Java Naming and Directory Interface", Sun Microsystems, (1998) "JNDI SPl Java Nammg and Directory, Service Provider Interface", Sun Microsystems, (1998) S Schulze-Kremer, "Ontologies for Molecular Biology" Pacific Symposium on Biocomputing 3 693-704 (1998)

Monoclonal Antibody Card

dn clone=A.F.6-78 ou=Pharmingen, o= Becton- Dickinson Immunocytometry Systems objectClass Monoclonal Antibody clone A.F.6-78

Becton-Dickinson Immune System ou Pharmingen

UID en Anti IgH-6b, Anti IgM¹ specificity allele = b, locus = Igh-6, o = IUIS standard notation group creatorDn uid = Stall_/ d = Genetics, ou = School of Medicine, o = Stanford University manufacturer ou = Pharmingen, o = Becton-Dickinson Immunocytometry Systems

Investigator Card

dn uid = LenHerz, d = Genetics, ou = School of Medicine, o = Stanford University objectClass Scieπtificlnvestigator, InetOrgPerson, organizational Person, person

uid LenHerz en Len Herzenberg

Genetics ou School of Medicine

Stanford University

Professional Name Leonard A. Herzenberg

ProfessionalSpeciality Genetics, Immunology, Cell Sorting

Professional Affiliation National Academy of Sciences, American Association of Immunology FACS Sample Card

Genus Card

Species Card dn Η. sapiens' objectClass Species alias Species=sapiens, genus = homo dn Species=sapiens, genus = homo

REF: ldap://shgp.stanford.edu/species=sapiens, genus = Homo dn Species=sapiens, genus=homo

UID ldap://shgp. Stanford. edu/species = sapiens, geπus = Homo

Radiation Hybrid Panel Card

Sequence Tag Site Card

Radiation Hybrid Clone Card

Radiaton Hybrid Map Card

In addition two new attribute syntaxes seem called for. One for comparing dimensionful quantities sensibly, i.e.. feet/cm and the other for approximate searching of sequence attributes. Ldap is short on numerical stuff, I don't know about X.500. A sequence syntax would be new for both of course. The standard (and the Netscape server) allow for such extensions.

JOURNAL Access

The objectClass=scientificPublication should have optional multi-valued attributes reference and citation which are distinguished names. When the publisher establishes the record they will fill in the reference with the dn of another scientificPublication which this one references. An indexing service would buy the rights to replicate the raw data and when new data appeared to update the citations in it's copy then serve the result as "value added" to it's customers. Monoclonal Antibody Card

Investigator Card

FACS Sample Card

Genus Card

Species Card

Radiation Hybrid Panel Card

Sequence Taq Site Card

Radiation Hybrid Clone Card

Radiaton Hybrid Map Card

In addition two new attribute syntaxes seem called for. One for comparing dimensionful quantities sensibly, i.e. feet/cm and the other for approximate searching of sequence attributes. Ldap is short on numerical stuff, I don't know about X.500. A sequence syntax would be new for both of course. The standard (and the Netscape server) allow for such extensions.

JOURNAL Access

The objectClass=scientificPublication should have optional multi-valued attributes reference and citation which are distinguished names. When the publisher establishes the record they will fill in the reference with the dn of another scientificPublication which this one references. An indexing service would buy the rights to replicate the raw data and when new data appeared to update the citations in it's copy then serve the result as "value added" to it's customers. A Digital Library of Flow Cytometry Data and Cell Phenotypes

A. Project Summary

We have conceptualized a broadly applicable digital library architecture that is both flexible and extensible and can be implemented using current Internet Standards and existing software application tools. In our design, we advocate separating metadata from data and focusing searches on the metadata. The metaphor we employ is storing "information about information" in Card Catalogs that represent information in diverse Library Collections. We propose to build the Card Catalog using distributed and repUcated directory services that then refer to primary data in data sources such as file servers and databases. Within this design we take advantage of several directory service features:

• We can maintain information context;

• We can cross-index entries that refer to different data sources; and

• We can distribute timely information.

We believe this infrastructure is applicable to cataloging diverse data sources both scientific and non-scientific, and are intent on creating an exemplary digital library based on this conceptualization.

Our research interests are focused on building a digital library of flow cytometry and cell phenotype data. Whereas the Genome Project technologies are directed at dissecting genotype diversity, flow cytometry technologies are directed at dissecting phenotype diversity. Scientists doing research in almost every biological science use flow cytometry as an invaluable tool in their daily research. There are flow cytometers used for clinical research in nearly every medium to large hospital in the country. We are intent on building a flow cytometry digital library that can be cross-referenced to other biological data sources (such as the Genome Project). And after we have completed this project we anticipate that any scientist having discovered an unusual peripheral blood phenotype (among tens of thousands of scientists using flow cytometry) can ask, "Has anyone else seen this before?" and expect a reasonably accurate answer.

Section A TABLE OF CONTENTS

For font size and page formatting specifications, see GPG Section II C

Total No. of Page No.*

Section Pages in Section (Optional)*

Cover Sheet (NSF Form 1207 - Submit Page 2 with original proposal only) A Project Summary (not to exceed 1 page) A1 B Table of Contents (NSF Form 1359) B1

Pre,"' Description (including Results From Prior NSF Support) 11 C01 - C14 (not to exceed 15 pages) (Exceed only If allowed by a specific program announcement/solicitation or tf approved in advance by the appropriate NSF Assistant Director or designee) r~! Please check if Results from Prior NSF Support already have been reported to NSF via the NSF FastLane System, and list the Award Number for that Project

NSF Award No

Letters of Commitment Total Pages 7 Pgs C15 - C21

BDIS

Fox Chase Cancer Center

University of Iowa

Dept of Health & Human Services

Dr Russ AJtman

SAS institute

D References Cited Footnotes

£ Biographical Sketches (Not to exceed 2 pages each) 22 E01 - E22

F Summary Proposal Budget 15 F01 - F15

(NSF Form 1030, including up to 3 pages of budget justification)

G Current and Pending Support (NSF Form 1239) G01 - G04

H Facilities, Equipment and Other Resources (NSF Form 1363) H01 - H04

I Special Information/Supplementary Documentation I00 - 141

J Appendix (List below)

Include only If allowed by a specific program announcement/ solicitation or If approved In advance by the appropriate NSF Assistant Director or designee)

Appendix Items:

'Proposers may select any numbering mechanism for the proposal, however, the entire proposal must be paginated Complete both columns oniy if the proposal is numbered consecutively NSF Form 1359 (10/97) C. Project Description

Specific Goals and Objectives

Our long-standing goal is to build an open framework to manage the entire life cycle of flow cytometry data. This life cycle begins when an investigator first thinks of and plans an experiment It continues as the investigator executes the experiment and stores, retrieves, analyzes, and re-analyzes the experiment's data. At any point, collaborators and other investigators retrieve and analyze the data as well. Eventually the primary investigator may publish findings based on this data. Traditionally this is where the data life cycle ends and data are sequestered in laboratory notebooks and microfiche. In contrast we believe the data life cycle continues.

Our current and previous grants (from the National Libraries of Medicine) support our work on the input side of this life cycle. This support includes developing an intelligent protocol editor, an effective Local Area Network (LAN) infrastructure for data collection and storage, and desktop data analysis applications. This software suite is called FACS (Fluorescence- Activated Cell Sorter) Desk.

In preparing for this grant application, we have brought together a Consortium made up of leading academic flow cytometry research groups and the leading commercial cytometry vendor to design and build an innovative infrastructure for the output side of this data life cycle. This will be a digital library that extends the traditional model of the data life cycle. Our current academic partners include the cytometry research groups at Stanford, Fox Chase Cancer Center¹, the University of Iowa², the Stanford University Shared FACS (Fluorescence- Activated CeU Sorter) Facility, the Center for Biologies Evaluation and Research at the NTH³, and an innovative group from the Stanford Statistics Department. Becton Dickinson Immunocytometry Systems, the leading commercial vendor of flow cytometry instruments and reagents, is our industry partner*. We propose to complete the digital library in two overlapping phases over the next five years.

Target Audience

Fluorescence- Activated Flow Cytometry was initially developed because of the needs of cellular immunologists to distinguish functional lymphocyte populations. Subsequent to the development of hybridoma (monoclonal) antibodies, trillions of cells have been analyzed, sorted, and categorized using flow cytometry. What started as an immunologist's research tool is used today in molecular and cellular research by both clinical and basic research investigators. This is a short list of the diverse projects supported by flow cytometry:

Lymphoid Cell Phenotyping and Subset Analysis

Hematopoetic Stem Cell Characterization and Sorting

Monitoring and Evaluation in AIDS Clinical Trials

Leukemia Phenotyping and Monitoring

Reporter Gene Expression

Cell Activation

Apoptosis Studies in Cancer Biology

Cell Physiology Studies

DNA Content and Cell Cycle Analysis

Tumor Cell Identification and Aneuploidy Evaluation

Chromosome Analysis and Sorting

A letter of commitment from Dr. Randy Hardy at the Fox Chase Cancer Center is Included In the Appendices.

A letter of commitment from Dr. Morris Dalley at the University of Iowa Is Included In the Appendices. ³ A letter of commitment from Dr. Gerald Marti at the Center for Biologies Evaluation and Research at the NIH Is included In the Appendices. ^' A letter of commitment from Becton Dickinson Immunocytometry Systems is Included In the appendices.

Section C • X-Y Sperm Discrimination and Sorting

• Isolation of Fetal Cells in Maternal Circulation

• Various Studies of Bacteria and Yeast

• Drosophila Embryo Cell Sorting

• Plankton (Nanoplankton) Analysis in Marine Biology

• Plant Protoplast Studies

• Water Quality Monitoring

Section I Part A contains a description of the significance of flow cytometry in helping advance our understanding of the immune system.

This flow cytometry digital library is targeted at two major audiences. The first is the flow cytometry user community involved in the diverse range of research areas listed above; and the second is the digital library development community involved in developing the infrastructures of other digital libraries. We believe the innovative use of directory services as Card Catalogs that refer to other data sources can be generalized and used to link diverse data collections. In addition there may be a third group consisting of individuals interested in our work on clustering and developing ways to describe cell populations.

Project Plan

The project is divided into three parts:

1. Designing and building the library,

2. Developing new ways to search the library, and

3. Adding a dictionary and thesaurus to support a common vocabulary.

We plan to introduce library services in two phases. At the end of Phase I we will do a controlled release of a testbed application that will access core library features. These features include a Card Catalog of user and experiment information and a Central Data Archive containing instrument data. In Phase π we will add new search procedures with which to query the library, and add an antibody dictionary and an antigen thesaurus to the Card Catalog. We will also respond to user feedback from the Phase I testbed release.

Background

Our approach to σeating a flow cytometry digital library is unique, because we view the cytometry library as an integral part of the entire data life cycle. During the past year, we redesigned and began building a Web version of our existing FACS Desk application. Figure 1 is a diagram of this design specification. Investigators use workspaces in the FACS Desk framework to plan and organize their experiments and results. This framework makes it easy to run experiments, retrieve data, and use other FACS Desk application modules or other third-party desktop applications to analyze and visualize their data.

The accumulated FACS Desk Data Archive consists of all the experiments and data from the Stanford Flow Cytometry User Group. It is a library of flow cytometry data that can be accessed by Stanford users having a FACS Desk account When new users want access to this library, they require new accounts, which results in a need for more systems and increases the accounting administration load. We foresaw that the Web- version of FACS Desk would only exacerbate these problems. Remote users accessing the library would further burden our computer systems and network bandwidth.

In order to transform FACS Desk into a real Intemet Application, we have made significant changes in our original design specifications. We wanted an easy way to open access to the library to users outside the Stanford Community and similarly we wanted an easy way for Stanford users to access other flow cytometry libraries. Figure 2 is a diagram of our Internet Application. The key to this new design is an innovative use of directory services not only as a user directory, but also as a Card Catalog for searching and browsing other data sources.

Section C The FACS Desk user shown in the middle of the diagram in Figure 2 is a data author. The digital library user shown at the top of the diagram is the data reader. The reader is looking for information that is authored (or owned) by other users. The reader expects the library service to provide access to diverse data collections. We will describe a digital library infrastructure that is very analogous to visiting a library and searching for references in the library's card catalog. When a visitor finds a card of interest, the next step is to find the reference in the local library's stacks or use the library'^ service to access stacks in other remote library collections.

The TestBed Application Core - Phase I

Phase I activities include defining an evolving Recommended Data Standard that remains back compatible with existing flow cytometry data formats. We will seek input from key players in the field of flow cytometry, and anticipate this process will be an ongoing evolution. The data attributes described in the evolving Standard define the initial schema for the directory service. The directory service and data archive will then be populated with user and experiment data from all of our Consortium Members. We plan to use the University of California at San Diego's Super Computing Resources through the National Partnership for Advanced Computational Infrastructure Program as a Central Archive for all instrument data files⁵. The directory service, which we will refer to as the Card Catalog, will be distributed and replicated to each participating site using the inherent functionality of directory services. Users will access the Card Catalog using a Web browser with JAVA plug-ins. We plan to do a controlled release of this testbed application at all of our Consortium Members' sites. Figure 3 is a logical network diagram of the controlled release. Users will be able to access, search, and browse the Card Catalog and then view or download data for analysis using third-party applications. Phase I activities include exploring economic models to support and maintain the digital library beyond the grant period. We expect to receive feedback on these as well as usability issues during the controlled release.

An Extensible Recommended Standard for Data Collection

The necessity to build an easily searched and distributed card catalog presents an opportunity to create an extensible Flow Cytometry Data Standard. Except for cytometry sites supported by Stanford's FACS Desk software suite, little OT no experiment contextual data (experiment metadata) is recorded when flow cytometry data are collected and saved-to-disk. Data about experiments are recorded in notebooks; and users use computer data file and directory names to index their data.

Presently the Cytometry User and Developer Community is considering including contextual experiment data within the data file format called FCS3.0⁶. There are two problems associated with this proposal: the first is that experiment metadata is sequestered within the data file; and the second is that changes are precluded until FCS4.0 is specified. This approach to a Flow Cytometry Data Standard inexorably mixes software and biological requirements. We strongly recommend that the software requirements needed to assure a robust file format should be considered separate from the formal definition of metadata schemata.

Our strategy is to store experiment metadata on Cards (in the Card Catalog). This may be in addition-to or instead-of storing the metadata within the data file (to remain compatible with existing Standards), but this approach enables us to support any extended schemata. Separately we advocate using several existing Intemet standards as guidelines fee: file formats. This would enhance interoperability and transportability rather than support a single all-purpose exclusive Flow Cytometry Standard. Section I Part C contains our rationale for using both the established MIME (Multipurpose Internet Mail Extensions) standard and the emerging JARs (JAVA Archive files) standard.

Scope and Methodology

In Phase I Consortium Members will draft an initial Recommended Data Standard. Using this as a starting point we will actively solicit input from the following groups:

• Key Cytometry Leaders from both Research and Clinical Laboratories;

⁵ A letter of intent from Dr. Russ Altman, Molecular Science Thrust leader for the National Partnership for Advanced Computational Infrastructure (NPACI) grant from the National Science Foundation to U.C. San Diego and the San Diego Supercomputer Center Is Included In the Appendices.

⁶ Data File Standard for Flow Cytometry, Version 3.0 <http://nucleus.lmmunol.washlngton.edu/ISAC/fcs3/FCS3.html>

Section C • Commercial Vendors and Third-party Application Developers;

• Journal Editors;

• Major Pharmaceutical Laboratories; and

• Key Individuals from the Food and Drug Administration.

We will solicit input in the following areas:

• Metadata Schemata:

• Data Validation, Authentication, and Transport;

• Computing Infrastructure Requirements;

• Copyright; and

• FDA Requirements.

Copyright and FDA Requirements are listed separately because they are exceptional requirements.

Anticipated Results

We expect a plethora of suggestions on what to include and exclude in the Recommended Standard, but we are prepared to maintain an evolving Recommended Data Standard. The inherent flexibility and extensibility of the directory services enables us to accommodate this approach. We anticipate that most users will be more concerned with how to salvage and organize their existing data. This is because while a few large organizations have jury- rigged their own solutions, nearly all-existing flow cytometry data is archived on tapes with print indices. So while a general utility to salvage existing data is needed, we prefer that commercial vendors fill this gap.

The data archived by the Stanford FACS Desk software suite and used at the University of Iowa, Fox Chase Cancer Center, and Stanford (as well as sites in Japan and Germany) is readily exported to the digital library. The combined library from the three US sites alone is near a terabyte of data. Today over ten thousand flow cytometry instruments are used in basic research and clinical settings generating hundreds of gigabytes of data daily.

How important is archiving all this data? We believe it is as important as archiving die nucleic acid sequence data in the Human Genome Project and for the same reasons. Where the Genome Project focuses on technologies to define and archive genotypes, research using flow cytometry focuses on defining cellular and humoral phenotypes. Furthermore research using flow cytometry is conducted on more organisms than the Genome Project supports. Eventually genotype and phenotype libraries need to be linked. In fact we believe this can be done using our Card Catalog. We provide a glimpse of how this can be done in our technical description of directory services in Section I Part B.

A National Directory of Flow Cytometry Users

In order to manage potentially tens of thousands users we plan to create a National Directory of How Cytometry Users using the Lightweight Directory Access Protocol (LDAP) and X.500-compliant directory services. A thorough technical description of our use of LDAP and X.500 directories is provided in Section J Part B.

The emergence of directory services in the computing industry derives from the need to provide users with complete and transparent access to network resources and other network users. The primary role of directory services is to translate network names to network addresses and to provide a unified naming space (schema) for all network entities.

Why LDAP Directories

LDAP is a simple mechanism for Intemet clients to query and manage a directory service. A directory service is basically an arbitrary database of hierarchical attribute/value pairs. uch databases are generally X.500-compliant

Section C directories X 500 is a directory service specificauon supported by the Intemauonal Organization for Standardization (ISO⁷) and the Consultauve Committee for Intemauonal Telephony and Telegraphy (CCITT⁸) The Internet Activity Board (JAB) also has published RFC (Request for Comments ) 1006 specifying the deployment of X 500 over TCP/IP networks

Extended Attributes for Experiments

Our use of directory services to create a Nauonal Directory of Flow Cytometry Users is m line with the Intemet industry's directory goals However we have extended the concept of a unified "nammg space" for network entities to include what we call an individual user's workspace A flow cytometer user's workspace contains information about experiments, data, and results Essentially each workspace entry is unique and congruous with the X 500 directory naming space X 500 also specifies that m addiuon to the published attribute/value pairs, developers may designate extended attributes to meet the need of specific contextual information In our case we intend to define additional object classes (types of attributes) for laboratory data. For example, additional object classes include experiments, samples, and instruments Specific expeπment attπbutes are described m Section I Part B The specified structure of the directory hierarchy supports the concept that the user "owns" the expeπment information This enables us to certify lnformauon "ownership¹⁰" using the X 509 authentication framework

Cards and Card Catalogs

We think of X 500 and X 500-lιke directσπes as electronic card catalogs and object classes or combinauons of object classes as electronic cards When we extend this idea to workspaces and expeπment object classes, we visualize Card Catalogs containing "bits and bytes" of metadata and data abstractions that can be distπbuted and replicated by federated directory services Searching the Card Catalog will quickly determine whether something exists When a card of mterest is found, the card refers the user or application to another data source, which might be a file server or a relational database or off-line data.

An XML Flow Cytometry Markup Language Specification

An added benefit for using X 500 directoπes is that the attnbutes and values we define for flow cytometry data and expeπment metadata are directly useful as definitions of XML elements. The hierarchical trees of directory services are nearly identical to XML Document Type Definition (DTD) structures. In fact, the requirements for X.500 Directory namespaces are identical to requirements for XML namespaces which are discussed m the World Wide Web Consortium Working Draft 18-May-1998 (WD-xml-names-19980518)¹¹ Smcε we expect both Netscape and Microsoft to release XML-aware Web Browsers within the next two years, we will focus our efforts on developing the directory services structure and data archives before formalizing a flow cytometry XML Markup Language Specification We will address this specification near the end of Phase I activities

Distributed Card Catalogs and Local, Regional, and Central Library Stacks

Card Catalog Access

The user in our Phase I scenaπo is a flow cytometry user Library access will be limited to authenticated users in the directoπes of all of the Consortium Members' flow cytometry sites We need to figure out what general access means and what access categoπes to support by the end of the grant penod. The fine-gramed access control provided by X 500 directoπes will support flexible Literary Card Pπvileges

Searching and browsing the Card Catalog

We will provide JAVA plug-ins for Web browsers to view and search the Card Catalog Additional views may be supported using Web pages with ODBC or XML Phase I search capabilities will be restπcted to attnbutes specified

⁷ <http://www.lso.ch/> ' <http://www.ltu.lnt/>

⁹ <http://www.letf.org/home.html>

¹⁰ Tnls conceptually enables us to provide copyright information on request.

¹¹ <http://www.w3 org TR WD-sml-names.html>

Section C in the most current Recommended Data Standard We suspect these will include expeπment and sample parameters and attπbutes such as anubody reagents used Users will be able to Download data for use in desktop software appiicauons

The Library Stacks

In phase I the San Diego Supercomputer Center (at the University of California, under the auspices of the National Partnership for Advanced Computer Infrastructure (NPACI) - Molecular Sciences Application Thmst¹²) will support a Central Consortium Library Stack ¹³ This Central Resource will consist of the combined Stacks from all participating Consortium Members

We imagine there will be Regional and Central Libraπes containing both pπvate and pubhc stacks (closed and open data collecuons) The system allows data owners to determine access pπvileges One use scenaπo is a hbrary user searches the Card Catalog and discovers an mterest in a data collection located in a pπvate (closed) stack Infoπnauon in the card catalog provides the user information to contact the data owner to negotiate access pπvileges perhaps as a collaborator. This example illustrates the fine-gramed access control made possible by the Card Catalog. It also exemplifies a prefeπed use scenaπo where an author at least provides information that data exists, albeit with restπcted access

Another use scenaπo mirrors the precedents set by individuals involved m the Genome Project. They estabhshed the practice of mandating pubhc disclosure of experimental data (e g , nucleic acid sequences) when infoπnauon about the data is published We anϋcipate a similar mandate by several editorial boards of Journals that publish flow cytometry data¹⁴

Viewing and Using the Library

We anucipate several views of the Card Catalog As previously mentioned one view will use Web browsers with JAVA plug-ins and display our Card Catalog metaphor. The FACS Desk Protocol Editor presents another view This view provides a data input interface Web pages using LDAP-ODBC dπvers or XML DTDs will provide other views Our open architecture enables anyone to build his or her own special views and applications to access the Card Catalog.

Economic and Business Models

The Consortium Members and the individuals involved in defining the Recommended Data Standard include the most likely candidates to support the flow cytometry digital hbrary after the grant penod is over. Dunng Phase I we will direct a business intern to put together several busmess models based on input from both Consortium Members and participants developing the Recommended Data Standard. We expect that several models will be economically feasible.

Much as everyone expects the emergence of global e-mail address spaces, there will emerge directoπes of science namespaces - the first being this National Flow Cytometry Namespace with its expeπment and biology namespaces Another will be the Genome namespace. Pubhc and pπvate stacks will emerge, exactly as lntra-, extra-, and Internets have emerged The federated directones will glue them together.

Different hbrary stadcs will contain both pπvate and pubhc stacks similar to Web Sites supported by Pharmaceutical Companies, Universities, and Research Institutes. Probably some stacks will provide pubhc access to parts of collections in hopes of attracting paid-subscπptions to pπvate data collections The use of open Intemet standards will support the interoperability of all these possibilities

And yet despite all these possibihties, we believe there still should be a grant-supported Central Data Archive for Flow Cytometry. This Central Resource might also provide a Master Directory Service that distπbutes and repbcates

¹² An NPACI letter of commitment Is Included In the Appendices.

¹³ We refer to data sources, whether a file server or a relational database, as Library Stacks in keeping with our Card Catalog metaphor.

¹⁴ The Editorial Board of the International Immunology Journal has already adopted presentation requirements for flow cytometry data; and several other Editorial Boards are considering similar Standards. In phase I activities, we will include Journal Editors in our discussions of our evolving flow cytometry Recommended Standards.

Section C subscribed subdirectories to federated local and regional services. We need this Central Resource because searching the existing print and electronic literature asking, "Has this been done before?" or "Has anyone done a similar study on another patient cohort?", cannot provide accurate answers.

Published manuscripts in print Journals reporting results based on experiments using flow cytometry describe a miniscule subset of the actual data generated. There are innumerable data sets sequestered in this publication process. In addition the research focus and experience of individual investigators determine what information is abstracted and published from data sets. Other information in the same data sets is neglected in this process.

A Central Flow Cytometry Resource would maximize the use of flow cytometry data and enhance collaboration between investigators. At least for the period supported by this grant, we intend to use the San Diego Supercomputer Center as this exemplary Central Resource. An added benefit for doing this is it enables other investigators to mine this large data source using novel statistical strategies.

Searching and Visualization - Phase II

The Phase I testbed apphcation core is the infrastructure for the How Cytometry Digital Library. Using this core, we provide access to the Library's Card Catalog where individual Cards may refer to data sources either in an SQL or file server. Requested data is delivered as MIME-types and transported as JARs (see Section I Part . This scenario describes a general solution for providing distributed access and an efficient means to capture and search for information in digital libraries. Phase I is complete when we build an exemplary Central Resource (the Pubhc Library) for Flow Cytometry data.

In Phase I, we provide the capability to search for data using experiment-centric attributes. This is a significant improvement over what is available today. In Phase II we attempt to improve our capability to do meaningful searches. We have divided this challenge into three parts. The first is to develop computer-assisted methods to find cell populations in n-dimensional data; the second is to describe these cell populations in a way that is machine- understandable. This is a high-risk undertaking since searching for populations in n-dimensional data is fundamentally a search for clusters. The third part is to build an antibody directory and an antigen thesaurus to encourage the use of a common vocabulary and thereupon improve the reliability of hbrary searches. These features will be included in the Card Catalog. Some of these phase π development activities overlap Phase I.

Most flow cytometry experiments are attempts to characterize and define cell populations, whether they are cells of the human immune system, the newly discovered nanoplankton, or genetically-tagged drosophola embryos. Our particular research interests are cells involved in human and mouse immune systems. In the human immune system, cells are categorized on the basis of more than 166 human CD antigens, which in different combinations characterize distinct functional and developmental immune cell populations. These cellular phenotypes are even more complex because distinct cell populations also express different amounts of the same cell-surface antigen. This is seen in the published literature as cell descriptions such as CD4+CD8dim cells, where CD8dim means dim fluorescence staining with fluorescence-tagged anti-CD8 antibody, which loosely translates to a low cell-surface CD8 antigen density¹⁵. Cell populations are also described by their functional phenotype, such as "killer cells;" and inescapably they are described using both cell-surface and functional phenotypes, such as "CD8+ killer cells." These descriptions make if difficult to search any print or digital literature with confidence. Searches are restricted to "What antigen names are included in the manuscript?" and "What antibodies were used?" rather than "What studies included this cell population?" We need methods to identify and describe these cell populations that are uniform and machine-understandable.

Visualizing N-dimensional Measurements in Two-dimensions

The biological significance of n-dimensional measurements using flow cytometry is described in Section I Part A There are two features by which finding distinct cell populations in n-dimensional measurements should be considered. The first is numerically finding the populations; and the second is visualizing them. Ideally we want to present an automatic procedure without viewing two-dimensional projections (gates) for visualization by enumerating structures (populations) and assigning numbers to these structures in raw n-dimensional data. Should

¹⁵ These descriptions are even more arbitrary due to the use of Instruments with different detection sensitivities and the use of similar reagents with different fluorescence qualities. Flow cytometry practitioners have made great efforts to standardize the art.

Section C this be too difficult, it would still be significant to provide his functionality examining lower-dimension data after some pre-selection by user or machine interaction. Minimally we should find two-dimensional projections (gates) that show structure (populations) and present this information to the user for further interaction.

Requirements

Since n-dimensional flow cytometry data sets are very large, we need a computationally efficient representation of the density surface to support interactive visualization. The usual statistical approach to searching for mixtures of populations is to search for modes or local maxima in the density; however experienced flow cytometry experts find structure in two-dimensional visualizations that do not correspond to modes, i-once a novel statistical procedure is required that uses a new way to find structure (populations). This procedure must be computationally efficient, because it may have to search many two-dimensional projections. Since we prefer an automatic procedure, the procedure should not have to rely on subjective parameters to be specified by the user (e.g., smoothing parameters, bandwidths, etc.).

The Solution is an Interplay of Statistics and Computation

The cell-surface antigen distribution measured using flow cytometry are clearly not normally distributed. There are theoretical arguments that distributions in biologically consistent cell populations should be log-concave (ie., the log of the density is a concave function). A technical description of our approach is presented in Section I Part D. It generalizes approaches based on normal distributions, but is restrictive enough to allow good statistical results and computational implementation. It also seems very promising for other applications. Almost no work has been done on this type of statistics since the basic properties of log-concave densities were explored in the 1960s.

A mixture of log-concave densities can be shown to be always of the form exp(g(x)+clxl^Λ2), where g is a concave function and c>0. A test whether one or several structures (populations) are present is a test whether c=0. The logic of this model is very amenable to detect cell populations:

• If a density has multiple modes, then some of its contours are not convex.

• If some of its contours are not convex, thai it is not log-concave.

As none of the reverse implications are true, testing for log-concavity can find structure that cannot be found by looking at modes. (An alternate approach, albeit weaker, would be to test for convexity of the contours).

The theoretical insight that is provided by this approach is extremely advantageous. Only a concave function and a constant have to be fit. The concavity puts a strong restriction on g that makes maximum likelihood estimation possible without any subjective smoothing parameters. For example, one can show that estimating a log-concave density by maximum likelihood has a unique solution, where the log is piecewise linear with knots at the observed data points in one dimension. In higher dimensions, the Delaunay triangulation arises.

The statistical decisions for example can be based on likelihood ratio tests. The computational problems can be reduced to methods such as Delaunay triangulation. The maximum log likelihood estimate is piecewise linear over this triangulation. This is attractive for visualization purposes, because surfaces are usually displayed on a computer as piecewise linear functions. Additional research wUl explore how to further simplify this representation in order to transmit it faster over the Internet Clearly, a good approximation to the surface would only require a fraction of the observations in the triangulation.

Classification of populations (according to biological developmental or activation stages) could be based on properties of log (density), which is known to be concave, such as skewness, curvature etc. This could provide the basis to numerically describe cell populations.

Role for Supercomputing

Phase II feasibihty may be determined initially using a limited number of data sets. However, we foresee the need to let-loose this research on large subsets of the hundreds of gigabits of data included in the Phase I testbed application. The need for the computing power necessary to accomplish this is another reason we have established a partnership with the National Partnership for Advanced Computer Infrastructure (NPACI), Molecular Sciences Thrust Program. Our partnership assures that we not only have a committed Central Resource for the How Cytometry Digital

Section C Library, but that we have the computing power needed to test new data analysis procedures on large volumes of data. As a Central Resource, the Digital Library also ensures other investigators access to "real" data in order to explore other novel methods to extract information and insights. Since the entire library infrastructure is built using Internet Standards other investigators and commercial vendors may build their own unique solutions to finding and naming cell populations.

A Directory of Biological Materials - Phase II

The diversity and quality of biological reagents, particularly monoclonal (hybridoma) antibodies are critical to experimental results. The clone identity of the antibody and its particular preparation and conjugation with a fluorescence reporter group (and its identity and preparation) must be recorded if an experiment is to be reproduced. The other face of this coin is the antigen target of monoclonal antibody reagents. Most important are the human CD antigens and their equivalents in other species. Like any biological naming system, the history of antigen discovery in multiple species creates a plethora of aliases for most antigens.

More Cards

Phase II activities include compiling a reagent dictionary and an antigen thesaurus as part of the Card Catalog (Le., the directory service). We foresee at least two interfaces to this part of the directory service. General hbrary users (diagrammed in Figure 1) will access this information using Web Browsers with JAVA plug-in or Web pages with embedded ODBC links or XML DTDs; while flow cytometry instrument users will have access using applications like the FACS Desk Protocol Editor. Examples of the directory service schemata for antibodies and antigens are provided in Section I Part B. Phase II activities also include gathering additional input on these schemata.

The importance of adding antibody and antigen cards to the Catalog is that they provide a single source, a single namespace, encouraging use of a unified scientific nomenclature¹⁶. Other laboratories and third-party vendors may access the same dictionary and thesaurus by LDAP-enabling their own applications and Web pages. Using this common vocabulary will significanϋy increase the probability that searches within realms using the Card Catalog will return good results.

Citation Index

Throughout our efforts we have focused on managing the entire life cycle of flow cytometry data. Part of this data life cycle includes manuscripts based on flow cytometry experiments. In turn the life cycle includes manuscripts referring to other manuscripts. In essence the data life cycle includes bibliographic references and referrals (or a citation index). If there is enough interest, the Card Catalog can include these attributes¹⁷.

Special Cards

The Card Catalog may contain any "bits and bytes" abstracted from other data sources. We envision that future applications will include "special cards." The first candidate for "special cards" may be abstracted descriptions of cell populations from raw flow cytometry data. This would enable searching the "literature" for cell populations rather than searching for the use of particular antibodies or an appropriate combination of keywords.

In Conclusion

Our current and previous efforts have always viewed The Challenge as managing the entire life cycle of flow cytometry data. We have singularly maintained a user-centric focus; where the user has always been the bench scientist, and not the principal investigator or the project leader or the research director. The process begins with designing experiments, includes running a flow cytometer, collecting and storing data, and analyzing and publishing results. We view this process as managing a flow cytometry laboratory notebook. We do not superimpose management layers on this process nor do we focus on managing results (permutated data) and reports. We do focus

¹⁶ At the same time, the cards are flexible enough to accommodate differences and scalable enough to Include extensions.

¹⁷ Example schemata described in Section I Part B Include manuscript attributes.

Section C Figure 2 - Internet Application S^DJg;jta(Library _:

Figure 3 - Federated Card Catalogs

B. Cytometry on the (In/Ex)t(er/ra)πet Scientific Data

The "Scientific" Data Model

Scientific data sets are a chimera of large volumes of simply structured numerical data and modest volumes of primarily textual annotation information with very complex logical structure. The strategies and tactics for dealing with these two components are very different

Arrays of Numbers

In this context images, univariate and multivariate histograms, and list mode flow data are all represented as numerical arrays of various dimension and size. There are relatively few choices for storing numerical data arrays and the issues are mainly tactical ones. In some specific cases substantial compression is possible; in others it is not.

There are several existing application and image content types with which it will be useful to remain interoperable. Although it is officially discouraged (and requires an RFC), I think that a full mime content type to handle these data sets is justified There are sufficient scientific visualization and statistical discovery tools etc. which can manipulate "scientific" data to make it beneficial to separate it from generic apphcation data Scientific data set should be treated as a media type such as audio or video.

Attribute Hierarchies

In addition to numerical data, scientific data sets need to contain a great deal of additional information that allows the numerical data to be integrated into a larger experimental context DICOM has an elaborate objeα hierarchy and specifies ways for moving it about HDF is at least compatible with implementing such a hierarchy either using vgroups and/or annotations. Historically the lack of such hierarchical structure was a major criticism of FCS from the beginning and influential in our decision not to use CDF some time later. An exciting new possibility is storing (or replicating) this information separately in LDAP or JNDI directories.

Instrument Journal

Documenting the behavior of scientific instruments requires, in addition to fixed attributes, Le., values not changing in time, attributes which are set or measured at particular times and which may take on many different values during an analytical procedure either for scientific or technical reasons. We will define a simple text format which is intended to deal (reasonably) efficiently with this case which we refer to as an "instrument journal."

MIME Proposal

Rather than propose a single all-purpose standard, We advocate refinements of and guidelines for using several existing standards with the goal of enhancing interoperability and transportability. We propose using one well- established standard, MIME, and a new proposed standard JARs.

Why MIME

MIME headers and content can be parsed by simple rules, which allow lightweight applications to parse and retrieve the information they need and ignore information they don't need or understand. (Historically another major criticism of FCS was the failure of the HEADER, ANALYSIS, and OTHER segments to achieve this.) MIME headers are text so that knowledgeable humans can read and interpret them. This facilitates development and maintenance of lightweight apphcations. MIME is flexible enough to encompass complex apphcations. Various implementations based on MIME are widely available on many platforms. It is widely and heavily used on the Internet Software for parsing MIME headers exists on any system, which implements SMTP (e-mail) or HTTP (World Wide Web).

Section I MIME content can be reliably and in some cases securely transported by the standard protocols of the Internet FTP, SMTP, HTTP, HTTPS etc It is even possible to send MIME messages containing binary data through text based e- mail systems

Why JARs

JARs are a MIME flavored standard advanced by Sun and JavaSoft to implement secure and efficient transport of Java applets and their resources to clients on the Internet They combine MIME content with manifest and signature files, which provide packaging and error detection as well as optional compression and signature veπficauon for either individual elements or the whole contents JARs are based on the popular and wiαely available ZIP format (NASA maintains a pubhc archive of freeware programs to read and wnte ZIP files on many machines. ZIP is expected to become a documented API in the Windows operating systems ) JAR unplementauons are freely available on the Internet as part of the Java Software Development Kit It is also incorporated into Netscape's product suites, which are free to educanonal and non-profit users

New MIME content types for cytometry

The MIME standard defines an open-ended set of content types I will specify several new content types specialized for stadsucal and cytometry data types for which existing types appear insufficient In addiϋon, I will define additional semanucs that can be used with some existing types to enhance their utility for cytometry apphcauon

Annotations - text/ldap-interchange

LDAP defines a simple text encoding LDIF which can be used to transport directory trees and sub trees. A text type is chosen so that power users and lmplementers will be able to read the files for development and maintenance. For Web apphcauons the volume of the annotauons is not likely to be so large as to cause problems and these files can be substantially compressed using the standard ZIP algoπthms.

Instrument Journal - text/instrument-journal

A text type is chosen for the same reasons given above. Each change of an attπbute value constitutes an event that specifies an attπbute, a new value, the time (UTC) and an agent identifier The agent field mdicates the source of the change for example it should indicate whether the change was initiated by the operator or by an auto-cahbration utility, auto-sampler or some other experimental sequencing apparatus. Time and agent data in journal files can be compressed by stoπng delta times, i.e., differencing, stoπng only the changes between agents and prefix compression of the attnbutes and agents.

One Dimensional Histogram - scientific univariate-histogram

Data from samples with local concentrations, for example chromosome data, or small to medium data sets of very high resolution might be compressed by the methods discussed under the multi-vaπate histogram types. Histograms from very large data sets might be compressed by differencing. Smgle variable histograms at reasonable resolution are not so large that compression is very important because the data transfer time is small compared to the connection setup time.

Two Dimensional Histogram - sdentific/bivariate-histogram

A scientific type is chosen to allow content handlers to return the histogram as tables or matrices to spread sheets, visuahzation and staustical analysis programs An image type also makes some sense m theory but we think it is less likely to be useful in practice. Two or more dimensional histogram data is highly compressible. Because the number of b s goes up as a power of the resolution while data collection time goes up hnearly with total count the cells per bin ratio is low in a large histogram. Therefore most of the bin counts are zero or small numbers. Very large numbers are also rare because if many bms had large counts the total sample size would be huge. The current implementation uses a variable word length code to store histograms and already achieves about an order of magnitude compression on our typical data. (A fact which has significant implications for cytometry Web apphcation design ) We are conducting additional investigations to further refine this method It appears that run

Section I compression of the zeros may yield significant additional compression. It would also be desirable for the algorithm to choose the code at run time based on the sample size and number of bins on the basis of a theoretical analysis of this relationship. The final version of this algorithm will be codified for the standard.

List Mode Data - sαentific/cytoπ-etry-list-mode-data

A a scientific type is chosen to allow content handlers to return the list mode data as a table to spread sheet visualization and statistical analysis programs. List mode data from whole cells is generally not compressible to a useful degree (Bigos). In particular cases time data in list mode may be compressed by run compression or differencing. This proposal does not support any type of compression other than bit packing for list mode data that it requires.

Nested loops in the pack and unpack routines will be most efficient if the inner loop is the longer. The inner loop will be fastest if the bit size is constant while it executes. Therefor data should be transmitted in column major form. Computation, permutation, and projection of flow data sets are facilitated if they are manipulated on a column wise basis. That is to say that the data for each parameter is treated as a homogeneous array of values that may be accessed independently of the other measurements. Row major order is more natural during data collection and other real time use but would be less efficient for transmission, storage, and analysis. Column major order may seem awkward for real time use but aside from reasonable buffering (essential in network apphcations anyway) it does not impose other restrictions or performance penalties on live displays.

FCS Compatibility - application/FCS

The society should specify and register a MIME type to represent its existing standard, FCS. Typically this would be "application/FCS" or "application/org.isac-fcs." An apphcation type seems required.

Compound Types

Two alternatives are proposed for combining cytometry content items possibly with other content into a compound document for transport or storage.

Multi-part MIME Encoding

The MIME type multipart mime is designed to transmit a series of MIME content items as a unit It is fairly simple to implement and widely used but not in itself secure OT absolutely reliable.

Java Archive (JAR) Encoding

JARs are a new standard designed for secure and reliable transmission over the Internet It provides reliable transport and optional compression with the possibility of digitally signing individual content items OT the whole collection. A competing Microsoft technology (CABinets) seems to be less suited for cytometry use at this point because, it is not widely accepted, is largely MS specific and not as freely available. This may not be true for ah users and could change.

A new mime type multipart scientific-archive might be useful if extensions to JARs are define for scientific use.

Java Binding and Implementation

In order to facilitate evaluation of this proposal we are providing a set of abstract interfaces which embody these concepts and are planning a preliminary implementation. Clarity of exposition rather than complete efficiency will be the main goal of this implementation but the methods chosen are computationally efficient for moderate amounts of data, e.g., a single instrument run or medium size experiment

Annotation Information

For access from Java, JNDI provides most of the API necessary to access the annotations. A service provider (which actually carries out JNDI requests) is available for LDAP and there are experimental implementations based on the

Section I host computer's file system or its main memory. Service p^roviders which can look into JAR files and FCS files can and should be developed. Since JNDI allows federated namespaces it would then be possible to have quite powerful (though not necessarily fast) directory service locally without a true LDAP server. A utility to import an LDIF file into a suitable JNDI directory would also be useful.

Numeric Information

A Java package to handle numerical or scientific data needs to be developed. The design of HDF is in many ways a good beginning but starting afresh with object oriented design principles seems warranted. This package should emulate JNDI and define separate API and SPl so that a variety of implementations and representations of scientific data can interoperate.

Access each data value.

Access a slice of the full data

Iterate over index values (and slices).

One Dimensional Histogram - table isac-univariate Two Dimensional Histogram - table/isac-bivariate List Mode Data - table/isac-list-mode-data

Instrument Journal - text/instrument-journal

Find the value of a specific attribute at a specified time;

Find the complete state at a specified time, i.e. the values of all attributes at that time;

Advance the state information to the next distinguished time point;

Restore the state the previous distinguished time point;

Enumerate the value of all or a subset of attributes at a given time;

Enumerate all the times an attribute or a group of attributes changed and their values;

Enumerate the events between two time points;

FCS Compatibility - application/FCS

Convert to multi-part MIME ∞mbining annotation constructed from the TEXT segment and zero, one or more cytometry data content parts constructed from the DATA segment

Convert directory information and a numerical array to a FCS file (to the extent possible).

Discussion

Biases

This proposal is Web centric. We regard this as a virtue because it focuses on achievable goals and delivering real services to an existing and rapidly growing community. It does lack the ideological purity of a universal standard. Our current work is also Java centered but is not a priori biased toward any specific computer architecture or language. Indeed the vast majority of cuπent MIME, ZIP, and JAR implementations are probably in C or C++.

Contrast with DICOM and HDF

DICOM is arguably the better standard technically. It uses objeα oriented design principles and has a well define model of the data objects. However it was developed (by radiologists and equipment manufactures) in a clinical setting and has heavy emphasis on interfacing with Picture Archiving and Communications Systems (PACS) and HIS/RIS (Hospital/Radiology Information Systems). The data model is also heavily clinically oriented in design. For example, you can specify the patient's mother's maiden name and their health insurance status but concepts as "patient" species, inbred line, cell culture or sea water sample are not available. The standard does allow for inclusion of flow data in a technically clean way. However, all the existing types are image types of various sorts. It is unlikely that typical DICOM clients will have any knowledge of how to manipulate flow data. Of course given sufficient motivation (on the part of clinicians) the standard does allow for this in the future. Something of this sort

Section I will clearly be necessary if flow cytometry is to become clinically important because it will then be necessary to interact with HIS.

DICOM contains a scheme for generating unique universal identifiers for its modeled objects some of the objects. This allows efficient coding and facilitates consistency by central management of the object model. This makes a great deal of sense when working with the large health care bureaucracy but is unrealistic for basic science where the models are still being developed and are diverse and fluid.

HDF was not object-oriented by design. Some work on suitable object models to encapsulate it has been done for C++ and Java. Never the less, the HDF model does allow for a clean representation of the proposed cytometry objects model so the lack of 00 principles in HDF itself need not be a barrier to interoperability.

As mentioned above various MIME and JAR compliant implementations are freely available. This is also true of HDF because NCSA maintains an Internet archive of implementations for the most popular operating systems and architectures. A public domain beta implementation of DICOM is available from UCSD for a variety of platforms but while DICOM is fairly mature and considered as a "standard" the available implementations are not (This of course will change.)

Implementing a simple program to collect manipulate, or analyze data with little or no code hbrary support is possible with MIME but not possible with either DICOM or HDF. If implemented properly, i.e., consistently with the cytometry objects model, data from such simple systems can easily be imported into more advanced systems.

Importance of Interoperability

An important (and achievable) goal for the field of cytometry is that users of all these formats be able to convert their data between them conveniently when necessary. In the context of the Web this conversion facility can be delivered by data servelets. Interoperability with existing standards also facilitates the combination of flow and image cytometry data with other modalities for analysis of higher level experiments and analysis within a broader perspective. For example, when information from genetic or protein databases, time series OT clinical trials need to be combined with statistical summaries of gated flow data or processed image data for complete analysis of an experiment

Gateways

We intend to provide FCS -> Internet and Internet -> FCS gateways We are considering the feasibility of Internet -> JMP HDF -> Internet is quite feasible

Cytometry Objects Proposal

Interoperation of various scientific data analysis and visualization systems including those specifically designed for cytometry will be enhanced if the sample annotations follow a consistent abstract data model. We propose the foUowing model for the documentation of cytometry data sets. This proposal does not specify exaα implementations but rather that all compliant implementations specify a well defined invertable relationship between implementation dependent objects and the model objects. Bindings consistent with the practice in DICOM should be implementable, partial consistency with FCS is possible. I have provided a welim-nary MIME and FCS bindings, primarily for conαeteness and to facilitate discussion of the proposed model.

Organization FCS=$ORG

Identifies an organization with which the instrument, investigator, or operator is affiliated.

Investigator FCS=$EXP

Identifies the investigator responsible for the data. If .InvestigatOT.Organization is null use .Organization

Section I Operator FCS=$OPR

Identifies the operator responsible for the instrument dunng data collection If Operator Organization is null use Organizauon If Operator is null use Invesugator

Sample Source

Idenufies the source of the material analyzed, for example a blood draw, ussue sample or water sample

Sampled Individual

When the sample is taken from an mdividual human or animal that may be sampled more than once, e.g by repeated blood draws, this attπbute should distinguish the individual It should be unique at least relanve to the protocol, preferably relanve to the projeα or institution

In this case the sample source should distinguish the samples from this mdividual.

When a human subjeα is mvolved this attπbute should not contam identifymg personal information so that ordinary data can be exchanged without compromising pπvacy A separate database relating the experimental sub_jects to actual persons should be maintained securely (What does the FDA say⁹)

Instrument FCS=$CYT

Uniquely identifies the instrument with which the data were collected

If IntrumentOrganizauon is null use Organization

Protocol FCS=$PRO

Identifies this session on the instrument from all other sessions on the same instrument

<Instrument> <Protocob> must identify the session umquely

Prepared Sample FCS=$SMNO

Identifies the well, tube or slide which was sampled by the instrument dunng a given data collection operation. Relative to the current protocol

The relationships between Sample Sources and Prepared Samples are what we generally refer to as "protocol information " In addiuon to documenting the data collection procedures, these attnbutes are cπtical for _joining the results of flow analysis with other data.

Data Sample or Replicate

When data are collected from the same prepared sample multiple times, this attπbute umquely identifies the replicates.

<Instrument> <Prc--ocoL>.<Prepared Sample>.<Reρhcate> must identify a data set umquely. For example the current archive implementation represents Flasher.12345 Al.a as "Flasher 12345 A la" in the user interface

Section I The following is a draft (November 3, 1998) of a chapter for 2-volume set entitled Automating 21™' Century Science, edited by Richard Lysakows i and colleagues.

Data Annotation: The Heart of the Laboratory Notebook

Lee Herzenberg, Wayne Moore, David Parks, Len Herzenberg and Vernon Oi (Order? Zahava?) Genetics Department, Stanford University Medical School, Stanford CA94305

1. Introduction

Although there are nearly as many different definitions and descriptions of Electronic Laboratory Notebooks (ELNs) as there are ELN designers and users, virtually all current designs tend to approach the central problem of data management and storage of data from the perspective of the laboratory manager. Few, if any, address the annotation and data access problems that plague the bench scientist who collects the data the ELN is intended to manage. However, since uncontextu- alized (raw) data cannot be interpreted without access to information about the samples, reagents, methods and instrumentation used to generate it, automation of the data capture and annotation process should be central to all ELNs. Furthermore, since experiments are usually planned, executed, analyzed and interpreted at different sites and often by different people, ELNs must provide global mechanisms that link and serve the information that accumulates as an experiment progresses. Thus, focus on the way bench scientists ply their trade, and how they transfer information to each other and to their managers, is essential if ELNs are to reach full potential as management tools.

In essence, ELNs must be designed with the recognition that data are only useful when collected and annotated so they can be viewed within the context of the experiment and study in which they were generated

This means that ELNs must incorporate three related functions. First, they must provide simple and reliable ways to electronically define a specific experiment within an overall study, in order to create the context for data collection. Second, they must provide a non-volatile pointer or link between the experiment definition and the data being collected, so that the data can always be interpreted in its appropriate context and the context can always find its data. Finally, they must provide mechanisms for electronically storing findings - analyses and interpretations of data - within the context of the experiment and the overall study.

These functions are simple in theory. However, the voluminous data and experiment context information that bench scientists collect, and the instrumentation these scientists use, is so complex and varied that the construction of a ELN capable of serving scientists working in unrelated disciplines is truly a task for the 21rst century. Indeed, even conceptualizing the needs of such scientists requires a specialized interaction between software engineers and working scientists. We are well aware that the model ELN discussed in this chapter owes its life to the fortuitous partnership developed within our laboratory, in which biologists whose cell analysis work demands extensive computer support were working side by side with electronic and software engineers dedicated to provide that support. It happened this way. The biologists understood what information they wanted but not how to obtain it or whether it was possible to obtain. The engineers understood what could be done and how to do it, but didn't know what was worth doing. Together, the two groups tumbled through implementations and evaluations to develop a software and instrumentation package that, although still rudimentary, would support state-of-the-art experiments in molecular biology, cell biology, immunology and medicine. This package, which currently serves several hundred scientists, models the conceptual advances in overall ELN design discussed here.

ELN: The basic need

When the bench scientist does an experiment, it is usually part of a larger study aimed at testing a particular theory, developing a particular product or defining the characteristics of a particular process. Often several scientists will collaborate in the study, with one or more being involved in the analysis and interpretation of the study data rather than in the bench work that generated it. The aims of the study dictate the kinds of experiments to be done, the instrumentation to be used and the kinds of data to be collected. The bench scientist translates this into a series of experiments, the details for each being recorded initially as a plan of action often referred to as the experiment protocol and the data for each being recorded and interpreted in the context of the information in the protocol.

Protocols for experiments specify the samples and reagents that will be put in the test tubes, the planned incubation time and conditions, the specific instruments that will be used for data collection and any instrumentation settings unique to the experiment. In addition, they contain information recorded to enable data interpretation, including the relationship of the experiment to the overall study, the origin(s) of samples, the origin(s) of reagents, and notes concerning any anomalies that occurred during sample addition or incubation.

In general, experiment protocols are constructed and entered into the scientist's paper notebook before the experiment begins. They are usually displayed on the bench as the test-tube additions are made and are brought along during data collection for final annotation concerning instrumentation conditions and data collection anomalies. In addition, because data is still collected manually with some instruments, the protocol is sometimes used as a template in which data read from instrument dials is directly recorded in association with the protocol information for the sample. This simple system, the cradle from which contemporary laboratory notebook practice developed, is ideal in that it juxtaposes protocol information and experiment data. Thus, although labor intensive, it maximally facilitates interpretation of the data in the context of the experiment in which it was collected.

The first wave of automated instrumentation disrupted this natural relationship between protocol information and hand-recorded notebook data. At first, data output came in printed form that required only a little deviation from past notebook practice. Some scientists simply copied the printed data into appropriate columns in the notebook. Others, worrying about introducing errors during the copying process, just pasted (literally) the raw data printout beneath the protocol in the paper notebook. When additional computations were required (e.g., to convert raw data to standard units), the data from the printouts were usually typed into a calculator or computational program that generated a printout, which was then pasted into the notebook below the raw data.

Even this minimal separation between protocol information and data output introduced difficulties in constructing tables that re-associate the yin and yang (protocol and data) of the experiment. However, these difficulties pale in comparison with the current situation. Protocols still tend to be entered into paper notebooks, but sample and subject descriptions are often in files or elec- tronic spreadsheets. Most data acquisition instruments are supplied with digital output systems, but these usually interface to dedicated computers that are often alien to the scientists. In addition, although database and file management systems abound, mastering their intricacies is beyond what most bench scientists are willing (or able) to attempt. Thus, file naming, file transfer and file organization fall to the scientists, who eke out their living in an electronic Tower of Babel. Is it any wonder that they often find it easier to print everything and paste (or scan) it into the notebook than to wrestle with bringing the relevant information together on line.

Graphical representations, the heart of scientific data display, pose additional problems since axes and other elements in these representations must be labeled with sufficient information to make the data interpretable. Originally, graphs were drawn and labeled by hand. Usually, the axes were were fully labeled before the data points were put on the graph. The introduction of spread sheets and other computer programs that render data graphically, however, once again disrupted the natural connection between protocol information and data output. Some connection remains when data are typed into modern spreadsheet programs, since axis and curve labels are usually assigned from column headings entered with the data. However, when data is reported by instruments that produce graphical output, labels are mainly instrument-centric. Some manufacturers provide methods (usually slow and painful) for replacing the instrument -generated labels, but handwritten replacement on paper output is necessary in most cases.

In sum, the automation of laboratory equipment has far outstripped the innovative development of ELNs that support the recording and contextual interpretation of data collected with these instruments. While protocols today are still constructed and entered into paper notebooks, most laboratory data is reported in printed or electronic form. Data are mainly identified by the position of the sample in the readout system array and must be manually associated with the protocol information that gives meaning to the data. Thus, over the years, the bench scientist has basically traded the laborious task of hand data collection and entry for the dubious pleasure of finding ways to re-associate protocol information with data collected by instruments that basically ignore the problem.

ELN: the complete need

The bench scientists' difficulties in linking protocol information with instrument-acquired data are the Achilles heel of the ELN. There is little point in developing ELNs that manage the output of groups of scientists if that output is flawed by inadequate annotation of the data the each of the scientists collect. Unless the ELNs provides the means for automation of protocol development and data collection as an integrated process, they are bound to fail under the "garbage in, garbage out" rule. Therefore, to succeed, ELNs must provide adequate mechanisms for automating the annotation of data collected as experiments proceed. This essentially means devising a system to capture protocol and study information and to appropriate associate that information with the data that is collected.

Most ELN designers have shied away from this task, largely on the basis of it being too undefined. The diversity of instruments used to collect data would seem a sufficient caution, let alone the diversity of experiments that can be done. However, from the standpoint of the bench scientist with long experience in the laboratory, the specifics may differ but the underlying process - the way information flows ~ is the same for nearly all studies in today's laboratories.

In essence, the basic ELN unit is the Data Collection Session (DCS), during which a particular instrument is used to collect data from samples treated according to a particular protocol. Experiments commonly have multiple protocols and/or multiple DCS, either because the same or similarly-treated subjects are sampled at intervals or because a single set of samples is treated according to different protocols and analyzed with different instruments. Studies typically consist of one or more experiments, the goals for each being defined by the overall design for the study. To be useful at the study level, data collected at the experiment level must be appropriately annotated with information about the samples and treatments in the study just as data collected in each DCS must be annotated with information about sample treatment, instrumentation, etc. Therefore, to be useful, the ELN must provide the mechanisms for annotation and integration of information and data at all levels in the study.

The information flow for a single DCS in a multi-experiment study (see figure 1) can be visualized as an descent and subsequent ascent through a series of levels, each of which is responsible for handling certain protocol or study information. During descent, each level acquires and retains specific information, e.g., overall protocol for the DCS, individual sample and reagent descriptions, instrumentation set up, etc. At the lowest level, data is collected by the instrument. During ascent, the information "retained" at each level is successively joined to the data set so that it can ultimately be interpreted and integrated at the study level.

An Anatomy of Laboratory Research (Figure 1)

Figure 1

Patents, Publications

It

Study

Answers, Interpretations

To date, instrumentation manufacturers have typically dealt with the lower levels in Figure 1 while ELN efforts have typically been directed toward the upper reaches. The middle ground, where the key action takes place, is left in limbo — more specifically, in the hands of the bench scientist responsible for making the connections that turn raw data into findings.

Strangely, scientists rarely seem conscious of this problem. They have dealt with it throughout their training and see it largely "part of the real estate." Therefore, they only ask for solutions when the data load is too great or the nature of the data analysis is such that hand-linking protocol information and analytic output seriously impact the ability to do productive work. Even then, perhaps because it is difficult to imagine how to input the relevant protocol information in useful form, scientists commonly ask only for partial solutions. Thus, it is not surprising that this problem has escaped the attention of most instrumentation companies and ELN designers.

Our laboratory, however, has been virtually forced to deal with these issues. The FACS instruments that we developed are central to our biological and medical studies. However, from the beginning, the bench scientists in our laboratory had difficulty dealing with the numerous and vo- luminous multi-parameter data sets that these instruments generate. They were constantly struggling to keep track of the data they collected, since it had to be stored on digital media that could not be pasted into notebooks. Furthermore, they were severely limited by the difficulties involved in integrating protocol information with data for individual samples, e.g., to appropriate label graphical output.

On the other hand, our engineering group was committed to improving the instrumentation to increase sample processing speed and to enable more measurements per cell. The net result, we knew, would be a continued increase in the size and complexity of the data sets and hence in the difficulties in interpreting the data. Thus, to avoid eventual ^stalemate, we committed the resources necessary to automate the collection, annotation, storage, retrieval, analysis and display of this complex data. The resultant system, described in some detail below, constitutes a working model of an ELN that, although primitive, serves many of the needs of bench scientists who uses instrumentation to gather data.

FACS/Desk: a working ELN prototype

Our current ELN software (if it can be dignified by that name) provides a first-pass solution to the problems involved in associating protocol information with output data in a readily accessible, online form. This system (which we call FACS/Desk) was designed to meet the challenge of organizing access to flow cytometry data files, which are too big and too numerous to handle in the normal paper-notebook manner. It solves what could be considered the minimal ELN problem. At the experiment level, it provides for entry of protocol data, collection and storage of the data, permanent association of the protocol information and the collected data, long-term data storage, ready retrieval of stored data, specialized computation and display algorithms and, most important, specification of computations and display of computed data in the context of the initially- entered protocol information, i.e., with graph axes and table columns heads automatically assigned on the basis of reagents used for the sample for which data is being displayed.

This FACS/Desk system has served us well, and continues to do so even today. However, we have recently begun to build its replacement, which will incorporate modern browser-based data access and a wide variety of other features. In addition, with this new design, we have begun to address the larger ELN issues involved in drawing together data from the series of related experiments that together constitute a study. Discussing our approach to these problems, and the model system in which prototype implementations are now being evaluated, requires a short digression to explain the history, operation and uses of the instrumentation that constitutes the driving force behind our work.

A brief history of FACS instrumentation and software

Several thousand medical and biological laboratories at locations throughout the world currently use flow cytometry instruments (Fluorescence-Activated Cell Sorters and analyzers, aka FACS) to count or study the properties of different types of cells co-resident in blood or other organs. Blood samples from LπN-infected people, for example, are routinely analyzed with these instruments to monitor CD4 T cell counts, an index of the progress of the infection and the effectiveness of therapy. Cells from influenza-infected mice are studied for clues to immune responses to the infection. Cells from fruit flies are examined for novel gene expression patterns. Genes are introduced in human or mouse cells in culture and the cells expressing the genes are sorted and studied. Cancer cells are typed according to their physical features and the kinds of cell surface molecules they have. The list is as endless and varied as the studies done to improve human health or to understand the molecular mechanisms operating in plant and animal cells.

The idea of building an instrument that could perform serve these varied needs originated in our laboratory in the early 1960's. At that time, the roles that lymphocytes play in the immune system were just beginning to be understood. Our principal research effort was directed towards distinguishing human and mouse lymphocyte subsets and determining the function of the various subsets that we and others identified. However, the available methodology was extremely unsatisfactory for this purpose. In essence, we could use fluorescence microscopy to visualize distinct lymphocyte subsets in cell suspensions that had be n incubated with various anti- lymphocyte antibodies coupled to a fluorescent dyes. These antibodies individually recognize (specifically bind to) distinct surface molecules (e.g., proteins) that are selectively expressed on various lymphocyte subsets. Therefore, we could identify and count subsets of cells according to the fluorescent-labeled antibodies they bound, but we had no way of isolating one lymphocyte subset from another. To unambiguously distinguish the functions of these subsets, we clearly needed the ability to sort viable cells identifiable as belonging to particular subsets by the fluorescent-labeled antibodies they bound.

In search of a solution to this problem, Len Herzenberg (the head of our Stanford University laboratory) went to visit Mack Fulwyler, a researcher at Los Alamos who had built an instrument to characterize radioactive particles present in the lungs of mice and rats. To separate the particles according to size, Fulwyler was using sorting technology developed by Dick Sweet, a Stanford engineer unknown at the time to Herzenberg. Sweet's method, later to become the technological principle behind the inkjet printer and current cell sorting systems, puts an electrostatic charge on individual, very tiny droplets breaking off from a stream a short distance below the tip of the nozzle from which the stream emanates. Just below this "break-off point," an electric field steers the charged droplets to desired locations while passing uncharged droplets without deviation. Fulwyler devised a method for introducing particles into the stream so they would be individually encapsulated in droplets. In addition, he developed an interrogation system that would identify the droplets that contained cells and would selectively charge the particle-containing droplets so they would fall into collection vessels.

Herzenberg was excited by the possibility that the Fulwyler system could be used for sorting fluorescent-labeled cells; however, he realized that there were substantial barriers to adapting the instrument for this purpose. For example, fluorescence detectors would have to be added to enable measurement of the amount of cell-associated fluorescence for each of the cells in a sample. In addition, methods would have to be created to define and enter fluorescence "gates" to indicate which cells (drops) should be sorted. Fulwyler was not interested in adding either fluorescence detection or fluorescence-activated cell sorting to his instrument; however, he readily gave Herzenberg a copy of the plans for the Los Alamos machine and encouraged him to build the fluorescence-based system himself.

Herzenberg returned to Stanford somewhat chagrinned, since he was a biologist rather than an engineer. Nevertheless, he managed to organize and lead an engineering development group that built the first Fluorescence-Activated Cell Sorter (FACS) and successfully transferred the technology (circa 1970) to Becton-Dickinson Immunocytometry Systems, a leading builder of FACS instruments in today's market. Herzenberg put the first commercial FACS instrument into for routine use for biologists at Stanford as soon as it was built. Later, through a set of fortuitous meetings, he connected directly with Sweet, who took over leadership of the group. At that time, as now, the group included hardware engineers Dick Stovel and Tom Nozaki, software designer Wayne Moore, and physicist/biologist David Parks (the current group leader). Herzenberg continued to work with the FACS development group to expand the power of the instrumentation. In addition, with his laboratory colleagues, he improved the specificity with which subsets could be identified by introducing monoclonal antibodies as fluorochome-coupled reagents and conducted a series of landmark biology and immunology studies that made FACS analysis and sorting central to work in biology and medicine.

Perhaps not surprisingly, the analytic capabilities of the FACS became progressively more important as functional subsets became well characterized knowledge about individual lymphocyte (and other cell) subsets increased. While sorting and testing the functions of newly-recognized subsets is still ε major part of FACS work, the use of the FACS analytic capabilities to determine subset representation in patients with HIV or other diseases, in experimental animals undergoing various treatments or in cultures of genetically or physiologically modified cells now occupies center stage in most laboratories. Thus, the need for methods to facilitate the storage, retrieval, processing and display of FACS data has grown steadily as the technology has become more widespread.

In 1983, the development group exacerbated this need by introducing a new FACS instrument capable of analyzing and sorting of cells marked with four different fluorochrome-coupled antibodies, distinguishable by the absorption and emission properties of the fluorescent dyes coupled to the antibodies. The initial biology studies conducted with instrument, which foreshadowed our recently-developed 12-parameter FACS, brought home the realization that the speed and ease with which data could be generated with multiparameter FACS instruments was far greater than the speed with which that data could be integrated into a usable notebook. In essence, the biologists started sinking under the data load.

Fortunately, we (notably Wayne Moore) had predicted this impasse and were already building software to enable acquisition of sufficient information about the context of the experiments to permit collection, storage retreival and analysis of data with minimal reference to "paper" notes, e.g., for post-hoc entry of axis labels. The result of this effort, the FACS/Desk program suite inaugurated by the end of 1983, was an extremely successful proof of principle ~ so successful, in fact, that the biologists using the FACS in our facility have forced continued revisions of this initial prototype and steadfastly refused to allow it to be retired until it can be replaced by software that provides the same or greater ELN functionality.

FACS instruments: the data they generate and the software that processes it

In essence, FACS instruments measure cell-associated fluorescence and light scatter for individual cells passing single file, in a laminar flow stream, past a set of light detectors. The cell-associated fluorescence is commonly due to "staining" (incubation) with fluorochrome-coupled reagents (monoclonal antibodies or other proteins) that bind specifically to molecules on or in cells. Alternatively, it can be generated by staining with fluorogenic reagents that enter cells and either are, or become, fluorescent as the result of internal enzymatic or chemical reactions. The light scatter measurements provide an index of the size and granularity of the cell. At present, up to 5,000 cells can be analyzed per second.

As each cell passes the detectors, it is illuminated by lasers and emits and scatters light. The detectors are set to measure the light emitted at particular wavelengths or scattered at particular angles. The signals generated in each of the detectors are processed, digitized, and joined to create the set of measurements that are recorded individually for each cell by the data collection system. This "list mode" data recording can be thought of as a two-dimensional table in which the number of columns are defined by the number of parameters measured (fluorescence colors and light scatters) and the number of rows are defined by the number of cells for which data was taken (specified by the FACS user).

The earliest FACS software processed the list mode data immediately after acquisition terminated and produced a volatile screen representation that had to be photographed to be stored. This was replaced initially by a minimal computer system that enabled labeling and storage of the processed analysis output, i.e., the screen representation. Later, Moore (in his first project with the group) replaced this awkward configuration with a PDP-1 1 system that permitted storage and subsequent "offline" processing of the list mode data.

This system, primitive by today's standards, represented a quantum leap in the ease with which biologists could deal with analyzing FACS data and sorting subsets of cells based on the analytic data. It had a visual interface that allowed the user to define and manipulate subsets containing cells with similar fluorescence and light scatter properties (indicating similar expression of one or more cell surface or internal marker molecules). In essence, the user was given the ability to define "gates" that specify the bounds for each of the parameters measured (fluorescence color; light scatter) and thus specify the characteristics of cells to be treated as a subset. Once a set of gates was established (there can be many sets), they could be used to count or sort the cells within a given subset or to determine the mean or median expression of particular markers on the cells within the subset.

Modern commercial FACS software includes many of the innovative data processing, gating, and display strategies originally demonstrated in the PDP-11 software. However, it also maintains the PDP-11 single-user (rather than time-share) approach and provides very little data management capability, largely leaving the protocol entry, data storage, gate storage, processed data storage, archiving, and data retrieval largely to the biologist. The lack of significant third-party support for these crucial operations over the years has unfortunately left most biologists bereft of the computer-accessible legacy of FACS data and information that could have been built from their work.

In contrast, because our experience with the PDP-11 software demonstrated that many FACS experiments were repeated simply because investigators lost track of where they had stored data, we were acutely aware that efficient FACS usage requires central protocol entry, data storage, and computerized methods for retrieving and processing the stored data. This led Moore to move from the PDP-11 to a more capable, time-sharing system (VAX 11-780) and to begin development of the FACS/Desk software, a bare-bones model of the ELN we currently envision.

Designing an ELN from scratch

The problems Moore faced in beginning what we would now call and ELN design effort were legion. There was no vocabulary to describe the kind of technological solutions he was considering, and precious little technology with which to work. The computer mouse and window-based systems had yet to be developed. Bitmap screens were known but priced far out of range for a project of this type. Databases were minimally competent and available only for large, main-frame systems. Disk space was commonly limited to less than 100 MB and processors operated at a snail's pace. Even the VAX 11-780, Moore's eventual platform choice, was still to be announced.

Adding to these woes, Moore had to design his system to meet the needs of bench scientists who viewed computers as alien beasts rather than as potential workhorses. The input from these scientists (computer illiterates, in today's jargon) gave Moore a general of what needed to be built, but virtually no idea of what it should look like or how it should "talk" to the user. Mainly, he had to determine the user needs by examining their handwritten notebooks and listening to their frustrations in not being able to record or find aspects of the data or its associated annotation. Fortunately, over time, he was able to apply the information he acquired to the building of a system whose capabilities were sufficiently advanced to entice the biologists to learn how to use it.

Even today, fundamental problems interfere with the development of computer systems that facilitate data annotation by bench scientists. Often, experiments and technologies are so complicated that although scientists themselves intuitively understand how particular experiments are organized, they cannot make this organization comprehensible eⁿough to enable software designers to produce innovative information management systems that would provide the necessary functions. Instead, the scientists often approach the problem by pointing to existing, familiar technology and requesting specific modifications that they feel could solve their problems. However, when the engineers make these modifications, the biologists frequently discover that although they got what they asked for, they did not get what they need. Frustration ensues on both sides, and creative design suffers.

In addition, biologists commonly expect that entry, storage and management of manage extensive annotation information will spend force them to waste expensive, often limited time at an instrument that sits idle while they "diddle" with the computer. Further, they are not inclined to waste precious time learning how to enter annotation data and extract the information they need at a later time. Thus, as Moore learned when he began developing his system, biologists communicate poorly with software developers and tend to be cooperative only when they truly believe that the system being built will make their work easier and more productive. Basically, this means that successful ELN development requires that developers recognize and remove bottlenecks that biologists may not even recognize are interfering with work. Once this "magic" is accomplished, the product will become an integral part of the biologists' tool kit, and life without it will be unimaginable.

FACS/Desk: a well-used ELN prototype

The earliest FACS/Desk version, which was released ca. 1983 and ran on VMS (version 2) with a 60-megabyte disk, is basically still in operation today (albeit running on much-improved hardware). Its client/server architecture takes charge of FACS list mode data as it is collected and manages access to the data thereafter. All data are maintained by a central storage system that has an immediately accessible on-line cache and a tape library that contains all data collected in the system. Data are moved to and from tape automatically. The on-line cache is governed by a least-recently-used algorithm that removes the oldest data when users call for analyses based on off-line data.

Each user communicates with the FACS/Desk system through a personal, password-protected "Desk" assigned when the user enters the system. The non-procedural user interface that Moore introduced for this communication foreshadows today's "point and click" GUIs. At log-on, the Desk displays an alphabetical list the protocols and experiments already created by the user. Simple keystrokes allow the user to add new protocols, to collect new data, or to analyze data that has already been collected. The common FACS/Desk archive, also accessible from the personal Desk, provides a repository for retrievable experiments that users no longer wish to keep on individual Desks.

To capture experiment descriptions and other annotation information without unnecessarily burdening the user, FACS/Desk is built with a protocol editor that prompts users to enter descriptive experimental data (e.g., sample names, reagents, and fluorescence reporter groups) Protocols are created prior to initiating data collection. Data collection is controlled through second GUI, generated from the experiment protocol, that enables the user to access annotation information, to determine the number of cells for which to collect data, and to initiate data collection for each sample. The collection GUI also signals the permanent association of the annotation information with the list mode data once collection terminates.

FACS/Desk stores annotation information and list mode data in separate, pointer-linked, files so that sample and reagent descriptions can be maintained on line when the data is stored to tape. This information, available through the individual u._er Desks, is used to legate and retrieve stored data. In addition, it is available through the FACS/Desk analysis GUI, where it is used to specify analyses and to label analysis output, e.g., axes in graphs (plots) and columns in tables during data analysis.

The FACS/Desk analysis package takes advantage of the client/server architecture and enables users to specify a set of analyses and submit them for batch processing. The user is then free to specify and submit more analyses or to terminate the FACS/Desk session. Submitted analyses are queued and processed in the order they are received. Results of the analyses are returned to the submitting user's desk and stored permanently in association with the experiment. In addition, results are sent to the print queue if printing was specified. Minutes, months or years later, the user can re-open his or her desk to view results, submit additional analyses, call for additional printing, etc. Thus, with respect to FACS experiments and data, the user's Desk within the overall FACS/Desk system provides the elements essential to an ELN.

Additions to the basic FACS/Desk

In later versions, Moore and other software development engineers have introduced a series of FACS/Desk innovations, e.g., extension of data collection and analysis capacity to up to 16 parameters; advanced instrument calibration and standardization, fluorescence compensation and data collection capabilities to make the archived data comparable between, as well as within, FACS runs; network access for analysis of FACS-Desk data; and Macintosh-based access for data analysis and display. These capabilities serve the needs of a large number of biologists and medical scientists at Stanford. Therefore, although FACS/Desk is an antique by some standards, it is still running at Stanford and several other sites and will continue to do so until, as indicated above, it all of its current features can be replaced with modern equivalents.

Fortunately, Moore built substantial flexibility into FACS/Desk. For example, although provision for data collection for up to 16 parameters seemed excessive at the time, the Stanford team has now developed a prototype "high-dimension" FACS system (instrumentation and software) that simultaneously measures up to ten distinct fluorescence-tagged surface molecules on individual cells (in addition to the standard two scatter measurements). With this extended FACS capability, biologists in our laboratory have been able to resolve previously unrecognized subsets of the overall CD4 and CD8 T cell subsets in human peripheral blood. Commercial FACS instruments are widely used determine overall CD4 T cell levels as an index of HIV disease progression. Our "high dimnension" FACS studies reveal additional changes in T cell representation as FUN disease progresses, demonstrating that certain subsets are lost while others increase from barely detectable to relatively high frequencies. In addition, in studies with peripheral blood T cells from atopic and leprosy-infected subjects, we have shown that the changes in the frequencies of the finely-resolved subsets account for the principal differences in cytokine production in these diseases. In contrast to the extensibility of the FACS/Desk data collection and storage capabilities, the FACS/Desk analytic package was too limited to support the "high-dimension" FACS work. Thus, we initiated development of a new analytic package some time ago by building a Macintosh-based prototype that is now in routine operation in our laboratory. This prototype, which follows Moore's general design, was built by Adam Triester and Mario Roederer and named "FlowJo."

FlowJo operates best in conjunction with FACS/Desk, since it lacks an independent data annotation and collection system. However, it is much in demand outside our laboratory because its data handling features are markedly better than those provided by current commercial systems. Thus, it has been fitted with a mechanism for reading data acquired by commercial FACS instruments and is now distributed by TreeStar Software and Becton-Dikinson Immune Systems.

2.3 The Future

The use of Fluorescent- Activated Cell Sorters in research and medicine continues to expand as new applications are developed and older applications become standard practice. To meet the challenges generated by this expansion, we have already begun using recently released Internet tools to create a "FACS Data Web" intended to facilitate collection, analysis and interpretation of data from FACS studies and to enable integration of that data with relevant information acquired with other methodologies. In essence, this system will create an ELN centered on FACS data but potentially extensible to most biomedical experimentation.

Basically, we plan to build a JAVA-based, Internet-accessible FACS DataWeb with integrated modules for planning FACS experiments (protocols) and for collecting, archiving, retrieving, analyzing and displaying data in the context of information entered at the planning (protocol) stage. The experiment planning modules will utilize semantic models to link experiments to data sources and other information relevant to protocol design, experiment execution, and subsequent data analysis, e.g., previous FACS data; reagent information; patient, animal, or cell line databases; and, clinical laboratory and medical record data from a clinical trial. The data entry and collection modules will enable standardization, storage and archiving of FACS data annotated with the protocol and execution information necessary for retrieving it and for specifying, displaying, and permanently recording analysis results. Finally, the data analysis and visualization modules will include novel statistical approaches to data visualization and visualization capabilities utilizing graphics browser facilities, e.g., Computer Graphics Metafile (CGM) and Virtual Reality Modeling Language (VRML).

We plan to implement the FACS DataWeb in JAVA to support a three-tiered client/server World Wide Web-based architecture, making it available to FACS users at other sites via the World Wide Web (see figure 2). Later, we hope to imbed the FACS DataWeb in a broader DataWeb that integrates FACS data with Web-available cellular immunology data and genetic and medical resources, including genome and other widely-used databases. Client Middle Tier Tier

Inst

2. 3. Conclusion

The DataWeb software focuses on providing an automated solution for the storage of protocol information and its use in data interpretation. FACS/Desk, our current system, has already implemented and proven the utility providing a protocol editor through which a modicum of basic information can be entered to help manage and interpret the voluminous data collected in FACS experiments. The DataWeb extends this system to include semantic models that enable entry and use of protocol information for the collection, archiving, display, and interpretation of FACS data in the association of FACS data with Web-accessible informat .. from other sources. Since the complexity of FACS protocols and data interpretation exercise virtually all of the features required to collect, store, and interpret data from other instrumentation, the DataWeb should be readily extendible into a full-fledged ELN that can integrate with instrumentation and conceptualization software to provide the functionality necessary for the overall support of scientific and medical research.

As its name implies, the DataWeb is designed as a distributed system that can take advantage of the potential inherent in collecting, storing, retrieving and analyzing data via the Internet. To this end, we have designed novel strategies for serving data based on the use of Directory Services, such as those provided by LDAP servers, to uniquely identify data and allow broad searches based on the kinds of metadata that the protocol editor discussed here is designed to capture and associate with raw data during the data acquisition process. This Directory Service approach, which provides fine-grained access control and enables use of locally-controlled data servers that can be federated to provide global access, effectively removes many of the disadvantages of storing data and metadata in relational databases. Thus, we view it as central to ELN design in the 21rst century.

3. Figure Legends

To be completed.

Update

IIS-98-17413,

Vernon Oi, P.I.

Title: A Digital Library of Flow Cytometry Data and Cell Phenotypes

Since we submitted our proposal in July 1998, we have made substantial progress both in detailing and implementing a functional LDAP system for our own application (a test LDAP server will be on line shortly) and in organizing a site to house our proposed FACS Digital Library (San Diego Super Computer Center). In addition, as indicated below, we have begun extending the scope of our proposed LDAP Directory Service to other disciplines (e.g., Human Genome project, PubMed).

Not surprisingly, these ideas have begun to attract attention locally and at the national level. Wayne Moore, the senior software designer on the project recently gave two informal Stanford seminars. Each resulted in a potentially fruitful collaboration (one with David Cox, co-head of the Stanford Human Genome Project and a second with Russel Altman, head of the Stanford Riboweb Project). In addition, Moore has been invited to speak at a plenary session of the National Federation of Abstracting and Information Services (NFAIS) Metadiversity conference ("Responding to The Grand Challenge for Biodiversity Information Management Through Metadata," November 8-12, 1998, Natural Bridges, VA).

Moore's abstract for the NFAIS Conference succinctly summarizes the progress in our thinking:

Perspectives on information management on the Internet.

Directory Service, as defined by the ISO X.500 and IETF LDAP standards, is rapidly becoming an essential infrastructure component of corporate and governmental intranets as well as the wider Internet. Although conceived in terms of traditional telephone number and e-mail address directories, current LDAP implementations are quite competent databases in their own right and can be exploited for many other purposes. This technology may be particularly useful for information storage and exchange in the biological and medical sciences and in other areas that similarly deal with very large name spaces (i.e., many discrete named elements) that are difficult to serve with current approaches.

Our motivation for this project derives from a necessity to maintain and serve data from Flow Cytometry (FACS) instruments, which are used world-wide in basic science and medicine. In perhaps their best known use, FACS instruments provide the ability to monitor changes in CD4 T cell counts as HIV disease progresses. We currently maintain an archive of over 200GB of FACS data, collected mainly in basic science studies at Stanford over the last 15 years. We plan to implement LDAP service for this data, which will be collected locally, stored at the San Diego Super Computer facilities in San Diego, and made available to FACS users at Stanford and elsewhere over the Internet. This universal accessibility, which mirrors the intentionally broad availability of genomics data, provides a general model for facilitating the electronic interchange of scientific data and the publication of scientific findings.

In this presentation, I will discuss the advantages of using Directory Service vs a more traditional relational database approach for these purposes. Directory Service advantages include 1) global service capable of providing the same information to everyone in the world; 2) fine grained access control; 3) federated servers that need not be located within a single organization; and, 4) compatible client software that is widely available and runs on "lightweight client? (e.g., PCs and Macs). I will illustrate this discussion with examples from our laboratory's work in lymphocyte biology and Flow Cytometry and from a variety of other areas, including genetics, genomics, taxonmy, museum and scholarly collections, electronic publication and scientific literature index services.

Additional Interest in our Digital Libraries Concept - A Directory of Directories

I. Human genome project. After an informal presentation at our recent Genetics Department Retreat, the Stanford Human Genome Project Group expressed significant interest in our concepts of information directories (taking advantage of LDAP and directory services). A letter of collaboration, from David Cox and Rick Meyers, Directors of the Stanford Human Genome Center, is included in Appendix B. Dr. Cox was specifically interested in solving problems associated with replicating and distributing data currently in relational databases. Our LDAP approach offers significant advantages for this purpose. While directory services can be readily replicated and distributed across the Internet, more significant is directory services can be created as a federation of directories across the Internet. This would require that a directory schema be conceptualized, agreed-upon, and created by members of the multiple National Genome Centers to provide a National Human Genome Directory accessible to the general scientific community. Moore has already created a partial example such a schema (see Appendix A) and is working with Dr. Cox to set up a demonstration server to test this approach.

Ultimately, individual Directory Services can be maintained by each Human Genome Project Group as part of a National Federation of Directory Services. These Directories could be replicated and redistributed by network resources located at the National Laboratories and/or the National Super Computing Centers to facilitate Internet work access by the general scientific community. Alternately Internet 2 resources could act as the National Directories of Directories.

2. Medical Information Sciences. A small informal presentation of our directory concepts to the Stanford Bioinformatics Group stirred interest in comparing the features, benefits, and performance of using directory services, either uniquely or in addition to relational database "solutions" for storing and accessing information. Dr. Russel Altman is committed to compare these approaches using the ribosomal protein structure and function database that he has developed (see collaboration letter in Appendix B).

Directory Services as a common entry to other data sources.

Because Directory Services can integrate readily with relational databases, object databases and other data sources, they offer the potential for developing a "knowledge portal" capable of rapidly directing users to data that might otherwise be difficult to find. Further, because Directory Services can be federated, they provide an infrastructure that can be locally maintained and globally accessed.

Directories of National and International Scientific Nomenclature, and their Associated Databases, provide the basis for the unique name spaces required to create unambiguous directories that can be used for this purpose. An international scientific board can be created to develop and maintain the Internet Standard Recommendation on Scientific Data Schemata. Appendix A includes a papei by Wayne Moore that presents tecnnical view of the overall principles underlying the Directory Service (LDAP) approach we propose. Moore's paper illustrates this approach with examples from Flow Cytometry; however, he has also developed tables with "distinguished name" specifications for LDAP servers illustrating how Human Genome information and information about scientific publications can be served (see Appendix A).

PubMed and Directory Services

PubMed can be viewed as a Directory of Directories, i. e., a Directory of Journal Directories. If the National Library of Medicine (NLM) supported a "Recommended Journal Directory Standard," PubMed could be replaced with a Federated Scientific Journal Directory in which publishers independently ran local Directory Services that would be accessed through the PubMed Federated Directory. The NLM is best suited to maintain this central directory service, both because they are already established in this role and because the search and naming mechanisms they have developed (MESH and scientific and medical thesauri) can be readily incorporated into the directory schema. Citation indexing is also easily incorporated into the Directory Structure. In addition, the schema developed for these directory services can provide the basis for defining XML name spaces and DTDs.

Since LDAP supports fine-grained access control, each publisher participating the PubMed Federation would be able assign access privileges (to titles, authors, abstracts, etc.) as desired. In fact, publishers could allow search access to any or all information but require subscriptions to see commercially valuable material (e.g., entire published manuscripts). In addition, a National Directory of Scientists could be created to provide authentication for access to specific information levels or specific directories.

If, as we propose, Directory Services become established as the mechanism for storing raw and interpreted data in the laboratory, published views of data could readily be linked to the underlying experimental findings, e.g., published FACS plots could be linked to primary FACS data. Thus, publishers could require that laboratory data referred to in publications be put "on file" much the same way as publishers now require that sequence data be entered into the sequence database before publication.

Additional FACS Directory sites.

The San Diego Super Computing Center, which is supported by the National Science Foundation (NSF), has recently agreed to house the entire FACS Digital Library Archive described in our proposal. The Center will not provide funding for our development work but will house our archive and will provide consultation necessary for its establishment.

We have also accepted two additional FACS sites (in Tokyo and Salamanca) for participation in the FACS Directory Trial. FIGURE 1

120 121 122 123 124 206

-101

-102 -103 -104 -105 -106 -107

Client Middle Tier Tier

Instru

Figure 1

Patents, Publications

Our laboratory, however, has been virtually forced to deal with these issues. The FACS instruments that we developed are central to our biological and medical studies. However, from the beginning, the bench scientists in our laboratory had difficulty dealing with the numerous and vo- Figure 3 - Federated Card Catalogs

Figure 2 - Internet Application | :;| Digital Library

Radiation Hybrid Clone Card

Radiaton HybriJ Map Card

JOURNAL Access

The objectClass=scientificPublication should have optional multi-valued attributes reference and citation which are distinguished names. When the publisher establishes the record they will fill in the reference with the dn of another scientificPublication which this one references. An indexing service would buy the rights to replicate the raw data and when new data appeared to update the citations in it's copy then serve the result as "value added" to it's customers. Species Card

Radiation Hybrid Panel Card

Sequence Taq Site Card

FACS Sample Card

Genus Card

Monoclonal Antibody Card

dn clone=A.F.6-78, ou=Pharmingen, o=Becton- Dickinson Immunocytometry Systems objectClass Monoclonal Antibody clone A. F.6-78

Becton-Dickinson Immune System ou Pharmingen

UID cn Anti IgH-6b, Anti IgM¹ specificity allele = b, iocus = Igh-6, o = IUIS standard notation group creatorDn uid = Stall, d = Genetics, ou = School of Medicine, o = Stanford University manufacturer ou = Pharmingen, o = Becton-Dickinson Immunocytometry Systems

Investigator Card

Radiation Hybrid Clone Card

Radiaton Hybrid Map Ca. d

JOURNAL Access

The objectClass^scientificPublication should have optional multi-valued attributes reference and citation which are distinguished names. When the publisher establishes the record they will fill in the reference with the dn of another scientificPublication which this one references. An indexing service would buy the rights to replicate the raw data and when new data appeared to update the citations in it's copy then serve the result as "value added" to it's customers. Species Card dn 'H. sapiens' objectClass Species alias Species = sapiens, genus = homo dn Species = sapiens, genus = homo

REF: ldap://shgp. stanford.edu /species = sapiens, genus = Homo dn Species = sapiens, genus = homo

UID ldap://shgp.stanford.edu/species = sapiens, genus = Homo

Radiation Hybrid Panel Card

Sequence Tag Site Card

FACS Sample Card

dn Coord = Ala, Protocol = 1234, Instrument= Flasher2, uid = LenHerz, d = Genetics, ou = School of Medicine, o = Stanford University objectClass FACSSample

Coord Ala protocol 1234 instrument Flasher2

UID LenHerz ou School of Medicine o Stanford University title T cell subsets in HIV⁺ subjects description Stains for PBMC subsets sampleLabel Subject #2 "(pid = 234)" investigatorDn uid-=LenHerz, d = Genetics, ou = School of Medicine, o = Stanford University instrumentDn Instrument = Flasher2, ou = Shared FACS facility, o = Stanford University dateCollected September 15, 1998 startTime 14:22 endTime 14:23 numberOf Measurements 12 numberOf Events 200000

URL ftp:// Curie.Stanford.EDU/Flasher2/ Len herz/1234/Ala.FCS

Genus Card

Monoclonal Antibody Card

Investigator Card

dn uid = LenHerz, d = Genetics, ou = School of Medicine, o = Stanford University objectClass Scientificlnvestigator, InetOrgPerson, organizational Person, person

uid LenHerz cn Len Herzenberg

Genetics ou School of Medicine

Stanford University

Professional Name Leonard A. Herzenberg

ProfessionalSpeciality Genetics, Immunology, Cell Sorting

Professional Affiliation National Academy of Sciences, American Association of Immunology

5 Conclusion

This paper examines the problem of computer-assisted commumcations in flow cytometry particular, and biology in general, from the point of view of the emerging standards for computerized directory service Following Schulze- Kremer¹⁴ "To improve the current situation of non-unified and ambiguous vocabulary, the only solution is to develop a core of commonly agreeable definitions, and using these, to implement user interfaces to and between databases" As an example of how this goal can be accomplished, I have outlined how X 500 directory services accessed via LDAP from lightweight clients can be used to create and manage a umque namespace in the flow cytometry domain

Table 4: Monoclonal antibodies

Table 5: FACS instrument

Table 6: FACS experiments

10

Claims

1. A directory of biological data comprising: a pluralilty of nodes comprising distinguished names; at least one extension of one of said plurality of nodes comprising an identifier of a biological sample.