EP1101176A1 - Verfahren und gerät für eine speicherarchitektur mit einem verbesserten informationsspeicher- und wiederauffindungssystem in einer gemeinsam genutzten dateiumgebung - Google Patents

Verfahren und gerät für eine speicherarchitektur mit einem verbesserten informationsspeicher- und wiederauffindungssystem in einer gemeinsam genutzten dateiumgebung

Info

Publication number
EP1101176A1
EP1101176A1 EP99940859A EP99940859A EP1101176A1 EP 1101176 A1 EP1101176 A1 EP 1101176A1 EP 99940859 A EP99940859 A EP 99940859A EP 99940859 A EP99940859 A EP 99940859A EP 1101176 A1 EP1101176 A1 EP 1101176A1
Authority
EP
European Patent Office
Prior art keywords
partition
data
column
user
record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP99940859A
Other languages
English (en)
French (fr)
Inventor
Scott Wlaschin
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Enfish Technology Inc
Original Assignee
Enfish Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/128,922 external-priority patent/US6182121B1/en
Application filed by Enfish Technology Inc filed Critical Enfish Technology Inc
Publication of EP1101176A1 publication Critical patent/EP1101176A1/de
Withdrawn legal-status Critical Current

Links

Definitions

  • the present invention relates generally to a method and apparatus for storing, retrieving, and distributing various kinds of data. More specifically, the present invention relates to a physical storage architecture for a shared file environment and method for using such.
  • Client-server systems designed for desktop information management and local area networks uniformly use one of the first two approaches described above. These approaches tend to provide an imbalanced load on the server and typically require locking of the shared files on the remote server which further hampers performance. In addition, the files resident on the server typically require a connection to the client and thus updates may not occur without such a connection. The first two approaches also tend to be relatively slow for updates as updates must be synchronized in real-time.
  • the present invention overcomes the limitations of the prior art by providing a flexible, efficient and fast physical storage system that combines the advantages of a synchronous replication with the need for direct access to central data. It is designed to be used as a file system that allows users to share files on networks and across different storage media such as hard-drives, CD-ROMS and WORM drives.
  • a physical storage system must store data items, such as a database record, in a non-volatile memory until such time as an application requires access to such data. This process typically involves 'flattening' the contents of data items and writing them to the storage medium.
  • the storage medium is generally divided into fixed size blocks, each of which has a location.
  • each data item be a fixed length.
  • the second restriction is that only the most recent version of each data item need be stored.
  • Prior art storage systems generally operate according to one or both of these restrictions.
  • a block of memory is found that is large enough to hold a data item, which is then written to that block.
  • the other items in the block are reorganized to free up the maximum amount of space, ready for another data item.
  • a new block is created only when no existing block has enough space for a new data item.
  • Prior art systems do not readily support variable length data and previous versions of a data item are not available, so that no 'undo' function is available to the user. Further, the prior art methods may not be used in conjunction with append-only media such as write-once read-many (WORM) disks. As will be described, the present invention overcomes the limitations of prior art storage systems by providing a system that easily supports variable length data items without erasing older versions of data items while occupying a relative minimum of disk space.
  • WORM write-once read-many
  • databases comprised a simple "flat file" with an associated index.
  • Application programs as opposed to the database program itself, managed the relationships between these files and a user typically performed queries entirely at the application program level.
  • the introduction of relational database systems shifted many tasks from applications programs to database programs.
  • the currently existing database management systems comprise two main types, those that follow the relational model and those that follow the object oriented model.
  • the relational model sets out a number of rules and guidelines for organizing data items, such as data normalization.
  • a relational database management system (RDBMS) is a system that adheres to these rules.
  • RDBMS databases require that each data item be uniquely classified as a particular instance of a "relation 1 .
  • Each set of relations is stored in a distinct x table' .
  • Each row in the table represents a particular data item, and each column represents an attribute that is shared over all data items in that table.
  • the pure relational model places number of restrictions on data items. For example, each data item cannot have attributes other than those columns described for the table. Further, an item cannot point directly to another item. Instead, "primary keys' (unique identifiers) must be used to reference other items.
  • these restrictions cause RDBMS databases to include a large number of tables that require a relatively large amount of time to search. Further, the number of tables occupies a large amount of computer memory.
  • the object oriented database model derived from the object-oriented programming model, is an alternative to the relational model. Like the relational model, each data item must be classified uniquely as belonging to a single class, which defines its attributes. Key features of the object- oriented model are: 1) each item has a unique system- generated object identification number that can be used for exact retrieval; 2) different types of data items can be stored together; and 3) predefined functions or behavior can be created and stored with a data item.
  • both the relational and object oriented models share important limitations with regard to data structures and searching. Both models require data to be input according to a defined field structure and thus do not completely support full text data entry. Although some databases allow records to include a text field, such text fields are not easily searched. The structural requirements of current databases require a programmer to predefine a structure and subsequent date entry must conform to that structure. This is inefficient where it is difficult to determine the structure of the data that will be entered into a database.
  • word and image processors that allow unstructured data entry do not provide efficient data retrieval mechanisms and a separate text retrieval or data management tool is required to retrieve data.
  • the current information management systems do not provide the capability of integrating full text or graphics data entry with the searching mechanisms of a database.
  • the present invention overcomes the limitations of both the relational database model and object oriented database model by providing a database with increased flexibility, faster search times and smaller memory requirements and that supports text attributes. Further, the database of the present invention does not require a programmer to preconFIG. a structure to which a user must adapt data entry. Many algorithms and techniques are required by applications that deal with these kinds of information.
  • the present invention provides for the integration, into a single database engine, of support for these techniques, and shifts the programming from the application to the database, as will be described below.
  • the present invention also provides for the integration, into a single database, of preexisting source files developed under various types of application programs such as other databases, spreadsheets and word processing programs.
  • the present invention allows users to control all of the data that are relevant to them without sacrificing the security needs of a centralized data repository.
  • the distributed storage system of the present invention provides a method and apparatus for storing, retrieving, and sharing data items across multiple physical storage devices that may not always be connected with one another.
  • the distributed storage system of the present invention comprises one or more 'partitions' on distinct storage devices, with each partition comprising a group of associated data files which in turn contain a collection of data items, each of which can be accessed individually.
  • Partitions can be of various types. Journal partitions may be written to by a user and contain the user's updates to shared data items. In the preferred embodiment, journal partitions reside on a storage device associated with a client computer in a client- server architecture. Other types of partitions, library and archive partitions, may reside on storage devices associated with a server computer in a client-server architecture.
  • the data items on the journal partitions of the various clients may, at various times, be merged into a data item resident within a new, consolidated partition. If two or more clients attempt to update or alter data related to the same data item, the system resolves the conflict between the clients to determine which updates, if any, should be stored in the consolidated partition.
  • the merge operation may occur at various time intervals or be event driven.
  • the consolidated partition can optionally be merged into the library partition, which maintains a shared version of a data item.
  • the archive partition stores older versions of data items from the library partition.
  • journal partitions can share the same library and archive partitions, which provides a means for providing the data items in the library and archive partitions as shared, while allowing the data items in the journal partition to have a local version, independent of data items in other journals or shared data items.
  • the journal partition of the present invention comprises a series of objects that are written sequentially to physical memory.
  • the journal partition stores older versions of objects such that a user may retrieve data that had been changed.
  • the objects correspond to data items, such as a record in a database or a text file.
  • a table is stored to track the location of objects within the journal partition.
  • the present invention improves upon prior art information search and retrieval systems by employing a flexible, self- referential table to store data.
  • the table of the present invention may store any type of data, both structured and unstructured, and provides an interface to other application programs such as word processors that allows for integration of all the data for such application programs into a single database.
  • the present invention also supports a variety of other features including hypertext.
  • the table of the present invention comprises a plurality of rows and columns. Each row has an object identification number (OID) and each column also has an OID.
  • OID object identification number
  • a row corresponds to a record and a column corresponds to an attribute such that the intersection of a row and a column comprises a cell that may contain data for a particular record related to a particular attribute.
  • a cell may also point to another record.
  • column definitions are entered as rows in the table and the record corresponding to a column contains various information about the column. This renders the table self referential and provides numerous advantages, as will be discussed in this
  • the present invention includes an index structure to allow for rapid searches. Text from each cell is stored in a key word index which itself is stored in the table.
  • the text cells include pointers to the entries in the key word index and the key word index contains pointers to the cells. This two way association provides for extended queries .
  • the invention further includes weights and filters for such extended queries.
  • the present invention includes a thesaurus and knowledge base that enhances indexed searches.
  • the thesaurus is stored in the table and allows a user to search for synonyms and concepts and also provides a weighting mechanism to rank the relevance of retrieved records .
  • An application support layer includes a word processor, a password system, hypertext and other functions.
  • the novel word processor of the present invention is integrated with the table of the present invention to allow cells to be edited with the word processor.
  • the table may be interfaced with external documents which allows a user to retrieve data from external documents according to the enhanced retrieval system of the present invention.
  • FIG. 1 is a functional block diagram illustrating one possible computer system incorporating the teachings of the present invention.
  • FIG. 2 is a block diagram illustrating the partition structure of the present invention in a client-server architecture.
  • FIG. 3 illustrates the linkage between the partitions of FIG. 2 and shows how files are transferred from one partition to another.
  • FIG. 4a illustrates the structure of an appendable list data item that may exist within more than one partition.
  • FIG. 4b illustrates the structure of an appendable text data item that may exist within more than one partition.
  • FIG. 5 is a flow chart for reading and writing data items according to the teachings of the present invention.
  • FIG. 6 is an illustration of an operation for merging files located in a journal portion to a file located in a library partition.
  • FIG. 7 is a flow chart illustrating the sequence of steps of the present invention for writing data to a consolidation file.
  • FIG. 8 is a flow chart illustrating the sequence of steps of the present invention for consolidating the consolidation file.
  • FIG. 9 is a flow chart illustrating the sequence of steps of the present invention for merging the consolidation file into a library file.
  • FIG. 10 illustrates the structure of a journal partition file in the preferred embodiment.
  • FIG. 11 illustrates the structure of an object stored in the journal partition.
  • FIG. 12 is a flow chart for inserting, updating, and deleting data items from the journal file.
  • FIG. 13 illustrates the "sentinel" feature of the present invention for storing tables that map objects stored in the journal file to blocks of physical memory.
  • FIG. 14 is a block diagram illustrating the main components of the present invention.
  • FIG. 15 illustrates the table structure of the database of the present invention.
  • FIG. 16 is a flow chart for a method of computing object identification numbers (OID's) that define rows and columns in the table of FIG. 15.
  • OID's object identification numbers
  • FIG. 17 is a part of the table of FIG. 14 illustrating the column synchronization feature of the present invention.
  • FIG. 18 is a flow chart for a method of searching the table of FIG. 15.
  • FIG. 18A is a flow chart for synchronizing columns of the table of FIG. 15.
  • FIG. 18B illustrates the results of column synchronization.
  • FIG. 19A illustrates a reference within one column to another column.
  • FIG. 19B illustrates an alternate embodiment for referring to another column within a column.
  • FIG. 20 illustrates a "Record Contents" column of the present invention that indicates which columns of a particular record have values .
  • FIG. 21 illustrates a folder structure that organizes records.
  • the folder structure is stored within the table of FIG. 15.
  • FIG. 22 illustrates the correspondence between cells of the table of FIG. 15 and a sorted key word index.
  • FIG. 23 illustrate the "anchors" within a cell that relate a word in a cell to a key word index record.
  • FIG. 24 illustrates key word index records stored in the table of FIG. 15.
  • FIG. 25 illustrates the relationship between certain data records and key word index records .
  • FIG. 26 illustrates the relationship of FIG. 25 in graphical form.
  • FIG. 27A illustrates an extended search in graphical form.
  • FIG. 27B illustrates a further extended search in graphical form.
  • FIG. 28 illustrates the thesaurus structure of the present invention stored in the table of FIG. 15.
  • FIG. 29 illustrates prior art hypertext.
  • FIG. 30 illustrates the hypertext features of the present invention.
  • FIG. 31A illustrates a character and word box structure of the word processor of the present invention.
  • FIG. 3IB illustrates the word and horizontal line box structure of the word processor of the present invention.
  • FIG. 31C illustrates the vertical box structure of the word processor of the present invention.
  • FIG. 32 illustrates the box tree structure of the word processor of the present invention.
  • FIG. 33A illustrates the results of a prior art sorting algorithm.
  • FIG. 33B illustrates the results of a sorting algorithm according to the present invention.
  • FIG. 34 illustrates the correspondence between cells of the table of FIG. 15 and a sorted date index.
  • the manipulations performed are often referred to in terms, such as adding or comparing, which are commonly associated with mental operations performed by a human operator. No such capability of a human operator is necessary, or desirable in most cases, in any of the operations described herein which form part of the present invention; the operations are machine operations.
  • Useful machines for performing the operations of the present invention include general purpose digital computers or other similar digital devices. In all cases there should be borne in mind the distinction between the method operations in operating a computer and the method of computation itself.
  • the present invention relates to method steps for operating a computer in processing electrical or other (e.g., mechanical, chemical) physical signals to generate other desired physical signals.
  • the present invention also relates to apparatus for performing these operations.
  • This apparatus may be specially constructed for the required purposes or it may comprise a general purpose computer as selectively activated or reconFIG.d by a computer program stored in the computer.
  • the algorithms presented herein are not inherently related to a particular computer or other apparatus.
  • various general purpose machines may be used with programs written in accordance with the teachings herein, or it may prove more convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these machines will appear from the description given below.
  • a data item as referred to herein corresponds to a discrete element of data that a user may wish to access.
  • a data item may comprise a particular record of a database or a particular field within a record of a database.
  • a data item may comprise a word processing file or any other type of file.
  • a data object as referred to herein stores a version of a data item. Different versions of the same data item may be stored in different data objects. For example, an original version of a text file and an updated version will be stored in two different data objects that each correspond to the same data item, the actual text file.
  • a domain describes the type of a particular data item and is used consistently with the terminology in the copending Application entitled “Method and Apparatus for Improved Information Storage and Retrieval System” filed February 3, 1995, Serial No. 08/383,752.
  • a particular data item may be of the text, number or Boolean domains, or a user defined domain.
  • the present invention discloses methods and apparatus for data storage, manipulation and retrieval. Although the present invention is described with reference to specific block diagrams, and table entries, etc., it will be appreciated by one of ordinary skill in the art that such details are disclosed simply to provide a more thorough understanding of the present invention. It will therefore be apparent to one skilled in the art that the present invention may be practiced without these specific details.
  • FIG. 1 illustrates an information storage and retrieval system structured in accordance with the teachings of the present invention.
  • the information storage and retrieval system includes a computer 23 which comprises four major components. The first of these is an input/output (I/O) circuit 22, which is used to communicate information in
  • I/O input/output
  • computer 20 includes a central processing unit (CPU) 24 coupled to the I/O circuit 22 and to a memory 26.
  • CPU central processing unit
  • FIG. 1 Also shown in FIG. 1 is a keyboard 30 for inputting data and commands into computer 23 through the I/O circuit 22, as is well known.
  • a CD ROM 34 is coupled to the I/O circuit 22 for providing additional programming capacity to the system illustrated in FIG. 1.
  • additional devices may be coupled to the computer 20 for storing data, such as magnetic tape drives, buffer memory devices, and the like.
  • a device control 36 is coupled to both the memory 26 and the I/O circuit 22, to permit the computer 23 to communicate with multi-media system resources. The device control 36 controls operation of the multi-media resources to interface the multi-media resources to the computer 23.
  • a display monitor 43 is coupled to the computer 20 through the I/O circuit 22.
  • a cursor control device 45 includes switches 47 and 49 for signally the CPU 24 in accordance with the teachings of the present invention.
  • a cursor control device 45 (commonly referred to a "mouse") permits a user to select various command modes, modify graphic data, and input other data utilizing switches 47 and 49. More particularly, the cursor control device 45 permits a user to selectively position a cursor 39 at any desired location on a display screen 37 of the display 43.
  • the cursor control device 45 and the keyboard 30 are examples of a variety of input devices which may be utilized in accordance with the teachings of the present invention. Other input devices, including for example, trackballs, touch screens, data gloves or other virtual reality devices may also be used in conjunction with the invention as disclosed herein.
  • the present invention comprises two main components.
  • the first component is a distributed file architecture that permits two or more users to access a common file.
  • the second component is the physical storage system within the local computer 23 that supports variable length data items and maintains previous versions of the data items.
  • the Specification will discuss these components in turn.
  • FIG. 2 illustrates an overview of the physical storage architecture of the present invention.
  • the computer 23, commonly known as a client communicates with a remote computer 56, commonly known as a server, that contains database files and other files that the computers 23 and other computers may access.
  • a remote computer 56 commonly known as a server
  • the transmission of data items between physical locations can occur over any network communication system, including, but not limited to: TCP/IP, Novell IPX, and NetBEUI.
  • the packaging protocol used to transmit data items may be any standard scheme for transmitting data, including, but not limited to: File transfer protocols such as FTP, Modem transfer protocols such as ZMODEM, Email protocols such as SMTP, Hypertext Transport Protocol, and so on.
  • the present invention divides the physical storage system into partitions where each physical device contains at least one partition.
  • Each partition comprises one or more associated data files.
  • the client computer 23 includes a journal partition 58 stored on the disk 32 while the server 56 includes a library partition 60 and an archive partition 62 that reside on the same or different storage devices within the server 56.
  • FIG. 2 illustrates one type of architecture structured according to the teachings of the present invention.
  • the library partition 60 may reside on a CD-ROM and the journal partition on a client computer 23.
  • all three partitions may reside on the client computer 23.
  • the journal partition is on a network server, one library partition is one the same server, and a second library partition is connected remotely over the Internet.
  • a particular list of linked partitions is called a 'partition chain', as illustrated by partitions 58, 60 and 62 in FIG. 2.
  • a partition chain may contain any number of partitions, including one.
  • the partition 58 nearest the user is called the 'update partition' and must be a journal partition and is the only partition in the chain that can be updated directly.
  • the other partitions, 60 and 62 are 'remote partitions' and are read-only partitions such that they can be read from but not written to directly.
  • Partitions may be classified according to various types, depending upon the function of the partition.
  • a journal partition such as the partition 58 comprises at least one append-only journal file as will be described more fully below.
  • a library partition such as the partition 60, stores a 'packed' version of the journal partition, containing only a single version of each data item.
  • An archive partition such as the partition 62, stores multiple historical versions of the data. Other types of partitions are possible.
  • journal, library and archive partitions are linked together as illustrated in FIG. 2.
  • Updates to files are not written directly to the library partition 60. Instead, updates are stored in the journal partition 58 immediately and then provided to the server 56 and merged into the library partition 60 at some later time.
  • FIG. 3 illustrates the linkage between the journal partition 58, the library partition 60 and the archive partition 62.
  • a journal file 70 residing within the journal partition 58 includes various data objects, for example database records, and the file may also contain unused memory.
  • the journal file 70 is packed and consolidated into a new consolidation file 70 which can be inserted in the partition chain between the now empty journal files, and the library file .
  • the consolidated journal file may be optionally packed and then stored in a library file 72 stored within the library partition 60.
  • the server 52 may write the library file 72 to an archive file 74, stored within the archive partition 62.
  • the archive file 74 contains multiple versions of the same data object.
  • the library may contain a large data item such as an item which stores a list of 10,000 pointers to objects or a large text document. In these cases, updating the value of the data item would cause an unnecessary duplication of the data.
  • the physical storage system of the present inventions supports 'appendable' data items, which distribute the storage of their contents across multiple partitions.
  • An appendable data item keeps track of the changes to the original data and stores only the changes to the original data in the journal.
  • FIG.s 4a and 4b show two implementations of appendable items for a list and text data, respectively.
  • FIG. 4a illustrates a list data item, which comprises an original list snored a remote partition and additions and removals from the list stored in a local partition.
  • the original list is a read only list and any updates must be written to the update list. Changes might be stored as identification numbers to add to the original list, and identification numbers to remove from the original list.
  • FIG. 4b illustrates a text data item stored as an appendable list 82, which comprises original text stored in a remote partition and additions and deletions from the text stored in a local partition.
  • the original text is stored such that it is read only text and any updates must be written to the local partition.
  • the changes might be stored as a series of editing actions such as insertions, deletions, and formatting actions.
  • appendable data items are advantageous. They allow the storage requirements for updates to be minimized since the original information need not be stored in the local partition. Further, they reduce synchronization problems since the local partition stores only the changes to the original data and not the original data itself. Finally, the use of appendable data items allows read-only media such as CD-ROMs and one-way electronic publishing services to be annotated.
  • Multiple partition chains can share some or all of the same library and archive partitions, which provides a means for providing the data items in the library and archive partitions as shared, while allowing the data items in the journal partition to have a local version, independent of data items in other journals or shared data items.
  • the various journal files can be consolidated to various degrees, by excluding certain data items from consolidation. Furthermore, the consolidation file itself can act as a new partition , without being merged into the prior library partition.
  • the partition chain for user X contains a data item Al in its journal file
  • the partition chain for user Y contains a data item Bl in its journal file
  • both chains contain a library partition C which has a data item Cl, and an older version of data item Al, called A2.
  • the various layers can serve various purposes, for example, there could be a four layer system, with the layers defined as follows: the first layer can be personalized for a single user, the next layer can contain shared information for a workgroup, the succeeding layer can contain shared information for a company-wide information system, and the final layer can contain publicly shared information.
  • the current invention provides for the advantageous ability for users to exchange subsets of their data items with each other by using the layer method.
  • a user X can create a new partition (A, say) based on a subset of his data, and transmit it to another user Y, using any standard data transfer system as described above.
  • User Y can then insert the partition A into his partition chain with the result that all of the data items in Partition A immediately and transparently appear to be part of user Y's data set, unless a particular data item was masked by a 'higher' layer.
  • This ability can be advantageously applied to such requirements as: synchronizing users who are not sharing a centralized system, shipping updates and annotations to a read-only published medium such as a CD-ROM, and gathering and consolidating information from distinct sources.
  • the system provides the consolidated contents of the journal partition 58 to the library partition 60 according to clock intervals or the occurrence of events .
  • the user of the system may define those conditions that trigger such a merge operation such as when the journal partition 58 contains a specified amount of data or when a certain amount of transactions have occurred since the most recent merge operation.
  • FIG. 6 illustrates a merge operation, where a plurality of data items 120, 122 and 124 in different locations within the journal partition 58 are copied and the copies provided to the library partition 60 where they are consolidated and merged with the other data in the library partition 60.
  • the data items may be compressed according to a data compression algorithm.
  • FIG. 7 is a flow chart for a merge operation.
  • data is written to a file in the journal partition 58.
  • the system determines whether the device on which the library partition 60 resides may be written to. If the device is a read-only device such as a CD- ROM, then the merge process cannot occur, and the routine halts at block 146. Otherwise, the system branches to block 148 where data is provided to the library partition 60 from the journal partition 58.
  • the system determines whether other journal files need to be merged. If so, the system branches back to block 148. Otherwise, the system consolidates multiple data items from the journal partition 60 into a single consolidation file and the file is merged into the library file, as illustrated in block 152. Subsequently, the routine exits, as illustrated in block 154.
  • FIG. 8 is a flow chart for the consolidation procedure.
  • the routine initializes and a new 'consolidated' file is created which will eventually contain all the data from the journal files. For each journal file in turn, and each data item in each journal file, the routine attempts to add the data item to the consolidation file.
  • the routine determines whether another version of the data item from another source, usually a device associated with a different user, already exists within a consolidation file. If not, the new data is added to the consolidation file at block 184 and the routine exits at block 186. If another version of the data item from another source already exists within the consolidation file, block 162 branches to block 164 and the conflict between the versions is resolved by applying the rules specified by the user or specified by the type of data object.
  • the conflict may be resolved by a merge operation. For example, two changes to a text document that do not overlap can be merged. If the routine solved the conflict in block 164, block 166 branches to block 174 where the new data is merged with the data from another source using the method defined by the user or object type (domain) . The system then retrieves the next item at block 182.
  • block 166 branches to block 168 where the system determines whether the new item or the item from another source will be stored. If the new item wins the conflict and will thus be stored, block 168 branches to block 176 where the item from another source is removed from the consolidation file and a message provided to the user that created the item from another source to inform that user that the data item will not be stored. Subsequently, the routine branches to block 182.
  • the winner of the conflict may be determined by a number of rules, including but not limited to, which item had the most recent timestamp or the higher value. Alternatively, the routine may give priority to the user with a higher status journal file or the user who entered the information.
  • block 168 branches to 170 where the system determines whether the item from another source wins the conflict. If so, block 170 branches to block 178 and the new data item is removed from the consolidation file and a message provided to the user that created the new data item and the routine branches to block 182. Finally, if neither the new item nor the item from another source wins the conflict, both data items are removed from the consolidation file and a message provided to the users of both items. Subsequently, the routine branches to block 182.
  • the old journal files can be cleared, ready to be updated again. If the library file must be left unchanged, the routine can halt and the consolidation file can be inserted in the partition chain as a new library partition, creating a new 'layer' as described above. Otherwise, the consolidation file can be, in turn, merged with the library file.
  • FIG. 9 is a flow chart for merging a consolidation file with a library file.
  • the routine determines whether an older version of the data item already exists in the library file, and, if not, the routine branches to block 194. Otherwise, block 192 determines whether the older version is to be preserved. If so, the older version is transferred to the archive file as illustrated in block 196. If the older version is not to be preserved, it is deleted, and block 192 branches to block 194.
  • the system determines whether the new item comprises an appendable record that must be stored with its parent. If so, the new data item is merged with the existing older version using the merge method defined by the domain and the routine exits at block 202. According to the present invention, data may be merged from multiple sources, none of which need to be connected to the device upon which the library partition resides. If the new item is not an appendable record, block 194 branches to block 198 and the new data item is added to the library file, overwriting any older version. At block 198, as an option, the old version may be archived. Subsequently, the routine exits at block 202.
  • FIG. 5 is a flow chart for reading data according to the teachings of the present invention.
  • the system first searches any local journal partitions and then searches remote partitions in order, such as the library partition and then the archive partition, to find a data item with the identification number of the data item being read. If the system cannot locate any data items, block 92 branches to block 94 and a "NULL" or default value is returned.
  • block 92 branches to block 96, and the system determines whether the data item is a "tombstone, " that is, whether the particular data item has been deleted. If so, the system branches to block 94. Otherwise, at block 98 the system determines whether the append flag of the item is set and, if not, the system returns the data item as shown in block 100. If the append flag is set, indicating a appendable item, at block
  • the system searches other partitions to find other data items with the same identification number. If no parent data item is found, the system branches from block 104 to block 106 where the system indicates an error since an append flag implies that a parent data item exists.
  • the system determines whether the parent data item's append flag is set at block 108, which indicates whether the parent has its own parent. If so, the routine branches back to block 102 where the next partition, in order, is searched. When the routine locates all related data items, they are merged into one item and returned, as illustrated in block 110.
  • One of the advantages of the current invention is that reading and writing of data items does not require all partitions to be present at all times. If data is mainly read from (and always written to) the journal partition, then some or any of the library and archive partitions can be absent without detrimental effect.
  • the current invention provides for processing data over an unreliable or completely disconnected partition chain.
  • a user may have a journal partition and small, local library partition on a laptop computer. At certain times, the user can connect to a network to access a much larger master library partition and archive partition, but can also disconnect from the network and still be able to enter and retrieve data items successfully.
  • journal file While the user is connected to the network, consolidation of the journal file may occur, if desired, after which a new journal and library file can replace the prior versions.
  • the journal partition 58 can store variable length data items, such as a free text database record, on a storage medium such that prior versions of these same items can be retained.
  • FIG. 10 illustrates the structure of the journal partition 58.
  • the journal partition may reside on the mass memory 32 of FIG. 1.
  • the memory that includes the journal partition 58 is divided into physical storage device blocks 250, 252 and 254.
  • Data objects, including data objects 256, 258 and 262, are stored within the blocks 250, 252 and 254.
  • journal partition 58 may include older versions of the same data object.
  • a data object 256 may comprise a database cell including information about a company's employees and data object 258 may represent that cell after a user has updated it.
  • the system creates a new block when needed and the system stores a table 260 that relates objects to their respective blocks. The table 260 is updated each time an object is written to a block.
  • FIG. 11 shows the contents of the object 262.
  • the object 262 comprises five fields, a status field 264, an identifier field 266, a data field 268, a pointer field 268 and a timestamp field 272.
  • the object 262 need not contain all of the other fields and the status field 264 contains flags that indicate those fields that an object contains.
  • the data field 268 stores data such as text and numbers corresponding to the object 262 and the pointer field 268 contains a pointer to a prior version of the object 262.
  • the timestamp field 272 indicates when the object 262 was created and the identifier field 266 contains a number identifying the object 262 that is used in the table 260.
  • deleting a data item must be handled specially.
  • a special marker called a 'tombstone' is written to the journal partition to signify that the data item was deleted.
  • the tombstone comprises an object with a data field that has no value and a special status flag is set to show that the item is a tombstone.
  • the tombstone object stores a pointer to an object that contains the last version of the data item to be deleted.
  • the most recent version of the data item is retrieved by looking up the appropriate block in the table 260. Once the most recent version of a data item has been retrieved by retrieving the item's associated most recent object, prior versions can be retrieved by using the pointer stored within the retrieved object.
  • a user may wish to discard older versions of the data items. This is done by copying the desired data items, generally the most recent, to another file, and discarding the original file.
  • FIG. 12 is a flow chart for inserting items to the journal partition 58, updating items within the journal partition 58 and deleting data items from the journal partition 58. According to the routine illustrated in FIG. 12, inserting, updating and deleting are performed by a similar method and status flags indicate the difference between the actions.
  • All three operations include writing a new object to the journal partition 58.
  • the new object includes the updated data and points to the previous version, as previously described.
  • a tombstone object is written to the journal partition 58 indicating the deleted data item, as previously described.
  • an insert operation begins and branches to block 282, where the prior address flag is set to FALSE since the insertion of an item implies that there is no prior address to point to.
  • the system stores the address of the object containing the item to be updated and sets its prior address flag to TRUE as shown in block 302.
  • the routine sets the "tombstone” flag to TRUE and the "data value” flag to FALSE, indicating that there is no data in the object being written, and that the object being written implies the deletion of a data item, as shown in block 316.
  • the system then writes the new object to the journal partition 58.
  • the routine may process the new object according to various options. For example, at block 284, the routine determines whether it will store an object identifier in the object identifier field. Storing the identifier is not necessary for retrieval, but can be used to recover data in case of file corruption. If the identifier is not to be stored, block 284 branches to block 304 and the identifier flag is set OFF. Block 304 branches to block 286 where the status flags are written to the journal partition 58.
  • the routine determines whether the identifier flag is TRUE. If so, the system branches to block 306 and the identifier is written to the journal partition 58. The system then branches to block 290, to determine whether the value flag is TRUE. If so, the system writes the data value to the journal partition 58. Similarly, at block 292, the routine determines whether the prior address flag is TRUE. If so, the system branches to block 310 and the prior address is written to the pointer field in the new data object created in the journal partition 58. The system then branches to block 294, to determine whether the timestamp flag is TRUE. If so, the system writes the timestamp to the timestamp field of the new object created in the journal partition 58.
  • the table 260 is updated to reflect the new location of the data item on disk corresponding to the new object written to the journal partition 58.
  • This approach allows for various options. For example, for all items, it is optional to store the identifier. If the identifier, timestamp, and prior pointer are not stored, the required storage size of the data item is minimal.
  • the structure of the table 260 is a standard extendible hash table data structure.
  • the table 262 is updated every time a new object is written to the journal partition 58. Since the table 260 may become quite large, to avoid saving it, by writing it to a non-volatile memory, every time it is updated, a checkpoint approach is used whereby the table 260 is saved at certain user-defined intervals. For example, a user may specify that the table should be saved after every 50 updates.
  • FIG. 13 illustrates "sentinel" data object.
  • "Sentinel” objects 350 and 352 each contain a timestamp and pointers to tables 354 and table 356 respectively.
  • the tables 354 and 356 comprise versions of the table 260 and are stored in non-volatile memory when the "sentinel" objects are written to the journal partition 58.
  • the system need only reconstructed the table 260 since the actual data is already stored on the journal partition 58.
  • Reconstructing the table 260 can start from the last valid sentinel, rather than from the beginning of the file, which greatly increases the speed of recovery.
  • the routine for reconstructing the table 260 the most recent "sentinel" object is located by reading backward from the end of the journal. This will be the last point at which the table 260 was still valid.
  • the sentinel will contain a pointer to the disk file that stores the table 262 and the table 262 may then be loaded from this file. If the table 262 is missing or damaged, the routine then attempts to find a "sentinel" object check point that is earlier in the journal file. This process continues until a valid "sentinel" object is found or the beginning of the journal file is reached.
  • journal partition 58 is read, starting at the next object written to the journal partition 58 after the
  • the physical arrangement of cells on the storage medium is flexible and can be adjusted for various requirements. For example, if a particular column or set of columns is searched regularly, the cells comprising these columns may be kept adjacent to each other in a partition. Alternatively, the cells may be separated from the main storage file and stored in a distinct file, called a 'stripe'.
  • a special storage technique can be used.
  • the contents of the data item are stored in a special location, distinct from the journal partition. This location may be reused every time the reconstructible data item is written which saves memory space and time.
  • the journal then contains a pointer to this external location, instead of the actual data itself. If, for some reason, the external location is missing or damaged, the data item can be reconstructed using an appropriate method.
  • Fig. 14 is a block diagram of the information storage and retrieval system of the present invention.
  • the present invention includes an internal database 400 that further includes a record oriented database 401 and a free-text database 402.
  • Database 400 may receive data from a plurality of external sources 403, including word processing documents 404, spreadsheets 405 and database files 406.
  • the present invention includes an application support system that interfaces the external sources 403 with the database 400.
  • a plurality of indexes 406 including a keyword index 407 and other types of indexes such as phonetic, special sorting for other languages, and market specific such as chemical, legal and medical, store sorted information provided by the database 400.
  • a knowledge system 408 links information existing in the indexes 406.
  • FIG. 14 is for conceptual purposes and, in actuality, the database 400, the indexes 406 and the knowledge system 408 are stored in the same table, as will be described more fully below.
  • This Specification will first describe the structure and features of the database 400. Next, the Specification will describe the index 406 and its implementation for searching the database 400. The
  • FIG. 15 illustrates the storage and retrieval structure of the present invention.
  • the storage and retrieval structure of the present invention comprises a table 409.
  • the structure of the table 409 is a logical structure and not necessarily a physical structure.
  • the memories 26 and 32 conFIG.d according to the teachings of the present invention need not store the table 409 contiguously.
  • the table 409 further comprises a plurality of rows 410 and a plurality of columns 420.
  • a row corresponds to a record while a column corresponds to an attribute of a record and the defining characteristics of the column are stored in a row 411.
  • the intersection of a row and a column comprises a particular cell.
  • Each row is assigned a unique object identification number (OID) stored in column 420 and each column also is assigned a unique OID, indicated in brackets and stored in row 411.
  • OID object identification number
  • row 410 has an OID equal to 1100 while the column 422 has an OID equal to 101.
  • the OID's for both rows and columns may be used as pointers and a cell 412 may store an OID. The method for assigning the OID's will also be discussed below.
  • each row may include information in each column. However, a row need not, and generally will not, have data stored in every column.
  • row 410 corresponds to a company as shown in a cell 413. Since companies do not have titles, cell 414 is unused.
  • the type of information associated with a column is known as a "domain".
  • Standard domains supported in most database systems include text, number, date, and Boolean.
  • the present invention includes other types of domains such as the OID domain that points to a row or column.
  • the present invention further supports "user-defined' domains, whereby all the behavior of the domain can be determined by a user or programmer. For example, a user may conFIG. a domain to include writing to and reading from a storage medium and handling operations such as equality testing and comparisons.
  • individual cells may be accessed according to their row and column OID's. That is, the cell (or intersection between the rows and columns) is the unit of storage and management, not the record (or row) as is standard in the prior art.
  • Using the cell as the unit of storage improves many standard data management operations that previously required the entire object or record. Such operations include versioning, security, hierarchical storage management, appending to remote partitions, printing, and other operations.
  • Each column has an associated column definition, which determines the properties of the column, such as the domain of the column, the name of the column, whether the column is required and other properties that may relate to a column.
  • the table 409 supports columns that include unstructured, free text data.
  • the column definition is stored as a record in the table 409 of FIG. 15.
  • the "Employed By" column 415 has a corresponding row 416.
  • the addition of rows that correspond to columns renders table 409 self-referential. New columns may be easily appended to table 409 by creating a new column definition record. The new column is then immediately available for use in existing records.
  • Dates can be specified numerically and textually.
  • An example of a numerical date is "11/6/67" and an example of a textual date is "November 6, 1967.”
  • Textual entries are converted to dates using standard algorithms and lookup tables.
  • a date value can store both original text and the associated date to which the text is converted, which allows the date value to be displayed in the format in which it was originally entered. Numbers
  • Numeric values are classified as either a whole number (Integer) or fractional number.
  • Integers are stored as variable length structures, which can represent arbitrarily large numbers. All data structures and indexes use this format which ensures that there are no limits in the system.
  • Fractional numbers are represented by a ⁇ numerator/denominator> pair of variable length integers. As with dates, a numeric value can store both the original text ("4 1/2 inches") and the associated number(4.5). This allows the numeric value to be redisplayed in the format in which it was originally entered.
  • a record can be associated with a "record type' .
  • the record type can be used simply as a category, but also can be used to determine the behavior of records.
  • the record type might specify certain columns that are required by all records of that type and, as with columns, the type definitions are stored as records in the table 409.
  • column 422 includes the type definition for each record.
  • the column 422 stores pointers to rows defining a particular column type.
  • the row 416 is a "Field” type column and contains a pointer in cell 417 to a row 418 that defines "Field” type columns.
  • the "Type Column” 422 of the row 418 points to a type called "Type, " which is defined in a row 419.
  • "Type” has a type column that points to itself.
  • Record types may constrain the values that a record of that type may contain. For example, the record type "Person' may require that records of type "Person' have a valid value in the "Name' column, the "Phone' column, and any other columns.
  • the type of a record is an attribute of the record and thus may change at any time.
  • a template is a special record that comprises a description and a list of fields.
  • a Contact template might consist of the fields: "First Name”, “Last Name”, “Phone”. Templates are used for conveniently labeling a group of fields.
  • a template can be used for various purposes, such as determining which fields to use when editing a record, printing a record, searching for a record, or exporting a record. More sophisticated templates can contain layout information that is used when editing or printing a record.
  • any record can contain any field, so any template can be applied to any record, even if some fields may be "inappropriate' for the type of record. For example, when printing a collection of records, the same template may be applied to all the records in the collection, even if the records are of different types.
  • a record can be associated with a particular "default template'. This template is used as the default when editing, printing, and exporting the record.
  • FIG. 16 is a flow chart of the method for assigning OID's.
  • CPU 24 that is running the database program stored in the memory 26 requests a timestamp from the operating system.
  • the system determines whether the received timestamp is identical to a previous timestamp. If the timestamps are identical, block 432 branches to block 434 and a tiebreaker is incremented to resolve the conflict between the identical timestamps.
  • the system determines whether the tiebreaker has reached its limit, and, if so, the system branches to block 430 to retrieve a new time stamp. Otherwise, the system branches to block 440 where the system requests a session identification which is unique to the user session.
  • the session identification is derived from the unique serial number of the application installed on the users machine.
  • the session identification may be used to determine the type of object. For example, dates are independent of any particular machine, and so an OID for a date may have a fixed session identification.
  • the system requests a session identification which is unique to the user session.
  • OID OID-length indicating which type of OID to be used may be embedded in the header of each database.
  • OID domains are used to store OID's, which are pointers to other records. An efficient query can use these OID's to go directly to another record, rather than searching through columns .
  • the present invention includes a novel technique for determining an OID from the textual description. Conversion from text to an OID may also be necessary when a user is entering information into a record. For example, in FIG. 15, the user may be entering information in the "Employed By" column 415, and wish to specify the text "DEXIS" and have it converted to OID #1100. For this purpose, special columns are required that provide a definition for how the search and conversion is performed.
  • FIG. 18 is a flow chart for searching the table 409 conFIG.d according to the structure illustrated in FIG. 17.
  • a user enters text through the keyboard 30 or mouse 45 for a particular column that the user wishes to search.
  • the system retrieves the search path for the column to be searched from the information stored in column 451 as illustrated in FIG. 17.
  • a cell 453 in the row 416 contains the search path information for the "Employed By" column 415 of FIG. 15.
  • the search path information for the "Employed By" field indicates that the folders called “ ⁇ contacts" and " ⁇ departments " should be searched for a company with the dabel "DEXIS. "
  • the system searches table 409 according to the retrieved search path information.
  • the routine searches for a record that has an entry in the label column 423 of FIG. 15 that is the same as the text being searched for, and is of the same class, as indicated in column 422 of FIG. 15. Folders will be further described below.
  • the system determines whether it has found any items matching the user's search text. If no items have been found, at block 458, the system prompts the user on the display screen 37 to create a new record. If the user wishes to create a new record, control passes to block 462 and the system creates a new record. At block 464, the OID of the new record is returned. If the user does not wish to create a new record, a "NIL" string is returned, as shown at block 460.
  • the system determines whether it has found more than one item, as illustrated in block 466. If only one item has been located, its OID is returned at block 468. If more than one item has been located, the system displays the list of items to the user at block 470 and the user selects a record from the list. At block 472, the OID of the selected record is returned, which, in the above example, is #1100, the OID of the record for the company "DEXIS.”
  • various features may be added to the search mechanism as described with reference to FIG. 18. For example, further restrictions may be added to the search; the search may be related by allowing prefix matching or fuzzy matching instead of strict matching; and the search may be widened by using the "associative search' techniques described below.
  • Records may have interrelationships and it is often desirable to maintain consistency between interrelated records.
  • a record including data for a company may include information regard employees of that company, as illustrated in row 410 of FIG. 15.
  • the employees that work for that company may have a record that indicates, by a pointer, their employer, as illustrated by row 421 of
  • FIG. 15 Thus, the employee column of a company should point to employees whose employer column points to that company.
  • the present invention includes a synchronization technique to ensure that whenever interrelated records are added or removed, the interrelationships between the columns are properly updated.
  • the system synchronizes interrelated records by adding a "Synchronize With" column 455 to the table 409 as illustrated in FIG. 17. Since the value in the columns defines the relatedness between records, the rows 416 and 457 corresponding to columns contain information within the "Synchronize With" column 455 that indicates which other columns are to be synchronized with the columns corresponding to rows 416 and 457.
  • the "Employed By” column 415 is synchronized with the "Employees" column by an OID pointer in the "Synchronize With” column 455 to the "Employees” column, represented by row 457.
  • the "Employees" column is synchronized with the "Employed By” column 415 by a pointer in the "Synchronize With” column 455 to the "Employed by” column 415, represented by row 416.
  • the "Employee” column of the previous employer is updated to eliminate the pointer to the ex-employee and, correspondingly, the addition of the employee in the "Employed By" field of the new employer. Synchronization may need to occur whenever a column is changed, whether by addition or subtraction of a reference to another column, or when entire records are added or eliminated from the table 409.
  • FIG. 18A is a flow chart for synchronizing records when a user adds or deletes a record.
  • the system makes a backup of the original list of references to other rows, which are simply the OID's of those other rows, so that it can later determine which OIDS have been added or removed. Only these changes need to be synchronized.
  • the system generates a new list of references by adding or deleting the specified OID.
  • the system determines whether the relevant column is synchronized with another column. If it is not, then the system branches to block 467 and the update is complete. If the column is synchronized with another column, the system determines whether it is already in a synchronization routine. If this were not done, the routine would get into an endless recursive loop. If the system is already in a synchronization routine, the system branches to block 471 and the update is complete.
  • the system performs actual synchronization.
  • the system finds an OID that has been added or subtracted from the column (Cl) of the record (Rl) being altered.
  • the system retrieves the record (R2) corresponding to the added or subtracted OID at block 474.
  • the system determines the synchronization column (C2) of the column (Cl) at block 475 and locates that field in the added or subtracted OID. For example, if an employer is fired from a job, and the employer's "Employed By" field changed accordingly, the system would look up the value of the "Synchronize With" column 455 for the "Employees" column which is contained in the cell 459 as illustrated in FIG. 17.
  • the system locates the "Employed By” field of the record for the fired employee.
  • the located cell, (R2:C2) is updated by adding or subtracting the OID.
  • the "Employed By" field of the employee would be changed to no longer point to the previous employer by simply removing the employer's OID from that field.
  • the system branches back to block 473 to update any other OID additions or subtractions. If the system has processed all of the OID's, then the routine exits as illustrated at blocks 477 and 478.
  • FIG. 18B illustrates the results of column synchronization of the "Employed By” field and the “Employees” field. As shown, the pointers in the records of these two columns are consistent with one another.
  • a column may contain within it a reference to another column in the same record.
  • a "name' column may contain a reference to both a "first name' and a "last name' column. The value of the "name' column can then be reconstructed from the values of the other two columns.
  • FIG.s 19A and 19B illustrate two possible implementations for reconstructing a value from one or more columns within the same record.
  • FIG. 19A illustrates a table 480 that includes a "First Name” column 482, a "Last Name” column 484 and a "Name” column 486.
  • a record 226 for "John Smith” has the first name "John” in the "First Name” column 482 and the last name "Smith” in the column 484.
  • FIG. 19B employs a variant of the referencing scheme illustrated in FIG. 19A.
  • FIG. 19A illustrates a table 481 that includes a "First Name” column 483, a "Last Name” column 485 and a "Name” column 487.
  • a record 489 for "John Smith” has the first name "John” in the "First Name” column 483 and the last name "Smith” in the column 485.
  • the name field 487A returns the text "The name is John Smith” by referencing the fields by defined variables ' fn ' and 'In' as shown in column 487.
  • the table 409 illustrated in FIG. 15 includes a "RecordContents" column that indicates those columns within which a particular record has stored values.
  • FIG. 20 illustrates the table 409 with a "RecordContents" column 427 that includes pointers to the columns containing values for a particular record.
  • the "RecordContents" column 427 for row 410 has pointers to the column 423 and a column 425 but does not have a pointer to the column 415 because the row 410 does not have a value for the column 415.
  • the "RecordContents" column 427 has a defining row 429.
  • the cell containing the record contents can be versioned, providing the ability to do record versioning.
  • the table 409 may include a data type defined as a folder.
  • FIG. 21 illustrates the structure of a folder which includes a "Parent Folder” column 490 and a "Folder Children” column 492.
  • a folder has a corresponding record.
  • a folder entitled “Contacts” has a corresponding row 494 as illustrated in FIG. 10.
  • the "Folder Children" column 492 of the "Contacts" folder includes pointers to those records that belong to the folder.
  • those records that belong to a folder include a pointer to that folder in the "Parent Folder" column 490.
  • a particular record can belong to any number of folders .
  • the folder structure illustrated in FIG. 21 facilitates searching. As previously described, a column may be searched according to a folder specified in the column definition. If a folder is searched, the system accesses the record corresponding to the folder and then searches all of the records pointed to by that folder. Further, the synchronization feature described above may be used to generate the list of items in a folder. For example, in FIG. 21, the "Folder Parent' and "Folder Children' columns may be synchronized.
  • Certain folders are defined such that their contents are automatically determined. This is an “index” folder or “query” folder, depending on the type of definition.
  • An index folder contains all records that contain a valid value in a certain field.
  • a "People” folder may defined such that it automatically contains all records of type "Person”.
  • a folder that automatically contains a) all records of type "Person” that also b) contain the word "California” .
  • Such automatic folders facilitate the use of the system by automatically filing and organizing records of interest, without requiring an explicit action from the user.
  • the present invention includes an indexing system that provides for rapid searching of text included in any cell in table 409.
  • Each key phrase is extracted from a cell and stored in a list format according to a predefined hierarchy. For example, the list may be alphabetized, providing for very rapid searching of a particular name.
  • FIG. 22 illustrates the extraction of text from the table 409 to a list 495.
  • the list 495 is shown separately from the table 409 for purposes of illustration but, in a currently preferred embodiment of the present invention, list 495 comprises part of table 409.
  • List 495 stores cell identification numbers such as cell identification number 495A for each word in the list where a cell identification number may be of the format ⁇ record OID, column OID> .
  • the word "Ventura” occurs in cells 496, 497 and 498 that correspond to different rows and different columns.
  • the word “Ventura” in the list 495 contains a pointer, or cell identification number, to cells 496, 497 and 498.
  • each cell stores the references to the key phrases within it using "anchors' .
  • an anchor contains a location (such as the start and stop offset within the text), and an identification number. Both the text and the anchor are stored in the cell 496.
  • Other kinds of domains also support anchors. For example, graphical images support the notion of "hot spots' where the anchor position is a point on the image.
  • each key phrase is stored as a record in the database and the OID of the record equals the identification number described with reference to FIG. 23.
  • One column stores the name of the key phrase and another stores the list of cell identification numbers that include that phrase.
  • Key phrases may have comments of their own, which may also be indexed.
  • the sorted list 495 as illustrated in FIG. 22 is stored as a Folder, as illustrated in FIG. 24.
  • a cell identification field 491 maintains the cells that include the term corresponding to that record.
  • the "Parent Folder" column 490 for each of the terms on the list 495 indicates that the parent folder is an index with a title “Natural.”
  • the "Natural” folder has a row 499 that has pointers in the "Folder Children” column 492 to all of the terms in the list 495.
  • the "Natural" folder corresponds to an index sorted by a specific type of algorithm.
  • Computer programs generally sort using a standard collating sequence such as ASCII.
  • the present invention provides an improvement over this type of sorting and the improved sorting technique corresponds to the "Natural" folder. Records in the "Natural" folder are sorted according to the following rules:
  • a key phrase may occur at more than one point in the list.
  • la Key phrases may be permuted and stored under each permutation. For example: "John Smith' can be stored under "John' and also under "Smith' . Noise words such as "a' and "the' are ignored in the permutation.
  • lb Key phrases which are numeric or date oriented may be stored under each possible location.
  • each modified key phrase is used to determine the position of a reference to the main key phrase record, and an entry is made in the folder accordingly. For example, '1ST JOHN SMITH THE' is stored between '1' and "2", while "FIRST
  • FIG. 33a illustrates the results of a prior art sorting algorithm while FIG. 33b illustrates the results of a sorting algorithm according to the present invention. Extracting the key phrases
  • the system To generate a sorted list, the system must first extract the key phrases or words from the applicable cells. Various combinations of key phrase extraction can be used.
  • the "key phrase' anchors are automatically updated whenever the text in the cell changes. This is advantageous as it removes the necessity for manually linking cells when data changes.
  • each field can define how it is text-indexed. For example, the "Last name” field may be analyzed so that it's entire contents is considered important, and indexed, while the "Phone” field may be analyzed so that it's entire contents is considered unimportant, and never indexed.
  • the combination of structured information and text allows various combinations of key phrase extraction to be used differently for each individual cell in a record, allowing the user to select the most suitable combination.
  • certain fields can be defined such that links based on key phrases are automatically created for data in those fields, without user intervention.
  • the date indexing scheme is very similar to the text indexing scheme as previously described. Important dates are extracted from the text and added to an "Important Date' list. Each important date is represented by a "Important Date' record. The "Important Date' records are stored in a "Important Dates' folder, which is sorted by date.
  • the important dates are extracted from the text .
  • the system may search for numeric dates, such as "4/5/94' or date- oriented text, such as "Tomorrow”, "next Tuesday” or
  • FIG. 34 illustrates the correspondence between cells of the table of FIG. 15 and a sorted date index.
  • Important Date records are assigned special predetermined OIDS since they always have the same identity in any system. Assigning predetermined OID's to dates allows Important Dates to be shared across systems.
  • the predetermined OID is generated by using a special session identification number that signifies that the OID is an Important Date. In this case, the timestamp represents the value of the Important Date itself, not the time that it was created.
  • a sorted key word list is generated from the text in cells and list stored in a folder whose records point to the text cells.
  • the associations between the list of records with text and the list of key phrases is two-way since the cells that include text point to the key words.
  • FIG. 25 illustrates this two way correspondence. Each record can point to multiple key phrases, and each key phrase can point to multiple records.
  • FIG. 26 is a graphical representation of the two way association between records and the key word list.
  • Each record such as record 501 in a plurality of records 500 may point to one or more important word entries such as word entry 510.
  • each important word entry may point to one or more records .
  • a single level search involves starting at one node such as 502 (on either side of the graph) and following the links such as link 504 to the node on the other side such as node 512. For example, a user may wish to find the records including the word "Shasta.” First, the important word index would be accessed to find the word "Shasta" and the records pointed to by this word would then be retrieved.
  • FIG. 27A illustrates this concept.
  • the term “Shasta” may correspond to a dog with extraordinary intelligence such that in one record, "Shasta” is described as a dog and another record, "Shasta' is described as a genius.
  • the system locates cell 508, "Shasta" in the "Important Words” folder 503 which points to records such as records 505 and 507 including the word “Shasta.”
  • records 505 and 507 pointed to contain pointers to the "Important Words” list for each indexed word in the record. Since "Shasta” appears with “dog” and “genius” in the records, these words are retrieved by the system.
  • FIG. 27B illustrates an additional level of searching.
  • the word “genius” may occur in records referring to Dirac, and the word “dog” associated with “Checkers, " such that the multilevel search illustrated in FIG. 27B results in a retrieval of "Dirac” and “Checkers” when provided with the word "Shasta.”
  • a relevance ranking can be created based on weights associated with each link and type of key word, and the records can be displayed in order of descending relevance.
  • the relevance is based on the distance from all nodes. In this way, only nodes which are near all the initial nodes will have a high relevance. Many other relevance rankings apart from distance may be used.
  • filters can be used to constrain the links that are followed.
  • the search may be filtered such that only the type "Person" is listed such that, in the above example, Shasta will be associated with Dirac but not Checkers .
  • a currently preferred embodiment of the present invention includes a knowledge base and thesaurus to further improve searching capabilities.
  • Each important word record (term) included within the thesaurus contains a pointer to a "concept” record.
  • Each concept record contains pointers to other concept records, and to the terms that are included within the bounds of that concept.
  • Fig. 28 illustrates the structure of the thesaurus.
  • Table 409 includes a "Parent Concept” column 522, a "Concept Name” column 524, a “Synonyms” column 526, a "More Specific Terms” column 528, a “More General Terms” column 530 and a "See Also” column 532.
  • a concept record 520 defines the concept "IBM” and the Synonyms column 526 points to records that are synonymous with IBM, a record 521 with a label field with the value "IBM” and a record 523 with a label field with the value "International Business Machines.”
  • the records 521 and 523 have pointers in the "parent concept” field that point to the parent concept record 520.
  • the thesaurus structure illustrated in FIG. 28 provides for greater flexibility than exact synonyms.
  • the "More Specific Terms” column 528 of the concept record 520 associated with "IBM” points to a concept record 525 associated with the IBM PC with an assigned weight of 100%, where the weight percentage reflects the similarity between the initial term “IBM” and the related term “IBM PC.”
  • the "More General Terms” column 530 of the concept record 520 associated with “IBM” points to a concept record 527 associated with Computer Companies with an assigned weight of 60%.
  • the “See also” column 532 points to a record 529 associated with the concept "Microsoft” with a weight of 70%, where the weight percentage reflects the similarity between the initial term "IBM” and the related term "IBM PC.”
  • the Thesaurus illustrated in Fig. 28 enhances the searching mechanisms previously described with reference to Fig.'s 25-27B.
  • the system first locates the record associated with a key word and locates the parent concept record pointed to by the key word record. The system may then follow some or all of the pointers in the columns 526, 528, 530 and 532 and return of the OID's stored in the "Concept Name' column 524.
  • any other columns may be used to extend the knowledge and information stored therein.
  • the system can store any kind of relationship, including relationships other than thesaural relationships, between key phrases, concepts and other records.
  • the database of the present invention has been described without reference to its interface with applications that may use the invention as their primary storage and retrieval system.
  • the present database includes an interface to support applications programs.
  • Components in the application support system include external document support, hypertext, document management and workflow, calendaring and scheduling, security and other features.
  • the present invention includes various user interface components that allow have been developed to provide full access to the structure of the database of the present invention. In particular, a new kind of structured word processor will be presented. The Specification will describe each component of the application support system separately.
  • the present invention supports indexing of external documents.
  • the table 409 stores the filenames of documents, such as word processor documents, where the contents of the files are not directly stored in the database.
  • the documents names may be stored in a column with a specialized "External Document" domain.
  • the external documents may reside in the mass memory 32 or on a multi-source that interfaces with the system through device control 36.
  • an external document is converted into a plain text format.
  • Key phrases are then extracted as previously described.
  • fields in the text can be determined and mapped to fields within the database.
  • a "Memo' document may contain the text: "To: John Smith. From: Mary Doe' .
  • This text can be mapped to the fields called "to' and "from”, and the values of these fields set accordingly.
  • the analysis of the text in this way can be changed for different types of external documents such as memos, legal documents, spread sheets, computer source code and any other type of document.
  • a start and stop point within the text is determined.
  • a list of anchors of the format previously described, ⁇ start, stop, key phrase> is generated by the parser and stored within the table 409 under the external document domain.
  • the stored anchors are overlaid on top of the document such that it appears that the external document has been marked with hypertext.
  • the corresponding anchor is determined from the various start and stop coordinates.
  • the OID of the key phrase corresponding to the anchor is stored within the anchor, and can be used for the purposes of retrieving the key phrase record or initiating a query as previously described.
  • the present invention supports conventional Hypertext.
  • Prior art Hypertext systems typically associate a region of text with a pointer to another record, as illustrated in FIG. 29. This creates a "hard-coded' link between the source and the target. When s user clicks on the source region, the target record is loaded and displayed. If the target record is absent, the hypertext jump will fail, possibly with serious consequences.
  • each hypertext region such as region 540 is associated with a key phrase, not a normal record.
  • all the records 542 associated with the key phrase are retrieved and ranked using any of the associative search techniques previously described.
  • the application can then display on the display screen 37 either the highest ranked item such as item 543, or present all the retrieved items and allow the user to pick the one to access.
  • the user may want to access a single "default' item such as item 543.
  • This item can be determined automatically, by picking the item at the top of the dynamically generated list, or manually, by letting the user pick the item explicitly and then preserving this choice in the anchor itself.
  • the database of the present invention includes the ability to annotate and comment on any record, either by adding "annotation' fields to an existing record, or by creating a new "annotation" record that points to the original record.
  • the annotation mechanism of the present invention is fully integrated into the database, and is available for indexing, to have hypertext, to be placed in folders, and so on.
  • the database of the present invention includes a novel Structured Word Processor that may be used in conjunction with table 409.
  • FIG. 31A illustrates three character boxes 551, 553 and 555 concatenated to form a word box 550.
  • Fig. 3IB illustrates four word boxes 552, 554, 556 and the word box 550 combined to form a horizontal line box 557.
  • Horizontal boxes are used for words and other text tokens that are spaced horizontally inside another box, such as a line (or column width) .
  • Fig. 31C illustrates the combination of the horizontal line box 557 with another horizontal line box 559 to form a vertical box 558. Vertical boxes are used for paragraphs and other objects that are spaced vertically inside other boxes, such as page height .
  • Boxes may be attached to other boxes with "glue.”
  • the glue can stretch or shrink, as needed. For example, in a justified sentence, the white space between words is stretched to force the words to line up at the right edge of the column.
  • Glue can be used for between-character (horizontal) spacing, between-word (horizontal) spacing including “tab” glue, that "sticks” to tab markings.
  • Glue may also be used for between- line (vertical) spacing and between-paragraph (vertical) spacing.
  • each word and field definition is converted into boxes.
  • the system organizes these boxes into a tree structure of line boxes and paragraph boxes, as illustrated in Fig. 32.
  • the record structure hierarchy 560 represents the record structure of the table 409 where a record 562 corresponds to a row in the table 409, and record 562 includes a plurality of attributes, including attribute 564, that correspond to the columns of the table 409.
  • the attributes may include a variety of items.
  • the attribute 564 includes text, represented by block 566, field references represented by block 568 and other items as shown.
  • the layout hierarchy 570 comprises a document 572 which in turn comprises a plurality of pages such as page 574.
  • Page 574 comprises a plurality of paragraphs including paragraphs 576 and 578 and paragraph 576 comprises a plurality of lines, including lines 577 and 579.
  • Paragraph 578 includes line 579.
  • the word processor of the present invention allows the document 572 to be inserted into the record 562 by providing a plurality of boxes, including boxes 565, 567 and 569, common to both the record structure hierarchy 560 and the layout hierarchy 570.
  • the box 565 corresponds to part of the line 577 and comprises part of text 566 of attribute 564.
  • the box 567 corresponds to part of the line 579 and may comprise a field reference as indicated by block 568.
  • the shared box structure as illustrated in FIG. 32 allows any type of word processing document to interface with any record in the table 409.
  • each box is kept as a bitmap, and its height and width are known, so the system displays the tree structure 571 by displaying all of the bitmaps corresponding to the boxes in the tree. If the tree is changed, for example, by adding a new word, only the new word box and a relatively small number of adjacent boxes need be recalculated. Similarly, line breaks or restructuring of a paragraph does not alter most of the word boxes, which may be reused, and only the lineboxes need be recalculated.
  • a user may click a cursor on a part of the text. The system locates the word box or glue that is being edited by a recursively descending through the tree structure 571.
  • the word processor supports multiple fonts and special effects such as subscripts, dropcaps and other features including graphic objects.
  • a word in a different font than a base font is in a different box and may have a different height from other boxes on a line.
  • the height of a linebox the height of the largest wordbox within it. Effects within a word can be handled by breaking a word into subboxes with no glue between them. Again, the height of a wordbox is the height of the largest box within it.
  • Graphic objects such as bitmaps, may be treated and formatted as a fixed width box.
  • the word processor of the present invention may be used to edit records in the table 409.
  • the text associated with each field in a record can be considered a "paragraph" for the purposes of inter-field spacing, text flow within a field, and other formatting parameters. Storing all the fields in the same way during text-editing allows the movement of text and "flow" to appear natural.
  • the text being edited is divided into fields, with each field corresponding to a column in the underlying database.
  • the positions and sizes of the attributes are not fixed but are dynamic and all the features of a word-processor such as fonts and embedded graphics are available to edit the record fields.
  • the word processor of the present invention allows existing fields to be added by typing the prefix of a field name and pressing a button. The system then completes the rest of the field name automatically.
  • the word processor of the present invention supports other database features. For example, new fields can be created by a user by using a popup dialog box. Similarly, references to other records or important words can be added by a dialog box. With particular regard to the table 409 of the present invention, OID references may support fields within other fields and a particular field within other fields supports the use of "templates, ' where a template is a list of field references embedded in text.
  • templates allow a user to build dynamic forms quickly and easily without having to use complicated form drawing tools .
  • Any set of fields or layout can be saved and reused as a named template.
  • Such templates preserve all the formatting and layout of the fields, but contain no actual field data themselves. Instead they contain empty field references (as described above) which act as placeholders for data in the word-processor .
  • the user interface for the word processor of the present invention allows a user to switch between two modes of data entry.
  • the word-processor of the present invention is used for flexible entry into one record at a time, while a columnar view is used for entering data in columns. The user can switch back and forth between these two views with no loss of data and switching from the word processor to the columnar view will cause the fields that were entered in the single item to become the columns to be displayed in the columnar view.
  • a particular use of the present invention is to support the creation, automatic indexing, organization and retrieval of "structured' email.
  • Structured email is a variant of traditional email that stores fields and other structural information in the body of the message.
  • the present invention supports the flexible data entry needed for email through the use of the word processor and templates, and brings to the management, indexing and retrieval of such email all the advantages described herein.
  • the password should not be made of common words, because an aggressor can use a brute force approach and a dictionary to guess the password;
  • a password should never be written down or embedded into a login script and should always be interactive.
  • a user's identity is determined through an extensive question and answer session.
  • the responses to certain personal questions very quickly identify the user with high accuracy. Even an accurate mimic will eventually fail to answer correctly if the question and answer session is prolonged.
  • sample questions might be: "What is your favorite breakfast cereal?'; "Where were you in April 1990?' "What color is your toothbrush?' .
  • These questions are wide ranging and hard to mimic.
  • the correct responses are natural English sentences, with an extremely large solution space, so that a brute force approach is unlikely to be successful.
  • the user creates the list of questions and corresponding answers, which are then stored. Because the user has complete control over the questions, the user may find the process of creating the questions and answers enjoyable, and as a result, change the questions and answer list more frequently, further enhancing system security.
  • a user creates a list of 50-100 questions and answers that are encrypted and stored.
  • the questions can be entirely new, or can be based on a large database of interesting questions.
  • the system randomly selects one of the questions related to that user and presents the question to the user.
  • the user then types in a response, which is matched against the correct answer.
  • the matching can be fuzzy and associative, as described above. If the response matches correctly, access is allowed.
  • more security may be provided by repeatedly asking questions until a certain risk threshold is reached. For example, if the answer to "What color is your toothbrush?' is the single word "Red', then brute force guessing may be effective in this one case. In this scenario, repeatedly asking questions will diminish the probability of brute force success.

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
EP99940859A 1998-08-04 1999-08-03 Verfahren und gerät für eine speicherarchitektur mit einem verbesserten informationsspeicher- und wiederauffindungssystem in einer gemeinsam genutzten dateiumgebung Withdrawn EP1101176A1 (de)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US09/128,922 US6182121B1 (en) 1995-02-03 1998-08-04 Method and apparatus for a physical storage architecture having an improved information storage and retrieval system for a shared file environment
PCT/US1999/017551 WO2000025235A1 (en) 1996-04-10 1999-08-03 Method and apparatus for a physical storage architecture having an improved information storage and retrieval system for a shared file environment
US128922 2002-04-24

Publications (1)

Publication Number Publication Date
EP1101176A1 true EP1101176A1 (de) 2001-05-23

Family

ID=22437628

Family Applications (1)

Application Number Title Priority Date Filing Date
EP99940859A Withdrawn EP1101176A1 (de) 1998-08-04 1999-08-03 Verfahren und gerät für eine speicherarchitektur mit einem verbesserten informationsspeicher- und wiederauffindungssystem in einer gemeinsam genutzten dateiumgebung

Country Status (3)

Country Link
EP (1) EP1101176A1 (de)
JP (1) JP2002528821A (de)
AU (1) AU5463599A (de)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10452631B2 (en) 2017-03-15 2019-10-22 International Business Machines Corporation Managing large scale association sets using optimized bit map representations

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0025235A1 *

Also Published As

Publication number Publication date
JP2002528821A (ja) 2002-09-03
AU5463599A (en) 2000-05-15

Similar Documents

Publication Publication Date Title
US6182121B1 (en) Method and apparatus for a physical storage architecture having an improved information storage and retrieval system for a shared file environment
US5893087A (en) Method and apparatus for improved information storage and retrieval system
US5499359A (en) Methods for improved referential integrity in a relational database management system
US5991776A (en) Database system with improved methods for storing free-form data objects of data records
US7305613B2 (en) Indexing structured documents
Stonebraker et al. Document processing in a relational database system
US7953745B2 (en) Intelligent container index and search
US6078925A (en) Computer program product for database relational extenders
US7797336B2 (en) System, method, and computer program product for knowledge management
US7529726B2 (en) XML sub-document versioning method in XML databases using record storages
US6658406B1 (en) Method for selecting terms from vocabularies in a category-based system
US5850522A (en) System for physical storage architecture providing simultaneous access to common file by storing update data in update partitions and merging desired updates into common partition
US20040024778A1 (en) System for indexing textual and non-textual files
Lesk Some applications of inverted indexes on the UNIX system
US8065605B2 (en) Indexing structured documents
JPH0550774B2 (de)
Clifton et al. Hyperfile: A data and query model for documents
EP1101176A1 (de) Verfahren und gerät für eine speicherarchitektur mit einem verbesserten informationsspeicher- und wiederauffindungssystem in einer gemeinsam genutzten dateiumgebung
EP1116137A1 (de) Datenbank und entsprechende datenspeicher- und wiederauffindungsmethoden
Hodel et al. Using text editing creation time meta data for document management
KR100493399B1 (ko) 정보검색 관리시스템 및 그 방법
Harrison et al. On integrated bibliography processing
Zabback et al. Office documents on a database kernel—filing, retrieval, and archiving
Constantopoulos et al. Multimedia document storage, classification and retrieval: Integrating the multos system
Wilkes The refdbms bibliography database user guide and reference manual

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20010206

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20030301