WO2001033419A2

WO2001033419A2 - Access by content based computer system

Info

Publication number: WO2001033419A2
Application number: PCT/IB2000/001697
Authority: WO
Inventors: Jean Poncet; Jean François Xavier MIGNON; Patrick Constant
Original assignee: Jean Poncet; Mignon Jean Francois Xavier; Patrick Constant
Priority date: 1999-10-26
Filing date: 2000-10-26
Publication date: 2001-05-10
Also published as: JP2004500628A; WO2001033419A3; EP1252585A2; AU1171001A; WO2001033419A9

Abstract

Content-accessible method and system for operation of a computer. The three main parts of this invention include, first, a method for defining, classifying and indexing content; second, a method for designating all real numbers such that they can be arranged easily in a monotonic fashion; and third, a fast, linear method of sorting through the content according to the associated monotonic real numbers. The problem of ergonomics is solved by a simple dialogue box and a button asking for some 'advanced' search. The possibility of searching by 'themes' is added to this. A theme is defined by words, expressions or texts and is used to get pertinent information. The system uses semantics: the meaning of words. When searching for a theme, one does a Boolean OR on the words of the theme, or a selection of the most pertinent of them associated with the proximity constraint: at least N words must be in the same paragraph. The results are then sorted by pertinence to reduce the noise. FOCUS is not a full text engine as it detects groups of words, words roots, synonyms, 'concepts' (Focusers) and stores all these in its repository. Every 'knowledge' is extracted on data input and stored in the FOCUS repository. Analysis implies identification of the data format, decoding of it, detecting language on textual information, running the linguistic procedures and storing the result according to FOCUS input format.

Description

A REVERSED COMPUTER SYSTEM BASED ON ACCESS BY CONTENT INSTEAD OF ACCESS BY ADDRESS AND ITS FULLY OPTIMIZED

IMPLEMENTATION

This application claims the benefit of provisional application with Serial

Number 60/161,579 that was filed October 26, 2000.

FIELD OF INVENTION

This invention relates to content and context accessible computer systems.

BACKGROUND

The majority of current computer systems are not content-accessible; rather they are address-accessible, typically by a pathway, e.g., disk drive, folder, perhaps sub-folder, and file name. An example of an address might be "C:\Program_Files\Office\ WordProcessor\ Document 3042_version_ 5". An example of a content addressable subject might be "my letter to Smith of last week".

Until now, computer processors and computer systems alike have operated on the basis of "give the computer an address, and the computer will give the content of the address back". This is exactly the opposite of how the human mind apparently performs. This reverse process may account for the basic difficulty people have when using computers: humans and computers process information in an opposite manner. One does not ask another "do you remember the 14th of

April 1972 at 7:34pm?" Instead, one is more likely to ask "do you remember when we were at this restaurant with the piano bar where we had spare ribs?"

The restaurant, the piano bar and the spare ribs are the primary memories. One is not primarily concerned about the date of the event.

Rudimentary attempts have been initiated on content addressing for computers. The whole structure of the computer processing, however, needs be reconsidered. It is difficult to build content addressing on a foundation that has evolved from the inverted principles, relatively speaking, of specific addressing of files and of similar aspects of information arrangement.

Building a content accessible computing system, in an optimal mode of application, requires controlling content from its very source, i.e. data coming from a network or being stored on a disk by a program. Optimizing every building block of this system makes it fast enough to handle and process the content flow into the system. Making such a system portable and robust implies designing it to be independent of hardware, memory and item sizes as well as operating systems. The content accessible system, itself, is an operating system.

This is exemplified by the following: no computer user is really surprised when asked for a "file name" when "saving" a text file. But in a good content addressable system, this question will not arise. Rather than retrieving the file by a name that nobody knows and even the file creator may have forgotten one would simply ask for "my letter to Smith of last week".

Today, the realization of a good content addressable system is described in terms of implementing a software package under an existing operating system. One can make use of the existing physical storage methods and some software for access to files and the system clock. These are the only system resources one needs to implement the content accessible system. This ensures total portability.

SUMMARY OF THE INVENTION

This invention comprises a content-accessible method and system for operation of a computer. The three main parts of this invention include, first, a method for defining, classifying and indexing content; second, a method for designating all real numbers, including integers, such that they can be arranged easily in a monotonic fashion; and third, a fast, linear method of sorting through the content according to the associated monotonic real numbers, including integers, to access contents..

Fully Optimized Content Using Computer System (FOCUS) is designed as a linguistic package that is simple enough to be handled by the average user and fast enough to cope with network speeds. It is an indexing and searching system, which combines simplicity, speed, and efficiency and is based on users of the system knowing what information they want to get, as opposed to knowing an address where some information might be stored. BRIEF DESCRIPTION OF THE DRAWINGS

The above and other features and advantages of the invention will be more apparent from the following detailed description wherein:

Figure 1 shows the functional architecture of the content accessed computer system;

Figure 2A shows naming of the monotonic numbering sequence; Figure 2B shows a finer detail for the naming of the monotonic numbering sequence;

Figure 3 illustrates part of the linear sorting technique; and Figure 4 shows the L-shaped repository files.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The following description is of the best mode presently contemplated for carrying out the invention. This description is not to be taken in a limiting sense, but is merely made for the purpose of describing the general principles of the invention. The scope of the invention should be determined with reference to the claims.

This invention comprises a content-accessible method and system for operation of a computer. Functionally, the main operating parts of this invention are shown in Figure 1 , including, first, a method for defining, classifying and indexing content (Focusers 101); second, a method for designating all real numbers, including integers, such that they can be arranged easily in a monotonic fashion (coding, 102 ); and third, a fast, linear method of sorting 103 through the content according to the associated monotonic real numbers, including integers. The system is designed to meet the users twofold needs: (1) ease of use and being free to express the request as the user likes (Query 104), (2) getting pertinent and immediate information 105. The content accessible information functionally resides in a repository 106. These last two points define what may be considered an ideal information retrieval system. Point 1 is mainly a matter of ergonomics. A simple dialogue box and a button asking for some "advanced" search solve this aspect. The possibility of searching by "themes" is added to this. A theme is defined by words, expressions or texts and is used to get pertinent information. Point 2 is achieved by systems using semantics: the meaning of words. Prior systems trying to utilize semantics ended up with a heavy computational overhead. Although the processes described further fall into the field of semantics, they are very far from what is currently used in this field, which have not yet proved to be useful. The FOCUS system's ergonomics are simple. (A) one text field, (B) one radio button "theme", (C) one button "advanced search", (D) one button "search by meaning". The text field allows the use of a convention to "force" groups of words as opposed to a mere sequence of words (such as using quotes to define the group of words). Advanced search deals with Boolean logic applied to πεarch components. Search by meaning is done through a file or a list of files corresponding to a given theme. These files could be a list of words. A "theme" is a list of simple words chosen by the user as that user's expression of a given concept, along with the frequency in the given texts. This list of simple words can then be sorted by pertinence using the frequency deviance from a reference. When searching for a theme, one does a Boolean OR on the words of the theme, or a selection of the most pertinent of them associated with the proximity constraint: at least N words must be in the same paragraph. N is at least 2 and usually 3. The results are then sorted by pertinence to reduce the noise. A paragraph is anything separated by something recognized as a paragraph separator (this code exists in Unicode, but it has to be analyzed for simple ASCII texts). To have any meaning, a paragraph should have some minimal length.

The "theme" radio button indicates that the word or expression to be searched for is to be considered as a theme. The themes are pre-computed, thus allowing very fast semantic search. Results can be displayed in three forms. The first is using the title of the document when it can be identified. The second is using some sort of automatic abstract (first sentences or first expressions, or first most deviant expressions in the most pertinent paragraphs in the file). The third is displaying the text itself, in which case the viewing program might highlight pertinent paragraphs and words and so allow an easier navigation through the hits.

It should be noted that references themselves are to be recorded as contents. They are part of the items of the computer system that carry a meaning. For instance, for each filename, one will record the complete path along with all of its components: drive ID, directory names, directory extensions, file name, file extension and every word in these.

Next, one must decide what is an information. One can state that information has a twofold nature: it is a set of data one is looking for and also the way one looks for that set of data. In a text, a word can be an information, as one can use words to find words. But a text is also about themes and concepts. So, one then extracts themes and concepts from text. In languages like German where most words are compound, even part of a word is an information. Knowing that the word "being" is a noun or a verb is also an information. To retrieve information one uses the following steps: (A) Extracting or computing information from the files stored in the system. This will be the information one will be able to look for. It depends on the state of the art of concept extraction software. (B) Storing this information in such a way that it can be instantly retrieved. (C) Using a storage process that has performances independent of the mass of data to handle. (D) Providing a retrieval system to access all elements stored. (E) Doing all this as close to a real-time as possible.

Instant retrieval means building a repository with direct access to all its information contents. Having linear characteristics means that from the first step on, namely, the sorting routine that feeds the repository, every process must be linear on any set of data available at one time. Being independent of the amount of information to monitor allows one to avoid the use of multiple repository files as well as the copy of the repository to update it. Real-time processing means handling of file manipulation such as updating, renaming, moving and deleting as they occur without interruption or break in the time line. Fully Optimized Content Using Computer System (FOCUS)

Intelligent filtering, as part of the FOCUS system, is not simply giving a list of wanted or unwanted terms. However, it is usually undertaken with techniques with a heavy computational burden that could require several minutes to decide upon the fate of a single sentence. The approach here is very different. The semantic configuration only requires the user to give examples of texts that are meant to be representative of the "theme" desired to be defined.

The "feeding" of FOCUS can be done in a variety of ways. For instance, receiving data from a network can trigger the analysis of it's content. Updating a database record can trigger the analysis as well. Updating a word processor file can also trigger the analysis. The concept of Focusers.

A Focuser is defined as a set of the following elements: (1) its name, (2) a list of words or expressions that represents the Focuser, (3) its language description (French, English, ...), as an option but not required, (4) several parameters that control its existence in a paragraph, or its list of words or expressions: (a) Number of words or expressions in a paragraph, fewer of which found in a paragraph indicate that the paragraph is not relevant for the Focuser. This number is used for the detection of a Focuser. (b) Number of words or expressions in a paragraph below which the paragraph might be relevant and above which the paragraph is really relevant for the Focuser. (c) Threshold of pertinence, below this number, the word does not belong to the Focuser. This threshold is used to build of a Focuser. Building Focusers from a text

It is possible to have an automatic building of a Focuser from a text by utilizing he known general frequency of words in a given language. With this information, one can calculate a "pertinence" which is a number representing the uniqueness of the word found in the paragraph. This number is the ratio: frequency of the word in the text / general frequency. So one can order all the words from the source text by decreasing pertinence values. Words having the biggest value are the most pertinent for this paragraph.

It is also possible to manually enhance the Focuser by creating specific information: (A) specific expressions that are very relevant; (B) synonyms (which can be automatically proposed by the system); (C) words or expressions, which should be excluded from the Focuser (forced zero pertinence value); (D) words or expressions, which discriminate the Focuser, i.e. they are automatically excluded from the Focuser (negative pertinence value); (E) a word that is accepted, excluding all expressions containing the word. The Focuser is then represented by some of the words that have the biggest value. The threshold can be chosen by the program or the user. A very simple example of a text for Focuser is the following: This is about horses , horse, horseback cavalry, @mounted_troupes . horse . Where "@mounted_troupes" is a very relevant expression. The Focuser itself would look like the list of words where non pertinent words have been removed: Cavalry, horse, horses, horseback, @mounted_troupes

Detection of a Focuser

A Focuser does not mean anything per se but takes all its meaning when it is compared to a text that has paragraphs. So, a Focuser is said to be recognized when a certain number of words or expressions pertaining to this Focuser are recognized in the same paragraph. This number is a parameter of the Focuser, and its value is usually around 3. Expressions that have been manually entered in the Focuser have a value of 3 times the value of a single word in the Focuser. Automatic routing/filtering using content of a text

Routing/filtering are defined through a "profile". A "profile" is a set of positive and negative Focusers. Whenever a mail or an HTML text or any text goes though the filter, all defined Focusers are detected (see paragraph "Detection of a Focuser") and compared to the profile: if a negative Focuser has been recognized, the text is rejected, else if there is at least one positive Focuser, the text is accepted. Any combination of positive and negative Focuser can be used.

The semantic analysis process qualifies every paragraph according to predefined Focusers. Instead of just recording that qualification along with the data one can use it to route (or stop) the incoming information when particular themes are associated to particular destinations. On the other hand, all this can be kept in a central repository, each user being configured to have an implicit Boolean AND with his themes. This procedure would allow some network manager to monitor what is routed (or stopped), and where it is routed (stopped).

When filtering application come to a "firewall", monitoring data escaping the firewall is mandatory if one wants the firewall to be efficient. Being able to check what was stopped that should not have been, what was let pass that should have been stop, what is still undecided upon, the firewall controller can tune his firewall according to the attempts of defeating it from people who want their information to pass through. Of course when the controller has enough evidence that a particular provider is sending only undesired data, he can also declare his identification (IP, etc.) to be rejected, which will still be handled by FOCUS, but will speed the process a lot. Automatic building of Focusers utilizing text in a repository

One can automatically build a concept associated to any word of text in a repository. The algorithm is the following: get all paragraphs containing this word, then extract the concept from the concatenation of this concept. It is very time consuming to do this for all words, so, for example, one can limit this kind of extraction to expression that contain at least two meaningful words. So, the user can ask for a given expression as such or for this expression as a concept. Representation of semantic networks, thesauri, lexicons with automaton

One can represent any kind of information linked to a string in memory the following way: <word><separator><type of information><separator><information>

If the information is small enough, one can keep it directly in the automaton, otherwise the information stored in the automaton is only the address to the information which is, for example, stored in a file. The <separator> and <type of information> must be small (usually one letter) in order to be as compact as possible. Thus, semantic networks, thesauri, lexicons, etc. are represented this way provided they can be contained in memory. Semantic analysis/filtering implementation on silicon

Basic routines of dictionary manipulation can be implemented in silicon either on a general processor based controller or on a dedicated RISC chip. Automatic viewpoint of a text using deviance

It is possible to give the most pertinent words or expressions in a text in relation to a Focuser. The Focuser is said to be a "viewpoint" on the text. To have a viewpoint simply given the most deviant word or expression that are both in the text and the Focuser. Expressions do not need to be explicitly in the Focuser because the deviance of the expression is the compounded deviance of the words in the expression. A compounded deviance is the sum of the deviance of the single words contained in the expression divided by their numbers. If no specific viewpoint is given, the viewpoint is simply the Focuser build from the text itself.

Automatic language and code page recognition

Recognizing a particular language could be done by comparing all the terms of a sentence to all the dictionaries of the planet coded in every possible character code page. It is easy to see that this method is too heavy. A much faster and roughly as accurate solution is to characterize languages by the statistic distribution of n-uplets. Experience shows that getting the statistic distribution of quadruplets gives a rather good identification of the combination language/code page. This is done very straightforwardly using the sorting routine. The elements to be sorted are simply, the n-uplet starting on the first byte, the one starting on the second byte, and so on. Then one counts the duplicates and compares the result to a pre-built n-uplets database. This database is built applying the same process to a large enough text to be representative. A sample text of the order of magnitude of 1Mbyte is considered sufficient to build the database. This database consists of n-uplets and their corresponding frequency for each combination language/code page.

When comparing a text to a database, adding the frequency of each occurrence of n-uplets found in the text for each pair language/code page and choosing the highest number give the language/code page. This method is valid for recognizing a language and a character code page for a given paragraph. One has another method, which is more precise, given a specific character code page (in other words, one needs to be able to recognize the words of the text). This method is based on the use of a small dictionary of frequent words for each language to be recognized, and is able to recognize the language in a single sentence. Each word in this small dictionary is associated to one or several languages. Then each language pertinence is incremented when a word pertaining to this language is recognized. The highest language pertinence is then chosen. Content accessible computer

A content accessible computer considers every content to have three parameters (1) a reference (2) the content itself (3) a position. A reference is a string of bytes that indicates to the hosting software how to get to the content. This can be a file name, an address in memory, a field in the database record, a file in an archive, etc. References are preceded by a code indicating which software will be asked for the content: the filing system for a file name, an archiving program for an archived reference, a database management system for a record, etc. The rest of the string is an argument that is passed to the host, such as SQL request for a database manager. The coding used is host dependent.

Now one considers content. Every content in a computer is a string of bytes and the length of the string. Typically these bytes are 8 bits to a byte. A third aspect is position. This is not a physical position in the computer's memory or the computer's storage area. This position refers to something meaningful to the user. In a text file, for example, this position can be the number of the chapter, number of the' section, number of the paragraph, etc. In a database, this may be the number of the table, the number of a record, the number of a field, the number of a paragraph, the number of a sentence, and so on. In a medical scanner's image, this position could be XYZ coordinates, and the red-green-blue RGB false color contents. Position is primarily recorded to provide proximity management.

Semantics glimpses

This technology provides a way to identify the content of a document before opening it, which can be a relatively lengthy operation on large documents, accessed via a network. The technology of glimpses uses the idea of an abstract of the document, but creates it automatically from the content of the whole document. As the linguistics of a FOCUS extracts groups of words, it is relatively easy to select the groups that are most characteristic of a document by comparing them to a statistical analysis of a given body of data.

For instance, for plain text, the body used as a reference can be considered as ordinary English, French, etc. taken from a large plain text repository from a FOCUS. When analyzing databases, field names are often very far from ordinary spoken language and tend to be systematically selected for the glimpses. It is easy to analyze all of the database and then select it on the corresponding FOCUS to build the reference list of current words and expressions. A Focuser is a text file giving a list of terms, synonyms and expressions of how the "Focuser owner" would speak of something when speaking to another person on a given subject. For any set of text that is considered pertaining to a given domain, this domain can be described in a Focuser and the result can be used as above to give a more accurate reference for this domain.

As far as a FOCUS has been fully implemented, these redefinition and analysis can be made locally with respect to the FOCUS' repository with no access to the documents themselves. These technologies are thus suitable for a network environment as well. When a document name is selected or displayed, it is easy to display the associated glimpses that have been recorded in the FOCUS. Browsing through these glimpses allows the user to quickly identify whether a document is interesting or not for his current query. Any expression of the glimpse can be used in a subsequent query as well. dynamic Focusers

A dynamic Focuser is a Focuser that is not compiled and created at filtering time but one which is analyzed and created in real time on one or several Focus Repositories. As above, a Focuser is a text file giving a list of terms, synonyms and expressions of how the "Focuser owner" would speak of something when speaking to another person on a given subject. Computing a Focuser is not restricted to the time a document is analyzed. In fact, it can be done any time, even when the user asks a question. The format can be anything. The user may type a series of words and/or expressions, or type real sentences, or better, grab entire sentences and paragraphs from any document, in particular those which have been found through another query on their FOCUS.

The first job of the system is to make a query of all the words in this text. Then to sort them according to the documents and the positions they're in. This function is greatly helped if the number of the paragraphs are stored in the FOCUS' memory, if not this technique reverts to what was known as "clustering", which, by the way is a linguistics nonsense. Allowing entire paragraphs to be used as a REAL query with an acceptable response time is a breakthrough in NATURAL LANGUAGE computing. Even more, these queries can be intermixed (Boolean) with "ordinary" queries on text, images, music, etc.

Using the linear sorting routine, this operation is fast enough to be carried interactively.

Then, according to the accuracy parameter (roughly the number of words that must occur in any given paragraph) a selection of the "winning paragraphs" is computed. This parameter can be monitored by ways of a slider or "+" and "-" buttons, and the result can be quickly recomputed until a convenient number of documents is selected. The result of the query is then sorted by pertinence.

If the "dynamic Focuser" is satisfactory to the user and if the user thinks of using it again, this definition (the Focuser) and its results (the selected paragraphs) can be recorded by the FOCUS. This operation, again, involves no access to the original documents and can be appropriate for a network environment (such as a Web portal).

The same technology can be used to update Focusers (would they be dynamic or "recorded" in the FOCUS. The only extraneous step is to erase the occurrences of the previous definition of the Focuser, which again involves no access of the original documents. The fact that it is never necessary to refer back to the documents simply stems from the fact that the FOCUS has a much better "knowledge" of these documents than the documents themselves, so to speak. In fact, a FOCUS will typically store over 15 times more information than the number of words in a given document. All these updates are done in "real time". reverse Focusers for e-commerce

In fact this application of Focusers is not restricted to E-Commerce but this provides a good example. On an E-commerce portal, the vocabulary describing the items to be sold is usually the merchant's. This makes the navigation very hard for a customer who does not necessary knows this language. Classification is often not very helpful as the same article can be thought of being both in

"sports" or "leisure", for instance. The FOCUS' solution to that dilemma is fairly simple. First, start the portal with whatever descriptions are available, but make them Focusers. When a customer types a word in, request the FOCUS for all Focusers containing this word and display the answers as if they were "categories". If the user does not find the category he was thinking of, he may rephrase its query (or quit!) and the game starts again.

If all the users' queries are simply logged in any proprietary format, this logging provides the E-commerce manager with two types of interesting information. Of course, he will have the "profile" of his customers (and he can make Focusers out of them if they are consistent enough), but he will also have a "profile" of the way people speak of what they want for each item on sales. These profiles can be used to update the Focusers describing the items, and the FOCUS will quickly learn how to answer customers' queries in their own words (avoiding them to quit...). This process shows how a standard FOCUS can be applied. use of glimpses to detect similar documents

A request on the expressions taken in the glimpses of a given document can quickly provide with a set of similar documents (textual, images, music, etc.). There are two ways of achieving this.

One is to run new requests on the glimpses themselves using tokens; the other is to run a request on each of the expressions. The answers have to be sorted out using any parameters the user think appropriate and the output is then ranked by similarity.

automatic duplicate detection

It is highly improbable that two different documents can have the same glimpse (this would involve that the 30 most typical expressions are the same for the two documents).

It is very easy to run a separate application that just queries the glimpses and outputs a list of those that are belonging to more than one file. But it is also possible to issue "duplicate warnings" automatically when storing the document parameters in the FOCUS' repository: if a glimpse already exists, then this might be a duplicate. Of course these techniques only provide you with "duplicate candidates". Even if the files have the same length and date is not enough to declare them "duplicates". On the other hand, this detection is independent of the format: a text processor's document duplicating an HTML one will produce the same glimpses. Ultimately, some human being (directly or through an application program) will have to decide what to do with the "duplicates". Open-ended. Unlimited And Variable Lenth Coded Numbers monotonic floating point coding (MFPQ A variable length coding needs the length to be part of the code. Computer data are always a string of bits (or bytes). The first upper bits in first byte of the string indicate the length of the string in bytes. Let us code it in the upper bits of the first byte and say that the position of the first reset bit (0) gives the length of the string. In this particular coding, all bits are used to indicate the value. This is hence optimal. The space taken is absolute minimum and there is no limit to the numbers (when all 8 bits of the first byte are set, coding goes on the second byte). Moreover, the strings are binary monotonic, which allows sorting and order testing without decoding. Routines to code and decode are straightforward. This coding is independent on the processor's representation of numbers (big or small endian). It can be based on other units such as 16 bit words when computers agree on their 16 bit binary representation. If the numerical fields in a database comply with the MFPC coding, then all fields can be sorted as binary values without dynamic collation for numerical fields. For numerical information, that is, real numbers including integers,

FOCUS does not put any limitation either as to the number of digits in the mantissa or the value of the exponent. As seen in Figure 2A, the real numbers can be divided according to the sign of the mantissa 201 and of the exponent 202. The overall value depends on these classifications. Figure 2B shows a further classification where, for example, infinity plus 204 or minus infinity 205 are separated out, plus one 206 and minus one 207 and zero 208 is also separated out, although these are not really required, but it doesn't hurt. In Figure 2A one sees the mantissa 201 in the first row with a plus sign 209, the exponent 202 also with a plus sign 210, and this is coded as D 211 in this system. In the next case, the mantissa 201 is plus 212 the exponent minus 213 and the code is C 214. The C 214 is less than D 211 or D 211 is greater than C 214, intrinsically. The next case has the mantissa 201 as minus 216, the exponent 202 as plus 217, and is coded as B 218. Now B 218 is less than C 214, however, it is greater than the code A 219, which has a mantissa 201 of minus 220 and an exponent of minus 221.

As one sees in Figure 2B, looking at the fourth row down which has the mantissa plus 212, the exponent minus 213 and the code F 251 as well as the eighth row down which has the mantissa minus 216, the exponent plus 217 and the code B 218, one has to realize that the absolute value of the exponent, the larger it is, the smaller the value of the number. In the first case if it is positive, but its something to the minus power and getting larger which is one over something and getting very big and in the next to last row one sees that the mantissa is minus so its a negative number, the exponent is positive so that number is getting minus or something to the power of very big so its getting to be very big, but its a very big negative number, so in F 251 and B 218 areas the larger the absolute value of the exponent, the smaller the value of the number. Therefore, one might want to compliment this to some base. For example, 12 in the base 10 would become 99 minus 12 equals 87. The absolute value of the mantissa then has some coded value or to some base. The various bases to which these components are coded is really irrelevant as long as they are coded the same for all the coded elements of the numbers. An unlimited real number can be coded into a sign combination character, which is (Figure 2B) A 265, B 218, C 266, D 267, E 264, F 251, G 263, H 262, 1 261, unlimited monotonic coding exponent or compliment and the string of characters for the mantissa.

One can write any real number using first unlimited integer numbers of minimum length for the exponent and the mantissa can be any length number. One way of multiplexing numbers is to use numbers written in the base N and the numbers are now written in the base N plus 1, so that the N plus 1 figure never appears in the number. If one is encoding in base 2, one has zeros and ones and if one does that in a base 3 then the number is N plus 1, N equals 2, then one would have 1001, etc. and then a 2 and then some other number 1001, etc., so that the higher number serves strictly as a separator. The smaller numbers of N waste less space so that the number for N 2 or 3 is preferred. It turns out that 2 is better for coding speed and 3 for decoding speeds, so the actual number used is a toss up. Linear Sorting

When one does linear sorting, one has to basically sort a list in some order. The classification can be done at the binary level using a micro-value of the chosen component that one wants to sort. For bytes one has eight bits or two to the eighth power or 256 values; when one uses sixteen bit words one has 64K values and so on. When one classifies a list of bytes, along with the bytes one carries an integer pointing to the next byte, along with available values one carries an array of starting and ending pointers. Byte pointers are initialized in some way, typically first item, second item, third item, etc. If a byte of value X is read, starting and ending pointers of X will be updated. If both are not used yet, the value will be the pointer to the byte, if they all ready have a value, the end will point to the byte while the byte pointer to the previous end value will be pointing to the new byte. At the end of the scan, storing in any values reported in the byte pointers in the order of the collation for that particular rank.

Figure 3 shows the situation with an array of starting pointers 301, an array of ending pointers 302 and array of pointers to the next 303. In Figure 3, it is shown the first 3 ranks or letters XYZ 304 have been sorted and found alike and now the next letter in the XYZ... series, namely, A 305, is being examined. The routine processes of bytes of rank are in a list. The next rank will only be processed when there is more than one item in a rank list. This can be known using any kind of flagging for end of input string. The next rank does not need to be R plus 1, but can be in any order rank that is desired. date/duration coding based on the MFPC A date or a duration can be expressed as a number of years, months, days, hours, minutes, seconds and fractions of second or any combination of these. Using MFPC on all these values allows one to represent any length in time no matter how small or big. For dates, all values but years and fractions of second have limited values and can thus be represented as simple binary integers. In a repository, it is advised to record the full date or duration along with the individual components, so that asking for "the month of May" will not end with a "token research". Repository - Logical Structure

An optimal solution providing a fastest access to any individual item utilizes "L" shaped strings. An L shaped string is an optional "horizontal string" and a "vertical string" of at least one byte, each byte of the vertical string being associated to a pointer to the next L string. For instance, to store "Albert"

"Bogart" and "Bowie" the L strings will be the following: (See Figure 4). The first string is purely "AB" 401 is purely vertical, "LBERT" 402 is apparently horizontal. In fact "LBER" 403 is a horizontal string and "T" 404 is a vertical string with its pointer. The next string has "OGW" 405 as a full L shaped string. A "0" 406 is used to indicate that the string points to nothing more (i.e., it is an end of string mark), textural information

In order to record uniformly textural information, FOCUS uses the Unicode standard code codification. This is a 2 byte-encoding scheme that covers all the glyphs used on the planet and is open ended. L strings never duplicate a given character. In order to handle badly accented words or inconsistently spelled or written character words, more than one representative can be considered. For example, the number 12 can be written as a sequence of characters "twelve" or as a sequence "12" or as a numerical value "12". There is nothing in FOCUS that prevents using the three representations at once. The same thing may be done with a compound word such as weekend. This can be stored as "weekend" "week-end" "week" and "end", all at the same address. automatic splitting of repository when disk access becomes critical

When disk accesses becomes a bottleneck, it is usual to replicate the repository, thus inducing heavy synchronization procedures, and more generally loosing every possibility to maintain a real time application. Under a FOCUS, and due to the algorithm of the sorting routine being used, it is almost natural to "split" a repository instead of replicating it. It is easy to deduce from the sorting algorithm how to send the beginning of the sort to one file and then switch to another.

The problem here is how to split a repository in the first place. One of the utilities used with a FOCUS is sequential reading of its repository. This program is used to build a new repository on the second disk up to a point where it is half the original. Then it goes on reading but writes the second half on another file again. Then the original is erased. This is preferable to keeping the original as the second half and erasing the data stored in the first half, since if other splits are needed, resulting in an unnecessarily empty file. The number and names of repositories are stored in the "basic" repository (the one that starts with the lowest strings) as a special code (the lowest) which keeps the whole management inside the FOCUS.

An advantageous feature of this scheme is that the more repositories one is obliged to have for the number of user's accesses, the shorter the time will be to run a backup procedure. When backing up, the watchdog keeps on working but the feeding process is stopped so that the copy involves a known situation. Then the repositories are copied. The more we have, the smaller they will be and the shorter the backup time. Repository (Index) Management incremental/decremental index updating w/o copying the index

Usually an update to a repository triggers a periodical update of the repository itself. This implies that the space made available is twice as much as what is needed. The FOCUS routines are doing their own "garbage collection" which eliminates the need of copying a repository to get rid of its slack. caching indexes for real-time indexing

Not sorting the input data directly into the final repository allows a faster >ort and a faster production of a "cache" repository that can be quickly accessed, thus providing almost real-time access to recorded or received information. use of monotonic access methods on index files including numbers and dates/duration

The fact that different types of data have a monotonous coding allows the internal engine to have a very limited set of functions. Typically: 1) looking FOR a particular item, 2) looking FROM a particular item, 3) looking for the NEXT item. An important feature of FOCUS is that when an item is not found, the next item available is sent to the requesting program along with a flag telling the request was unsuccessful. This allows the calling program to decide on the next call to be made (see the substitution process). generalized indexing of databases

This is where references to table/record replace reference to file names automatic references monitoring for real-time indexing

Intercepting Operating system calls on file openings, file closing, writing to files, etc. (depending on the processes used by a given OS) allows to build a chronological log of events which in turn allows to decide which file is to be declared deleted, moved, renamed or updated. Unless any smarter method (OS dependent) could be used on updated files, an update is to be considered as a deletion followed by a creation. Partial updates will generally only stem from a new analysis (semantic, for instance) producing new parameters and canceling others. If the reference concerns a field in a database, it is only necessary to have an automatic logging of editing on this database, the logging file containing all necessary parameters. An easy format for these logs is ODBC. More generally changes and updates are better taken into account at their very generation. Most processors allow some kind of intervention at this level. In the worst case, one has to wait until the change is reflected by one of the processes in which one has an access method (waiting for the modification to be sent to a file for instance). non-textual extensions on indexing engines

"Full text" engines are generally unable to record non-textual information. There are two ways of doing this: deciding on one or more "prefixes" and deciding on one or more "suffixes". For instance if a textual information is meant to be prefixed by "0" then numeric information can be prefixed by "1", dates by "2", MIDI by "3", etc. If textual information is a file content, one can decide on suffix "0". If this textual information is a parameter for the filing system, then the postfix " 1 ". If it is a filename, the " 1 " postfix can be appended by a "0", and so on. The choice of putting a pre or post fix is dependent upon the access that will be used for retrieval. It is perfectly possible to record duplicated data, one using a prefix, the other a postfix, both referring to the same item. Note: as the Unicode standards provide for two sets of 32 values that are considered out of the standard (0 to 31 and 128 to 143), neither prefix nor postfix are defined for textual information. Codes for the first pre and post fixes are chosen from one of these 64 values. Should an improbable variety of codes arise, it is always possible to multiplex the value 143 to allow the extension.

FOCUS Sequence

Feeding FOCUS can be done in a variety of ways. For instance, receiving data from a network can trigger the analysis of it's content. Updating a database record can trigger the analysis as well. Updating a word processor file can also trigger the analysis. Whichever process is used, all FOCUS wants is to get a list of references to be taken into account in a dedicated file. One can point out immediately that the analysis process can also decide not to store or keep the data, such as for network filtering. But it may also flag the data as being "unacceptable" and store it anyway to allow the network manager to trace unacceptable information.

On textual data, the analysis takes linguistic and semantics into account so that "concepts" can be identified on the fly. As seen above, these "concepts" cans be used to route the data flow and/or can be stored along with other information analyzed. Even when dealing with text, a FOCUS is not a full text engine as it detects groups of words, words roots, synonyms, "concepts" (Focusers) and stores all these in its repository. This is where "knowledge management" comes into the picture: at the reception of data and NOT at the time of the user's request. Every "knowledge" that can be known by the state-of-the-art technology is extracted on data input and stored in the FOCUS repository where it is immediately accessible. For real-time input (network incoming data) the reference is given along with its output. When the input is simply a notification of something that happened, the analysis is performed.

Analysis implies identification of the data format, decoding of it, detecting language on textual information, running the linguistic procedures and storing the result according to FOCUS input format. NOTE: General commercial data format identification is outside the scope of this document. Many commercial products are available for this purpose. But extracting embedded structure indication in formats like HTML, SGML, XML is part of FOCUS. At the end of these processes the FOCUS input files are ready to be processed. The problem is that putting this information in a repository involves sorting them. One will see how the sorting process is linear in spite of what mathematicians demonstrated, but whatever routine is used for sorting, RAM based techniques are order of magnitudes faster than disk based routines. So if many input files are available at the same time, FOCUS will concatenate them to feed the sorting routine. If a single file is available, it will be sorted at once to provide immediate access to its data by FOCUS' query programs.

The output of FOCUS sorting routine is already using the FOCUS' repository format, so that users can access recent data very quickly. But multiplying these files on a particular physical disk drive means degrading the system's performances on query and updates (particularly deletions). So the final step is to "merge" these temporary repositories into a single final one on every physical drive. Although this process is relatively fast, it can be carried in low priority background (FOCUS controls its own background priorities). Depending on applications, however, temporary repositories can also be kept instead of merged (if they all carry one day of incoming data, and if they are to be erased after a given number of days, for instance).

References are recorded in the repository and given an ID (or handle) number (unlimited integer). The cross-reference (from ID number to reference) is also recorded. When a reference is deleted, its cross reference is updated as being deleted, further deletion of its entries occur and at the end of this deletion process the cross reference is updated as being free. These free cross-references are used by the reference allocation routine. Users may want to record other cross references they are interested in, and they are free to do so: FOCUS is not concerned by meanings associated to data, only users are.

When the word "unlimited" is used in this text, it means that the FOCUS system does NOT force any limitation on the part of the user or the programmer. Application programs will nevertheless agree upon some limitation of their own. For instance FOCUS could store the number pi with one million decimals, but there is very little sense to do so as no one looking for pi is ready to type in the 1,000,000 decimals anyway. Also, while a FOCUS has no limitation of its own, choosing a limit can have some impact on the dimensioning of some buffers or clusters. The fact that the sorting routines can be run with as little as IK of RAM to handle terabytes of information does not mean it is clever to do so. And, anyway, devoting just .IK for the buffering will not allow to sort items larger than IK. A FOCUS is obviously limited by the amount of RAM and disk space available on a given system, but if will cope with any size of RAM and disk space however large they are.

If the input data are larger than what can fit into memory, the current approach here is to process the first rank and store all classes that guaranty to fit in RAM for a final processing. This is based on the fact that a class must be smaller than the amount of RAM available proportioned between what has already been read and what is left to be read, assuming that all subsequent data will have the same proportion of that class. When a class is larger than what is mentioned above, the next rank is used for classification and the process goes on. When everything has been processed, classes are read in the collating order and classification is thoroughly performed. This occurs only once. When a class has been further split, the first subclass is loaded along with the class before the processing is undertaken. All classified data are then flushed onto the disk, data is reorganized in RAM and the next subclass is loaded. And so on.

This process is completely coherent and completely linear. It is furthermore absolutely optimized as NO unnecessary byte will be considered, and all bytes are only taken into account ONCE. When end of byte strings are not given by a special value (which is the case in FOCUS), the length of the strings must be used. Keeping track of the first and last values in a rank proves to speed the procedure twice. Using bit maps of values used to do the collating can further enhance the speed if written in assembly language.

This technique of "un-merging" uses a temporary storage equivalent to the final one: current techniques need extra storage roughly equivalent to the size of the input data. As temporary files can be deleted as soon as they are read into

RAM, writing their content back in the final file will take the same space. In fact, as most operating systems store their files in clusters, the temporary files will use slightly more storage than the final result, but no more than the product of the number of temporary files them by the cluster size. The amount of RAM devoted to the process is not critical. Variations of about 20% only have been observed. Organizing the RAM for the sorting routine as here, is just one of the ways it could be implemented. At one given moment, there might be a maximum of three lists in RAM: the list of items itself, the list of files from previous loads, and the list of files being stored with the current load. Loading a buffer of data is thus done between the end of the first list and the beginning of its data structure (pointers). Loading ceases when the size of the data plus the size of the pointers overflows the available RAM. This third list is made up in the following way. Lets us consider that one is in the first rank and that the byte value is B. Lets us also consider that the size of the B class is OK to fit in a file which already exists (this is known by reading the second list as it is sorted). Then all entries of the B class will be appended to the B file. The B list is skipped. If the B file is created here, the fist entry of the B class will be given a length of 1 byte and will point to the "B+1 " (according to collation) class. When all items have been sent to their respective files, their sequence has been replaced by the second list. These "filesurnames" are then moved to append the first list and their pointers are moved before the first list's pointers. The two lists are then joined and sorted. Filenames are given in ascending order such as "0000", "0001", ..., their

"surnames" are the letters they contain that have already been classified. These names can be appended to the surname taking into account that the actual data is 4 bytes (in the example) longer than the string to be sorted. Reading a data gives a direct access to the corresponding filename. If doublets are to be skipped, it is done as soon as they are detected (the fist occurrence points directly to the next item, bypassing the doublet).

The only issue here comes when doublets in excessive number are to be stored. They need a dedicated file with the same surname as another file containing the doublet plus some appended strings. To make the differentiation one can store the length of the surname as the double of the length, and add one if the surname lead to a "normal" file. Beside this, doublets may have a special storage format (storing "her blue eyes" one million times in a temporary files is kind of stupid...). Nevertheless, they will have to be expanded for the final output. If RAM space is an issue, storage of the "filesurnames" can be done in a separate file. But again, sorting one thousand tera records with just one can is obviously silly and also very unlikely: as far as one can see in the personal computer's history, the ratio of disk space to RAM has always been around 1 : 1000.

Repository logical structure (already described above) physical storage

Swapping data between RAM and disk implies some clustering scheme. By design, FOCUS will accept any cluster size above 4k bytes (minimum based on a 8bit data analysis and small repositories) which is convenient for the user and compatible with it's longest item (roughly twice as much). For simplicity sake, in the implementation one has arbitrarily restricted the cluster size to be a power of 2, which is generally what operating systems swapping schemes are using anyway. This clustering scheme brings fragmentation of the repository so that L strings can point either to another L string in the same cluster or to another L string in another cluster, and this can be indicated by a flag bit.

Done just like this, one will end up with a pretty empty repository as when a string ends asking for another L string allocation that cannot be done in the same cluster, a new cluster will be initialized which will only carry this end of string, which most frequently, will never be completed by any other. To take an example, storing the word "man" and being obliged to "export" the final "n" may lead us to complete that "n" with words like "mankind", "manhood", "manipulation". But if one were to export the final letter of any of these other words, it is clear that chances are they will not be complemented as far as English language is concerned.

In practice, as soon as the repository is a few megabytes long, the slack quickly soars over 50%. So a finer approach is needed: subclusters. To keep things simple, subclusters are a power of 1/2 fraction of a cluster. The general address of an L string thus become: cluster, subcluster, and byte address.

Whenever the length of an L string changes it might be moved to a subcluster of different size. In the FOCUS repository, all numbers (lengths, addresses) are coded using the unlimited monotonous (i.e., monotonic) integer numbers of minimum length technique. Space allocation is done using the dynamic FATless allocation (described below).

structure of FOCUS contents

Current FOCUS L strings build the following string of bytes: content, ID of reference, numerical value. The two last values are coded as Unlimited monotonous integer numbers of minimum length. The first is the ID or handle associated with the reference used to access the content, and the value is a multiplexed number used for proximity purposes. Cross-reference FOCUS L strings are made of a Cross-reference code, the ID, and the reference. Here, the "content" is the whole string.

The composite structure of FOCUS strings implies two subsequent fields. The length of a string is the length of the data after which numerical values contain their own length (see Unlimited monotonous integer numbers of minimum length) In fact, if the feeding is done in such a way that the IDs are monotonous and positions within IDs also, it is enough to sort on the content only providing "duplicates" are kept in their original order. dynamic FATless allocation for filing systems

FOCUS is an operating system by itself, and as such has its own filing system. This filing system could be used by any operating system with great profit, but it also allows us to think of putting the system directly on chips aboard the disk controllers. This filing system uses no FAT (File Allocation Table) but is easy to be made ultra safe and allows all usual optimization to be carried relatively simply. Instead of centralizing cluster allocation, they are chained in the cluster themselves.

The basic idea is that an empty file is a chain of clusters. The chain means that each cluster has a pointer to the previous and the next one. If no previous or no next cluster exists, the pointer is set to any arbitrary value that cannot be a valid logical/physical cluster's address (usually 0). This chain - called the "free- chain" - can either be initialized over the whole file if its size is known at the time of creation of the file, or dynamically expanded by one or more clusters when the need arises. ^■

Allocating a cluster is simply taking it from the free-chain to put it in another chain built on the same principle. De-allocating a cluster is giving it back to the free-chain. The first cluster (root) can be used to store the starting and ending points of these chains if there is a limited number of them. If not, this recording itself will use the chaining technique. All of the usual functions of a filing system can be readily implemented. De-fragmentation, for instance, is done by copy of last clusters into the free-chain ones, and truncating the file when the free-chain is a continuous set of clusters at the end of the file. Multiprocessing Data Bases referencing a data base content

When indexing databases, a structure is implicitly given which can be used as data for indexing purposes. Typically, knowing that a word is in table T, record R, field F, paragraph P will bring much more precise access to this word. This information can be stored along with the word itself using a postfix coding technique (also used for textual structural information). The information retrieved by a query can be used to build a database of coincidences. This capability is greatly enhanced by concept extracting techniques used in textual information as described below.

There are several ways to qualify items recorded in the repository, such as belonging to a database, being in a title, etc. These qualifications can be recorded and further used to build databases out of unstructured data. Merging of structured and unstructured data can be readily achieved in FOCUS. The most interesting use of this is on versions that monitor structure in structured data. Types of coding here are irrelevant: the important choice concerns the order in which these elements are stored. Multiple storage with different orders can be used if multiple access paths are to be used.

The multi-processing technique makes no call to the operating system and is thus OS independent. Mail boxes between processors, are used, and are dedicated file names that can even carry parameters. This not only allows synchronizing in and out of several processes, but also allows sending data to them with a file explorer type program. Although the same control could be obtained by running the command anew with new parameters, this explorer-type program control technique allows one to achieve a "control panel" of the process by just looking at the filenames in some directory. For instance setting a debug level can be controlled^' by filenames like "debug.005".

When synchronizing processes over some kind of pipeline chain, it is enough that the first process outputs to "lout.tmp", renaming it to "1 outgo" when the file is complete. Then the second process, looking for that particular file name will be triggered. Getting the two processes in parallel along with the processing of many files simply implies the synchronizing files to be numbered such as "loutOOl.go". The second process can easily take up these files in increasing order. When processes are not resident, the coordination has to be devoted to a command program, which will fire the different processes according to the existence of their input files and other synchronizing considerations. When processes are resident, the command program can still be used to have a centralized scheduler. Application to Textual Information text per se: optimized Unicode storage

To be able to record uniformly textual information, FOCUS uses the Unicode standard codification. This is a two-byte encoding scheme that covers all the glyphs used on the planet and is open ended. This nevertheless doubles the size of textual information. The way the information is stored in the repository optimizes this space: L strings never duplicate a given character.

While the Unicode standard provides for a unique codification for all glyphs on this planet, no available system deals with the problem of badly accentuated words or inconsistently ligatured characters. One handles all this along with usual tokens to represent unknown characters. The substitution processor (item #23) also takes care of misspelled and broken words.

A "word" can have more than one representation. The number twelve can be written as the sequence of characters "twelve", as the sequence "12" or as a numerical value 12. There is nothing in a FOCUS that prevents using the three representations at once. The same situation holds for compound words such as "weekend". This can be stored as "weekend", "week-end", "week" and "end", all at the same address. For languages like German or Swedish where most words are compound, A FOCUS will store the whole word along with its components, such as in the "weekend" example above. grammatically based addressing systems Locating a word in a text can be done in several ways. Let us mention three of them: a/ The "physical" address, i.e. the number of the byte in the file where the first letter of the word is, b/ the page number, line number and word number on the line where the word is, c/ the book number, chapter number, paragraph number, sentence number and word number. The first gives no information about the fact that two consecutive words may be in the same sentence. The second implies that end-of-lines be recorded in the file or that one has the information about how long a line is and which wrapping technique is used. The third relies on a grammatical analysis. Using the "multiplexing " integer coding technique, all these technologies amount to a single value representing the address. Which means that other locating schemes can be freely designed. Over this, semantic information can be attached to a paragraph or a sentence, all this being done at the time the text is fed to the repository. Compound words can be :.tored along with their components if one wants to be independent of delimiters, mistypings, or able to access text through the components themselves. If grammatical engines are used, numbers and dates can be stored under textual and numerical forms together without difficulty. graphic rendering of arithmetical/logical expressions People dealing with mathematical formulas use to cope with the usual representation of mathematical expressions using pairs of parenthesis. People who have to express Boolean expressions have usually no practice with this form of codification, which makes it hard for them to express complex conditions. By displaying the expression in a two-dimensional surface, one can get rid of the necessity or pairing the parenthesis and give a much better visibility on complex expressions. This display is also relatively intuitive and people get used to it rather quickly.

Items are entered on a series of lines, and if a line is more indented than the previous one, then it is a parenthesis at the end of the expression formed by the lines over it. Every new line can have its own Boolean operator shown in some buttons (AND, OR, NOT). In the prototype of FOCUS, one also assumes that an implicit Boolean OR links all items on one particular line. NOTE: Boolean logic can be applied in two ways, and the way used by computer scientists is confusing for the lay people. When a data base user looks for the records of "Brown" AND "Smith", he has to enter the formula "Brown" OR "Smith" (the two worlds are dual to one another). The Boolean operators mentioned above are the operators that common people would use. co-ocurrences A co-occurrence is defined as a link between two words. This link is simply the fact that those two words appear together in the same paragraph. Studies have shown that simple word co-occurrences do not give interesting results because of their semantic ambiguities. To the contrary, groups of words υr expressions are semantically rich enough to be used with a co-occurrence network. The basic function that co-occurrences are doing is providing a list of expressions from a given request. Those expressions are the one that appear most often with this word and all the co-occurrences between all those expressions are given for any graphical representations. All co-occurrences calculations are based upon the results from the given request. A request can be a word, an expression or any Boolean combination of words or expressions.

Co-occurrences is an old idea in linguistics, meaning that there may be interesting information stemming from the words or expressions most frequently associated to a given term.

If n is the number of terms, ordinary co-occurrence processes are in n to the power 3. This means that it is impractical on large bases, and even on large RAM memories, as the CPU time is tremendous. Strangely enough, most cooccurrences systems use what is known as "clustering". This technique consists of choosing an arbitrary value (usually around 2K bytes) within which cooccurrences will be computed. This value has obviously no linguistic meaning. A FOCUS can record paragraph numbers of the terms it analyses. Then the computation of co-occurrences can be done on a much more significant linguistic basis. Furthermore, if a FOCUS stores the cross-references of the paragraphs with their associated words, the Whole process can be rendered linear.

Computing a co-occurrence within a FOCUS is simply sending a second query on the paragraph numbers given at the first one. Then, sorting the result (a linear operation as far as FOCUS is concerned) one easily gets the most frequent words and expressions associated with the query. Under a FOCUS, cooccurrences are no longer associated with a single term, but with whole queries as well. the "right window" The key feature of the user's interface to a FOCUS interactive search function is what is called "the right window" as it appears in the upper right corner of the interface of the prototype and first versions sold. On the "web" interface (the Java applet), this "right window" is partly transferred to the bottom window (glimpses along with file references). The main use of this window is to interact with the user for every character typed.

When the user starts typing a word, the window echoes his typing to show that the FOCUS "understands" what is going on. Conversely, when the window stays empty, it shows that some of the character typed had no echo in the FOCUS' knowledge.

As long as the user is within a word (i.e. the cursor is just after a letter), the suggested display in the right window is a sampling of words beginning by the letters before the cursor, but with a different next letter for each word displayed. For instance if the user's cursor is after the letters "BE" and the FOCUS has stored the words BEE, BEEF, BENEATH, BENIGN, we will only show BEE and BENEATH, simply because a screen space is limited and that is enough for the user to know that he won't find BEFORE as no word beginning with BEF is displayed.

When the user has finished a word, i.e. he as entered a space to move to the next word, we suggest that the FOCUS displays all possible orthographies of the word (mixing cases, accents, ligatures, of these options are set) along with the number of documents and of occurrences within these documents. Finally, when the user is done entering a request's text and asks for the answer, for each document reference displayed, the right window will display the "glimpses" of this document. ocr type ligatures

The "substitution" engine used to handle alphabetical ligatures can also be used for Optical Character Recognition errors. Given an engine that does not make too many mistakes, i.e. to the point where the output is easily readable (our current reference is FineReader V4), common OCR mistakes such as the confusion between "rn" and "m" can be declared in the data for the substitution engine. Searching for ."modern" will then also find misspellings like "modem". This feature which is to be triggered by some menu option or any other suitable means allows one to keep valid output from OCR engines without having to do any manual corrections and yet be able to find pertinent data in them. iava applet + javascript structure

The current CGI (HTML) mode of operation on the Web has longer response times when compared to those of a FOCUS. It was thus necessary to design one new way to preserve as much interactivity as we could. Using a Java applet to handle the query dialog does this. Every time a key is pressed, and assuming there is no pending query, only a few characters are sent on the network. Advantageously, the FOCUS' answer can be kept very small (2K, for instance), so that once again the load of the network is minimal. In contrast, using a CGI script would require that over 10 times more characters be sent to redraw a whole page. Similarly when the query is complete and the FOCUS sends the list of documents along with their glimpses, only the data are sent and the formatting is done locally by the JavaScript: again a tenfold improvement from the CGI approach. The key point in all this is to control the dialogue between the applet and the server so that no query is sent as long as another is currently in the pipeline.

Application To Multimedia Information multimedia Apart from text, data currently stored on a computer include sounds and images. These codes can then be used as a prefix for data referring to sounds and images. In fact, apart from textual data, one can also store numbers as numerical values and dates or duration in terms of a number of time units. All these data types can be characterized by a different prefix. This allows one to keep all data within a single repository, guaranteeing the speed of both storing and searching. As one has decided to use the Unicode standard for coding text, one knows that several two byte codes cannot be valid Unicode values. So the "prefix" for text in the prototypal implementation of FOCUS is simply any potentially valid Unicode word. automatic text and image correlation

Some products are able to extract image signatures. One can place these signatures in FOCUS after a linear transformation. If one takes an image database with texts related to images, one can generate a number of correlations between words, expressions, Focusers and image signatures, by a statistical approach. After that, when one extracts signatures from a new image, it is possible to propose words, expressions and Focusers for this new image. That is an approach to automatic image description and text generation. In a text-based document, correlating legends (when they exist) with images is another way to achieve this. If none exist, getting the themes out of the text around provides with the same output. What has been said to images can be extended to sound, music, drawings, video, etc. application to sound As far as human beings are concerned, sounds can be plotted on a three dimensional vectorial space: speech, music, noise. Indexing speech as such has to wait for efficient speech-to-text converters, which are not available now. Music has already its "textual" aspect in the form of MIDI files that carry the "names" of the notes to be played along with the texts of the lyrics. That is discrete data as opposed to "continuous" data as recorded on a magnetic tape. Search of MIDI phrases can be achieved in much the same way as text is being searched for by its words. When it comes to music in the form of "wave" files, there are "wave-to-MIDI converters" already available, but they are restricted to monophonic music for the time being. There is no indication that polyphonic converters could be soon available, but when they are, one is back to the storing MIDI information scheme. In wave to MIDI conversion, it is also possible to identify the timbre of the instrument. Noise can be unwanted parts of a signal and as such is generally processed by "denoising" devices. But noise can also be a special effect. As such, it is input for analyzing software. If one considers the soundtrack of an opera, all three dimensions are available at the same time and one is very far from being able to identify all these components. But again, as soon as appropriate analyzing software is available, one is ready to store the elements that can be provided as output from the analysis program(s). indexing MIDI files Note values in a MIDI file are coded on 7 bits. As the research is going to be done without reference to the actual pitch of the notes, what is to be recorded is simply the difference between successive notes. So the first recorded value will always be zero, which means one can skip it. Musically, this means that everything is being chromatically transposed, say in C. Being coded on 7 bits only, differences between two notes can always been recorded on one single byte.

Now that one knows that coding is possible. The problem is to choose what υne codes. There is a need of a software program that tries to identify the melody channel (usually monophonic) and cuts the melody into musical phrases. As most people only recall the beginning of a verse or of a bridge, this slicing program can be relatively rudimentary at first. Searching for a particular MIDI file involves playing a few notes on a keyboard or singing them in a mike equipped with pitch to MIDI conversion. As the information has been transposed, perfect pitch is not needed. People who sing out of tune will be helped by elementary fuzzy logic using the substitution engine. indexing music through MIDI

Using "wave-to-MIDI" to convert monophonic sound tracks to MIDI data, equipment and programs are commercially available. creating a standard in the music industry

This applies to Direct-to-disk recording (D2D), CDs, DVDs, etc. At production level, when music has been mixed, the wave to MIDI conversion becomes useless. It is nevertheless rather inexpensive to have somebody playing the melody on a MIDI instrument along with the music. Then one can run this into a FOCUS and provide the production with a repository. Every musical phrase becomes accessible then. application to images

There is one category of images that readily allows some kind of discrete access: those produced by CAD systems or vector drawing programs such as

Corel Draw, AutoCad, DesignCad, etc. In these images, all elements are discrete and their structure is defined even if not publicly documented. Text in these images is often kept in computerized text form, and all information relative to image components is available. Position of some meaningful point (such as the center, for a circle) can be used as the position parameter.

Several attempts have been made to analyze still images that are, either the capacity of discovering a special pattern in a particular graphic field, or the capacity of characterizing some of the features of an image, such as the overall color composition, the types of textures in the picture, etc. These programs give binary bit patterns that one can record in the repository. The fastest search result is achieved if these bit patterns have some kind of ordering in them so that only portions of the repository have to be scanned to answer a query. For video sequences, the process is very much the same as those sequences are cut into homogeneous parts, each of which being characterized by its first image, all subsequent images being determined as deviation from the first one. Seen like this, a film is merely represented by a limited number of pictures and one is then brought back to the problem of still images. Some indications referring to the movement can also be extracted from video sequences.

Accessing a musical sequence or a video sequence can be done through the industry standard SMPTE (Society of Motion Picture Technical Engineers) time code. The value of that code at the beginning of the sequence is what one can use as a positioning parameter. Now, all these types of data usually do not occur alone but they are accompanied by textual information. For a movie, readily available textual information is the script and dialogue (sub-titles). Musical information can be stored by laying a MIDI track of the melody along with the video sequence (any ordinarily skilled keyboardist can do so at once, so cost is not a big issue here). Running FOCUS on these textual data and using the SMPTE code as an access method, one can allow direct access to any given sequence using those textual and musical parameters. searching through image signatures

This application presupposes that an external technology is able to extract meaningful signatures from any given image, such as color, composition, texture, etc. Currently, these technologies tend to use parameters and look for a match through convolutions or other CPU consuming techniques. A FOCUS uses these technologies very differently. In fact, the general idea is to build "graphic Focusers". Out of a collection of images selected because they are "rather blue", it is easy to search for the parameters that are rather constants on "blue" images and to build a Focuser characterizing a blue image by the average value (or a bracket) of the corresponding parameters and the same operation to detect any king of texture, composition etc. When analyzing a new image a set of parameters is compared to the values of every "graphic Focuser" much like it is done for text. And these values are recorded along with the image references and their original parameters (in case they can be used). The access is then granted through a user interface much like a graphic editor to a set of pictures. It is still pertinent and still immediate.

Using The Repository Structure For Dynamically Managing Structured Data domains

A domain is simply the subset that is the response to a query. The same query will obviously produce different domains as the database evolves and changes. So a domain will be defined simply by the query itself and applied, from now on to everything analyzed by the FOCUS. A domain definition can be done "dynamically" by a Focuser and then stored to be used as a reference. Domains are needed to describe the field of action of such things as Focusers and synonyms (thesauri). Their primary use is to play the role of "categories" or "taxonomies". Given a term, FOCUS knows which domain definitions this term is used in, allowing to present to the user these domains so that he can restrict its query to it. For instance, typing "cell" may produce a choice between, "spreadsheet" and "biology". The notion of domain is wider than the notion of Focuser as a domain can be defined by one or more Focusers but also many more query items (such as date brackets, directories, URLs, etc.) cross-references

This again (like any other non-textual element) is given a special code in a FOCUS. It is extensively used every time an indirect access is to be done. For instance, a given word is meant to occur in the document "mydoc". This document name can vary (through a "rename" or "move" operation). It is thus very dumb to code it as "mydoc" along with the word. Instead, "mydoc" is given an ID (usually a number) which need to "point" to the actual name ("mydoc" for the time being). If the document is moved or renamed, the ID needs not change, only the cross-reference to that ID is updated. Cross-references can be used for document names, sizes, languages, etc., for field names in databases, every item that is generic to a set of data. It is especially mandatory to use them if this item can be changed. For fixed data, not using cross-references only means spoiling space, but there may be applications that need it. Cross-references are also used in our implementation of co-occurrences to store which words are in which paragraph.

Application To Information Represented By A List Of Networks chemistry and other networks A specialized type of data has been studied to be stored in a FOCUS: images of chemical molecules. In fact what is to be handled is networks with different nodes and different links.

The immediate answer (that would nevertheless handle the 24M of molecules patented today) is to describe all possible representations of the network (i.e. starting with every node) in a hierarchical way, which provides only one possible representation for each node. If the nodes and links are represented by a numerical code, the uniqueness can be granted simply by ordering the different ways of navigating through the network on a numerical sorting. Loops are simply special links (we suggest to give them the lowest values after the "end of string" value) which jump to a node described by the number of links backward to a common node, and the number of links forward (in the description) to the looping node. Special groups of nodes and links can easily be represented as particular node values. Recorded networks are entered as described. Querying networks simply consist of choosing a non ambiguous node to start with (the farthest of any ambiguous node is preferred) and describing the network under all its possible forms (according to the groups of nodes it may contain) and running the corresponding request to the FOCUS. Because of the fuzzy definition of certain groups of nodes, all the answers to the query to the FOCUS may not be answers to the user's query; it is thus necessary to select the right ones.

The 24M of molecules already registered would involve a 90Gb repository and the access is as usual, pertinent and not more than one hundredth of a second away...

Embedding Focus In Existing Applications As already said, a FOCUS allows to revisit all of computer applications as one has them today. It is not simply a new function that can be called from an application as an external routine, it can really be EMBEDDED in - part of- all existing programs. In a file explorer, for instance, FOCUS can be used to replace the existing lirrtited access to contents. But also, if the user can type a word that refers to a filename, FOCUS will give immediate access to all filenames containing that word. No more navigation through unending directory tree structure. Assuming a number of people be connected onto a network, agreeing on some - even limited - conventions on directory names and/or architecture will enable them to focus at once on other's directories.

In a word processing, spreadsheet, database application, being able to run the Focusing process on the current paragraph or text field will allow the user to get at once all related paragraphs on all the data he has access to. Clicking on any word, expression or cell can be enough to call in all related data as well. Even calling the FOCUS' retrieval routine can be used to not only display the information found, but even to feed existing applications with the result. the "targets" system

Instead of using clipboard copy or drag and drop techniques which always make necessary to go from one "window" to another, a "Targets" system can be used by the viewing routines associated to the query interface. Targets are identified by a single letter and consist either of files or applications where the selected items will be sent to just by typing the associated letter. Contextual menus and/or using the Ctrl or Alt keys along with pressing the letter key allows the user to monitor the "target": changing the name of the file, the target application, the various options associated to the sending of the data, etc. Basically, if a target is.a filename, all information sent to that file will simply be appended. This not only saves a lot of time as compared to the current copy/drag & drop techniques, but it allows the user to dispatch the results of his queries to predefined batches (think of a journalist building up its references for a number of different articles at once). A host of batch applications are also available as the output of a FOCUS query: sending all paragraphs found to a particular processing unit. Such a system is already what FOCUS uses for automatic Focusation. This capability can be associated with the "target" technique: sending all "hits" of the current file to the target file or application, or even all hits or the query to a target. RDBMs links

The basic access to data on ordinary computers is done through "Data Bases" the most elaborate being "Relational", thus producing the RDBMS abbreviation for Relational DataBase Management Systems. This approach once thought of as an easier way to access data proves now to be very painful, heavy and slow in its handling of data. For a FOCUS, structure is simply another type of content, such as tree structures for directories. How a FOCUS is fed with DBMS data is demostrated as follows. Optionally, but recommended, some "administrator" may choose which fields are to be recorded in the FOCUS. Then an initial analysis is done that can simply be by dumping the data base in an ASCII, Unicode, or SGML format, which is in turn analyzed by the FOCUS (as it is done with usual text documents, for instance). On small databases (a few hundreds of megabytes), and if real-time updating is not a concern, this procedure can be repeated periodically. But if the base is several terabytes large or if real time response is wanted, an update procedure is necessary. In every case, an "alias" file will redirect the references stored by the FOCUS to the original record/document, just as it is done for a portal or an access to a network document. We will assume that the intermediate format is XML (but HTML, SGML and all structured texts formats are equivalent for this purpose). Selecting data: From the base structure, the administrator will select the fields to be analyzed, indication whether the field name is to be kept or changed into an XML tag which FOCUS will associate to its content.

Initial analysis: As noted above it is a simple dump of the base selected fields. Update: The first solution is to use a "phantom table" with the same structure as the original one, this phantom table being periodically dumped as for the initial analysis with the same parameters. The FOCUS will handle the historicity of the records by deleting previous occurrences, as it does with all other documents. The second solution is more complex but also more efficient. It consists in reading the internal logging format of DBMSes to achieve the same result. It must be noted that the first solution implies a slight modification for all applications using the DBMS, but no knowledge of its internal coding, while the second does not imply any interference with the applications, but a knowledge of its internal structure.

While the invention herein disclosed has been described by means of specific embodiments and applications thereof, numerous modifications and variations could be made thereto by those skilled in the art without departing from the scope of the invention set forth in the claims.

Claims

CLAIMSIt is claimed that:

1. A method for a content-accessible system of operation for a computer comprising the steps of: (a) defining;

(b) classifying; and

(c) indexing content.

2. The method as in claim 1 further comprising the step of: designating all real numbers in a scheme; wherein said numbers can be arranged easily in a monotonic fashion.

3. The method as in claim 1 further comprising the step of: sorting through the content, according to a sorting method that is linear and fast.

4. A method for a content-accessible system of operation for a computer comprising the step of : designating all real numbers according to a scheme; wherein said numbers can be arranged easily in a monotonic fashion;.

5. The method as in claim 4 further comprising the step of sorting through the content, according to the associated monotonic real numbers; wherein said sorting is linear and fast.

6. A method for a content-accessible system of operation for a compute comprising the step of: sorting through the content, according to a sorting method that is linear and fast.

7. A method for a content-accessible system of operation for a computer comprising the steps of : (a) designating all real numbers by a scheme; wherein said numbers can be arranged easily in a monotonic fashion;

(b) sorting through the content, according to the associated monotonic real numbers; wherein said sorting is linear and fast.

8. A method for a content-accessible system of operation for a computer comprising the steps of :

(a) defining; classifying; and indexing content;

(b) sorting through the content, according to a sorting method that is linear and fast.

9. A. method for a content-accessible system of operation for a computer comprising the steps of :

(a) defining; classifying; and indexing content;

(b) designating all real numbers; wherein said numbers can be arranged easily in a monotonic fashion;

10. A method for a content-accessible system of operation for a computer comprising the steps of : (a) defining; classifying; and indexing content;

(c) sorting through the content, according to the associated monotonic real numbers; wherein said sorting is linear and fast.

11. The method as in claim 10 further comprising the steps of:

(a) utilizing a Focuser;. further comprising the steps of:

(b) utilizing an L-shaped string further comprising: an optional horizontal string; and a vertical string of at least one byte;

(c) pointing to the next L string with at least a byte of associated vertical string.

12. The method as in claim 11 further comprising the step of: utilizing defragmentation to carry out software garbage collection efficiently.

13. The method as in claim 11 further comprising the steps of:

(a) utilizing subclusters; (b) defining subclusters as a power of 1/2 fraction of a cluster;

(c) addressing, in general, an L string with the successive: cluster, subcluster and byte address.

14. • The method as in claim 11 further comprising the steps of:

(a) coding, in the Focus repository, all numbers including lengths and addresses, as monotonic integer numbers;

(b) coding said integers according to said monotonic real numbers.

15. The method as in claim 14 further comprising the steps of:

(a) building a string of bytes for a Focuser L -string;

(b) constructing said string with: content, ID of reference, numerical value.

16. The method as in claim 15 further comprising the steps of: (a) utilizing two numerical fields, data and the length of the data plus length of data field;

(b) coding said numerical fields with monotonic integer numbers;

(c) utilizing monotonic ID's and monotonic positions within IDs; sorting on the content.

17. The method as in claim 16 further comprising the steps of:

(a) selecting an empty file as a chain of clusters;

(b) utilizing two pointers in each cluster; (c) pointing one pointer to a previous cluster;

(d) pointing a second pointing to a next cluster;

(e) setting a pointer to an arbitrary, non-valid address, value if no next or previous exists.

18. The method as in claim 17 further comprising the steps of:

(a) initializing an empty chain of clusters as a free chain;

(b) initializing said free chain over the whole file if its size is known at the time of creation of the file;

(c) expanding a free chain dynamically by one or more clusters as needed;

(d) allocating from the free-chain to another chain;

(e) de-allocating a cluster by giving it back to the free-chain;

(f) storing starting and ending points of each chains in a first cluster; wherein said first cluster is a root;

(g) de-fragmenting by coping last clusters into the free-chain ones; and truncating file when the free-chain is a continuous set of clusters at the end of said file.

19. The method as in claim 10 further comprising the steps of:

(a) utilizing a multi-processing technique which makes no calls to an operating system; wherein said multiprocessing technique is operating system independent;

(b) utilizing mail boxes between processors; (c) dedicating file names for said mailboxes that carry parameters;

20. The method as in claim 19 further comprising the steps of:

(a) synchronizing in and out of a plurality of processors;

(b) sending data to a plurality of processors with a file explorer type program.

(c) achieving a control panel of a processor by looking at filenames in a directory with a file explorer type program

21. The method as in claim 20 further comprising the steps of:

(a) synchronizing processes over a pipeline chain;

(b) outputting first process outputs to "lout.tmp";

(c) renaming it to "1 outgo" when the file is complete; (d) looking for that particular file name;

(e) triggering the second process when encountering the file name;

(f) synchronizing files to be numbered such as "loutOOl.go";

(g) getting the two processes in parallel along with the processing of many files.

22. The method as in claim 21 further comprising the steps of:

(a) devoting coordination to a command program when processes are not resident;

(b) firing the different processes according to existence of their input files; (c) utilizing command program a centralized scheduler when processes are resident.

23. The method as in claim 22 further comprising the steps of: (a) indexing databases; (b) utilizing a structure implicitly given for indexing purposes;

(c) storing data structure information for a word;

(d) utilizing word is in table T, record R, field F, paragraph P for said information;

(e) storing said information along with the word itself using a postfix coding procedure.

24. The method as in claim 10 further comprising the steps of: (a) specifying a name for a Focuser; (b) associating a list of words and expressions that represents the Focuser;

(c) specifying the language used by the Focuser,

(d) specifying several parameters of the Focuser further comprising (1) indicating number of words or expressions in a paragraph, fewer of which found in a paragraph, indicate that the paragraph is not relevant for said Focuser;

(2) examining number of words or expressions in a paragraph below which the paragraph might be relevant and above which the paragraph is really relevant for the Focuser;

(3) thresholding for pertinence, below this number, the word does not belong to the Focuser.

25. The method as in claim 24 further comprising the steps of: (a) building a Focuser from a text by utilizing the known general frequency of words in a given language;

(b) defining pertinence number as the ratio: frequency of the word in the text divided by general frequency of word in language as a whole;

(c) calculating a "pertinence" number; (d) representing the uniqueness of the word found in the paragraph

(e) ordering all the words from the source text by decreasing pertinence values;

(f) denoting words having the biggest pertinence value as the most pertinent for a paragraph.

26. The method as in claim 24 further comprising the step of: building automatically, a Focuser, by utilizing pertinence numbers of words in a text.

27. The method as in claim 24 further comprising the steps of:

(a) enhancing, manually, the Focuser

(b) creating specific information: wherein said information includes (1) specific expressions that are very relevant; (2) synonyms (which can be automatically proposed by the system); (3) words or expressions, which should be excluded from the Focuser (forced zero pertinence value); (4) words or expressions, which discriminate the Focuser, i.e. they are automatically excluded from the Focuser (negative pertinence value); (5) a word that is accepted, excluding all expressions containing the word.

28. The method as in claim 27 further comprising the steps of: (a) recognizing a parametric number of words or expressions pertaining to a

Focuser, in the same paragraph; (b) detecting, thereby, said Focuser.

29. The method as in claim 28 further comprising the step of: setting said parametric number of words or expressions pertaining to a Focuser as three (3).

30. The method as in claim 28 further comprising the steps of:

(a) defining routing through a profile; wherein a profile is a set of positive and negative Focusers;

(b) defining filtering through a profile; wherein a profile is a set of positive and negative Focusers;

(c) rejecting text if a negative Focuser has been recognized; accepting text if at least one positive Focuser is detected.

31. The method as in claim 25 further comprising the steps of:

(a) building automatically a concept associated to any word in a text repository;

(b) following an algorithm

32. The method as in claim 31 further comprising the steps of:

(a) collecting all paragraphs in a text containing a word;

(b) extracting a concept from the concatenation of the paragraphs.

33. The method as in claim 32 further comprising the step of: limiting said extraction to expressions that contains at least two meaningful words.

34. The method as in claim 33 further comprising the steps of: (a) identifying a document before opening it; further comprising:

(b) utilizing one or more Focusers;

(c) applying one or more Focuser to a document in a repository of said one or more Focusers; (d) producing a glimpse of said document; wherein said glimpse is an abstract of said document produced by said one or more Focusers.

35. The method as in claim 34 further comprising the steps of: (a) analyzing and creating in real-time on one or more Focus repositories; (b) compiling and creating a dynamic Focuser; wherein said dynamic Focuser is produced in real-time.

36. The method as in claim 35 further comprising the step of: splitting a repository; wherein the overflowed part can be subsequently split without leaving unnecessary empty disk space