WO2001098941A1

WO2001098941A1 - Method of sentence analysis

Info

Publication number: WO2001098941A1
Application number: PCT/AU2001/000731
Authority: WO
Inventors: Simon Dennis
Original assignee: The University Of Queensland
Priority date: 2000-06-20
Filing date: 2001-06-20
Publication date: 2001-12-27
Also published as: AUPQ825300A0

Abstract

To analyse a sentence, the sentence is first segmented into a sequence of tokens and each token is placed in a processing buffer. The sentence is then subjected to lexical retrieval whereby each token is replaced by a vector of paradigmatic associations for each word from a lexical long term memory. The segmented sentence is also copied into a syntactic buffer and into a relational buffer. A syntactic probe is constructed and used to retrieve syntactic traces from a syntactic long term memory and to iteratively up-date the syntactic probe. Similarly a relational probe is constructed and used to retrieve relational traces from a relational long term memory.

Description

METHOD OF SENTENCE ANALYSIS

The invention relates to the general field of computer-based information retrieval. In particular it relates to the extraction of information from databases of textual material. Specifically, the invention provides a method of sentence analysis to extract relational information, and of accessing the relational information to answer questions.

BACKGROUND TO THE INVENTION

It has been hypothesized that humans learn language by acquiring the syntagmatic (within sentence) and paradigmatic (between sentence) associations among words. For instance, the fact that the word "green" is often followed by the word "grass" will lead to the formation of a syntagmatic association between "green" and "grass". Similarly, the fact that the word "green" often fills the same slot in a sentence as the word "red" (e.g. the balloon was green, the balloon was red) will lead to the formation of a paradigmatic association between "green" and "red".

Sentence interpretation may then be construed as a memory retrieval problem involving three critical memory systems: syntactic; relational; and lexical. A syntactic trace is the set of syntagmatic associations within a sentence. A relational trace is the set of paradigmatic associations activated by a sentence. A lexical trace is the set of paradigmatic associates of a word across sentences. The sentence to be interpreted can be used as a probe to a long-term memory containing sentences to which the system has been exposed and the retrieved information can form constraints on working-memory resolution. The representation and extraction of information from structured databases is a well-understood problem with well-developed and highly successful solutions. However, as general information retrieval mechanisms, structured databases have a number of inherent limitations. The application of database techniques both for constructing information stores and for querying these stores requires substantial expertise, limiting the accessibility of these techniques to expert users. In addition, database methodologies typically require the information technology professional to create a predetermined organization for the information domain of interest. Altering this organization once the store has been constructed can be difficult. In addition, the organization chosen when establishing the store can place limitations on the ways in which that information can be accessed, leading to inflexibility as retrieval needs evolve.

The majority of electronically stored information, particularly that available on line, comes in the form of open text. Open text requires only - language skills to construct, which allows for more general participation in the development of information bases. However, open text is difficult to query. Current information retrieval methods typically require the user to supply a set of keywords that are used to specify the general information domain. Documents related to that domain are returned and it is up to the user to trawl through these documents looking for the specific information they require. This can be a time consuming and error fraught process. This approach, however, is used almost universally by common search engines to extract information from databases accessible by the World Wide Web and much of the known prior art is primarily suitable for supporting this style of retrieval operation. For instance, reference may be had to United States patent number 5056021 in the name of Ausborn. The patent describes a method of abstracting concepts from natural language using a word by word categorization process. Many of these retrieval systems utilize co-occurrence information (syntagmatic information) to build semantic representations of words and documents. For instance, US patent number 4839853 assigned to Bell Communications Research Inc. describes a system for extracting latent semantic structure from corpora to support the representation and retrieval of textual objects. Similarly, United States patent number 5406480 and US patent number 5099425, both assigned to Matsushita Electric Industrial Co., Ltd., describe methods for extracting semantic representations from co-occurrence information. A solution to the difficulties outlined above is to allow users to supply their queries as natural language questions. The retrieval mechanism must then take each question and return just that information required to answer it. Current approaches to question-answer systems fall into one of two categories. Either they use symbolic techniques to parse each sentence rigorously extracting the content in accordance with syntactic and semantic grammars, and are consequently easily derailed by language examples that were not anticipated by the designers. Or they act like more general information retrieval engines using a set of heuristics to isolate in the document base a specific piece of text (word, phrase or sentence) that might contain the content of interest. In these systems, there is no attempt to code the deep structure of each sentence. United States patent number 5331554, assigned to Ricoh Corporation, describes the application of such a method to the question-answering task using a tree based pattern recognition approach.

Thus the known prior art either relies on a designer generated abstract grammar that makes it susceptible to unexpected input, or is limited to the extraction of simple information that is evident in the surface form of the text, thus missing relational structure that can be expressed in many surface forms. A more complete method is required that can construct lexical, syntactic and relational memories, which may then be used for extracting answers to natural language questions.

DISCLOSURE OF THE INVENTION

In one form, although it need not be the only or indeed the broadest form, the invention resides in a method of sentence analysis including the steps of: segmenting the sentence to be analysed; initialising a number of buffers for storing word vectors; lexical retrieval; syntactic retrieval and resolution; and relational retrieval and resolution.

The step of segmenting the sentence involves segmenting the sentence into a sequence of individual words. Suitably the sentence is segmented into a sequence of tokens which include root words, prefixes or suffixes, and punctuation. Suitably the step of initializing buffers includes initializing an input buffer and a processing buffer with the sequence of individual words or tokens representing the sentence. Initially each word is stored as the representation of itself. A syntactic buffer and a relational buffer are also established.

The step of lexical retrieval includes retrieving a vector of paradigmatic associations for each word from a lexical memory storage means and replacing each word in the processing buffer with the corresponding vector. Suitably, syntactic retrieval includes the steps of constructing a syntactic probe in the syntactic buffer from the contents of the processing buffer and using the probe to retrieve syntactic traces from a syntactic memory means, to update the syntactic probe. The updated probe is a weighted average of the retrieved syntactic traces. Syntactic resolution includes the step of modifying the contents of the processing buffer to reflect the constraints in the syntactic buffer. An optimization technique is suitably applied.

Suitable optimization techniques include gradient descent, conjugate gradient, simplex, and alignment. Suitably, relational retrieval includes the steps of constructing a relational probe in the relational buffer from the contents of the processing buffer and the input buffer. The probe is used to retrieve relational traces from a relational memory means to update the relational probe. The updated probe is a weighted average of the retrieved relational traces. Relational resolution includes the step of modifying the contents of the processing buffer to reflect the constraints in the relational buffer. An optimization technique is suitably applied.

Suitable optimization techniques include gradient descent, conjugate gradient and simplex. In a further form, the invention resides in a method of retrieving the answer to a natural language question, the method including the steps of: a) segmenting the question to be answered; b) initializing an input buffer and a processing buffer with the segmented question; c) performing lexical retrieval; d) performing syntactic retrieval and resolution; and e) performing relational retrieval and resolution; said steps c) to e) being performed one or more times until the processing buffer contains the answer to the question.

It will be appreciated that the steps c) to e) may be performed in any order and as many times as necessary to analyze the sentence, answer the question, or retrieve the desired text.

In a yet further form the invention resides in a method of establishing an ensemble of facts from a text document including the steps of: a) organising the text document into a collection of sentences; b) segmenting each sentence; c) initialising a number of buffers for storing word vectors, including an input buffer, a processing buffer, a syntactic buffer and a relational buffer; d) lexical retrieval; e) syntactic retrieval and resolution; f) relational retrieval and resolution; g) g) performing steps b) to f) until the relational buffer contains a fact in the form of a matrix; h) repeating step g) for all sentences in the document to compile the ensemble of facts. The method may further include the step of performing steps a) to h) for a first document to produce a first ensemble of facts, performing steps a) to h) for a second document to produce a second ensemble of facts, and comparing the first and second ensemble of facts to establish a similarity measure between the first and second document. The ensemble of facts may be compared by determining a dot product for each fact of the first document with each fact of the second document, and calculating an average of the maximum dot products. Alternatively, a cosine of the angle between each fact of the first document with each fact of the second document may be calculated, and an average of the maximum cosines determined.

BRIEF DETAILS OF THE DRAWINGS

To assist in understanding the invention, preferred embodiments will now be described with reference to the following figures in which:

FIG 1 shows the architecture of a sentence analysis engine; and

FIG 2 shows a flowchart of a document comparison process.

DETAILED DESCRIPTION OF THE DRAWINGS

In the drawings, like reference numerals refer to like parts. In FIG 1 is shown the architecture of a sentence analysis engine 1 for working the method. The engine consists of syntactic long-term memory system 2 coding syntactic traces (between slot, per sentence), relational long-term memory system 3 coding relational traces (within slot, per sentence) and lexical long-term memory system 4 coding lexical traces (within slot, across sentences).

An input sequence buffer 5 contains the sentence to be processed and a processing sequence buffer 6 contains the current interpretation of each of the words in the input buffer. Each slot in the processing buffer is a vector each component of which represents a possible token (word, punctuation symbol or affix) and will contain the set of possible paradigmatic associates of the input word.

The results of retrieval from the syntactic long term memory are stored in a syntactic buffer 7. A similar relational buffer 8 stores the result of retrieval from the relational long term memory. Sentence analysis follows the same basic process for building a knowledge base of sentence structures or for using the knowledge base to answer questions. Processing of a sentence involves the following steps: 1. Text Preprocessing: Firstly, the text representing the sentence or question must be segmented. Words and punctuation marks are separated into tokens. Performance of the model can be improved by also segmenting the affixes from each root word (e.g. ing, ly, un). Note that unlike other information retrieval techniques the affixes are retained as tokens because they aid syntactic interpretation. 2. Initialize Input and Processing Buffers: The sequence of tokens derived during step one is copied into the input and processing buffers, so that the initial interpretation of each word is the word itself.

3. Lexical Retrieval: The vector of paradigmatic associates of each of the input words is retrieved from Lexical Memory and added to the corresponding processing buffer slot.

4. Syntactic Probe Construction: Using the current contents of the processing buffer a syntactic probe is formed (see below for details) and placed in the Syntactic Buffer.

5. Syntactic Retrieval: The probe in the Syntactic Buffer is used to retrieve syntactic traces from syntactic memory. The probe is updated from the retrieved traces.

6. Syntactic Resolution: An optimization technique is used to alter the processing buffer representations to reflect the constraints stored in the Syntactic Buffer. 7. Relational Probe Construction: Using the current contents of the input and processing buffers a relational probe is formed (see below for details) and placed in the Relational Buffer.

8. Relational Retrieval: The probe in the Relational Buffer is used to retrieve relational traces from relational memory. The probe is updated from the retrieved traces.

9. Relational Resolution: Again, an optimization technique is used to alter the processing buffer representations to reflect the constraints stored in the Relational Buffer. Having outlined the main process it remains to elucidate how each of the operations such as memory retrieval, lexical, syntactic, and relational trace construction and syntactic and relational resolution occur.

The memory retrieval mechanism is based on the Minerva II model of episodic memory [Hintzman, D.L. (1984); Minerva 2: A Simulation Model of Human Memory; Behavior Research Methods, Instruments and Computers 6 (2) 96-101]. In Minerva II, memories are stored as vectors. Retrieval involves comparing a retrieval probe, also represented as a vector, against each of the traces in memory using a dot product operation.

s_t = p(t)J,

Where Sj is the similarity of the t^'th trace, p is the probe vector and 7} is the t^th trace.

A new probe is constructed by multiplying each trace by its similarity raised to a power (α) and summing:

Typically, α is an odd number (usually 3 or 5) so that the multiplier retains the sign of the dot product. The retrieved vector thus formed can then be used to initiate further retrieval. To avoid probes that grow or shrink rapidly the trace multipliers are normalized.

The method disclosed herein, assumes that a syntactic trace is the set of syntagmatic associates within a sentence. Each sentence is represented by a word-by-word matrix. A 1 in a given row and column indicates that the row word is followed by the column word.

For instance, the sentence "Mary is loved by John" would be coded with the matrix:

Such a representation has important advantages in a domain such as sentence processing where embedded structure is the norm rather than the exception. For instance, if memory contains the sentence:

Mary is loved by John Then this trace should be well matched by the probes:

The girl is loved by John and Mary who was sick is loved by John despite the fact that the component words do not align. Using the syntagmatic representation of the sentence ensures that there will be overlap. "The girl is loved by John" is represented by:

The underlined components overlap with the original sentence representation.

Similarly, "Mary who was sick is loved by John" is represented by:

Suppose that the input buffer 5 contains the sentence to be processed as an array of locally encoded vectors. For instance, the sentence "Mary is loved by John" is coded as:

Then the syntactic trace can be expressed as:

M forward -

where n is the number of words in the sentence. We could also code the backward associations:

A _4βdbrørf =

Note, however, that M_forward =M_b ^T _ackward so there is no need to calculate this matrix separately.

After syntactic retrieval the syntactic buffer 7 contains a set of constraints on the representations in the processing buffer 6. To apply these constraints we need to convert the order based sentence matrix into a sequence-based buffer. To do this, the method applies an optimization technique. A suitable optimization technique is the gradient descent procedure. An alternative approach is an alignment algorithm described in "Speech discrimination by dynamic programming" by Vinysyuk, T. K. in Cynemetics 4(1): 52-57 (1968). The following example uses the gradient descent procedure.

If Sij is the matrix of syntagmatic associations retrieved from memory then we need to minimize the cost function:

where

and bj is the processing buffer vector in the i location. To update the buffer vectors the following equation is used:

t _k = ε\ J (S-E) + ∑b,(S-E)⁷

1=1 k-\ l=k+\ n

where ε is a rate of change parameter and bk are constrained to be positive. Resolution continues until: max(Δb_/t ) < C

where C is a parameter of the process.

C represents the balance between the degree to which the cost function is minimised and the processing time. A smaller C requires longer processing time whereas a larger C represents less accuracy.

By coding the order relations we now have a mechanism by which appropriate surface forms, independent of embedded structure, can be matched. To be able to answer questions, however, we need a method for extracting a representation of the relational content of the sentence that is independent of surface structure. The sentences:

Mary is loved by John

Ellen is loved by Bert will match quite well using the order relations (is-loved, loved-by, is-by) even though they code quite different relational content, while: Mary is loved by John John loves Mary do not match at all despite the fact that these sentences code the same relational content. In this case, what we would like is a way of associating fillers to their roles that is independent of the surface structure and that does not require us to specify in advance what the roles are. To see how this is achieved suppose the engine is trained on the following sentences:

Ann is loved by Bert

Josie is loved by Steve

Ellen is loved by Dave

Bert loves Ann Steve loves Josie

Dave loves Ellen

Who does Bert love? Ann

Who does Steve love? Josie

Who does Dave love? Ellen

In each of these constructions there is a lover and a lovee.

Furthermore, in all three constructions the same people fill the lover and lovee roles. Bert, Steve and Dave are the lovers and Ann, Josie and Ellen are the lovees. The role is represented as the distributed pattern of those words that fill the same surface slots in different constructions

(paradigmatic associates). That is, we can call the distributed pattern where Bert, Steve and Dave are active the lover role vector and the distributed pattern in which Ann, Josie and Ellen are active the lovee role vector. To interpret the sentence "Mary is loved by John" we would code the order relations (as above) and retrieve from syntactic long-term memory. We would then find that Ann, Josie and Ellen have filled the "Mary" slot in the similar surface structures previously, and Bert, Steve and Dave have filled the "John" slot. As a consequence, the following mappings would be formed:

{Ann, Josie, Ellen} -> Mary {Bert, Steve, Dave} -> John

Furthermore, the same would be true independent of the surface form. "John loves Mary" produces the same bindings. The method takes advantage of the fact that questions (including their answers) are simply an alternative syntactic form of the relational content in a corresponding sentence. To ask the question "Who does John love?", the words would be instantiated in the Input and Processing Buffers. Then syntactic retrieval and resolution would be executed. As a consequence, Bert, Steve and Dave would be activated in the "John" slot and Ann, Josie and Ellen would be activated in the slot following the question mark. The mapping {Bert, Steve, Dave} -> John can then be used to retrieve the trace above. To fulfill the relational constraints imposed by this trace requires that there be a {Ann, Josie, Ellen} -> Mary mapping generated by the Input and Processing Buffers. This constraint can be resolved by increasing the activation of Mary in the final slot. In the process the question is answered.

In the preceding paragraphs, the relational traces were described as a set of mappings. Within the engine these sets of mappings are expressed as a sum of outer products with the following equation:

R = ∑b_i(i_i -b_iγ ι=l I - hi is used to ensure that only the components that are not well predicted (i.e. the implicit variables in the utterance) are coded in the relational representation, thus making it insensitive to surface form. (Note an alternative form of this equation which has the same effect as

R = f_jb_iI_i - diαg(f_jb_iI_i)).

1=1 ι=l For example, suppose the sentences listed above are stored in syntactic memory and we wish to find the relational representation for "Mary is loved by John". The input buffer will contain:

If a syntactic probe is generated, used to retrieve from syntactic memory and then used to resolve the retrieved syntactic matrix in the processing buffer B we get:

where Bi is a distributed vector representing the lovee role and B₅ is a distributed vector representing the lover role. Then R =

Relational traces constructed in this way can then be stored in relational memory to be later retrieved from partial relational cues.

Again, we use, for example, gradient descent with an objective function equal to the sum of squared differences between the current and retrieved relational representations to resolve the Processing buffer. If R_y is the matrix of paradigmatic associations retrieved from memory then we need to minimize:

where

and b_ι is the processing buffer vector in the z^th location.

To update buffer vectors, then, we use:

Ab_k = ε b_k (R - B)

Where £ is a rate of change parameter and bk are constrained to be positive. Again, resolution continues until: max(Δb_A) < C where C is a parameter of the process.

While the syntactic and relational memory mechanisms are sufficient to allow the extraction of relational representations of a sentence, a great many sentences would be required to ensure that each word had been seen within the many contexts in which it could appear. Performance can be enhanced by adding the distributed pattern of paradigmatic associates of a word to each slot prior to syntactic probe construction. Now, although a given word may not have appeared in a certain position within a sentence, a similar word contained in the input words lexical trace can precipitate syntactic retrieval.

The lexical trace for word W* is defined as:

a wntences

where b_t is the state of the buffer following syntactic resolution and lexical retrieval involves adding the trace of the corresponding input word to each processing buffer location:

b, = b, + L, 3 l, = W,

To answer a question, lexical, syntactic and relational memory systems are constructed and the sentence-processing algorithm is applied with the question as input. Additional slots beyond the end of the question are included in the processing buffer to hold the answer. Note that the lexical long-term memory system 4 and syntactic long-term memory system 2 contain general linguistic knowledge and hence can be pre- trained on a large corpus (containing some answered questions). The specific information required to answer the question will be contained in the relational long-term memory system 3. Consequently, it must be trained on the text expected to contain the answer.

The current invention differs from prior art question-answer systems in that it produces a representation of the deep structure of a sentence but does not require an elaborate set of parsing or semantic rules. Instead, it extracts the lexical, syntactic and relational information required to interpret a passage automatically from large corpora of electronic text. As a consequence, it is able to incorporate a more complete knowledge base than is possible when system designers must code these components. Furthermore, the automated nature of knowledge acquisition makes it feasible to track and respond to changes in language use. The inventors consider that the primary application of the method is in information retrieval, particularly extraction of information from a database in response to a natural language question. However, other applications of the method can include document filtering, document categorisation and document comparison. One particular application of document comparison is essay marking. The use of the invention in this application is shown in FIG 2. A master document is divided into individual sentences and each sentence is analysed to create a relational representation in the form of a matrix. The ensemble of matrices is a representation of the facts that are expressed in the master document. The same process is applied to a target document to produce a second ensemble of facts. The ensembles are compared to determine a similarity measure that indicates the degree to which the target document contains the same facts as the master document. Referring to FIG 2, the first step is to divide the document into a collection of sentences. This may be done, for example, by detecting punctuation. Other algorithms known in the field could also be used depending upon the specific situation. Each sentence is segmented into words or tokens and analysed in the manner described in detail above. It will be appreciated that a document can be divided into a collection of sentences before further analysis, or preferably, each sentence is analysed progressively before the next sentence is identified. For each sentence the process produces a matrix that defines a fact. If there are further sentences the process iterates until the ensemble of matrices (facts) is collated. Once the process has completed a comparison is made between the ensemble of matrices for the target document and the ensemble of matrices of the master document. The outcome of the comparison is a similarity measure that provides an indication of the degree of overlap of facts in the target document compared to the master document.

To compare the documents, each fact in the master document is compared against each fact in the target document using a similarity metric such as the dot product (other alternatives include the cosine of the angle between the matrices and the Euclidean metric). The fact from the target document that has the maximal dot product with each fact from the master document is the one most likely to contain the same information. This dot product is then a measure of the likelihood that the target document contains the same fact as the master document. By taking the average of these maximal dot products an overall measure of the amount of factual content from the master document that appears in the target document is produced.

If the cosine of the angle between the matrices is used instead of the dot product, the similarity measure is the average of the maximum cosines. Other similarity measures, such as Minkowski distance (of which the Euclidean distance is one case) may also be used.

It will be appreciated that the master document must be processed following the flowchart of FIG 2 before a comparison can be made with a target document.

Throughout the specification the aim has been to describe the preferred embodiments of the invention without limiting the invention to any one embodiment or specific collection of features.

Claims

1. A method of sentence analysis including the steps of: segmenting the sentence to be analysed; initialising a number of buffers for storing word vectors; lexical retrieval; syntactic retrieval and resolution; and relational retrieval and resolution.

2. The method of claim 1 wherein the step of segmenting the sentence includes segmenting the sentence into a sequence of individual words or a sequence of tokens.

3. The method of claim 2 wherein the sentence is segmented into tokens which include root words, prefixes or suffixes, and punctuation.

4. The method of claim 1 wherein the step of initializing buffers includes initializing an input buffer and a processing buffer with the sequence representing the sentence.

5. The method of claim 4 wherein the step of initializing buffers further includes establishing a syntactic buffer and a relational buffer.

6. The method of claim 1 wherein the step of lexical retrieval includes the steps of: retrieving a vector of paradigmatic associations for each word from a lexical long-term memory; and replacing each word in the processing buffer with the corresponding vector.

7. The method of claim 1 wherein the step of syntactic retrieval includes the steps of: constructing a syntactic probe in a syntactic buffer from the contents of a processing buffer; and using the probe to retrieve syntactic traces from a syntactic long-term memory, to update the syntactic probe.

8. The method of claim 7 wherein the step of syntactic resolution includes the step of modifying the contents of the processing buffer to reflect the constraints in the syntactic buffer.

9. The method of claim 1 wherein the step of relational retrieval includes the steps of: constructing a relational probe in a relational buffer from the contents of a processing buffer and an input buffer; and using the relational probe to retrieve relational traces from a relational long term memory to update the relational probe.

10. The method of claim 9 wherein the step of relational resolution includes the step of modifying the contents of the processing buffer to reflect the constraints in the relational buffer.

11. A method of retrieving an answer to a natural language question, the method including the steps of: a) segmenting the question to be answered; b) initializing an input buffer and a processing buffer with the segmented question; c) performing lexical retrieval; d) performing syntactic retrieval and resolution; and e) performing relational retrieval and resolution; said steps c) to e) being performed one or more times until the processing buffer contains the answer to the question.

12. The method of claim 11 further including the step of establishing a syntactic buffer and a relational buffer.

13. The method of claim 11 wherein step c) includes the steps of: retrieving a vector of paradigmatic associations for each word from a lexical long-term memory; and replacing each word in the processing buffer with the corresponding vector.

14. The method of claim 11 wherein step d) includes the steps of: constructing a syntactic probe in a syntactic buffer from the contents of a processing buffer; using the probe to retrieve syntactic traces from a syntactic long-term memory, to update the syntactic probe; and modifying the contents of the processing buffer to reflect the constraints in the syntactic buffer.

15. The method of claim 11 wherein step e) includes the steps of: constructing a relational probe in a relational buffer from the contents of a processing buffer and an input buffer; using the relational probe to retrieve relational traces from a relational long term memory to update the relational probe; and modifying the contents of the processing buffer to reflect the constraints in the relational buffer.

16. A method of establishing an ensemble of facts from a text document including the steps of: a) organising the text document into a collection of sentences; b) segmenting each sentence; c) initialising a number of buffers for storing word matrices, including an input buffer, a processing buffer, a syntactic buffer and a relational buffer; d) lexical retrieval; e) syntactic retrieval and resolution; f) relational retrieval and resolution; g) performing steps b) to f) for each sentence until the relational buffer contains a fact in the form of a matrix; h) repeating step g) for all sentences in the document to compile the ensemble of facts.

17. The method of claim 16 wherein the input buffer and the processing buffer store word vectors.

18. The method of claim 16 further including the step of performing steps a) to h) for a first document to produce a first ensemble of facts, performing steps a) to h) for a second document to produce a second ensemble of facts, and comparing the first and second ensemble of facts to establish a similarity measure between the first and second document.

19. The method of claim 18 wherein the comparison is performed by determining a dot product for each fact of the first document with each fact of the second document, and calculating an average of the maximum dot products.

20. The method of claim 18 wherein the comparison is performed by determining a cosine of the angle between each fact of the first document with each fact of the second document, and calculating an average of the maximum cosines.