CA2244127C - Phrase recognition method and apparatus - Google Patents
Phrase recognition method and apparatus Download PDFInfo
- Publication number
- CA2244127C CA2244127C CA002244127A CA2244127A CA2244127C CA 2244127 C CA2244127 C CA 2244127C CA 002244127 A CA002244127 A CA 002244127A CA 2244127 A CA2244127 A CA 2244127A CA 2244127 C CA2244127 C CA 2244127C
- Authority
- CA
- Canada
- Prior art keywords
- text
- list
- phrase
- chunks
- phrases
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/912—Applications of a database
- Y10S707/917—Text
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/964—Database arrangement
- Y10S707/966—Distributed
- Y10S707/967—Peer-to-peer
- Y10S707/968—Partitioning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99934—Query formulation, input preparation, or translation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99935—Query augmenting and refining, e.g. inexact access
Abstract
A phrase recognition method breaks streams of text into text "chunks" and selects certain chunks as "phrases" useful for automated full text searching.
The phrase recognition method uses a carefully assembled list of partition elements (10) to partition the text into chunks, and selects phrases from the chunks according to a small number of frequency based definitions (20). The method can also incorporate additional processes such as categorization of proper names (30) to enhance phrase recognition. The method selects phrases quickly and efficiently, referring simply to the phrases themselves and the frequency with which they are encountered, rather than relying on complex, time-consuming, resource-consuming grammatical analysis, or on collocation schemes of limited applicability, or on heuristical text analysis of limited reliability or utility.
The phrase recognition method uses a carefully assembled list of partition elements (10) to partition the text into chunks, and selects phrases from the chunks according to a small number of frequency based definitions (20). The method can also incorporate additional processes such as categorization of proper names (30) to enhance phrase recognition. The method selects phrases quickly and efficiently, referring simply to the phrases themselves and the frequency with which they are encountered, rather than relying on complex, time-consuming, resource-consuming grammatical analysis, or on collocation schemes of limited applicability, or on heuristical text analysis of limited reliability or utility.
Description
W 097/275~2 PCT~US971~0212 PHRASE RECOGNITION METHOD ANI~ APPARATUS
COPYRIGHT NOTICE: A portion of the disclosure (including all Appendices) of 5 this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent tloc ~ment or the patent disclosure as it appears in the file or records of this international application or in the patent file or records of any national or regional phase patent application based on this international application, but the copyright owner reserves all 10 other copyright rights whatsoever.
R~-c~GRouND OF T~F INVFNTION
1. Field of th~ Invention The present invention relates to automated indexing of full-text doc~lmentc to 15 identify the content-bearing terrns for later document retrieval. More specifically, the invention relates to co~ uL~r-automated identifi--~ti-)n of phrases which are useful in rep,~s~ g the conceptual content of documents for indexing and retrieval.
COPYRIGHT NOTICE: A portion of the disclosure (including all Appendices) of 5 this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent tloc ~ment or the patent disclosure as it appears in the file or records of this international application or in the patent file or records of any national or regional phase patent application based on this international application, but the copyright owner reserves all 10 other copyright rights whatsoever.
R~-c~GRouND OF T~F INVFNTION
1. Field of th~ Invention The present invention relates to automated indexing of full-text doc~lmentc to 15 identify the content-bearing terrns for later document retrieval. More specifically, the invention relates to co~ uL~r-automated identifi--~ti-)n of phrases which are useful in rep,~s~ g the conceptual content of documents for indexing and retrieval.
2. Rel~te~l Art A type of content-bearing terrn is the "phrase", a language device used in information retrieval to improve retrieval precision. For example, the phrase "product liability" in~lir~es a concept that neither of the two component words can fully express.
Without this phrase, a rekieval process is unable to find the documents in which the concept is rli~c~c~e~l In traditional Boolean retrieval systems, phrase recognition is not an issue. The systems are known as post-coordination indexing systems in that phrases can be discovered through ex~mining the adjacency relationships among search words during the process of merging inverted lists associated with the words.
However, in modern information retrieval systems, the st~ti~tir~l distribution 30 characteristics of index terms are crucial to the relevance ranking process, and it is desirable to recognize phrases and derive their st~ti~tic zll characteristics in advance. In W 097/27552 PCT~US97/00212 addition, in fabricating hypertext databases, recognized phrases are n.ocecc~ly for hypertext linkage.
Known phrase recognition methods include three types: machine translation, st~tictir~l text analysis and heuristical text analysis.
First, m~rhinP translation's approach to recognizing phrases (known as compound structures) is to analyze part-of-speech tags associated with the words in an input text string, usually a sentence. Noun phrases and verb phrases are two examples of such phrases. Synt~rtir~i context and lexical relationships among the words are key factors that determine sl-rcescful parsing of the text. In machine translation, the goal is not of finding 10 correct phrases, but of discovering the correct synt~ctir~l structure of the input text string to support other translation tasks. It is infeasible to use this synt~rtir~l parsing method for procçccing commercial full-text tl~t~h~cçs; the method is inefficient and is in practical terms not scalable. Regarding m~rllin.o translation, reference may be made to U.S. Patent Nos. 5,299,124, 5,289,375, 4,994,966, 4,914,590, 4,931,936, and 4,864,502.
The second method of analysis, st~tictir~l text analysis, has two goals:
disambiguating part of speech tagging, and discovering noun phrases or other compound terms. The st~tictirs used include collocation information or mutual information, i.e., the probability that a given pair of part-of-speech tags or a given pair of words tends to appear together in a given data collection. When a word has more than one part-of-speech tag 20 associated with it in a dictionary, concl~lting the part of speech of the next word and calc~ ting the probability of occurrence of the two tags would help select a tag. Similarly, a pair of words that often appear together in the collection is probably a phrase. However, ,ct~tictir,~l text analysis requires knowledge of collocation that can only be derived from an known data collection. Disadvantageously, the method is not suitable for procescing 25 unknown data. Regarding st~tictir~l text analysis, l~r~l~llce may be made to U.S. Patent Nos. 5,225,981, 5,146,405, and 4,868,750.
The third method of analysis, heuristical text analysis, emphasizes textual pattern recogIution. Textual Pa~Le111S include any recognizable text strings that l~les~llL concepts, such as company names, peoples' names, or product names. For example, a list of capital ,.
30 words followed by a company in~iir~tQr like "~ imitec~" or "Corp" is an example pattern for recognizing colllpally names in text. The hrllri~ctir~l text analysis method requires strong obs~ tion ability from a human analyst. Due to the limitation of hllm~nc' observation W O 97/27552 PCT~US97/00212 span, heuristical text analysis is omy feasible for small subject domains (e.g., co~ ally name, product names, case document names, addresses, etc.). Regarding heuristical text analysis, reference may be made to U.S. Patent Nos. 5,410,475, 5,287,278, 5,251,316, and 5,161,105.
S Thus, machine translation methods, involving potentially complex gr:~mm~tic~l analysis, are too expensive and too error-prone for phrase recognition. St~ti~ticzll text analysis, being based on collocation and being purely based on statistics, is still expensive because of the required full scale of part-of-speech tagging and pre-calc~ ting collocation information, and also has difficulties processing unknown data without the collocation 10 knowledge. Finally, heuristical text analysis, relying on "signal terms", is highly domain-dependent and has trouble processing general texts Thus, there is a need in the art for a simple, time-efficient, resource-efficient, and reliable phrase recognition method for use in ~s~i~tin~ text indexing or for forming a statistical thesaurus. It is desired that the phrase recognition method be applicable to both 15 individual documents and to large collections of docllment~, so that the performance of not only real-time on-line systems, but also distributed and mainframe text search systems can be improved. It is also desired that the phrase recognition method have engineering scalability, and not be limited to particular tl~m~in~ of knowledge.
The present invention is directed to fillfillin~ these needs.
SUM~A~Y OF THE INVENTION
The present invention provides a phrase recognition method which breaks text into text "chulks" and selects certain chunks as "phrases" useful for ~lltom~te~l full text 25 searching. The invention uses a carefully assembled list of partition words to partition the text into the chun'ks, and selects phrases from the chunks according to a small number of frequency-based definitions. The invention can also incorporate additional processes such as categorization of proper names to enhance p'nrase recognition.
The invention achieves its goals quickly and efficiently, referring simply to the 30 phrases and the frequency with which they are encounLel~ed, rather than relying on complex, time-con~llming, resource-con~llming gr~mm~tical analysis, or on collocation W 097/27552 ~CTrUS97/00212 .. 4-schemes of limited applicability, or on heuristical text analysis of limited reliability or utility.
Additional objects, features and advantages of the invention will become a~,ale when the following Detailed Description of the Plcrell~,d Embodiments is read in5 conjunction with the acconl~allyillg drawings.
B~TFF DESCRTPTION OF THE DRAWINGS
The invention is better understood by reading the following Detailed Description of the Preferred Embodiments with reference to the accolllpallying drawing figures, in which 10 like reference numerals refer to like elements throughout, and in which:
FIG. 1 illustrates an exemplary haldwdlc configuration on which the inventive phrase recognition method may be ex~clltefl FIG. 2 illustrates another exemplary hardware ~llvholllllent in which the inventive phrase recognition method may be practiced.
FIG. 3 is a high level flow chart schem~tic~lly in~ ting execution in an embodiment of the phrase recognition method according to the present invention.
FIG. 4A is a flow chart sch~ tir~lly indicating execution in a module for partitioning text and gelleldLillg text chunks.
FIG. 4B is a flow chart in~ ting execution of a module for selecting phrases using 20 the data memory structure diagram of FIG. 5.
FIG. S is a data memory structure diagram s~h~ tic~lly ill~ li.lg data ~low during the inventive phrase recognition method (FIGS. 3, 4A, 4B) and corresponding memory allocation for various types of data used in accordance with the process.FIG. 6A is a flow chart of an optional proce~sing module for consolidating with a 25 ~esdulus.
FIG. 6B is a flow chart of an optional processing module for processing phrases with prepositions.
FIG. 6C is a flow chart of an optional processing module for Ll-~ lling phrases with their collection frequencies.
FIG. 6D is a flow chart of an optional procec~ing module for categorizing propernames.
W 097/2755~ PCTrUS97/00212 FIGS. 7A, 7B and 7C illustrate exemplary applications of the inventive phrase recognition method according to the present invention. In particular: FIG. 7A in(lic~tPs a user's viewing of a document in accordance with a suitable on-line text search system, and invoking the ulv~llLiv~ phrase recognition method to search for additional documents of 5 similar conceptual content; FIG. 7B schPm:~tir~lly illustrates implem~ont~tion of the phrase recognition method in a batch phrase recognition system in a distributed development system; FIG. 7C srhPm~tir~lly illustrates application of the inventive phrase recognition method in a batch process in a mainframe system.
DETAILED Pl~SCl~IPTION OF Tl~F P~FFFR~ED EMBODTl\IF~TS
In describing ~ler~ ,d embotlim~nt~ of the present invention illustrated in the drawings, specific terminology is employed for the sake of clarity. However, the invention is not intended to be limited to the specific terminology so sel~octPd, and it is to be understood that each specific element includes all terhnir-~l equivalents which operate in a 15 similar manner to accomplish a similar purpose.
The concept of the present invention is first described on a particular example of a text stream. Then, block diagrams and flow charts are described, which illustrate non-limiting embodiments of the invention's structure and function.
Very briefly, a ~lerelled embodiment of the inventive method partitions an input20 text stream based on pnnrhl~tion and vocabulary. As the method processes the text stream seqmPnti~lly, it inserts partition symbols between words if certain pnn~t~l~tion exists, such as a comma, end of sentpn~e~ or change in capitalization. Further, each word encountered is checked against one or more vocabulary lists, and may be discarded and replaced by a partition symbols, based on the word and where it is encounL~ ,d.
After the document is thus processed, a set of ç~nflirl~te terms and "phrases" (a series of non-partitioned words) is produced. In the preferred embo~limPnt, solitary words (individual words imm~ tPly surrounded by partitions) are ignored at this point. The phrases are processed to determine which phrases occur with higher frequency. Preferably, shorter phrases which occur with higher frequency and as subsets of lower-frequency but 30 lengthier phrases are also sought. A set of phrases meeting or ex~ee~1ing a given threshold frequency is produced.
WO 97/27~52 PCT~US97/00212 . --6--The inventive method is more easily understood with reference to a particular example.
As mentioned above, members of a list of words (in~ ling pnn~ tion) serve as "break points" to form text "chunks" within input text. A first (rudimentary) list includes 5 words can be used as "stop words". The stop words usually carry little semantic information because they exist merely for various language functions. This list has a few hundred members and includes articles (e.g., "a", "the"), conjunctions (e.g., "and", "or"), adverbs (e.g., "where", "why"), prepositions (e.g., "of", "to", "for"), pronouns (e.g., "we", "his"), and perhaps some numeric items.
However, this first list is too short for the present phrase recognition method because it causes ~ leldLion of a list of text chunks that are too long to allow efficient generation of desirable phrases. Additional stop words or other partition items are needed for reducing the size of the text chunks, so that more desirable phrases may be found.
The following example of text illustrates this problem. In this example, the text 15 "chunks" are within square brackets, with the text chunks being separated by an members of the list of stop words (break points):
[Citing] what is [called newly conciliatory commentc] by the [leader] of the [Irish Republican Army]'s [political wing], the [Clinton A-imini~tration announced today] that it would [issue] him a [visa] to [attend] a [conference]
on [Northern Ireland] in [~nh~tt~n] on [Tuesday]. The [A~lmini.~tration]
had been [leaning] against [issuing] the [visa] to the [official], [Gerry Adams], the [head] of [Sinn Fein~, [leaving] the [White House caught]
between the [British Government] and a [powerful bloc] of [Irish-American legislators] who [favored] the [visa]. (Parsed text based on rlldiment~ry list) 25 Since desirable phrases include noun phrases (e.g., "ice crearn"), adiective-noun phrases (e.g., "high school"), participle-noun phrases (e.g., "operating system"), and proper names (e.g., "White House"), it is safe to add adverbs (e.g., "fully") and non-participle verbs (e.g., "have", "is", "obtain") to the list of stop words to form an enh~n~ed stop word list.
This enh~n~ed stop word list allows the method to provide smaller text chunks, yet is still 30 compact enough for efficient look-up by c~ u~l. With the enh~nred list, the above example text is parsed into chunks and stop words as follows:
[Citing] what is [called] newly [conciliatory comments] by the [leader] of the [Irish Republican Army]'s [political wing], the [Clinton A~lministration]
announced [today] that it would issue him a [visa] to attend a [collrelence]
on [Northern Ireland] in [~nh~tt~n] on [Tuesday~. The [A~lmini~tration]
CA 02244l27 l998-07-22 W 097/27~52 PCTAUS97/00212 had been [leaning~ against ~issuing] the ~visa] to the ~official], [Gerry Adams], the [head~ of [Sinn Fein], [leaving] the [White House] caught between the [British Government] and a [powerful bloc] of [Irish-American legislators] who favored the [visa]. (Second parsed text based on enh~nre~l list) The theoretical jllstific~tion of using this enh~nre-l list derives from two sources.
A first justification is that this list only represents about 13 % of unique words in a general ~ngli~h dictionary. For example, in the Moby dictionary of 214,100 entries, there are 28,408 words that can be put into the list. This fact ensures that sem~ntir information 10 in texts is m~int~in~tl at a m~imnm level.
A second justification involves the lexical characteristics of these words. Most of the words bear little content. This second fact reduces the risk of losing sçmzlntir information in the text.
The basic concept of the invention having been described, particular 15 implement~ti-)ns of its structure and function are now prese-nted As will readily be appreciated, the invention is preferably embodied as software, instruction codes capable of being exec~lte~l by digital CU~ uLel:j, including commercially available general purpose digital computers well known to those skilled in the art. The particular, hald~ale on which the invention may be implem~nte~l varies with the particular 20 desired application of the inventive phrase recognition method. Three examples of the such application of the phrase recognition method are described in greater detail below, with reference to FIGS. 7A, 7B, and 7C. Briefly, the dynamic recognition method involved in an on-line system (FIG. 7A) may be implemented in IBM 370 assembly language code.
Alternatively, in a batch recognition system in a distributed development system (FIG. 7B), 25 the phrase recognition method may be implemented on a SUN work station using the PERL
script hllel~l~Liv~ prototyping language. As a still further implementation, the inventive phrase recognition method may be implemented on an Amdahl AMD 5995-1400-a mainframe so that another batch phrase recognition system (FIG. 7C) may be realized. Of course, the scope of the invention should not be limited by these exemplary embo-liment~
30 or applications.
Embodiments of the inventive phrase recognition method may be implemented as a software program including a series of executable modules on a coll~uL~l system. As shown in Fig. 1, an exemplary hardware platform includes a central processing unit 110.
W O 97/27552 PCTrUS97/00212 . ~ --8--The central processing unit interacts with a human user through a user interface 112. The user interface is used for inputting information into the system and for interaction between the system and the human user. The user interface 112 includes, for example, a video display 113 and a keyboard 115. A COlllL)u~t;l memory 114 provides storage for data and 5 software programs which are e~c~lt~od by the central proces~ing unit 110. Auxiliary memory 116, such as a hard disk drive or a tape drive, provides additional storage capacity and a means for retrieving large batches of information.
All components shown in FIC~. 1 are of a type well known in the art. For example, the FIG. 1 system may include a SUN~ work station including the execution platform 10 Sparc 2 and SUN OS Version 4.1.2., available from SUN MICROSYSTEMS of Sunnyvale, California. Of course, the system of the present invention may be implemented on any number of modern computer systems.
A second, more complex enviro~ ent in which the inventive phrase recognition method may be practiced is shown in FIG. 2. In particular, a document search and15 retrieval system 30 is shown. The system allows a user to search a subset of a plurality of doc~lme~t~ for particular key words or phrases. The system then allows the user to view do~;ulllen~s that match the search request. The system 30 comprises a plurality of Search and Retrieval (SR) computers 32-35 connected via a high speed interconnection 38 to a plurality of Session ~rlmini~trator (SA) colll~u~ 42-44.
Each of the SR's 32-35 is conn~cted to one or more document collections 46-49, each cont~ining text for a plurality of documents, indexes therefor, and other ancillary data. More than one SR can access a single document collection. Also, a single SR can be provided access to more than one docnment collection. The SR's 32-35 can be implemented using a variety of commercially available CulllL~uL~ well known in the art, 25 such as Model EX100 m~mlf~hlred by ~Iitachi Data Systems of Santa Clara, California.
~ach of the SA's 42-44 is provided access to data representing phrase and thesaurus dictionaries 52-54. The SA's 42 44 can also be implemented using a variety of commercially available conl~uLel~, such as Models 5990 and 5995 m~mlf~-tllred byAmdahl Corporation of Sunnyvale California. The interconnection 38 between the SR's G
30 and the SA's can be any one of a number of two-way high-speed com~uLc;l data interconnections well known in the art, such as the Model 7200-DX m~mlf~rtl-red by Network Systems Corporation of Minneapolis, Minnesota.
W O 97/27552 PCTrUS97/00212 _g _ Each of the SA's 42-44 is connected to one of a plurality of front end processors 56-58. The front end processors 56-58 provide a connection of the system 30 one or more commonly available networks 62 for acces~ing digital data, such as an X.25 network, long distance telephone lines, and/or SprintNet. Connected to the network 62 are plural 5 tP.rmin~ls 64-66 which provide users access to the system 30. Terminals 64-66 can be dumb terminzll~ which simply process and display data inputs and outputs, or they can be one of a variety of readily available stand-alone co~ ulel j, such as IBM or IBM-cnmp~tihle personal con,~ulel~. The front end processors 56-58 can be implemented by a variety of commercially available devices, such as Models 4745 and 4705 m~mlf~-~hlrcd by 10 the Amdahl Corporation of Surmyvale California.
The number of components shown in FIG. 2 are for illustrative purposes only. Thesystem 30 described herein can have any number of SA's, SR's, front end processors, etc.
Also, the distribution of proces~ing described herein may be modified and may in fact be performed on a single colll~ul~r without departing from the spirit and scope of the 15 invention.
A user wishing to access the system 30 via one of the terminals 64-66 will use the network 62 to establish a connection, by means well known in the art, to one of the front end processors 56-58. The front end processors 56-58 handle col-.",lll-ic~tion with the user termin~l~ 64-66 by providing output data for display by the terminals 64-66 and by 20 procçs~ing terminal keyboard inputs entered by the user. The data output by the front end processors 56-58 includes text and screen comm~n~l~. The front end processors 56-58 support screen control ct-mm~n-l~, such as the commonly known VT100 cull.lll;.l~-ls, which provide screen functionality to the t~ lals 64-66 such as clearing the screen and moving the cursor insertion point. The front end processors 56-58 can handle other known types of 25 t~-rmin~l~ and/or stand-alone colll~ulel~ by providing a~l.",liate comm~n~lc.Each of the front end processors 56-58 col"""",icates bidirectionally, by means well known in the art, wi~ its corresponding one of the SA' s 42-44. It is also possible to configure the system, in a manner well known in the art, such that one or more of the front end processors can co"""ll"i~t~ with more than one of the SA's 42-44. The front end 30 processors 56-58 can be configured to "load balance" the SA's 4244 in response to data flow patterns. The concept of load balancing is well known in the art.
W 097/27552 PCTAUS97tOO212 - ~ --10-Each of the SA's 42-44 contains an application program that processes search requests input by a user at one of the terminals 64-66 and passes the search request inforrnation onto one or more of the SR's 32-35 which perform the search and returns the results, including the text of the docnment~, to the SA's 42-44. The SA's 42~4 provide the 5 user with text documents corresponding to the search results via the terminals 64-66. For a particular user session (i.e. a single user ~rcessing the system via one of the terminals 64-66), a single one of the SA's 42-44 will interact with a user through an a~pl~Jpliate one of the front end processors 56-58.
Preferably, the inventive phrase recognition method is implemented in the session 10 ~tlmini~trator SA co~ ulel~ 42-44, with p~ memory being in the SA colll~lulel itself and further memory being illustrated within elements 52-54.
The principles on which the inventive method is based, and hal.lw~ systems and software platforms on which it may be ex.oc~lte-l, having been described, a ple~lll:d embodiment of the UlVe~ iV~ phrase recognition method is described as follows.
FIG. 3 is a high level flow diagram of the phrase recognition method of the preferred embo-liment Referring to FIG. 3, the invention uses a carefully assembled list of F.n~lich words (and other considerations such as pllnt~hl~tion) in a Partition Word List (more generally referred to as a Partition Entity List, or simply Partition List) to partition one or more input 20 text streams into many text chunks. This partitioning process is illustrated in block 10.
Block 20 in~lir~tes selection of phrases from among the chunks of text, according to frequency based definitions. A Phrase List, including the selected phrases, results from execution of the process in block 20. During the phrase selection process, solitary words (single-word chunks~, as well as words from the decomposed phrases, can be m~int~in~
25 separate from the Phrase List as optional outputs for other in~ in~ activities.
Details of processes 10 and 20 are described with reference to FIGS. 4A and 4B.
The invention can optionally incorporate one or more other processes, generically intli~t~d as element 30. Such optional process may include categorization (examples described with l~re~ ce to FIGS. 6A-6D) to enhance the list of recognized phrases.
FIG. 4A is a ffow chart of FIG. 3 module 10, for partitioning text and ~e~ ling text chunks.
W 097/27552 PCTrUS97100212 FIG. 4A shows how the method, given a piece of text, partitions the text ihtO many small text chunks. A critical component in this method is the Partition List (including words and pnn~ tion) whose members serve as break points to generate the text chunks.
As mentioned above, a Partition List ideally allows parsing of text into short 5 phrases, but is itself still compact enough for efficient co~ ul~l look-up during the parsing process. Preferably, the Partition ~ist is generated using not only articles, conjun~;lions, adverbs, prepositions, pronouns, and numeric items, but also adverbs and verbs, to form an er~h~n~e-l list.
The text partitioning process starts off with looking up encoullL~ d text in the10 Partition List (at block 101) and replacing every m~t~hP~l partition word or other partition entity with a partition tag such as " " (shown at block 102).
Additional partition tags are added into those text chunks at the point where there is a case change, either from lower case to upper case or vice versa (shown at block 103).
Block 104 in-lic~t~s geneldlion of the text chunk list which preserves the natural sequence 15 of the chunks as encountered in the text.
The frequency information for each chunk in the list is collected by sc~nning the text chunks in their natural sequence. The first occurrence of each unique chunk in the sequence is registered as a new entry with its frequency as 1. Subsequent occurrences are registered by incl~lll~;lllhlg its frequency count by 1. This ~neldLion of occurrence 20 frequencies in association with the respective chunks is inrlic~t~d by block 105.
FIG. 4B is a flow chart of FIG. 3 module 20, illustrating details of a preferredprocess for selecting which chunks are phrases.
FIG. 5 is a data memory strucLure diagram showing how data may be arranged in memory for the process, and how data flows into and out of various steps of the process.
25 More specifically, the steps from FIG. 3 of text partitioning 10, phrase selection 20, and optional processing 30 (reproduced on the left side of FIG. 5) are illustrated in conjunction with an exemplary data memory structure diagram ~on the right side of FIG. 5) toschtom~ti~ lly illustrate data flow between major functional procedures and data structures.
The various lists shown in the exemplary memory blocks on the right side of FIG. 5 are 30 understood to include list members in conjunction with their respective frequencies of occurrence.
CA 02244l27 l998-07-22 W O 97/27552 PCTrUS97/00212 The memory (broadly, any data storage m~Aillm such as RAM and/or magnetic disk and/or optical disk and/or other suitable Co~ uLel readable mf~ lm) may be structured in memory blocks as sch~om~tir~lly illustrated. A text stream file 300 and a Partition List 310 are used as inputs to the partitioning process 10 of the illV~ iVt~ phrase recognition method.
S The partitioning process 10 provides a chunk list (nn~lerstood as including corresponding chunk frequencies) 315. Chunk list 315 is used by the phrase selection process 20 of the illVt;llliVe phrase recognition method.
The partitioning process produces various groupings of chunks, each with their respective frequencies of occurrence within the text stream. These groupings of chunks are 10 illustrated on the right side of FIG. 5, with the underst~n-iing that the invention should not be limited to the particular memory structure so illustrated.
Specifically, lower case words (that is, single-word chunks) are in memory block320, capitalized or "allcaps" single-word chunks are in memory block 325, a Proper Name List (preferably of greater than one word, each being capitalized or in allcaps) is in 15 memory block 330, lower case phrases of greater than one word occurring more than once are in memory block 335, lower case phrases of greater than one word which were encou~ d only once are in memory block 345, and, optionally, acronyms are in memory block 350.
A synonym thesaurus in memory block 375 may be used in an optional process 30.
20 A phrase frequency list derived from a collection of plural do~;ulllelll, in which the phrase frequency throughout the collection is greater than a tnreshold, in memory block 380, may also be used in an optional processing procedure 30. Further, one or more special inr1ic~tor lists, generally in~lir~trcl as 385A-385E (c~,lllpally inriic~tors, geographic names, product names, org~ni7~tinn in(1ir~qtors7 Fngli~h first names, respectively, some of which 25 are exemplified in the ~rhpfl Appendices~ may co,~Llibu~ to certain optional categorizing processes, and result in corresponding name lists (colnl,ally names, geographic location names, product names, Ol~ alli~a~ion names, and English names~ generally in~ir~tr-l as 390A-390E.
Referring again to FIG. 4B, after the text chunk list is produced, it is the time to 30 make decision whether each chunk in the list is a phrase useful for representing conceptual content of doculllt;llLs. The inventive phrase recognition method uses the frequency illrollllation of two types of the partitioned text chunks (namely, the proper names in block -W O 97/27552 PCT~US97/00212 330 and the lower case phrases in blocks 335 and 345) to make final phrase selection decisions. Preferably, the invention focuses on lower case phrases of more than one word, or on proper names ("John Hancock", "United States").
Referring to FIGS. 4B and 5, at block 201, entries con~i~ting of a solitary lower 5 case word are not selected as phrases. Rejected entries are stored in memory block 320.
As shown at block 202, those churlks that include plural lower case words are detPrmin~l to be phrases only if they occur at least twice in the text stream. These chunks are stored in memory block 335. Chunks not fitting these criteria are stored in block 345 for further processing.
For chunks con~icting of a solitary upper case word (either the first letter being capitalized or "allcaps"~, no phrase decision is made at this stage, as shown at block 203.
Such chunks are stored in memory block 325.
In block 204, chunks including plural upper case words are d~ ined to be proper names and are stored in a Proper Name List in memory block 330.
Finally, other text chunks not fitting the previous criteria are simply discarded at this time, as in~lic~ted at block 205.
Next, block 206 ex~minPs the lower case phrases having a single occurrence ~rom memory block 245. They are ex~nnin~ for having one of its sub-phrases as part of an existing lower case phrase in the list. For efficiency, a sub-phrase may be defined to be 20 the first or last two or three words in the phrases. When the existence of a sub-phrase is ~let~ct~-l it is merged into the corresponding phrase in the list in memory block 335, and its frequency count is ~lptl~t~d Otherwise, and the lower case phrase is decomposed into individual words for updating the lower case word list in memory block 320 as an optional output.
As a result of this sub-phrase mapping in block 206, in our example the list is reduced to a list of lower case phrases and a list of proper names, both with their respective frequency counts:
[political wing, 2]
[Citing, 1]
- 30 [Irish Republican Army, 1]
[Clinton A~lmini~tration, 1]
[NorthernIreland, 1]
[~nh~tt~n, 1]
[Tuesday, 1~
CA 02244l27 l998-07-22 W097/27552 PCT~US97/00212 [Administration, 1]
[Gerry Adams, 1]
rSinn Fein, 1]
[White House, 1]
~British Government, 1]
The singleton upper case word could be used for referencing an exi~tinp proper name in the proper name list. To make the final frequency count accurate, the method makes one additional scan to the Proper Name List 330. It consolidates the upper case word that is either the first or the last word of a previously recogIlized proper name, and updates its 10 frequency count. This use of upper case single words in memory block 325 to revise the Proper Narne List 330 is in~1ic~ted at block 207. The method stores the other upper case words in the upper case word list 325 as an optional output.
A special case of the singleton upper case word is that of the acronym. An acronym is defined either as a string of the first character of each word (which is neither a 15 preposition nor a conjunction) in a proper name or as a string of the first character of each word in a proper name followed by a period. As in~ tefl at block 208, when an acronym in memory block 325 maps to a proper name in the proper name list 330, the frequency count of the proper name is incremented, and the pair of the proper name and its acronym is copied into an acronym list 350 as an optional output.
In our example, this reference checking process further reduces the proper name list in this example to the following:
[Irish Republican Army, 1]
[Clinton~A-1mini~tration, 2]
[Northern Ireland, 1]
[Gerry Adams, 1]
[Sinn Fein, 1]
[White House, 1]
[British Governm.ont7 1]
If no additional proces~ing is n~cçss~ry, this method concludes by combining the lower case phrase list in memory block 335 and the Proper Name List in memory block 330 into a single Phrase List 340 which is provided as the final output of the phrase selection process 20.
In another embodiment, the lower case phrases with frequency = 1 in memory block 345 are also included in the consolidation, in addition to the Proper Name List in 35 memory block 330 and the lower case phrases having frequency greater than 1 in memory CA 02244l27 l998-07-22 W O 97/275S2 PCTrUS97/00212 .. -15-block 335. The choice of either including or excluding the lower case phrases in memory block 34~is determined by a frequency threshold parameter which determines the number of times a lower case phrase must be encountered before it is allowed to be consolidated into the final Phrase List.
The example shown in FIG. 5 has this threshold set to 2, so that those phrases encountered only once (in memory block 345) are excluded from the consolidated Phrase List 340. The dotted line ~x~ g downward from Phrase List 340 to include memory block 345 shows how lower case phrases encountered only once can be included in the Phrase List if desired, however.
In any event, the consolidation of memory blocks into a single Phrase List is in~ t~(l at block 209.
For this text stream example, the final Phrase List is as follows:
[political wing, 2]
rIrish Republican Army, 1~
[Clinton A~l . " i . .i ~;1, aLion, 23 [Northern Ireland, I]
[Gerry Adams, 1]
[Sinn Fein, 1]
[White House, 1]
[British Government, 1]
The invention envisions that optional processes are available for further enhancing the recognized phrases.
FIG. 6A is a flow chart of an optional processing module for consolidating with a synonym thesaurus.
Referring to FIG. 6A, the Phrase List can be further reduced with a synonym thesaurus, as inf1ic :lt~l at block 301. The synonym thesaurus may be any suitable synonym thesaurus available from commercial vendors. As an example, the Phrase List may map "White House" to "Clinton A~lnnihi~tration. " Using a synonym thesaurus is risky because its contents may not reflect the intended conceptual content of the text, and therefore may ' 30 cause mapping errors. For example, it would be problematic if a synonym thesaurus maps "Bill Clinton" to "White House", because the two terms are not always equivalent.
FIG. 6B is a flow chart of an optional procecsing module for proc~ing phrases with prepositions.
WO 97/275~2 PCTrUS97/~0212 Referring to FIG. 6B, when a desirable lower case phrase contains one of~a smallset of prepositions (e.g., "right to counsel", "standard of proof"), the method takes the set out of the Partition List used for generating text chunks so that the phrase including the preposition has an opportunity to reveal itself as being part of a "good" phrase. This S process is in iir~t~-d as block 302.
Since it is st~ti~tir-~lly unlikely that any given oc~;ullc;llce of a preposition is in a "good" phrase, this optional process consumes substantial time for a relatively small increase in phrases, and is considered optional.
It is n~ce~ry to have another process to further çx~min.o the unqualified phrase in 10 memory block 345 that contains one of the selectç(l prepositions, whether the sub-phrase on the left of the preposition or the sub-phrase on the right Co~ s a valid phrase in the lower case phrase list in memory block 335. This process is illustrated as block 303.
As a result of process blocks 302, 303, memory block 335 may be llp i~t.o i.
FIG. 6C iS a flow chart of an optional processing module for 1~ ;...,..i,.g phrases with 15 their collection frequencies.
Rere~ to FIG. 6C, still another optional process is that of editing the list of the Proper Name List 330 and lower case phrases 335 with additional frequency information 380 gathered from a text collection of more than one docllm~nt The as~ Lion here is that, the more authors which use a phrase, the more reliable the phrase is for uniquely 20 e~res~ g a concept. In other words, a phrase occurring in more than one document is a .
"stronger" phrase than another phrase occllrring only in a single docllment Here, the "collection frequency" of a phrase is the number of documents that contain the phrase. A collection frequency threshold (e.g., five documents) can be set to trim down those phrases whose collection frequencies are below the threshold, as int1ic.z.t~cl 25 at block 304. F~sçnti~lly, FIG. 6C trims the entire Phrase ~ist 340, including entries from either memory block 330 or 335.
When collection frequency information is available (as illustrated by memory block 380), the ..,i~-i.,.l..,, frequency requirement of two encounters for the lower case phrases within a text (see FIG. 5) can be lowered to one encounter. "Mistaken" phrases will be 30 re,jected when con~niting the collection frequency information when considering multiple docllmentc.
CA 02244l27 l998-07-22 W O 97/27S52 PCTrUS97/00212 FIG. 6D is a flow chart of an optional proces~ing module for categorizing~propernames.
Referring now to FIG. 6D, after proper names are identified and are stored in the Proper Name List 330, it is possible to categorize them into new sets of pre-defined S groups, such as company names, geographic names, o~ ion names, peoples' names, and product names.
A list 385A of Co~ dlly inflir~tors (e.g., "Co." and "T imitPd") is used for deterrnining whether the last word in a proper name is such an in~ tor, and thereafter for categorizing it into the group of COlllp~lly name. Any word after this in-liç~tor is removed 10 from the proper name.
With the knowledge of the company name, it may be useful to check the existence of the same colll~any name in the list that does not have the in-lir~tor word. If the search is s~cces~ful, the fre~uency count of the company name is updated. The recognized c(~lllpally names are kept in a Colll~ally Names list 390A as an optional output, as in~lir~t~d 15 at block 305.
Similarly, a list 385B of geographic names or a list 385C of product names may be used for looking up whether a proper name has a match and thelcarL~l for categorizing it into the list of geographic names or a list of product names, respectively. The recognized geographic names or product names are kept in Geographic Location Names 390B or 20 Product Names 390C lists as optional outputs, as inriir~tPcl at blocks 306 and 307.
A list 385D of words that (lç~i~n~te o,g~ ions is used for d~l~. ll.illillg whether the first or the last word of a proper name is the in-lic~tor of org~ni7~tion, and thereafter for categorizing it into the group of ol~ ions. The recognized ol~ tion names may be kept in an O~ tion Names List 390D as an optional output, as intlir~tPd at block 25 308.
Finally a list 385E of Fnglich first names is used for del~llllillillg whether the first word of a proper name is a popular first name and thereafter for categorizing it into the group of peoples' names. Any word before the first name is removed from the proper name. The more comprehensive the lists are, the more people names can be categorized 30 plopelly. The recognized people names are kept in a separate F.ngli~h Names list 390E as an optional output for other in(lçxing activities, as indicated at block 309.
W097/27552 PCTrUS97/0021 .. -18-Appendices A through E present an exemplary Partition List 310 and exemplary Special Tn-1jr~tor/Name lists 385A-385E.
The inventive method having been described above, the invention also enconnpasses ~JpdldLUS (especially programmable con,~uL~l~) for carrying out phrase recognition.
5 Further, the invention encomp~ç~s articles of m~n-lf?.rhlre, specifically, computer readable memory on which the computer-readable code embodying the method may be stored, so that, when the code used in colljull.;lion with a colll~uL~l, the collll)uLel can carry out phrase recognition.
Non-limiting, illustrative examples of a~paldLus which invention envisions are 10 described above and illustrated in FIGS. 1 and 2. Each co~ s a colll~ul~r or other programmable a~a,~lus whose actions are directed by a colll~uL~l program or other software.
Non-limitinpr, illustrative articles of m~mlf~rtllre (storage media with executable code) may include the disk memory 116 (FIG. 1), the disk memories 52-54 (FIG. 2), other 15 magnetic disks, optical disks, couvellLional 3.5-inch, 1.44MB "floppy" .1ick~tt~s or other m~gn~tic ~ k~tt~s~ m~n~-ti~ tapes, and the like. Each con~titlltes a cc.lll~uL~l readable memory that can be used to direct the coll,~uLe, to function in a particular manner when used by the colll~uL~.
Those skilled in the art, given the preceding description of the inventive method, 20 are readily capable of using knowledge of haldwdl~:, of operating systems and software platforms, of progl~ g languages, and of storage media, to make and use apparatus ~or phrase recognition, as well as con,L,uLel readable memory articles of m~nllf~hlre which, when used in conjunction with a c~ L,uLer can carry out phrase recognition. Thus, the invention's scope includes not only the method itself, but d~ lldlUs and articles of 25 m~mlf~t~tllre.
Applications of the phrase ~ecognilion ~nef~o~ The phrase recognition method described above can be used in a variety of text searching systems. These include, but need not be limited to, dynamic phrase recognition in on-line systems, batch phrase recogmtion in a distributed development system, and batch phrase recognition in a 30 mainframe system. The following description of the applications of the inventive phrase recognition method is illustrative, and should not limit the scope of the invention as defined by the claims.
W097/27552 PCT~US97/00212 In an on-line system (OLS) envisioned as a target application for the inventive phrase recognition method, a user viewing a current document and entering a command to search for documents of similar conceptual content must wait for the phrase recognition process to be completed. Accordingly, the efficiency of the inventive phrase recognition 5 method is important, as it allows reduced response time and uses minim~l resources in a time-sharing environment.
According to the application of the invention in a given on-line system, the method processes the text in a single document in real time to arrive at a list Qf "good" phrases, namely, ones which can be used as accurate and me~ rul indications of the document' s 10 conceptual content, and which can be used as similarly accurate and mP:lningful queries in subsequent search requests. In particular, according to a plerel,ed application, the Phrase List derived from the single docllment is used to construct a new search description to retrieve additional do~ lellL~ with similar conceptual content to the first docllment This implementation of the phrase recognition method may, for example, be 15 embedded in session a-l...i..i~l.alor (FIG. 2) or other software which governs operation of the co~ utér system on which the phrase recognition method. Of course, the particular implementation will vary with the software and hal-l~alc~ ell\~ilvlllllent of the particular application in question.
FIG. 7A in-lic~tes a user's viewing of a docllm~nt in accordance with a suitable on-20 line text search system, and invoking the inventive phrase recognition method to search foradditional do~;ulllellL~ of similar conceptual content. In particular, block 701 assumes a user viewing a given document enters a command (such as ".more") to retrieve more documents similar in conceptual content to the current one being viewed.
When the ".more" command is entered, control passes to block 702 which in~lir~tes 25 retrieval of the document being viewed and passing it to the session ~1mini~trator or other software which includes the inventive phrase recognition software.
Block 703 in~lir~t.os execution of the inventive phrase recognition method on the text in the retrieved document. A c~nrli-l~tf~ phrase list is generated, based on that single ~ocl~m~nt.
Block 704 inf~ t~s how the c~n~ t~ phrase list ~el~elaL~d from the single document may be v~ te~l against an existing (larger) phrase dictionary. The static phrase W097/27552 PCTrUS97/00212 dictionary may be generated as described below, with lc:rerellce to the batch phrase recognition application in a distributed development system.
If a c~n~ tP phrase does not already exist in the phrase dictionary, the c~n~ t~phrase is broken down into its component words. Ultimately, a list of surviving phrases is 5 chosen, based on frequency of occurrence.
At decision block 705, if at least a given threshold number of words or phrases (e.g., five words or phrases) is extracted, control passes from decision block 705 to block 708, described below.
If, however, the given threshold number of words or phrases are not extracted, 10 control passes from decision block 705 along path 706 back to block 701, after displaying an error message at block 707 which in~ t~c that the displayed current document could not sllcce~fully be processed under the ".more" command.
Block 708 i~ ici1lrs that the newly-added words or phrases are added to the search query which previously resulted in the user's viewing the current docl-ment. Block 709 15 in~lic?~tPs the system's displaying the new "combined" search query to the user. The user may edit the new query, or may simply accept the new query by pressing "enter".
FIG. 7B sch~om~ti~lly in~lic~es implement~tion of the phrase recognition method in a batch phrase recognition system in a distributed development system.
In contrast to the implementation of the on-line system of FIG. 7A, in the 20 application sho~vn in FIG. 7B, the phrase recoglution method is applied to a large collection of documents, and produces a list of p-h-rases associated with the entire collection.
As mentioned above with reference to FIG. 7A, the phrase dictionary may be generated by this batch recognition process in the "distributed development domain" (DDD) when there is an ablln-l~n-~e of idle system resources. When the on-line system then uses the res -lt~nt 25 phrase dictionary, the phrase dictionary is essentially static, having been g~l1eldt~d and modified outside the on-line sessions.
The FIG. 7B application takes longer to execute than the single--loc~-m~nt phrase recognition process occurring in the dynamic phrase recognition in the on-line application.
Accoldhlgly, the FIG. 7B process is preferably executed as a batch process at times when 30 overall system usage is not impaired, such as overnight. In particular, the software implementation of the phrase recognition/phrase dictionary building process may be implemented on SUN work stations.
W 097/27552 PCTrUS97/00212 As a background to FIG. 7B, a developer's control file defines which docliment~,and/or which portions of the documents, should be processed in a given run. Block 723 intlic~tPs a filtering process which filters out documents and portions of docum~nt~ which are not desired to cu~ ibuLe to the phrase dictionary, based on the control file. Block 724 5 in~ te~ application of the inventive phrase recognition method to the documents and portions of docllmentc which have passed through filter process 723.
The output of the phrase recognition process is a phrase list (PL) which, in theillustrated non-limiting embodiment, is stored as a standard UNIX text file on disk. In a ~ler~ d embodiment, single-word terms which are encountered are discarded, so that only 10 multiple word phrases are included in the phrase list (PL).
For simplicity, each phrase is provided on a single line in the file. Block 725 in~lic at.os how the UNIX file is sorted using, for example, the standard UNIX sort utility, causing duplicate phrases to be grouped together. Block 725 also ç~lclll~t~s the frequency of each of the grouped phrases.
If a given phrase occurs less than a given threshold number of times (e.g., fivetimes as tested by decision block 727) it is discarded, as in~1ir~tPd by decision block 726.
Only phrases which have been encountered at least that threshold number of times survive to be included in the revised Phrase List, as shown in block 728.
The revised Phrase List is then Lldl~rell~d from the SUN work station to its desired 20 ~lestin~tion for use in, for example, the on-line system described above. It may also be transferred to a main frame colll~u~l using a file transfer protocol FTP, to be processed by a phrase dictionary building program and compiled into a production phrase dictionary.
This process is shown illustrated as block 729.
Referring now to FIG. 7C, the application of the inventive phrase recognition 25 method on a mainframe system is schPm~ti~lly illustrated. In the illustrated application, the phrase recognition method is implemented as a batch process in a production ahlf~allle. The process involves a random sample of docllmentc from a larger collection of docllment~, and produces a set of phrases for each document processed. The process is preferably exPc3lt~-~l when system resources are not otherwise in high demand, such as 30 overni~;ht The process of FIG. 7C is especially useful for use with sl~ti~tic~l thesauri.
As a background, it is ~llmt~l that phrases may be considered to be "related" toeach other if they occur in the same document. This "relationship" can be exploited for CA 02244l27 l998-07-22 WO 97/27552 PCTrUS97/00212 such purposes as e~an-ling a user's search query. However, in order to provide this ability, large number of documents must first be processed.
Referring again to FIG. 7C, block 733 in~lic~t.-s the filt~rin~ of documents andportions thereof in accordance with specifications from a control file, in much the same 5 manner as described with reference to FlG. 7B. Block 734 inrli~ft?s the application of the inventive phrase recognition method to the documents which pass the filter. One set of terms (single words, phrases or both) is produced for each document and stored in respective suitably formatted data structure on a disk or other storage m~ m Further details of implementation of the applications of the inventive phrase 10 recognition method depend on the particular haldware system, sofLw~l~ platform, progr~mming languages, and storage media being chosen, and lie within the ability of those skilled in the art.
The following Appendices are exemplary, illustrative, non-limiting examples of aPartition List and other lists which may be used with an embodiment of the phrase 15 recognition method according to the present invention.
W 097/27552 PCTrUS97/00212 APPENDIX A
Example of PARTITION LIST
(On-Line System with News Data) s ~ Copyright 1995 LEXIS-NEXIS, a Division of Reed Elsevier Inc.
A BEING FRIDAY I'M
10 A.M BELOW FROM I'VE
ABOUT BETWEEN GET I.E
ABOVE BOTH GO IF
ACROSS BUT GOT IN
AFFECT BY HAD INTO
AGAIN COULD HARDLY IT
AGO DEC HAS ITS
ALL DECEMBER HAVE ITSELF
ALREADY DID HAVING JAN
ALTHOUGH DOE HENCE JUL
ALWAY DUE HER JULY
AN DURING HER'N JUN
AND E.G HERE JUNE
ANY EIGHT HEREBY LIKE
ANYBODY EIGHTEEN HEREIN MANY
ANYMORE ~l l ll~K HEREINAFTER MAR
ANYONE ELEVEN HEREINSOFAR MARCH
APR EVENTUALLY HEREON ME
APRIL EVER HERETO MIGHT
ARE EVERYBODY HEREWITH MINE
AROUND EVERYMAN HERN MON
ASIDE EVERYTHING HIC MORE
ASK EXCEPT HIM MUCH
AT FEB HIMSELF MUST
AUG FEBRUARY HIS MY
40 AUGUST FEW HIS ' N MYSELF
AWAY FEWER HISSELF N.S
BAITH ~ HOC NANE
BE FIVE HOW N ~11~
BECAME FOR HOWEVER NEVERT~F.T F~SS
BEEN FOURTEEN I'D NINETEEN
BEFORE FRI I'LL NO
WO 97/275S2 PCTrUS97/00212 NOBLEWOMAN SEVEN THIRTEEN WHEREBY ' NOBODY SEVENTEEN THIS WHEREEVER
NONE SEVERAL THOSE WHEREIN
NOR SHE THOU ~/~l~ l ~K
NOV SINCE THREE WHICHEVER
NOVEMBER SIR THROUGH WHICHSOEVER
NOW SIX THUR WHILE
O S ~ ;N THURSDAY WHO
OCTOBER SOME THY WHOM
OF SOMEBODY THYSELF WHOMSOEVER
OFF SOMEONE TILL WHOSE
OFTEN SOMETHING TO WHOSESOEVER
ONE SOMEWHERE TOMORROW WHOSOEVER
ONESELF SOONER TOO WHY
ONLY STILL TUE WILL
ONTO SUCCUSSION TUESDAY WITH
OTHER SUN TWENTY WITHOUT
OTHERWISE SUNDAY TWO WOULD
OUGHT TAKE UN YA
OUR TEN UNDER YE
25 OUR' N THAE UNLESS YES
OURSELF THAN UNTIL YESTERDAY
OURSELVE THAT UNTO YET
OUT THE UP YON
OVER THEE UPON YOU
30 P.M THEIR US YOUR
PERHAP THEIRSELF USE YOUR' N
QUIBUS THEIRSELVE VERY YOURSELF
QUITE THEM VIZ lST
35 REALLY THEN WE 1 lTH
40 SATURDAY THEREFROM WHATE'ER 16TH
SEE THEREOF WHATSOE'ER 18TH
SEEMED THEREON WHATSOEVER l9TH
CA 02244l27 l998-07-22 .
APPENDIX B
Ex~nple of COMPANY INDICATOR LIST
~ Copyright 1995 LEXIS-NEXIS, a Division of Reed Elsevier Inc.
BROS
10 BROS.
BROTHERS
CHARTERED
CHTD
CHTD.
CL.
CO
CO.
COMPANY
CORP.
CORPORATION
CP
CP.
GP
GP.
C;ROUP
INC
30 INC.
INCORP
INCORP.
INCORPORATED
INE
35 INE.
LIMITED
LNC
LNC.
LTD
40 LTD.
W 097/27552 PCTrUS97/00212 APPENDIX C
Example of PRODUCT NAME LIST
S $' Copyright 1995 LEXIS-NEXIS, a Division of Reed Elsevier Inc.
240sx Kleenex Taurus 300sx L.O.C. Tide 10 4-Runner Lexus Toshiba 7Up Linux Tums Access Lotus Tylenol Adobe Magnavox Windex Altima Maxima Windows 15 Arid Mercedes Yashika Avia Minolta Zoom B-17 Mitsubishi B17 Mustang BMW Nike 20 Bayer Nikon Blazer OS2 Bounty Oracle Camary P100 Cannon P120 25 Chevy P133 Cirrus P60 Coke P75 Converse P90 Corvette Paradox 30 F~tonic Pepsi Excel Pl~al~lion-H
F-14 Puffs F-15 Puma F-16 Quicken 35 F-18 Rave F-22 Reebok F14 Rolaids F16 Sable 40 F18 Sentra F22 Seven-Up Infinity Solaris Ingres Sony JVC Sprite 45 Jaguar Suave Jeep Sun Keds Sybase WO 97/27552 PCTrUS97/00212 APPENDIX D
Example of ORGANIZATION INDICATOR LIST
~' Copyright 1995 LEXIS-NEXIS, a Division of Reed Elsevier Inc.
ADMINISTRATION SCHOOL
AGENCY SENATE
ARMY SOCIETY
ASSEMBLY TEAM
BOARD UNIVERSITY
BUREAU
CENTER
CHURCH
CLUB
COLLEGE
COMMISSION
COMMl'l"l ~
CONGRESS
COUNCIL
COURT
CULT
DEPT
FACTION
FEDERATION
FOUNDATION
GUILD
HOSPITAL
HOUSE
INDUSTRY
LEAGUE
MEN
ORGANIZATION
PARLIAMENT
PARTY
REPUBLIC
W 097/27~52 PCTrUS97/00212 APPENDIX E
Example of ENGLISH FIRST-NAME LIST
6' Copyright 1995 LEXIS-NEXIS, a Division of Reed Elsevier Inc.
AARON ADRIANNE ALEXANDRINA ALPHONSUS
10 ABAGAIL ADRIEN = ALEXEI ALTA
ABBIE ADRIENNE ALEXI ALTHEA
ABBY AERIEL ALEXIA ALTON
ABE AGATHA ALEXIS ALVA
ABEGAIL AGGIE ALF ALVAH
ABELARD AGNES ALFIO ALVIN
ABIGAIL AGNETA ALFORD ALYCE
ABNER AGUSTIN ALFRED AMALIA
ABRAHAM AHARON ALFREDA AMANDA
ACIE AILEEN ALGERNON AMBER
ACY AILEENE ALICE AMBROSE
ADA AILENE ALICIA AMBROSIA
ADAH AIME ALINE AMBROSIUS
ADALBERT AINSLEE ALISHA AMIE
ADALINE AINSLEY ALISON AMILE
ADAM AJA ALIX AMITY
ADDAM AL ALLAN AMON
ADDY ALAINE ALLEEN AMY
ADELA ALAN ALLEGRA ANA
ADELAIDE ALANAH ALLEN ANABEL
ADELBERT ALANNA ALLENE ANABELLE
ADELENE ALBERT ALLIE ANASTASIA
ADELINE ALBERTA ALLISON ANATOLY
ADELLA ALBIN ALLOYSIUS ANCIL
ADELLE ALDO ALLY ANDIE
ADNA ALEC ALMA ANDREAS
ADOLF ALECIA ALMETA ANDREE
ADOLPH ALECK ALMIRA ANDREI
ADOLPHUS ALENE ALMON ANDREJ
ADRIAN ALEXANDER ALOYSIUS ANDY
ADRIANE ALEXANDRA ALPHA ANETTA
W O 97/27552 PCTrUS97100212 ANETTE ARIC AUD BARNY
ANGELA ARICA AUDEY BARRETT
ANGELICA ARIEL AUDIE BARRY
ANGELINA ARISTOTLE AUDINE BART
ANGELIQUE ARLEEN AUDREY BARTON
ANGIE ARLEN AUDRIE BASIL
ANGUS ARLENE AUDRY BAYARD
ANITA ARLIE AUDY BEA
ANNA ARLINE AUGUST BEATRIX
ANNABEL ARLO AUGUSTINE BEAUREGARD
ANNABELLE ARMAND AUGUSTUS BEBE
ANNALEE ARMIN AURELIA BECCA
ANNELIESE ARNE AUSTEN BEE
ANNELISE ARNETT AUSTIN BELINDA
ANNEMARIE ARNEY AUTHER BELLA
ANNETTA ARNIE AUTRY RF.T.T.F.
ANNICE ARON AVA BENEDICT
ANNIF ART AVERY BENJAMIN
ANNINA ARTE AVIS BENJI
ANNMARIE ARTEMIS AVITUS BENNETT
ANSELM ARTHUR AVRAM BENNO
ANSON ARTIE AXEL BENNY
ANTHONY ARTIS AZZIE BENTLEY
ANTOINE ARTY AZZY BERKE
30 ANTOINETTE ARVELL BABETTE BERKEL~.Y
ANTON ARVIE BAILEY BERKELY
ANTONE ARVO BAIRD BERKLEY
ANTONETTE ARVON BALTHAZAR BERLE
ANTONI ASA BAMBI BERNARD
ANTONIO ASHER BARBARA BERNETTE
ANTONY ASHLEIGH BARBEE BERNHARD
AP ASHLEY BARBI BERNICE
APOLLO ASTER BARBIE BERNIE
ARA ASTRID BARNABAS BERRY
ARAM ATHENA BARNABUS BERT
ARBY ATHENE BARNABY BERTHA
ARCH ATTILIO BARNARD BERTHOLD
ARCHIE AUBRIE BARNETT BERTRAM
ARETHA AUBRY BARNEY BERTRAND
W 097/27552 PCTrUS97/00212 BERTRUM BRAINARD BUFORD CARLISLE -BERYL BRAINERD BUNNIE CARLTON
BESS BRANDI BUNNY CARLY
BESSIE BRANDY BURL CARLYLE
BETSEY BREK BURNETTA CAROL
BETSIE BRENARD BURNICE CAROLA
BETSY BRENDA BURREL CAROLANN
BETTE BRENDAN BURT CAROLE
BETTY BRET BURTRAM CAROLINE
BETTYE BRETT BUSTER CAROLYN
BEULAH BRIAN BUTCH CAROLYNN
BEVERLEE BRICE BYRON CARREN
BEVERLY BRIDGETT CAITLIN CARRIN
BEWANDA BRIDGETTE CAL CARROLL
BIFF BRIDIE CALE CARSON
BILL BRIGIT CALEB CARY
BILLY BRIJITTE CALT TF. CARYN
BIRD BRITNY CALLY CAS
BJARNE BRITTANY CALVIN CASEY
BJORN BRITTNEY CAM CASI
BLAINE BROCK CAMERON CASPER
BLAIR BRODERICK CAMILE CASS
BLAKE BROOKE CAMILLA CASSANDRA
BLANCA BROOKS CAMILLE CASSIE
BLANCHE BRUNHILDA CANDI CATHARINE
BOB BRUNHILDE CANDICE CATHERINE
BOBBI BRUNO CANDIS CATHLEEN
BOBBIE BRYAN CANDUS CATHLENE
BONNIE BRYCE CANNIE CATHRYN
BONNY BRYON CARA CATHY
BOOKER BUBBA CAREN CEASAR
BORIS BUCK CAREY CEATRICE
BOYD BUD CARIN CECIL
BRACIE BUDDIE CARL CECILE
BRACK BUDDY CARLA CECILIA
BRAD BUEL CARLEEN CECILY
BRADLEY BUFFIE CARLETON CEFERINO
BRADLY BUFFY CARLINE CELESTE
W097/27552 PCTrUS97/00212 - ~ -32-CELESTINE CHRISTOPH CLEVON CORTNEY
CELIA CHRISTOPHER CLIFF CORY
CELINA CHRISTOS CLIFFORD COSMO
CESAR CHRISTY CLIFT COUNTEE
CHADWICK CHUCK CLINT COURTNEY
CHAIM CHUMLEY CLINTON COY
CHANCY CICELY CLIO CRAIG
CHANDLER CICILY CLITUS CRIS
CHARLEEN CINDY CLOVIA CRISPUS
CHARLENE CLAIR CLOVIS CRISSIE
CHARLES CLAIRE CLOYD CRISSY
CHARLESE CLARA CLYDE CRISTABEL
CHARLEY CLARE COLBERT CRYSTAL
CHARLIE CLARENCE COLE CURLESS
CHARLINE CLARICE COLEEN CURLY
CHARLISE CLARINA COLETTE CURT
CHARLOTTE CLARK COLITA CY
CHARLTON CLASSIE COLLEEN CYBIL
CHAS CLAUD COT.T F.TTE CYBILL
CHASTITY CLAUDE COLIIN CYNDI
25 CHAUNCEY CLAUDFT.T.F- COLON CYNDY
CHELSIE CLAUDETTE CONNIE CYNTHIA
CHER CLAUDIA CONNY CYRIL
CHERI CLAUDINE CONRAD CYRILL
CHERIE CLAUDIUS CONROY CYRILLA
CHESTER CLAY CONSTANTIA DABNEY
CHET CLAYMON COOKIE DACIA
CHIP CLAYTON CORA DACIE
CHLOE CT FTO CORABELLE DAGMAR
CHRIS CLEMENT COREY DAISEY
CHRISSIE CLEMENTINE CORINE DAISY
CHRISSY CLEMENZA CORINNE DALE
CHRISTA CLENELL CORKIE DALTON
CHRIST~RF.T.T.F. CLEOPHUS CORNEAL DAMIEN
CHRISTAL CLEOTHA CORNELIA DAMION
CHRISTIAAN CLEOTIS CORNELIUS DAMON
CHRISTIAN CLETA CORRIE DAN
45 CHRISTIE CLETUS CORRINE DAN'L
CHRISTINE CLEVE CORRINNE DANA
CHRISTOFER CLEVELAND CORRY DANIEL
W 097/27552 PCT~US97/00212 DANIELLA DEBBY DERL DOMENIC
DANIELLE DEBORA DERMOT DOMENICK
DANNA DEBORAH DERMOTT DOMER
DANNY DEBRA DERRALL DOMINIC
DANUTA DEDIE DERRICK DOMINICKA
DAPHNE DEE DERRY DOMINIQUE
DARBIE DEEANN DERWOOD DON
DARBY DEEANNE DESDEMONA DONALD
DARCEY DEIDRE DESIRE DONELLE
DARCI DEL DESIREE DONICE
DARCIE DELAINE DESMOND DONIS
DARCY DELANE DESMUND DONNA
DAR~O DELBERT DEVORAH DONNELLE
DARIUS DELIA DEWANE DONNIE
DARLA DELL DEWAYNE DONNY
DARLEEN DELLA DEWEY DONOVAN
DARLINE DELMA D I ;2S l ~;K DORCAS
DARLYNE DELMAR DEZ DORCE
DARNELL DELMAS DIAHANN DOREEN
DAROLD DELMO DIANA DORI
2~5 DARREL DELNO DIANE DORIAN
DARRELL DELORES DIANNA DORIE
DARREN DELORIS DIANNE DORIENNE
DARRIN DELOY DICK DORINE
DARRYL DELTA DICKEY DORIS
DARYL DEMETRIUS DIDI DOROTHEA
DASHA DENARD DIEDRE DOROTHY
DAVE DENE DIERDRE DORRANCE
DAVEY DENICE DIETER DORRIS
DAVIDA DENIS DIMITRI DORTHIE
DAVIE DENISE DINA DORTHY
DAVY DENNIE DINAH DOSHIE
DAWN DENNIS DINO DOT
DEANDRA DENNYS DIRK DOTTY
DEANE DENORRIS DIXIE DOTY
DEANNA DEO DMITRI DOUG
DEANNE DEON DOLLIE DOUGIE
DEBBI DEREWOOD DOLORES DOUGLASS
DEBBIE DERICK DOM DOY
W O 97/27552 PCT~US97100212 DOYLE EDISON FT.r~T~ EMETT
DREW EDITA ELISSA EMIL
DRU EDITH ELIUS EMILE
DUAIN EDMOND FT.TZA EMILIE
DUANE EDNA ELIZAR EMMA
DUB EDRIS ET KF. EMMALINE
DUDLEY EDSEL ELLA EMMERY
DUEL EDUARD ELLE EMMET
DUFF EDWIN ELLERY EMMIE
DUFFY EDWINA ELLIE EMMOT
DUGALD EDY F.T.T.TET EMMOTT
DUKE EDYTH ELLIOT EMMY
DULSA EFFIE ELLIS EMORY
DUNCAN EFRAIM ELLY ENDORA
DURWARD EGBERT ELLYN ENDRE
DURWOOD EGIDIO ELMA ENGELBERT
20 DUSTIN F.TT.F.F1~ ELMER ENID
DUSTY ELA ELMIRA ENISE
DWAIN ELAINE ELMO ENNIS
DWAINE ELAYNE ELNOR ENOCH
DWAYNE ELBA ELNORA ENOLA
DYLAN ELBERTA ELOISE EPHRAIM
DYNAH ELDA ELOUISE EPHRIAM
EARL ELDINE ELOY ERASMUS
EARLE ELDON ELRIC ERBIN
EARLINE ELE ELSA ERICA
EARNEST ELEANOR ELSBETH ERICH
EARNESTINE ELEANORA ELSIE ERICK
EARTHA ELEANORE ELTON ERIK
EBEN ELENA ELVERT ERIN
EBE~:G~K ELENORA ELVIE ERLAND
EBENEZER ELENORE ELVIN ERLE
EBERHARD ELEONORA ELVIRA ERMA
ED ELIAS ELVON ERNEST
EDD ELIC ELWOOD ERNESTINE
EDDIE ELIJAH ELY ERNESTO
EDDY ELINORE ELZA ERNIE
EDGER ELISABETH EMERIC ERROL
EDIE ELISE EMERY ERVAN
CA 02244l27 l998-07-22 W O 97/275S2 PCT~US97/00212 ERVEN EWIN FLOSSY GAGE
ERVIN I i.~.~;.K 11~1 FLOYD GAIL
ERWIN EZRA FONDA GALE
ESAU FABIAN FONTAINE GALLON
5 ESMERET.nA FABIEN FORD GARETH
ESTA FAIRLEIGH FORREST GARLAND
ESTEL FAITH FRAN GARNET
ESTELA FANNIE FRANCES GAROLD
ESTELLA FANNY FRANCIA GARRET
ESTER FARLEIGH FRANCIS GARRIE
ESTHA FARLEY FRANCOIS GARRY
ESTHER FARRAH FRANCOISE GAI~TH
ETHAN FARREL FRANK GARVIN
ETHELENE FARRIS FRANKLIN GASTON
ETHELINE FATIMA FRANKLYN GAVIN
ETHYL FAUN FRANKY GAY
ETIENNE FAWN FRANNIE GAYE
ETTIE FAYE FRANZ GAYLORD
EUDORA FELECIA FRANZI GEARY
EUFA FELICIA FRANZIE GEMMA
EUGENE FELICITY FRANZY GENA
EUGENIE FELIZ FREDA GENEVA
EUGENIO FERD FREDDIE GENEVIEVE
EULA FERDINAND FREDDY GENIE
EULALEE FERGUS FREDERICH GENNARO
EULOGIO FERREL FREDERIK GENNIFER
EUNACE FERRELL FREEMAN GENNY
EUNICE FERRIS FREIDA GENO
EUPHEMIA FIDELE FRIEDA GEO
EVA FILBERT FRITZ GEOFFREY
EVALEE FILIPPO FRONA GEORGE
EVAN FIONA FYODOR GEORGES
EVANDER FITZ GABBEY GEOR~I~
EVE FLO GABE GEORGIE
EVELYN FLOR GABRIEL GEORGINA
EVERETT FLORA GABRIF.T F. GERAID
EVERETTE' FLORANCE GABRIELLE GERALDINE
EVITA FLORTDA GAEL GERD
EWALD FLOSSIE GAETANO GERDINE
W097/275S2 rCTAUS97/~0212 GERHARD GLENN GUNTER HANNE
GER~ GLENNA GUNTHER HANNES
GERMAIN GLENNIE GUS HANNIBAL
GERMAINE GLENNIS GUSSIE HANS
S GEROLD GLENNON GUSSY HANSELL
GEROME GLORIA GUST HARLAN
GERRIE GLYN GUSTAF HARLEN
GERRIT GLYNDA GUSTAV HARLEY
GERRY GLYNIS GUSTAVE HARLIE
GERTA GLYNNIS GUSTOV HARMIE
GERTIE GODFREY GUY HARMON
GERTRUDE GODFRY GWEN HAROL
GEZA GODWIN GWENDA HAROLD
GIACOMO GOLDIE GWENDEN HARRIET
GIDEON GOLDY GWENDLYN HARRIETT
GIFFORD GOMER GWENDOLA HARRIETTA
GIGI GORDAN GWENDOLEN HARRIS
~'~TT.RF.R,T GOTTFRIED GWENDOLY HARROLD
GILDA GRACE GWENDOLYN HARRY
GILES GRACIA GWENDOLYNE HARVEY
GILLIAN GRACIE GWENDY HARVIE
GINA GRAEME GWENETTA HATTIE
GINGER GRAHAM GWENETTE HATTY
GINNI GRANT GWENITH HAYDEE
GINNIE GRAYCE GWENN HAYDEN
GINO GREGG GWENNETH HAYLEY
GIORA GREGGORY GWENYTH HAYWOOD
GIOVANNA GREGORY GWYEN HAZEL
GIOVANNI GRETA GWYLLEN HEATHCLIFF
GISELLA GRETEL GWYNETH HECTOR
GT~c~F~T~T F GRIFF GWYNNE HEDDA
GIULIA GRIFFIN GYULA HEDDIE
GIUSEPPE GRl~'~'l'l'~ HAILEY HEDWIG
GIUSEPPINA GUENTHER HALLIE HEINRICH
GLADIS GUERINO HALLY HEINZ
GLADYCE GUIDO HAMISH HELAINE
GLADYS GUILLERMINA HAMPTON HELEN
GLENDA GUISEPPE HANNA HELENE
GLENDON GUNNER HANNAH HELGA
W O 97/27552 PCTrUS97100212 HELMUT HIRAM IKE ITA
HELMUTH HIRSCH ILA IVA
HELOISE HOBART ILAH IVAN
HENDRTK HOBERT ILEEN IVANA
~ 5 HENDRIKA HOLLEY ILENA IVAR
HENNIE HOLLI ILENE IVARS
HENNY HOLLIE ILSE IVETTE
HENRI HOLLIS IMELDA IVEY
HENRIETA HOLLISTER IMOGENE IVIE
HENRIETTE HOLLYANN INGA IVONNE
HENRIK HOLLYANNE INGE IVOR
HENRY HOMER INGGA IVORY
HENRYK HONEY INGRAM IVY
HERBERT HOPE IOLA IZZIE
HERBIE HORACE IONA IZZY
HERBY HORST IONE JAC
HERCULE HORTENSE IRA JACK
HERM HORTON IRENA JACKIE
HERMA HOSEA IRENE JACKSON
HERMAN HOWARD IRIS JACKY
HERMANN HOWIE IRL JACOB
HEROLD HUBERT IRVIN JACQUELIN
HERSCH HUEL IRVING JACQUELINE
HERSCHEL HUEY IRWIN JACQUELYN
HERSHEL HUGH ISAAC JACQUES
HESTHER HULDA ISABEL JAKE
HETTA HULDAH ISABELLA JAKOB
111~:1111~: HUMPHREY ISABELLE JAMES
HETTLE HUNTINGTON ISAC JAMEY
HEYWOOD HY ISADORA JAMYE
H 1~:~.1 ;.K I ~H HYACINTH ISADORE JAN
HILARY HYMAN ISAIAH JANA
HILDA IAIN ISHAM JANE
HILDEGARD ICHABOD ISIAH JANEL
HILDEGARDE IDA ISIDOR JANELL
HILDRED IGGY ISIDORE JANELLE
HILDUR IGNATIUS ISMAEL JANET
HILLARY IGNAZ ISRAEL JANIE
T-TTT.T.FT~y IGOR ISTVAN JANINE
WO 97/27552 PCTrUS97/00212 ~ -38-JANIS JERMAIN JOETTE JUDY
JANISE JERMAINE JOEY JULEE
JANNA JEROL JOFFRE JULES
JAQUELYN JEROLD JOH JULIA
JARRETT JERRALD JOHANNES JULIANE
JARRYL JERRED JOHN JULIANNE
JARVIS JERRELD JOHNIE JULIE
JAS JERRELL JOHNNA JULIEN
JASON JERRIE JOHNNY JUIIET
JASPER JERROLD JOJO JULIETTE
JAY JERRY JOLENE JULIUS
JAYME JERZY JON JUNE
JAYNE JESSE JONAS JUNIE
JEAN JESSICA JONATHAN JUNIOR
JEANETTE JESSIE JONATHON JUNIUS
JEANIE JETHRO JONELL JURGEN
JEANNETTE J~'l"l'll~ JONNY JUSTICE
JEANNINE JEWEL JORDAN JUSTIN
JEBEDIAH JEWELL JORDON JUSTINE
JED JILL JOSEA KALVIN
JEFEREY JIM JOSEPH KAREN
JEFF JIMBO JOSEPHA KARIN
J l~ KEY JIMME JOSEPHINE KARL
J~ JIMMIE JOSEY KARLA
JEMIMA JINNY JOSHUA KASEY
JEMMA JO JOSIAH KASPAR
JEMMIE JOAB JOSIE KASPER
JEMMY JOACHIM JOY KATE
JENNIFER JOANN JOYCEANN KATEY
JENNY JOANNA JOZEF KATHERINE
JENS JOANNE JOZSEF KATHERYN
JERALD JOB JUBAL KATHLEEN
JERE JOCILYN JUDAH KATHY
JEREMIAH JOCLYN JUDAS KATIE
JEREMIAS JODI JUDD KATY
JEREMY JODIE JUDE KAY
4~ JERI JODY JUDI KAYE
JERIANE JOE JUDIE KAYLEEN
JERIE JOEL JUDITH KEENAN
W 097/27552 PCTrUS97/00212 . -39-KEISHA KORNELIA LAURENCE LENNIE
KEITH KORNELIUS LAURETA LENNY
KFT.T FY KRAIG LAURETTA LENORA
KET.T.T KRIS LAURETTE LENORE
5 KFT.TTF. KRISTA LAURI LENWOOD
KELLY KRISTEN LAURIE LEO
KELSEY KRISTI LAURIEN LEOLA
KELVIN KRISTIE LAVELL LEON
KEN KRISTIN LAVERA LEONA
KENDELL KRISTOFFER LAVERNA LEONID
KENDRICK KRISTY LAVERNE LEONIDA
KENNETH KURT LAVINA LEONIDAS
KENNY KYLE LAVINIA LEONORA
KERI LACEY LAWRENCE LERA
KERMIT LACIE LDA LEROY
KERRIE LACY LEA LES
KERRY LADON LEAH LESLIE
KERVIN LALAH LEANE LETA
KERWIN L~MAR LEANN LETHA
KF.VIN T A M ARTTNF I~ANNE ~.ETICT A
KIERSTEN LAMBERT LEATHA LETITIA
KILLIAN LANA LEEANN LETTY
KIM LANCE LEEANNE LEVERNE
KIMBER LANCELOT LEENA LEVERT
KIMBERLEE LANE LEESA LEVI
KIMBERLEY LARA LEFTY LEW
KIMBERLY LARENE LEIF LEWELL
KIP LARONE LEIGH LEWIS
KIRBY LARRIS LEILA LEXIE
KIRSTEN LARS LELAH LIANE
KIRSTI LASKA LELAND T.TRBIE
KIRSTIE LASLO LELIA LIBBY
KIRSTIN LASZLO LELIO LIDA
KIT LATIMER LEMUEL LILA
Kll lll~: LAUANNA LEN LILAC
KTTTY LAUNCIE LENA LILl~H
KLAUS LAURA LENARD LILE
KONSTANTINOS LAUREL LENETTE LILIEN
KOOS LAUREN LENISE LILITH
W 097/27~52 PCT~US97/00212 . ~0-LILLIA LONZIE LUCILLE - MADALYN
LILLIAN LONZO LUCINDA MADDIE
T.TT T.TF LONZY LUCIUS MADDY
LILLY LORA LUCRETIA MADELAINE
LIN LORAINE LUDLOW MADELENE
LINCOLN LORANE LUDVVIG MADELINE
LINDA LORAY LUDWIK MADELYN
LINDSAY LORAYNE LUDY MADGE
LINK LOREN LUGENE MADONNA
LINNEA LORENA LUIGI ~AE
LINNIE LORENE LUl~E MAGDA
LINNY LORETA LULA MAGDALENA
LINUS LORI LULU MAGDALINE
LINVAL LORIN LUMMIE MAGGIE
LINWOOD LORNA LUNA MAGGY
LINZIE LORNE LURLEEN MAGNUS
LIONEL LORRAYNE LURLINE MAHALIA
LISA LOTHAR LUTHER MAIA
LISABETH LOTTIE LUZ MAIBLE
LISE LOU LUZERNE MAIJA
LIZ LOUIE LYLE MAL
LIZA LOUIS LYMAN MALCOLM
LIZABETH LOUISA LYN MALINDA
T.T7.7.TF. LOUISE LYNDA MALISSA
LLEWELLYN LOURETTA LYNN MALLORIE
LLOYD LOVELL LYNNA MALLORY
LLOYDA LOVETTA LYNNE MALORIE
LOELLA LOVETTE LYNNETTE MALORY
LOIS LOY LYSANDER MAME
LOLA LOYAL M ' LINDA MAMIE
LOLETA LUANN MABEL MANDEE
LOLITA LUANNA MABELLE MANDI
LOLLY LUBY MAC MANDY
LON LUCAS MACE MANFRED
LONI LUCIAN MACIE MANICE
LONNA LUCIE MACK MANLEY
LONNY LUCILE MADALEINE MANNY
LONSO LUCILLA MADALINE MANSEL
CA 02244l27 l998-07-22 W 097/27552 PCT~US97/00212 MARABEL MARKIE MATTIE MELYNDA ' MARC MARKOS MATTY MENDEL
MARCEL MARKUS MATTYE MERCEDEL
MARCELIN MARLA MAUD MERCEDES
- S MARCELL MARLENA MAUDE MERCY
MARCELLA MARLENE MAURA MEREDETH
MARCELLE MARLEY MAUREEN MEREDITH
MARCELLUS MARLIN MAURENE MERIDETH
MARCI MARLON MAUREY MERIDITH
MARCIE MARNEY MAURIE MERLE
MARCUS MARNIE MAURINE MERLIN
MARCY MARNY MAURY MERLYN
MARDA MARSDEN MAVIS MERREL
MARGE MARSHAL MAXCIE MERRILL
MARGEAUX MARSHALL MAXCINE MERRY
MARGERY MARTA MAXIE MERVIN
MARGI MARTEA MAXIM MERWIN
MARGO MARTI MAXIMILLIAN MERYL
MARGOT MARTICA MAXINE META
MARGRET MARTIE MAXWELL MEYER
MARGY MARTIKA MAY MIA
25 MARIAM MARTILDA MAYRF.T,T,F, MIATTA
MARIAN MARTIN MAYDA MICAH
MARIANN MARTY MAYME MICHAEL
MARIANNE MARV MAYNARD MICHEL
MARIBEL MARVA MEAGAN MICHELE
30 MARIBELLE MARVIN MEG MIC~TF,T,T,F, MARIBETH MARY MEGAN MICK
MARIE MARYAM MEL MICKEY
MARIEL MARYANN MELANIE MICKIE
MARIETTA MARYANNE MELANY MICKY
MARILEE MARYFRANCES MELICENT MIKAEL
MARILYN MARYLOU MELINDA MIKAL
MARILYNN MASON MELISSA MIKE
MARINA MATE MELLICENT MIKEAL
MARION MATHEW MELODIE MILDRED
MARISSA MATHIAS MELODY MILES
MARIUS MATHILDA MELONIE MILICENT
MARJORIE MATIT~nA MELONY MILLARD
MARK MA'l"l'~l I i.W~ MELVIN MILLIE
MARKEE MATTHIAS MELVYN MILLY
W097/27~52 PCTrUS97/00212 . -42-MILO MYRAH NERSES NORBERT ' MILT MYRAL NESTOR NOREEN
MILTON MYREN ~1~;'1"1'1~; NORM
MIMI MYRNA NEVILLE NORMA
MINERVA MYRTLE NEWELL NORMAN
MINNIE NADENE NEWT NORRIS
MIRANDA NADIA NEWTON NORTON
MIRIAM NADINE NICHAEL NORVAL
MISSY NAMIE NICHOLAS NYLE
MISTY NAN NICK OBADIAH
MITCH NANCI NICKI OBED
MITCHEL NANCIE NICKIE OBEDIAH
MITZI NANETTE NICKY OCTAVE
MOE NANI NICODEMO OCTAVIA
MOLLIE NANNETTE NICODEMUS ODEL
MOLLY NANNI NICOL ODELL
MONICA NANNY NICOLAI ODIE
MONIQUE NAOMA NICOLAS ODIS
MONTE NAOMI NICOLE OGDEN
MONTGOMERY NAPOLEON NICOLETTE OKTAVIA
MONY NATALIE NICOLO OLAF
MORDECAI NATASHA NIGEL OLAN
MOREY NATASSIA NIKA OLEG
MORGAN NATHAN NIKE OLEN
MORRIS NATWICK NIKITA OLIN
MORT NAZARETH NIKITAS . OLIVE
MORTIMER NEAL NIKKI OLIVER
MORTON NEALY NILE OLIVIA
MOSE NEDINE NILS OLIIE
MOSES NEIL NIMROD OLOF
MO7,F,T.T,F, NEILL NINA OMER
MULLIGAN NELDA NOAH ONAL
MURPHY NELLE NOEL ONNIK
MURRAY NELLIE NOLA OPAL
MURRELL NELLY NOLAN OPEL
MURRY NELS NONA OPHELIA
MYNA NENA NORA ORA
MYRA NERO NORAH ORAL
W 097/27S52 PCT~US97/00212 ORAN PATRICE PHINEAS RAYFORD
OREN PATRICIA PHOEBE RAYMOND
ORESTE PATRICK PHYLLIS RBT
ORIN PATSY PIA REAGAN
S ORLAN PATTI PIER REBA
ORLEN PATTIE PIERCE - REBECCA
ORLIN PATTY PIERINA RECTOR
ORLYN PAUL PIERRE REED
ORPHA PAULA PIERS REGAN
ORTON PAULINA PIETRO REGGY
ORTRUD PAULINE ~ PLATO REGINA
ORVAL PAVEL POINDEXTER REGINALD
ORVID PEARCE POLLIE REGIS
ORVILLE PEARLENE PORTIA REINHARDT
OSBERT PEARLIE PRECY REINHOLD
OSCAR PEARLINE PRESTON REMI
OSGOOD PEARLY PRINCE REMO
OSSIE PEG PRISCILLA RENA
OSWALD PEGGI PRUDENCE RENATA
OTHA PEGGIE PRUE RENATE
OTHIS PEGGY PRUNELLA RENE
OTTIS PENNI QUEENIE RETA
OTTO PENNIE QUEENY RETHA
OVE PENNY QUENTIN REUBAN
OVETA PER QUINCY REUBEN
OZZIE PERCY RACHEL REUBIN
PADDIE PERRY RAC~TFT T F REUBINA
PADDY PERSIS RAE REVA
PAGE PETE RAIFORD REX
PAM PETRA RAMONA REY
PAMELA PETRO RANDAL REYNALD
PANSY PETROS RANDALL REYNOLD
PAOLO PHEBE RANDI RHEA
PARNELL PHIDIAS RANDOLF RHODA
PARRY PHIL RANDOLPH RHONA
PASCAL PHILBERT RANDY RHONDA
PASCHAL PHILIP RANSOM RHYS
PAT PHILLIP RAQUEL RICH
PATIENCE PHILO RAY RICHARD
WO 97/27552 PCTrUS97/00212 RICT-TF.T,T.F, RONDA ROYAL SAMUAL
RICHIE RONDELL ROYCE SAMUEL
RICK RONETTE ROZALIA SANDI
RICKEY RONNETTE ROZALIE SANDIE
RICKIE RONNY RUBEN SANDY
RICKY ROOSEVELT RUBENA SANJA
RIKKI RORY RUBERT SARA
RILEY ROSALEE RUBEY SARAH
RISA ROSALIE RUBIN SCHUYLER
RITA ROSALIND RUBINA SCOT
RITCHIE ROSALINDA RUBY SCOTT
RITHA ROSALYN RUBYE SCOTrIE
ROBBI ROSAMOND RUDOLF SEAMUS
ROBBIE ROSAMONDE RUDOLPH SEAN
ROBBIN ROSAMUND RUDY SEBASTIAN
ROBBY ROSAMUNDE RUE SEFERINO
ROBERTA ROSANNE RUFUS SFT,FNA
ROBIN ROSCOE RULOFF SELENE
ROBINA ROSE RUPERT SELINA
ROBINETTE ROSEANN RUSS SELMA
ROCCO ROSEBUD RUSTY SERENA
ROCT-TF,T,T,F, ROSELIN RUTH SERINA
ROCKY ROSELYN RUTHANNA SETH
ROD ROSEMARIE RUTHANNE SEYMORE
RODERIC ROSEMUND RUTHLYN SHANE
RODERICH ROSEMUNDE RYAN SHANNON
RODERICK ROSENA SABA SHARI
RODGER ROSETTA SABINA SHARLENE
RODRICK ROSINA SABRINA SHARYN
ROGER ROSLYN SADIE SHAUN
ROLAND ROSS SAL SHAWN
ROLF ROSWELL SALLI SHAWNA
ROLLO ROULETTE SALLY SHEARL
ROMAN ROWENA SALLYE SHEBA
ROMEO ROWLAND SALOMON SHEENA
ROMULUS ROXANNE SAM ~ST-TF,TT.A.
RONA ROXY SAMMIE SHELDEN
RONALD ROY SAMMY SHELDON
W 097/27S52 PCT~US97100212 SHELIA SMF.nLEY SUSANA TELMA
ST-TFTT.FY SOFIE SUSANNA TENA
SHELLY SOL SUSANNAH TENCH
SHELTON SOLOMAN SUSANNE TERENCE
SHERIDAN SONDRA SUZAN TERRANCE
SHERIE SONIA SUZANNE TERRELL
SHERILYN SONJA SU7.F.T.T.F.~ TERRENCE
SHERL SONNY SIJ~ ; TERRI
SHERLEE . SPARKY SUZIE TERRY
SHERMAN SPENCE SUZY TERRYL
SHERON SPENCER SVEN TESS
SHERREE SPENSER SWEN TESSIE
SHERRIE SPIROS SYBIL TEX
SHERRY SPYROS SYD THAD
SHERWIN STACEY SYDNEY THADDEUS
SHERYL STACI SYLVAN THADEUS
SHIRLE STACIE SYLVENE THEA
SHIRLEE STACY SYLVESTER THEDA
SHIRLEY STAN SYLVIA THEIMA
SI STANISLAW SYLVIE THELMA
SID STANLY TABATHA THEOBALD
SIDNEY STEFAN TABITHA THEODIS
SIEGFRIED ~ 1 TAD THEODOR
SIG STELLA TAFFY THEODORA
SIGMUND STEPHANIE TAMARA THEODORIS
SIGNE STEPHEN TAMI THEODOSIA
SIGURD STERIING TAMMI THEONE
SILAS STEVE TAMMIE THEORA
SILVIO STEVIE TANCRED TT-TFRF~.
SIMEON STEWART TANIA THERESIA
SIMON STU TANJA THERESSA
SIMONE STUART TANYA THOM
SISSIE SUANNE TATE THOMASINA
SISSY SUE TATIANA THOR
SKEET SUELLEN TAUBA THORA
SKIPPIE SUMNER TED THORE
SKYT.F.R SUNNY TEDDY THORVALD
SLIM SUSAN TEE THOS
W 097/27552 PCT~US97100212 THURMAN TRINA VERDEEN VIVIAN
THURMOND TRISH VERDELL VIVIANNE
THURSTAN TRISTAM VERGA VIVIEN
THURSTON TRISTAN VERGIL VLAD
TIFFANY TRIXY VERLIN VOL
TILLIE TROY VERLON VON
TILLY TRUDIE VERLYN VONDA
TIM TRUDY VERN VONNA
TIMO 'IWILA VERNARD WALDEMAR
TIMOTHY TWYLA VERNE WALDO
TINA TYCHO VERNEST WALDORF
TINO TYCHUS VERNESTINE WALLl~CE
TIPHANY TYRONE VERNIE WALLY
TITO UDIE VERNON WALT
TITUS UDY VERONICA WALTER
TOBIAS ULRICH VERSA WANDA
TODD UNA VI WARREN
TOLLIE URA VIC WASHINGTON
TOLLIVER URBAIN VICKI WAYEAN
TOLLY URIAS VICKIE WAYLEN
TOMMIE VACHEL VICTOR WAYMAN
TOMMY VADA VICTORIA WAYNE
TONEY VAL VIDAL WELDEN
TONI VALDA VIE WELDON
TONY VALENTINE VINCE WENDEL
TONYA VALENTINO VINCENT WENDELL
TOOTIE VALERIA VINNIE WENDI
TOVE VALE3~IE VINNY WENDY
TRACEY VANCE VIOLET WES
TRACI VANDER VIRACE WESLEY
TRACIE VANDY VIRDA WESLIE
TRACY VANESSA VIRGIE WESTLEY
TRAVIS VAUGHN VIRGINIA WIDDIE
TRENA VEDA VIRGINIUS WILBER
TRENT VELLA VITA WILBERFORCE
TREVOR VELMA VITINA WILBERT
TRICIA VEOLA VITO WILDA
TRILBY VERA VITTORIA WILEY
CA 02244l27 l998-07-22 W 097/27552 PCT~US97/00212 -~ -47-WILFORD ZALPH
WILFRED ZANE
WTT,T-TF,T,l~ ZEB
WTT,T-TF,T,l\~ENA ZEBADIAH
S WTT,T-TF,T.l\~INA ZEBEDEE
WILHEMENA ZECHARIAH
WILHEMINA ZEF
WILLARD ZEFF
WILLIAM ZEKE
10 WILLIE ZET,T A
WILLIS ZELIA
WILLMA ZELIG
WILLY ZELL
WILMA ZELLA
15 WILMAR 7.F.T.T.F, WILMOT ZELMA
WINFRED ZENA
WINIFRED ZENITH
WINNIE ZENO
WINNY ZENON
WINONA ZEPHERY
~N~LOW ~ETA
WINSTON ZETTA
WINTON ZILLA
WM ZILLAH
WOLFGANG ZINA
WOODIE ZITA
WOODRUFF ZOE
WOODY ZOLLIE
WYATT ZOLLY
WYLA ZORA
XAVIERA ZYGMUND
YARDLEY ZYGMUNT
YETTA
YOLANDA
YVES
YVETTE
YVONNE
ZACHARIA
ZACHARY
ZACK
WO 97/27552 PCT~US97100212 Modifications and variations of the above-described embodiments of the present invention are possible, as appreciated by those skilled in the art in light of the above S teachings. As mentioned, any of a variety of hardware systems, memory olg~ inns, software platforms, and pro~ i..g languages may embody the present invention without departing from its spirit and scope. Moreover, countless variations of the Partition List, company in~lic~tors, product names, ~ i.tion inflif ~tc~rs, Fngli~h first name list, and resulting Phrase Lists, and the like, may be employed or produced while r~om~inin~ within 10 the scope of the invention. It is therefore to be nn~lerstQod that, within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described.
Without this phrase, a rekieval process is unable to find the documents in which the concept is rli~c~c~e~l In traditional Boolean retrieval systems, phrase recognition is not an issue. The systems are known as post-coordination indexing systems in that phrases can be discovered through ex~mining the adjacency relationships among search words during the process of merging inverted lists associated with the words.
However, in modern information retrieval systems, the st~ti~tir~l distribution 30 characteristics of index terms are crucial to the relevance ranking process, and it is desirable to recognize phrases and derive their st~ti~tic zll characteristics in advance. In W 097/27552 PCT~US97/00212 addition, in fabricating hypertext databases, recognized phrases are n.ocecc~ly for hypertext linkage.
Known phrase recognition methods include three types: machine translation, st~tictir~l text analysis and heuristical text analysis.
First, m~rhinP translation's approach to recognizing phrases (known as compound structures) is to analyze part-of-speech tags associated with the words in an input text string, usually a sentence. Noun phrases and verb phrases are two examples of such phrases. Synt~rtir~i context and lexical relationships among the words are key factors that determine sl-rcescful parsing of the text. In machine translation, the goal is not of finding 10 correct phrases, but of discovering the correct synt~ctir~l structure of the input text string to support other translation tasks. It is infeasible to use this synt~rtir~l parsing method for procçccing commercial full-text tl~t~h~cçs; the method is inefficient and is in practical terms not scalable. Regarding m~rllin.o translation, reference may be made to U.S. Patent Nos. 5,299,124, 5,289,375, 4,994,966, 4,914,590, 4,931,936, and 4,864,502.
The second method of analysis, st~tictir~l text analysis, has two goals:
disambiguating part of speech tagging, and discovering noun phrases or other compound terms. The st~tictirs used include collocation information or mutual information, i.e., the probability that a given pair of part-of-speech tags or a given pair of words tends to appear together in a given data collection. When a word has more than one part-of-speech tag 20 associated with it in a dictionary, concl~lting the part of speech of the next word and calc~ ting the probability of occurrence of the two tags would help select a tag. Similarly, a pair of words that often appear together in the collection is probably a phrase. However, ,ct~tictir,~l text analysis requires knowledge of collocation that can only be derived from an known data collection. Disadvantageously, the method is not suitable for procescing 25 unknown data. Regarding st~tictir~l text analysis, l~r~l~llce may be made to U.S. Patent Nos. 5,225,981, 5,146,405, and 4,868,750.
The third method of analysis, heuristical text analysis, emphasizes textual pattern recogIution. Textual Pa~Le111S include any recognizable text strings that l~les~llL concepts, such as company names, peoples' names, or product names. For example, a list of capital ,.
30 words followed by a company in~iir~tQr like "~ imitec~" or "Corp" is an example pattern for recognizing colllpally names in text. The hrllri~ctir~l text analysis method requires strong obs~ tion ability from a human analyst. Due to the limitation of hllm~nc' observation W O 97/27552 PCT~US97/00212 span, heuristical text analysis is omy feasible for small subject domains (e.g., co~ ally name, product names, case document names, addresses, etc.). Regarding heuristical text analysis, reference may be made to U.S. Patent Nos. 5,410,475, 5,287,278, 5,251,316, and 5,161,105.
S Thus, machine translation methods, involving potentially complex gr:~mm~tic~l analysis, are too expensive and too error-prone for phrase recognition. St~ti~ticzll text analysis, being based on collocation and being purely based on statistics, is still expensive because of the required full scale of part-of-speech tagging and pre-calc~ ting collocation information, and also has difficulties processing unknown data without the collocation 10 knowledge. Finally, heuristical text analysis, relying on "signal terms", is highly domain-dependent and has trouble processing general texts Thus, there is a need in the art for a simple, time-efficient, resource-efficient, and reliable phrase recognition method for use in ~s~i~tin~ text indexing or for forming a statistical thesaurus. It is desired that the phrase recognition method be applicable to both 15 individual documents and to large collections of docllment~, so that the performance of not only real-time on-line systems, but also distributed and mainframe text search systems can be improved. It is also desired that the phrase recognition method have engineering scalability, and not be limited to particular tl~m~in~ of knowledge.
The present invention is directed to fillfillin~ these needs.
SUM~A~Y OF THE INVENTION
The present invention provides a phrase recognition method which breaks text into text "chulks" and selects certain chunks as "phrases" useful for ~lltom~te~l full text 25 searching. The invention uses a carefully assembled list of partition words to partition the text into the chun'ks, and selects phrases from the chunks according to a small number of frequency-based definitions. The invention can also incorporate additional processes such as categorization of proper names to enhance p'nrase recognition.
The invention achieves its goals quickly and efficiently, referring simply to the 30 phrases and the frequency with which they are encounLel~ed, rather than relying on complex, time-con~llming, resource-con~llming gr~mm~tical analysis, or on collocation W 097/27552 ~CTrUS97/00212 .. 4-schemes of limited applicability, or on heuristical text analysis of limited reliability or utility.
Additional objects, features and advantages of the invention will become a~,ale when the following Detailed Description of the Plcrell~,d Embodiments is read in5 conjunction with the acconl~allyillg drawings.
B~TFF DESCRTPTION OF THE DRAWINGS
The invention is better understood by reading the following Detailed Description of the Preferred Embodiments with reference to the accolllpallying drawing figures, in which 10 like reference numerals refer to like elements throughout, and in which:
FIG. 1 illustrates an exemplary haldwdlc configuration on which the inventive phrase recognition method may be ex~clltefl FIG. 2 illustrates another exemplary hardware ~llvholllllent in which the inventive phrase recognition method may be practiced.
FIG. 3 is a high level flow chart schem~tic~lly in~ ting execution in an embodiment of the phrase recognition method according to the present invention.
FIG. 4A is a flow chart sch~ tir~lly indicating execution in a module for partitioning text and gelleldLillg text chunks.
FIG. 4B is a flow chart in~ ting execution of a module for selecting phrases using 20 the data memory structure diagram of FIG. 5.
FIG. S is a data memory structure diagram s~h~ tic~lly ill~ li.lg data ~low during the inventive phrase recognition method (FIGS. 3, 4A, 4B) and corresponding memory allocation for various types of data used in accordance with the process.FIG. 6A is a flow chart of an optional proce~sing module for consolidating with a 25 ~esdulus.
FIG. 6B is a flow chart of an optional processing module for processing phrases with prepositions.
FIG. 6C is a flow chart of an optional processing module for Ll-~ lling phrases with their collection frequencies.
FIG. 6D is a flow chart of an optional procec~ing module for categorizing propernames.
W 097/2755~ PCTrUS97/00212 FIGS. 7A, 7B and 7C illustrate exemplary applications of the inventive phrase recognition method according to the present invention. In particular: FIG. 7A in(lic~tPs a user's viewing of a document in accordance with a suitable on-line text search system, and invoking the ulv~llLiv~ phrase recognition method to search for additional documents of 5 similar conceptual content; FIG. 7B schPm:~tir~lly illustrates implem~ont~tion of the phrase recognition method in a batch phrase recognition system in a distributed development system; FIG. 7C srhPm~tir~lly illustrates application of the inventive phrase recognition method in a batch process in a mainframe system.
DETAILED Pl~SCl~IPTION OF Tl~F P~FFFR~ED EMBODTl\IF~TS
In describing ~ler~ ,d embotlim~nt~ of the present invention illustrated in the drawings, specific terminology is employed for the sake of clarity. However, the invention is not intended to be limited to the specific terminology so sel~octPd, and it is to be understood that each specific element includes all terhnir-~l equivalents which operate in a 15 similar manner to accomplish a similar purpose.
The concept of the present invention is first described on a particular example of a text stream. Then, block diagrams and flow charts are described, which illustrate non-limiting embodiments of the invention's structure and function.
Very briefly, a ~lerelled embodiment of the inventive method partitions an input20 text stream based on pnnrhl~tion and vocabulary. As the method processes the text stream seqmPnti~lly, it inserts partition symbols between words if certain pnn~t~l~tion exists, such as a comma, end of sentpn~e~ or change in capitalization. Further, each word encountered is checked against one or more vocabulary lists, and may be discarded and replaced by a partition symbols, based on the word and where it is encounL~ ,d.
After the document is thus processed, a set of ç~nflirl~te terms and "phrases" (a series of non-partitioned words) is produced. In the preferred embo~limPnt, solitary words (individual words imm~ tPly surrounded by partitions) are ignored at this point. The phrases are processed to determine which phrases occur with higher frequency. Preferably, shorter phrases which occur with higher frequency and as subsets of lower-frequency but 30 lengthier phrases are also sought. A set of phrases meeting or ex~ee~1ing a given threshold frequency is produced.
WO 97/27~52 PCT~US97/00212 . --6--The inventive method is more easily understood with reference to a particular example.
As mentioned above, members of a list of words (in~ ling pnn~ tion) serve as "break points" to form text "chunks" within input text. A first (rudimentary) list includes 5 words can be used as "stop words". The stop words usually carry little semantic information because they exist merely for various language functions. This list has a few hundred members and includes articles (e.g., "a", "the"), conjunctions (e.g., "and", "or"), adverbs (e.g., "where", "why"), prepositions (e.g., "of", "to", "for"), pronouns (e.g., "we", "his"), and perhaps some numeric items.
However, this first list is too short for the present phrase recognition method because it causes ~ leldLion of a list of text chunks that are too long to allow efficient generation of desirable phrases. Additional stop words or other partition items are needed for reducing the size of the text chunks, so that more desirable phrases may be found.
The following example of text illustrates this problem. In this example, the text 15 "chunks" are within square brackets, with the text chunks being separated by an members of the list of stop words (break points):
[Citing] what is [called newly conciliatory commentc] by the [leader] of the [Irish Republican Army]'s [political wing], the [Clinton A-imini~tration announced today] that it would [issue] him a [visa] to [attend] a [conference]
on [Northern Ireland] in [~nh~tt~n] on [Tuesday]. The [A~lmini.~tration]
had been [leaning] against [issuing] the [visa] to the [official], [Gerry Adams], the [head] of [Sinn Fein~, [leaving] the [White House caught]
between the [British Government] and a [powerful bloc] of [Irish-American legislators] who [favored] the [visa]. (Parsed text based on rlldiment~ry list) 25 Since desirable phrases include noun phrases (e.g., "ice crearn"), adiective-noun phrases (e.g., "high school"), participle-noun phrases (e.g., "operating system"), and proper names (e.g., "White House"), it is safe to add adverbs (e.g., "fully") and non-participle verbs (e.g., "have", "is", "obtain") to the list of stop words to form an enh~n~ed stop word list.
This enh~n~ed stop word list allows the method to provide smaller text chunks, yet is still 30 compact enough for efficient look-up by c~ u~l. With the enh~nred list, the above example text is parsed into chunks and stop words as follows:
[Citing] what is [called] newly [conciliatory comments] by the [leader] of the [Irish Republican Army]'s [political wing], the [Clinton A~lministration]
announced [today] that it would issue him a [visa] to attend a [collrelence]
on [Northern Ireland] in [~nh~tt~n] on [Tuesday~. The [A~lmini~tration]
CA 02244l27 l998-07-22 W 097/27~52 PCTAUS97/00212 had been [leaning~ against ~issuing] the ~visa] to the ~official], [Gerry Adams], the [head~ of [Sinn Fein], [leaving] the [White House] caught between the [British Government] and a [powerful bloc] of [Irish-American legislators] who favored the [visa]. (Second parsed text based on enh~nre~l list) The theoretical jllstific~tion of using this enh~nre-l list derives from two sources.
A first justification is that this list only represents about 13 % of unique words in a general ~ngli~h dictionary. For example, in the Moby dictionary of 214,100 entries, there are 28,408 words that can be put into the list. This fact ensures that sem~ntir information 10 in texts is m~int~in~tl at a m~imnm level.
A second justification involves the lexical characteristics of these words. Most of the words bear little content. This second fact reduces the risk of losing sçmzlntir information in the text.
The basic concept of the invention having been described, particular 15 implement~ti-)ns of its structure and function are now prese-nted As will readily be appreciated, the invention is preferably embodied as software, instruction codes capable of being exec~lte~l by digital CU~ uLel:j, including commercially available general purpose digital computers well known to those skilled in the art. The particular, hald~ale on which the invention may be implem~nte~l varies with the particular 20 desired application of the inventive phrase recognition method. Three examples of the such application of the phrase recognition method are described in greater detail below, with reference to FIGS. 7A, 7B, and 7C. Briefly, the dynamic recognition method involved in an on-line system (FIG. 7A) may be implemented in IBM 370 assembly language code.
Alternatively, in a batch recognition system in a distributed development system (FIG. 7B), 25 the phrase recognition method may be implemented on a SUN work station using the PERL
script hllel~l~Liv~ prototyping language. As a still further implementation, the inventive phrase recognition method may be implemented on an Amdahl AMD 5995-1400-a mainframe so that another batch phrase recognition system (FIG. 7C) may be realized. Of course, the scope of the invention should not be limited by these exemplary embo-liment~
30 or applications.
Embodiments of the inventive phrase recognition method may be implemented as a software program including a series of executable modules on a coll~uL~l system. As shown in Fig. 1, an exemplary hardware platform includes a central processing unit 110.
W O 97/27552 PCTrUS97/00212 . ~ --8--The central processing unit interacts with a human user through a user interface 112. The user interface is used for inputting information into the system and for interaction between the system and the human user. The user interface 112 includes, for example, a video display 113 and a keyboard 115. A COlllL)u~t;l memory 114 provides storage for data and 5 software programs which are e~c~lt~od by the central proces~ing unit 110. Auxiliary memory 116, such as a hard disk drive or a tape drive, provides additional storage capacity and a means for retrieving large batches of information.
All components shown in FIC~. 1 are of a type well known in the art. For example, the FIG. 1 system may include a SUN~ work station including the execution platform 10 Sparc 2 and SUN OS Version 4.1.2., available from SUN MICROSYSTEMS of Sunnyvale, California. Of course, the system of the present invention may be implemented on any number of modern computer systems.
A second, more complex enviro~ ent in which the inventive phrase recognition method may be practiced is shown in FIG. 2. In particular, a document search and15 retrieval system 30 is shown. The system allows a user to search a subset of a plurality of doc~lme~t~ for particular key words or phrases. The system then allows the user to view do~;ulllen~s that match the search request. The system 30 comprises a plurality of Search and Retrieval (SR) computers 32-35 connected via a high speed interconnection 38 to a plurality of Session ~rlmini~trator (SA) colll~u~ 42-44.
Each of the SR's 32-35 is conn~cted to one or more document collections 46-49, each cont~ining text for a plurality of documents, indexes therefor, and other ancillary data. More than one SR can access a single document collection. Also, a single SR can be provided access to more than one docnment collection. The SR's 32-35 can be implemented using a variety of commercially available CulllL~uL~ well known in the art, 25 such as Model EX100 m~mlf~hlred by ~Iitachi Data Systems of Santa Clara, California.
~ach of the SA's 42-44 is provided access to data representing phrase and thesaurus dictionaries 52-54. The SA's 42 44 can also be implemented using a variety of commercially available conl~uLel~, such as Models 5990 and 5995 m~mlf~-tllred byAmdahl Corporation of Sunnyvale California. The interconnection 38 between the SR's G
30 and the SA's can be any one of a number of two-way high-speed com~uLc;l data interconnections well known in the art, such as the Model 7200-DX m~mlf~rtl-red by Network Systems Corporation of Minneapolis, Minnesota.
W O 97/27552 PCTrUS97/00212 _g _ Each of the SA's 42-44 is connected to one of a plurality of front end processors 56-58. The front end processors 56-58 provide a connection of the system 30 one or more commonly available networks 62 for acces~ing digital data, such as an X.25 network, long distance telephone lines, and/or SprintNet. Connected to the network 62 are plural 5 tP.rmin~ls 64-66 which provide users access to the system 30. Terminals 64-66 can be dumb terminzll~ which simply process and display data inputs and outputs, or they can be one of a variety of readily available stand-alone co~ ulel j, such as IBM or IBM-cnmp~tihle personal con,~ulel~. The front end processors 56-58 can be implemented by a variety of commercially available devices, such as Models 4745 and 4705 m~mlf~-~hlrcd by 10 the Amdahl Corporation of Surmyvale California.
The number of components shown in FIG. 2 are for illustrative purposes only. Thesystem 30 described herein can have any number of SA's, SR's, front end processors, etc.
Also, the distribution of proces~ing described herein may be modified and may in fact be performed on a single colll~ul~r without departing from the spirit and scope of the 15 invention.
A user wishing to access the system 30 via one of the terminals 64-66 will use the network 62 to establish a connection, by means well known in the art, to one of the front end processors 56-58. The front end processors 56-58 handle col-.",lll-ic~tion with the user termin~l~ 64-66 by providing output data for display by the terminals 64-66 and by 20 procçs~ing terminal keyboard inputs entered by the user. The data output by the front end processors 56-58 includes text and screen comm~n~l~. The front end processors 56-58 support screen control ct-mm~n-l~, such as the commonly known VT100 cull.lll;.l~-ls, which provide screen functionality to the t~ lals 64-66 such as clearing the screen and moving the cursor insertion point. The front end processors 56-58 can handle other known types of 25 t~-rmin~l~ and/or stand-alone colll~ulel~ by providing a~l.",liate comm~n~lc.Each of the front end processors 56-58 col"""",icates bidirectionally, by means well known in the art, wi~ its corresponding one of the SA' s 42-44. It is also possible to configure the system, in a manner well known in the art, such that one or more of the front end processors can co"""ll"i~t~ with more than one of the SA's 42-44. The front end 30 processors 56-58 can be configured to "load balance" the SA's 4244 in response to data flow patterns. The concept of load balancing is well known in the art.
W 097/27552 PCTAUS97tOO212 - ~ --10-Each of the SA's 42-44 contains an application program that processes search requests input by a user at one of the terminals 64-66 and passes the search request inforrnation onto one or more of the SR's 32-35 which perform the search and returns the results, including the text of the docnment~, to the SA's 42-44. The SA's 42~4 provide the 5 user with text documents corresponding to the search results via the terminals 64-66. For a particular user session (i.e. a single user ~rcessing the system via one of the terminals 64-66), a single one of the SA's 42-44 will interact with a user through an a~pl~Jpliate one of the front end processors 56-58.
Preferably, the inventive phrase recognition method is implemented in the session 10 ~tlmini~trator SA co~ ulel~ 42-44, with p~ memory being in the SA colll~lulel itself and further memory being illustrated within elements 52-54.
The principles on which the inventive method is based, and hal.lw~ systems and software platforms on which it may be ex.oc~lte-l, having been described, a ple~lll:d embodiment of the UlVe~ iV~ phrase recognition method is described as follows.
FIG. 3 is a high level flow diagram of the phrase recognition method of the preferred embo-liment Referring to FIG. 3, the invention uses a carefully assembled list of F.n~lich words (and other considerations such as pllnt~hl~tion) in a Partition Word List (more generally referred to as a Partition Entity List, or simply Partition List) to partition one or more input 20 text streams into many text chunks. This partitioning process is illustrated in block 10.
Block 20 in~lir~tes selection of phrases from among the chunks of text, according to frequency based definitions. A Phrase List, including the selected phrases, results from execution of the process in block 20. During the phrase selection process, solitary words (single-word chunks~, as well as words from the decomposed phrases, can be m~int~in~
25 separate from the Phrase List as optional outputs for other in~ in~ activities.
Details of processes 10 and 20 are described with reference to FIGS. 4A and 4B.
The invention can optionally incorporate one or more other processes, generically intli~t~d as element 30. Such optional process may include categorization (examples described with l~re~ ce to FIGS. 6A-6D) to enhance the list of recognized phrases.
FIG. 4A is a ffow chart of FIG. 3 module 10, for partitioning text and ~e~ ling text chunks.
W 097/27552 PCTrUS97100212 FIG. 4A shows how the method, given a piece of text, partitions the text ihtO many small text chunks. A critical component in this method is the Partition List (including words and pnn~ tion) whose members serve as break points to generate the text chunks.
As mentioned above, a Partition List ideally allows parsing of text into short 5 phrases, but is itself still compact enough for efficient co~ ul~l look-up during the parsing process. Preferably, the Partition ~ist is generated using not only articles, conjun~;lions, adverbs, prepositions, pronouns, and numeric items, but also adverbs and verbs, to form an er~h~n~e-l list.
The text partitioning process starts off with looking up encoullL~ d text in the10 Partition List (at block 101) and replacing every m~t~hP~l partition word or other partition entity with a partition tag such as "
Additional partition tags are added into those text chunks at the point where there is a case change, either from lower case to upper case or vice versa (shown at block 103).
Block 104 in-lic~t~s geneldlion of the text chunk list which preserves the natural sequence 15 of the chunks as encountered in the text.
The frequency information for each chunk in the list is collected by sc~nning the text chunks in their natural sequence. The first occurrence of each unique chunk in the sequence is registered as a new entry with its frequency as 1. Subsequent occurrences are registered by incl~lll~;lllhlg its frequency count by 1. This ~neldLion of occurrence 20 frequencies in association with the respective chunks is inrlic~t~d by block 105.
FIG. 4B is a flow chart of FIG. 3 module 20, illustrating details of a preferredprocess for selecting which chunks are phrases.
FIG. 5 is a data memory strucLure diagram showing how data may be arranged in memory for the process, and how data flows into and out of various steps of the process.
25 More specifically, the steps from FIG. 3 of text partitioning 10, phrase selection 20, and optional processing 30 (reproduced on the left side of FIG. 5) are illustrated in conjunction with an exemplary data memory structure diagram ~on the right side of FIG. 5) toschtom~ti~ lly illustrate data flow between major functional procedures and data structures.
The various lists shown in the exemplary memory blocks on the right side of FIG. 5 are 30 understood to include list members in conjunction with their respective frequencies of occurrence.
CA 02244l27 l998-07-22 W O 97/27552 PCTrUS97/00212 The memory (broadly, any data storage m~Aillm such as RAM and/or magnetic disk and/or optical disk and/or other suitable Co~ uLel readable mf~ lm) may be structured in memory blocks as sch~om~tir~lly illustrated. A text stream file 300 and a Partition List 310 are used as inputs to the partitioning process 10 of the illV~ iVt~ phrase recognition method.
S The partitioning process 10 provides a chunk list (nn~lerstood as including corresponding chunk frequencies) 315. Chunk list 315 is used by the phrase selection process 20 of the illVt;llliVe phrase recognition method.
The partitioning process produces various groupings of chunks, each with their respective frequencies of occurrence within the text stream. These groupings of chunks are 10 illustrated on the right side of FIG. 5, with the underst~n-iing that the invention should not be limited to the particular memory structure so illustrated.
Specifically, lower case words (that is, single-word chunks) are in memory block320, capitalized or "allcaps" single-word chunks are in memory block 325, a Proper Name List (preferably of greater than one word, each being capitalized or in allcaps) is in 15 memory block 330, lower case phrases of greater than one word occurring more than once are in memory block 335, lower case phrases of greater than one word which were encou~ d only once are in memory block 345, and, optionally, acronyms are in memory block 350.
A synonym thesaurus in memory block 375 may be used in an optional process 30.
20 A phrase frequency list derived from a collection of plural do~;ulllelll, in which the phrase frequency throughout the collection is greater than a tnreshold, in memory block 380, may also be used in an optional processing procedure 30. Further, one or more special inr1ic~tor lists, generally in~lir~trcl as 385A-385E (c~,lllpally inriic~tors, geographic names, product names, org~ni7~tinn in(1ir~qtors7 Fngli~h first names, respectively, some of which 25 are exemplified in the ~rhpfl Appendices~ may co,~Llibu~ to certain optional categorizing processes, and result in corresponding name lists (colnl,ally names, geographic location names, product names, Ol~ alli~a~ion names, and English names~ generally in~ir~tr-l as 390A-390E.
Referring again to FIG. 4B, after the text chunk list is produced, it is the time to 30 make decision whether each chunk in the list is a phrase useful for representing conceptual content of doculllt;llLs. The inventive phrase recognition method uses the frequency illrollllation of two types of the partitioned text chunks (namely, the proper names in block -W O 97/27552 PCT~US97/00212 330 and the lower case phrases in blocks 335 and 345) to make final phrase selection decisions. Preferably, the invention focuses on lower case phrases of more than one word, or on proper names ("John Hancock", "United States").
Referring to FIGS. 4B and 5, at block 201, entries con~i~ting of a solitary lower 5 case word are not selected as phrases. Rejected entries are stored in memory block 320.
As shown at block 202, those churlks that include plural lower case words are detPrmin~l to be phrases only if they occur at least twice in the text stream. These chunks are stored in memory block 335. Chunks not fitting these criteria are stored in block 345 for further processing.
For chunks con~icting of a solitary upper case word (either the first letter being capitalized or "allcaps"~, no phrase decision is made at this stage, as shown at block 203.
Such chunks are stored in memory block 325.
In block 204, chunks including plural upper case words are d~ ined to be proper names and are stored in a Proper Name List in memory block 330.
Finally, other text chunks not fitting the previous criteria are simply discarded at this time, as in~lic~ted at block 205.
Next, block 206 ex~minPs the lower case phrases having a single occurrence ~rom memory block 245. They are ex~nnin~ for having one of its sub-phrases as part of an existing lower case phrase in the list. For efficiency, a sub-phrase may be defined to be 20 the first or last two or three words in the phrases. When the existence of a sub-phrase is ~let~ct~-l it is merged into the corresponding phrase in the list in memory block 335, and its frequency count is ~lptl~t~d Otherwise, and the lower case phrase is decomposed into individual words for updating the lower case word list in memory block 320 as an optional output.
As a result of this sub-phrase mapping in block 206, in our example the list is reduced to a list of lower case phrases and a list of proper names, both with their respective frequency counts:
[political wing, 2]
[Citing, 1]
- 30 [Irish Republican Army, 1]
[Clinton A~lmini~tration, 1]
[NorthernIreland, 1]
[~nh~tt~n, 1]
[Tuesday, 1~
CA 02244l27 l998-07-22 W097/27552 PCT~US97/00212 [Administration, 1]
[Gerry Adams, 1]
rSinn Fein, 1]
[White House, 1]
~British Government, 1]
The singleton upper case word could be used for referencing an exi~tinp proper name in the proper name list. To make the final frequency count accurate, the method makes one additional scan to the Proper Name List 330. It consolidates the upper case word that is either the first or the last word of a previously recogIlized proper name, and updates its 10 frequency count. This use of upper case single words in memory block 325 to revise the Proper Narne List 330 is in~1ic~ted at block 207. The method stores the other upper case words in the upper case word list 325 as an optional output.
A special case of the singleton upper case word is that of the acronym. An acronym is defined either as a string of the first character of each word (which is neither a 15 preposition nor a conjunction) in a proper name or as a string of the first character of each word in a proper name followed by a period. As in~ tefl at block 208, when an acronym in memory block 325 maps to a proper name in the proper name list 330, the frequency count of the proper name is incremented, and the pair of the proper name and its acronym is copied into an acronym list 350 as an optional output.
In our example, this reference checking process further reduces the proper name list in this example to the following:
[Irish Republican Army, 1]
[Clinton~A-1mini~tration, 2]
[Northern Ireland, 1]
[Gerry Adams, 1]
[Sinn Fein, 1]
[White House, 1]
[British Governm.ont7 1]
If no additional proces~ing is n~cçss~ry, this method concludes by combining the lower case phrase list in memory block 335 and the Proper Name List in memory block 330 into a single Phrase List 340 which is provided as the final output of the phrase selection process 20.
In another embodiment, the lower case phrases with frequency = 1 in memory block 345 are also included in the consolidation, in addition to the Proper Name List in 35 memory block 330 and the lower case phrases having frequency greater than 1 in memory CA 02244l27 l998-07-22 W O 97/275S2 PCTrUS97/00212 .. -15-block 335. The choice of either including or excluding the lower case phrases in memory block 34~is determined by a frequency threshold parameter which determines the number of times a lower case phrase must be encountered before it is allowed to be consolidated into the final Phrase List.
The example shown in FIG. 5 has this threshold set to 2, so that those phrases encountered only once (in memory block 345) are excluded from the consolidated Phrase List 340. The dotted line ~x~ g downward from Phrase List 340 to include memory block 345 shows how lower case phrases encountered only once can be included in the Phrase List if desired, however.
In any event, the consolidation of memory blocks into a single Phrase List is in~ t~(l at block 209.
For this text stream example, the final Phrase List is as follows:
[political wing, 2]
rIrish Republican Army, 1~
[Clinton A~l . " i . .i ~;1, aLion, 23 [Northern Ireland, I]
[Gerry Adams, 1]
[Sinn Fein, 1]
[White House, 1]
[British Government, 1]
The invention envisions that optional processes are available for further enhancing the recognized phrases.
FIG. 6A is a flow chart of an optional processing module for consolidating with a synonym thesaurus.
Referring to FIG. 6A, the Phrase List can be further reduced with a synonym thesaurus, as inf1ic :lt~l at block 301. The synonym thesaurus may be any suitable synonym thesaurus available from commercial vendors. As an example, the Phrase List may map "White House" to "Clinton A~lnnihi~tration. " Using a synonym thesaurus is risky because its contents may not reflect the intended conceptual content of the text, and therefore may ' 30 cause mapping errors. For example, it would be problematic if a synonym thesaurus maps "Bill Clinton" to "White House", because the two terms are not always equivalent.
FIG. 6B is a flow chart of an optional procecsing module for proc~ing phrases with prepositions.
WO 97/275~2 PCTrUS97/~0212 Referring to FIG. 6B, when a desirable lower case phrase contains one of~a smallset of prepositions (e.g., "right to counsel", "standard of proof"), the method takes the set out of the Partition List used for generating text chunks so that the phrase including the preposition has an opportunity to reveal itself as being part of a "good" phrase. This S process is in iir~t~-d as block 302.
Since it is st~ti~tir-~lly unlikely that any given oc~;ullc;llce of a preposition is in a "good" phrase, this optional process consumes substantial time for a relatively small increase in phrases, and is considered optional.
It is n~ce~ry to have another process to further çx~min.o the unqualified phrase in 10 memory block 345 that contains one of the selectç(l prepositions, whether the sub-phrase on the left of the preposition or the sub-phrase on the right Co~ s a valid phrase in the lower case phrase list in memory block 335. This process is illustrated as block 303.
As a result of process blocks 302, 303, memory block 335 may be llp i~t.o i.
FIG. 6C iS a flow chart of an optional processing module for 1~ ;...,..i,.g phrases with 15 their collection frequencies.
Rere~ to FIG. 6C, still another optional process is that of editing the list of the Proper Name List 330 and lower case phrases 335 with additional frequency information 380 gathered from a text collection of more than one docllm~nt The as~ Lion here is that, the more authors which use a phrase, the more reliable the phrase is for uniquely 20 e~res~ g a concept. In other words, a phrase occurring in more than one document is a .
"stronger" phrase than another phrase occllrring only in a single docllment Here, the "collection frequency" of a phrase is the number of documents that contain the phrase. A collection frequency threshold (e.g., five documents) can be set to trim down those phrases whose collection frequencies are below the threshold, as int1ic.z.t~cl 25 at block 304. F~sçnti~lly, FIG. 6C trims the entire Phrase ~ist 340, including entries from either memory block 330 or 335.
When collection frequency information is available (as illustrated by memory block 380), the ..,i~-i.,.l..,, frequency requirement of two encounters for the lower case phrases within a text (see FIG. 5) can be lowered to one encounter. "Mistaken" phrases will be 30 re,jected when con~niting the collection frequency information when considering multiple docllmentc.
CA 02244l27 l998-07-22 W O 97/27S52 PCTrUS97/00212 FIG. 6D is a flow chart of an optional proces~ing module for categorizing~propernames.
Referring now to FIG. 6D, after proper names are identified and are stored in the Proper Name List 330, it is possible to categorize them into new sets of pre-defined S groups, such as company names, geographic names, o~ ion names, peoples' names, and product names.
A list 385A of Co~ dlly inflir~tors (e.g., "Co." and "T imitPd") is used for deterrnining whether the last word in a proper name is such an in~ tor, and thereafter for categorizing it into the group of COlllp~lly name. Any word after this in-liç~tor is removed 10 from the proper name.
With the knowledge of the company name, it may be useful to check the existence of the same colll~any name in the list that does not have the in-lir~tor word. If the search is s~cces~ful, the fre~uency count of the company name is updated. The recognized c(~lllpally names are kept in a Colll~ally Names list 390A as an optional output, as in~lir~t~d 15 at block 305.
Similarly, a list 385B of geographic names or a list 385C of product names may be used for looking up whether a proper name has a match and thelcarL~l for categorizing it into the list of geographic names or a list of product names, respectively. The recognized geographic names or product names are kept in Geographic Location Names 390B or 20 Product Names 390C lists as optional outputs, as inriir~tPcl at blocks 306 and 307.
A list 385D of words that (lç~i~n~te o,g~ ions is used for d~l~. ll.illillg whether the first or the last word of a proper name is the in-lic~tor of org~ni7~tion, and thereafter for categorizing it into the group of ol~ ions. The recognized ol~ tion names may be kept in an O~ tion Names List 390D as an optional output, as intlir~tPd at block 25 308.
Finally a list 385E of Fnglich first names is used for del~llllillillg whether the first word of a proper name is a popular first name and thereafter for categorizing it into the group of peoples' names. Any word before the first name is removed from the proper name. The more comprehensive the lists are, the more people names can be categorized 30 plopelly. The recognized people names are kept in a separate F.ngli~h Names list 390E as an optional output for other in(lçxing activities, as indicated at block 309.
W097/27552 PCTrUS97/0021 .. -18-Appendices A through E present an exemplary Partition List 310 and exemplary Special Tn-1jr~tor/Name lists 385A-385E.
The inventive method having been described above, the invention also enconnpasses ~JpdldLUS (especially programmable con,~uL~l~) for carrying out phrase recognition.
5 Further, the invention encomp~ç~s articles of m~n-lf?.rhlre, specifically, computer readable memory on which the computer-readable code embodying the method may be stored, so that, when the code used in colljull.;lion with a colll~uL~l, the collll)uLel can carry out phrase recognition.
Non-limiting, illustrative examples of a~paldLus which invention envisions are 10 described above and illustrated in FIGS. 1 and 2. Each co~ s a colll~ul~r or other programmable a~a,~lus whose actions are directed by a colll~uL~l program or other software.
Non-limitinpr, illustrative articles of m~mlf~rtllre (storage media with executable code) may include the disk memory 116 (FIG. 1), the disk memories 52-54 (FIG. 2), other 15 magnetic disks, optical disks, couvellLional 3.5-inch, 1.44MB "floppy" .1ick~tt~s or other m~gn~tic ~ k~tt~s~ m~n~-ti~ tapes, and the like. Each con~titlltes a cc.lll~uL~l readable memory that can be used to direct the coll,~uLe, to function in a particular manner when used by the colll~uL~.
Those skilled in the art, given the preceding description of the inventive method, 20 are readily capable of using knowledge of haldwdl~:, of operating systems and software platforms, of progl~ g languages, and of storage media, to make and use apparatus ~or phrase recognition, as well as con,L,uLel readable memory articles of m~nllf~hlre which, when used in conjunction with a c~ L,uLer can carry out phrase recognition. Thus, the invention's scope includes not only the method itself, but d~ lldlUs and articles of 25 m~mlf~t~tllre.
Applications of the phrase ~ecognilion ~nef~o~ The phrase recognition method described above can be used in a variety of text searching systems. These include, but need not be limited to, dynamic phrase recognition in on-line systems, batch phrase recogmtion in a distributed development system, and batch phrase recognition in a 30 mainframe system. The following description of the applications of the inventive phrase recognition method is illustrative, and should not limit the scope of the invention as defined by the claims.
W097/27552 PCT~US97/00212 In an on-line system (OLS) envisioned as a target application for the inventive phrase recognition method, a user viewing a current document and entering a command to search for documents of similar conceptual content must wait for the phrase recognition process to be completed. Accordingly, the efficiency of the inventive phrase recognition 5 method is important, as it allows reduced response time and uses minim~l resources in a time-sharing environment.
According to the application of the invention in a given on-line system, the method processes the text in a single document in real time to arrive at a list Qf "good" phrases, namely, ones which can be used as accurate and me~ rul indications of the document' s 10 conceptual content, and which can be used as similarly accurate and mP:lningful queries in subsequent search requests. In particular, according to a plerel,ed application, the Phrase List derived from the single docllment is used to construct a new search description to retrieve additional do~ lellL~ with similar conceptual content to the first docllment This implementation of the phrase recognition method may, for example, be 15 embedded in session a-l...i..i~l.alor (FIG. 2) or other software which governs operation of the co~ utér system on which the phrase recognition method. Of course, the particular implementation will vary with the software and hal-l~alc~ ell\~ilvlllllent of the particular application in question.
FIG. 7A in-lic~tes a user's viewing of a docllm~nt in accordance with a suitable on-20 line text search system, and invoking the inventive phrase recognition method to search foradditional do~;ulllellL~ of similar conceptual content. In particular, block 701 assumes a user viewing a given document enters a command (such as ".more") to retrieve more documents similar in conceptual content to the current one being viewed.
When the ".more" command is entered, control passes to block 702 which in~lir~tes 25 retrieval of the document being viewed and passing it to the session ~1mini~trator or other software which includes the inventive phrase recognition software.
Block 703 in~lir~t.os execution of the inventive phrase recognition method on the text in the retrieved document. A c~nrli-l~tf~ phrase list is generated, based on that single ~ocl~m~nt.
Block 704 inf~ t~s how the c~n~ t~ phrase list ~el~elaL~d from the single document may be v~ te~l against an existing (larger) phrase dictionary. The static phrase W097/27552 PCTrUS97/00212 dictionary may be generated as described below, with lc:rerellce to the batch phrase recognition application in a distributed development system.
If a c~n~ tP phrase does not already exist in the phrase dictionary, the c~n~ t~phrase is broken down into its component words. Ultimately, a list of surviving phrases is 5 chosen, based on frequency of occurrence.
At decision block 705, if at least a given threshold number of words or phrases (e.g., five words or phrases) is extracted, control passes from decision block 705 to block 708, described below.
If, however, the given threshold number of words or phrases are not extracted, 10 control passes from decision block 705 along path 706 back to block 701, after displaying an error message at block 707 which in~ t~c that the displayed current document could not sllcce~fully be processed under the ".more" command.
Block 708 i~ ici1lrs that the newly-added words or phrases are added to the search query which previously resulted in the user's viewing the current docl-ment. Block 709 15 in~lic?~tPs the system's displaying the new "combined" search query to the user. The user may edit the new query, or may simply accept the new query by pressing "enter".
FIG. 7B sch~om~ti~lly in~lic~es implement~tion of the phrase recognition method in a batch phrase recognition system in a distributed development system.
In contrast to the implementation of the on-line system of FIG. 7A, in the 20 application sho~vn in FIG. 7B, the phrase recoglution method is applied to a large collection of documents, and produces a list of p-h-rases associated with the entire collection.
As mentioned above with reference to FIG. 7A, the phrase dictionary may be generated by this batch recognition process in the "distributed development domain" (DDD) when there is an ablln-l~n-~e of idle system resources. When the on-line system then uses the res -lt~nt 25 phrase dictionary, the phrase dictionary is essentially static, having been g~l1eldt~d and modified outside the on-line sessions.
The FIG. 7B application takes longer to execute than the single--loc~-m~nt phrase recognition process occurring in the dynamic phrase recognition in the on-line application.
Accoldhlgly, the FIG. 7B process is preferably executed as a batch process at times when 30 overall system usage is not impaired, such as overnight. In particular, the software implementation of the phrase recognition/phrase dictionary building process may be implemented on SUN work stations.
W 097/27552 PCTrUS97/00212 As a background to FIG. 7B, a developer's control file defines which docliment~,and/or which portions of the documents, should be processed in a given run. Block 723 intlic~tPs a filtering process which filters out documents and portions of docum~nt~ which are not desired to cu~ ibuLe to the phrase dictionary, based on the control file. Block 724 5 in~ te~ application of the inventive phrase recognition method to the documents and portions of docllmentc which have passed through filter process 723.
The output of the phrase recognition process is a phrase list (PL) which, in theillustrated non-limiting embodiment, is stored as a standard UNIX text file on disk. In a ~ler~ d embodiment, single-word terms which are encountered are discarded, so that only 10 multiple word phrases are included in the phrase list (PL).
For simplicity, each phrase is provided on a single line in the file. Block 725 in~lic at.os how the UNIX file is sorted using, for example, the standard UNIX sort utility, causing duplicate phrases to be grouped together. Block 725 also ç~lclll~t~s the frequency of each of the grouped phrases.
If a given phrase occurs less than a given threshold number of times (e.g., fivetimes as tested by decision block 727) it is discarded, as in~1ir~tPd by decision block 726.
Only phrases which have been encountered at least that threshold number of times survive to be included in the revised Phrase List, as shown in block 728.
The revised Phrase List is then Lldl~rell~d from the SUN work station to its desired 20 ~lestin~tion for use in, for example, the on-line system described above. It may also be transferred to a main frame colll~u~l using a file transfer protocol FTP, to be processed by a phrase dictionary building program and compiled into a production phrase dictionary.
This process is shown illustrated as block 729.
Referring now to FIG. 7C, the application of the inventive phrase recognition 25 method on a mainframe system is schPm~ti~lly illustrated. In the illustrated application, the phrase recognition method is implemented as a batch process in a production ahlf~allle. The process involves a random sample of docllmentc from a larger collection of docllment~, and produces a set of phrases for each document processed. The process is preferably exPc3lt~-~l when system resources are not otherwise in high demand, such as 30 overni~;ht The process of FIG. 7C is especially useful for use with sl~ti~tic~l thesauri.
As a background, it is ~llmt~l that phrases may be considered to be "related" toeach other if they occur in the same document. This "relationship" can be exploited for CA 02244l27 l998-07-22 WO 97/27552 PCTrUS97/00212 such purposes as e~an-ling a user's search query. However, in order to provide this ability, large number of documents must first be processed.
Referring again to FIG. 7C, block 733 in~lic~t.-s the filt~rin~ of documents andportions thereof in accordance with specifications from a control file, in much the same 5 manner as described with reference to FlG. 7B. Block 734 inrli~ft?s the application of the inventive phrase recognition method to the documents which pass the filter. One set of terms (single words, phrases or both) is produced for each document and stored in respective suitably formatted data structure on a disk or other storage m~ m Further details of implementation of the applications of the inventive phrase 10 recognition method depend on the particular haldware system, sofLw~l~ platform, progr~mming languages, and storage media being chosen, and lie within the ability of those skilled in the art.
The following Appendices are exemplary, illustrative, non-limiting examples of aPartition List and other lists which may be used with an embodiment of the phrase 15 recognition method according to the present invention.
W 097/27552 PCTrUS97/00212 APPENDIX A
Example of PARTITION LIST
(On-Line System with News Data) s ~ Copyright 1995 LEXIS-NEXIS, a Division of Reed Elsevier Inc.
A BEING FRIDAY I'M
10 A.M BELOW FROM I'VE
ABOUT BETWEEN GET I.E
ABOVE BOTH GO IF
ACROSS BUT GOT IN
AFFECT BY HAD INTO
AGAIN COULD HARDLY IT
AGO DEC HAS ITS
ALL DECEMBER HAVE ITSELF
ALREADY DID HAVING JAN
ALTHOUGH DOE HENCE JUL
ALWAY DUE HER JULY
AN DURING HER'N JUN
AND E.G HERE JUNE
ANY EIGHT HEREBY LIKE
ANYBODY EIGHTEEN HEREIN MANY
ANYMORE ~l l ll~K HEREINAFTER MAR
ANYONE ELEVEN HEREINSOFAR MARCH
APR EVENTUALLY HEREON ME
APRIL EVER HERETO MIGHT
ARE EVERYBODY HEREWITH MINE
AROUND EVERYMAN HERN MON
ASIDE EVERYTHING HIC MORE
ASK EXCEPT HIM MUCH
AT FEB HIMSELF MUST
AUG FEBRUARY HIS MY
40 AUGUST FEW HIS ' N MYSELF
AWAY FEWER HISSELF N.S
BAITH ~ HOC NANE
BE FIVE HOW N ~11~
BECAME FOR HOWEVER NEVERT~F.T F~SS
BEEN FOURTEEN I'D NINETEEN
BEFORE FRI I'LL NO
WO 97/275S2 PCTrUS97/00212 NOBLEWOMAN SEVEN THIRTEEN WHEREBY ' NOBODY SEVENTEEN THIS WHEREEVER
NONE SEVERAL THOSE WHEREIN
NOR SHE THOU ~/~l~ l ~K
NOV SINCE THREE WHICHEVER
NOVEMBER SIR THROUGH WHICHSOEVER
NOW SIX THUR WHILE
O S ~ ;N THURSDAY WHO
OCTOBER SOME THY WHOM
OF SOMEBODY THYSELF WHOMSOEVER
OFF SOMEONE TILL WHOSE
OFTEN SOMETHING TO WHOSESOEVER
ONE SOMEWHERE TOMORROW WHOSOEVER
ONESELF SOONER TOO WHY
ONLY STILL TUE WILL
ONTO SUCCUSSION TUESDAY WITH
OTHER SUN TWENTY WITHOUT
OTHERWISE SUNDAY TWO WOULD
OUGHT TAKE UN YA
OUR TEN UNDER YE
25 OUR' N THAE UNLESS YES
OURSELF THAN UNTIL YESTERDAY
OURSELVE THAT UNTO YET
OUT THE UP YON
OVER THEE UPON YOU
30 P.M THEIR US YOUR
PERHAP THEIRSELF USE YOUR' N
QUIBUS THEIRSELVE VERY YOURSELF
QUITE THEM VIZ lST
35 REALLY THEN WE 1 lTH
40 SATURDAY THEREFROM WHATE'ER 16TH
SEE THEREOF WHATSOE'ER 18TH
SEEMED THEREON WHATSOEVER l9TH
CA 02244l27 l998-07-22 .
APPENDIX B
Ex~nple of COMPANY INDICATOR LIST
~ Copyright 1995 LEXIS-NEXIS, a Division of Reed Elsevier Inc.
BROS
10 BROS.
BROTHERS
CHARTERED
CHTD
CHTD.
CL.
CO
CO.
COMPANY
CORP.
CORPORATION
CP
CP.
GP
GP.
C;ROUP
INC
30 INC.
INCORP
INCORP.
INCORPORATED
INE
35 INE.
LIMITED
LNC
LNC.
LTD
40 LTD.
W 097/27552 PCTrUS97/00212 APPENDIX C
Example of PRODUCT NAME LIST
S $' Copyright 1995 LEXIS-NEXIS, a Division of Reed Elsevier Inc.
240sx Kleenex Taurus 300sx L.O.C. Tide 10 4-Runner Lexus Toshiba 7Up Linux Tums Access Lotus Tylenol Adobe Magnavox Windex Altima Maxima Windows 15 Arid Mercedes Yashika Avia Minolta Zoom B-17 Mitsubishi B17 Mustang BMW Nike 20 Bayer Nikon Blazer OS2 Bounty Oracle Camary P100 Cannon P120 25 Chevy P133 Cirrus P60 Coke P75 Converse P90 Corvette Paradox 30 F~tonic Pepsi Excel Pl~al~lion-H
F-14 Puffs F-15 Puma F-16 Quicken 35 F-18 Rave F-22 Reebok F14 Rolaids F16 Sable 40 F18 Sentra F22 Seven-Up Infinity Solaris Ingres Sony JVC Sprite 45 Jaguar Suave Jeep Sun Keds Sybase WO 97/27552 PCTrUS97/00212 APPENDIX D
Example of ORGANIZATION INDICATOR LIST
~' Copyright 1995 LEXIS-NEXIS, a Division of Reed Elsevier Inc.
ADMINISTRATION SCHOOL
AGENCY SENATE
ARMY SOCIETY
ASSEMBLY TEAM
BOARD UNIVERSITY
BUREAU
CENTER
CHURCH
CLUB
COLLEGE
COMMISSION
COMMl'l"l ~
CONGRESS
COUNCIL
COURT
CULT
DEPT
FACTION
FEDERATION
FOUNDATION
GUILD
HOSPITAL
HOUSE
INDUSTRY
LEAGUE
MEN
ORGANIZATION
PARLIAMENT
PARTY
REPUBLIC
W 097/27~52 PCTrUS97/00212 APPENDIX E
Example of ENGLISH FIRST-NAME LIST
6' Copyright 1995 LEXIS-NEXIS, a Division of Reed Elsevier Inc.
AARON ADRIANNE ALEXANDRINA ALPHONSUS
10 ABAGAIL ADRIEN = ALEXEI ALTA
ABBIE ADRIENNE ALEXI ALTHEA
ABBY AERIEL ALEXIA ALTON
ABE AGATHA ALEXIS ALVA
ABEGAIL AGGIE ALF ALVAH
ABELARD AGNES ALFIO ALVIN
ABIGAIL AGNETA ALFORD ALYCE
ABNER AGUSTIN ALFRED AMALIA
ABRAHAM AHARON ALFREDA AMANDA
ACIE AILEEN ALGERNON AMBER
ACY AILEENE ALICE AMBROSE
ADA AILENE ALICIA AMBROSIA
ADAH AIME ALINE AMBROSIUS
ADALBERT AINSLEE ALISHA AMIE
ADALINE AINSLEY ALISON AMILE
ADAM AJA ALIX AMITY
ADDAM AL ALLAN AMON
ADDY ALAINE ALLEEN AMY
ADELA ALAN ALLEGRA ANA
ADELAIDE ALANAH ALLEN ANABEL
ADELBERT ALANNA ALLENE ANABELLE
ADELENE ALBERT ALLIE ANASTASIA
ADELINE ALBERTA ALLISON ANATOLY
ADELLA ALBIN ALLOYSIUS ANCIL
ADELLE ALDO ALLY ANDIE
ADNA ALEC ALMA ANDREAS
ADOLF ALECIA ALMETA ANDREE
ADOLPH ALECK ALMIRA ANDREI
ADOLPHUS ALENE ALMON ANDREJ
ADRIAN ALEXANDER ALOYSIUS ANDY
ADRIANE ALEXANDRA ALPHA ANETTA
W O 97/27552 PCTrUS97100212 ANETTE ARIC AUD BARNY
ANGELA ARICA AUDEY BARRETT
ANGELICA ARIEL AUDIE BARRY
ANGELINA ARISTOTLE AUDINE BART
ANGELIQUE ARLEEN AUDREY BARTON
ANGIE ARLEN AUDRIE BASIL
ANGUS ARLENE AUDRY BAYARD
ANITA ARLIE AUDY BEA
ANNA ARLINE AUGUST BEATRIX
ANNABEL ARLO AUGUSTINE BEAUREGARD
ANNABELLE ARMAND AUGUSTUS BEBE
ANNALEE ARMIN AURELIA BECCA
ANNELIESE ARNE AUSTEN BEE
ANNELISE ARNETT AUSTIN BELINDA
ANNEMARIE ARNEY AUTHER BELLA
ANNETTA ARNIE AUTRY RF.T.T.F.
ANNICE ARON AVA BENEDICT
ANNIF ART AVERY BENJAMIN
ANNINA ARTE AVIS BENJI
ANNMARIE ARTEMIS AVITUS BENNETT
ANSELM ARTHUR AVRAM BENNO
ANSON ARTIE AXEL BENNY
ANTHONY ARTIS AZZIE BENTLEY
ANTOINE ARTY AZZY BERKE
30 ANTOINETTE ARVELL BABETTE BERKEL~.Y
ANTON ARVIE BAILEY BERKELY
ANTONE ARVO BAIRD BERKLEY
ANTONETTE ARVON BALTHAZAR BERLE
ANTONI ASA BAMBI BERNARD
ANTONIO ASHER BARBARA BERNETTE
ANTONY ASHLEIGH BARBEE BERNHARD
AP ASHLEY BARBI BERNICE
APOLLO ASTER BARBIE BERNIE
ARA ASTRID BARNABAS BERRY
ARAM ATHENA BARNABUS BERT
ARBY ATHENE BARNABY BERTHA
ARCH ATTILIO BARNARD BERTHOLD
ARCHIE AUBRIE BARNETT BERTRAM
ARETHA AUBRY BARNEY BERTRAND
W 097/27552 PCTrUS97/00212 BERTRUM BRAINARD BUFORD CARLISLE -BERYL BRAINERD BUNNIE CARLTON
BESS BRANDI BUNNY CARLY
BESSIE BRANDY BURL CARLYLE
BETSEY BREK BURNETTA CAROL
BETSIE BRENARD BURNICE CAROLA
BETSY BRENDA BURREL CAROLANN
BETTE BRENDAN BURT CAROLE
BETTY BRET BURTRAM CAROLINE
BETTYE BRETT BUSTER CAROLYN
BEULAH BRIAN BUTCH CAROLYNN
BEVERLEE BRICE BYRON CARREN
BEVERLY BRIDGETT CAITLIN CARRIN
BEWANDA BRIDGETTE CAL CARROLL
BIFF BRIDIE CALE CARSON
BILL BRIGIT CALEB CARY
BILLY BRIJITTE CALT TF. CARYN
BIRD BRITNY CALLY CAS
BJARNE BRITTANY CALVIN CASEY
BJORN BRITTNEY CAM CASI
BLAINE BROCK CAMERON CASPER
BLAIR BRODERICK CAMILE CASS
BLAKE BROOKE CAMILLA CASSANDRA
BLANCA BROOKS CAMILLE CASSIE
BLANCHE BRUNHILDA CANDI CATHARINE
BOB BRUNHILDE CANDICE CATHERINE
BOBBI BRUNO CANDIS CATHLEEN
BOBBIE BRYAN CANDUS CATHLENE
BONNIE BRYCE CANNIE CATHRYN
BONNY BRYON CARA CATHY
BOOKER BUBBA CAREN CEASAR
BORIS BUCK CAREY CEATRICE
BOYD BUD CARIN CECIL
BRACIE BUDDIE CARL CECILE
BRACK BUDDY CARLA CECILIA
BRAD BUEL CARLEEN CECILY
BRADLEY BUFFIE CARLETON CEFERINO
BRADLY BUFFY CARLINE CELESTE
W097/27552 PCTrUS97/00212 - ~ -32-CELESTINE CHRISTOPH CLEVON CORTNEY
CELIA CHRISTOPHER CLIFF CORY
CELINA CHRISTOS CLIFFORD COSMO
CESAR CHRISTY CLIFT COUNTEE
CHADWICK CHUCK CLINT COURTNEY
CHAIM CHUMLEY CLINTON COY
CHANCY CICELY CLIO CRAIG
CHANDLER CICILY CLITUS CRIS
CHARLEEN CINDY CLOVIA CRISPUS
CHARLENE CLAIR CLOVIS CRISSIE
CHARLES CLAIRE CLOYD CRISSY
CHARLESE CLARA CLYDE CRISTABEL
CHARLEY CLARE COLBERT CRYSTAL
CHARLIE CLARENCE COLE CURLESS
CHARLINE CLARICE COLEEN CURLY
CHARLISE CLARINA COLETTE CURT
CHARLOTTE CLARK COLITA CY
CHARLTON CLASSIE COLLEEN CYBIL
CHAS CLAUD COT.T F.TTE CYBILL
CHASTITY CLAUDE COLIIN CYNDI
25 CHAUNCEY CLAUDFT.T.F- COLON CYNDY
CHELSIE CLAUDETTE CONNIE CYNTHIA
CHER CLAUDIA CONNY CYRIL
CHERI CLAUDINE CONRAD CYRILL
CHERIE CLAUDIUS CONROY CYRILLA
CHESTER CLAY CONSTANTIA DABNEY
CHET CLAYMON COOKIE DACIA
CHIP CLAYTON CORA DACIE
CHLOE CT FTO CORABELLE DAGMAR
CHRIS CLEMENT COREY DAISEY
CHRISSIE CLEMENTINE CORINE DAISY
CHRISSY CLEMENZA CORINNE DALE
CHRISTA CLENELL CORKIE DALTON
CHRIST~RF.T.T.F. CLEOPHUS CORNEAL DAMIEN
CHRISTAL CLEOTHA CORNELIA DAMION
CHRISTIAAN CLEOTIS CORNELIUS DAMON
CHRISTIAN CLETA CORRIE DAN
45 CHRISTIE CLETUS CORRINE DAN'L
CHRISTINE CLEVE CORRINNE DANA
CHRISTOFER CLEVELAND CORRY DANIEL
W 097/27552 PCT~US97/00212 DANIELLA DEBBY DERL DOMENIC
DANIELLE DEBORA DERMOT DOMENICK
DANNA DEBORAH DERMOTT DOMER
DANNY DEBRA DERRALL DOMINIC
DANUTA DEDIE DERRICK DOMINICKA
DAPHNE DEE DERRY DOMINIQUE
DARBIE DEEANN DERWOOD DON
DARBY DEEANNE DESDEMONA DONALD
DARCEY DEIDRE DESIRE DONELLE
DARCI DEL DESIREE DONICE
DARCIE DELAINE DESMOND DONIS
DARCY DELANE DESMUND DONNA
DAR~O DELBERT DEVORAH DONNELLE
DARIUS DELIA DEWANE DONNIE
DARLA DELL DEWAYNE DONNY
DARLEEN DELLA DEWEY DONOVAN
DARLINE DELMA D I ;2S l ~;K DORCAS
DARLYNE DELMAR DEZ DORCE
DARNELL DELMAS DIAHANN DOREEN
DAROLD DELMO DIANA DORI
2~5 DARREL DELNO DIANE DORIAN
DARRELL DELORES DIANNA DORIE
DARREN DELORIS DIANNE DORIENNE
DARRIN DELOY DICK DORINE
DARRYL DELTA DICKEY DORIS
DARYL DEMETRIUS DIDI DOROTHEA
DASHA DENARD DIEDRE DOROTHY
DAVE DENE DIERDRE DORRANCE
DAVEY DENICE DIETER DORRIS
DAVIDA DENIS DIMITRI DORTHIE
DAVIE DENISE DINA DORTHY
DAVY DENNIE DINAH DOSHIE
DAWN DENNIS DINO DOT
DEANDRA DENNYS DIRK DOTTY
DEANE DENORRIS DIXIE DOTY
DEANNA DEO DMITRI DOUG
DEANNE DEON DOLLIE DOUGIE
DEBBI DEREWOOD DOLORES DOUGLASS
DEBBIE DERICK DOM DOY
W O 97/27552 PCT~US97100212 DOYLE EDISON FT.r~T~ EMETT
DREW EDITA ELISSA EMIL
DRU EDITH ELIUS EMILE
DUAIN EDMOND FT.TZA EMILIE
DUANE EDNA ELIZAR EMMA
DUB EDRIS ET KF. EMMALINE
DUDLEY EDSEL ELLA EMMERY
DUEL EDUARD ELLE EMMET
DUFF EDWIN ELLERY EMMIE
DUFFY EDWINA ELLIE EMMOT
DUGALD EDY F.T.T.TET EMMOTT
DUKE EDYTH ELLIOT EMMY
DULSA EFFIE ELLIS EMORY
DUNCAN EFRAIM ELLY ENDORA
DURWARD EGBERT ELLYN ENDRE
DURWOOD EGIDIO ELMA ENGELBERT
20 DUSTIN F.TT.F.F1~ ELMER ENID
DUSTY ELA ELMIRA ENISE
DWAIN ELAINE ELMO ENNIS
DWAINE ELAYNE ELNOR ENOCH
DWAYNE ELBA ELNORA ENOLA
DYLAN ELBERTA ELOISE EPHRAIM
DYNAH ELDA ELOUISE EPHRIAM
EARL ELDINE ELOY ERASMUS
EARLE ELDON ELRIC ERBIN
EARLINE ELE ELSA ERICA
EARNEST ELEANOR ELSBETH ERICH
EARNESTINE ELEANORA ELSIE ERICK
EARTHA ELEANORE ELTON ERIK
EBEN ELENA ELVERT ERIN
EBE~:G~K ELENORA ELVIE ERLAND
EBENEZER ELENORE ELVIN ERLE
EBERHARD ELEONORA ELVIRA ERMA
ED ELIAS ELVON ERNEST
EDD ELIC ELWOOD ERNESTINE
EDDIE ELIJAH ELY ERNESTO
EDDY ELINORE ELZA ERNIE
EDGER ELISABETH EMERIC ERROL
EDIE ELISE EMERY ERVAN
CA 02244l27 l998-07-22 W O 97/275S2 PCT~US97/00212 ERVEN EWIN FLOSSY GAGE
ERVIN I i.~.~;.K 11~1 FLOYD GAIL
ERWIN EZRA FONDA GALE
ESAU FABIAN FONTAINE GALLON
5 ESMERET.nA FABIEN FORD GARETH
ESTA FAIRLEIGH FORREST GARLAND
ESTEL FAITH FRAN GARNET
ESTELA FANNIE FRANCES GAROLD
ESTELLA FANNY FRANCIA GARRET
ESTER FARLEIGH FRANCIS GARRIE
ESTHA FARLEY FRANCOIS GARRY
ESTHER FARRAH FRANCOISE GAI~TH
ETHAN FARREL FRANK GARVIN
ETHELENE FARRIS FRANKLIN GASTON
ETHELINE FATIMA FRANKLYN GAVIN
ETHYL FAUN FRANKY GAY
ETIENNE FAWN FRANNIE GAYE
ETTIE FAYE FRANZ GAYLORD
EUDORA FELECIA FRANZI GEARY
EUFA FELICIA FRANZIE GEMMA
EUGENE FELICITY FRANZY GENA
EUGENIE FELIZ FREDA GENEVA
EUGENIO FERD FREDDIE GENEVIEVE
EULA FERDINAND FREDDY GENIE
EULALEE FERGUS FREDERICH GENNARO
EULOGIO FERREL FREDERIK GENNIFER
EUNACE FERRELL FREEMAN GENNY
EUNICE FERRIS FREIDA GENO
EUPHEMIA FIDELE FRIEDA GEO
EVA FILBERT FRITZ GEOFFREY
EVALEE FILIPPO FRONA GEORGE
EVAN FIONA FYODOR GEORGES
EVANDER FITZ GABBEY GEOR~I~
EVE FLO GABE GEORGIE
EVELYN FLOR GABRIEL GEORGINA
EVERETT FLORA GABRIF.T F. GERAID
EVERETTE' FLORANCE GABRIELLE GERALDINE
EVITA FLORTDA GAEL GERD
EWALD FLOSSIE GAETANO GERDINE
W097/275S2 rCTAUS97/~0212 GERHARD GLENN GUNTER HANNE
GER~ GLENNA GUNTHER HANNES
GERMAIN GLENNIE GUS HANNIBAL
GERMAINE GLENNIS GUSSIE HANS
S GEROLD GLENNON GUSSY HANSELL
GEROME GLORIA GUST HARLAN
GERRIE GLYN GUSTAF HARLEN
GERRIT GLYNDA GUSTAV HARLEY
GERRY GLYNIS GUSTAVE HARLIE
GERTA GLYNNIS GUSTOV HARMIE
GERTIE GODFREY GUY HARMON
GERTRUDE GODFRY GWEN HAROL
GEZA GODWIN GWENDA HAROLD
GIACOMO GOLDIE GWENDEN HARRIET
GIDEON GOLDY GWENDLYN HARRIETT
GIFFORD GOMER GWENDOLA HARRIETTA
GIGI GORDAN GWENDOLEN HARRIS
~'~TT.RF.R,T GOTTFRIED GWENDOLY HARROLD
GILDA GRACE GWENDOLYN HARRY
GILES GRACIA GWENDOLYNE HARVEY
GILLIAN GRACIE GWENDY HARVIE
GINA GRAEME GWENETTA HATTIE
GINGER GRAHAM GWENETTE HATTY
GINNI GRANT GWENITH HAYDEE
GINNIE GRAYCE GWENN HAYDEN
GINO GREGG GWENNETH HAYLEY
GIORA GREGGORY GWENYTH HAYWOOD
GIOVANNA GREGORY GWYEN HAZEL
GIOVANNI GRETA GWYLLEN HEATHCLIFF
GISELLA GRETEL GWYNETH HECTOR
GT~c~F~T~T F GRIFF GWYNNE HEDDA
GIULIA GRIFFIN GYULA HEDDIE
GIUSEPPE GRl~'~'l'l'~ HAILEY HEDWIG
GIUSEPPINA GUENTHER HALLIE HEINRICH
GLADIS GUERINO HALLY HEINZ
GLADYCE GUIDO HAMISH HELAINE
GLADYS GUILLERMINA HAMPTON HELEN
GLENDA GUISEPPE HANNA HELENE
GLENDON GUNNER HANNAH HELGA
W O 97/27552 PCTrUS97100212 HELMUT HIRAM IKE ITA
HELMUTH HIRSCH ILA IVA
HELOISE HOBART ILAH IVAN
HENDRTK HOBERT ILEEN IVANA
~ 5 HENDRIKA HOLLEY ILENA IVAR
HENNIE HOLLI ILENE IVARS
HENNY HOLLIE ILSE IVETTE
HENRI HOLLIS IMELDA IVEY
HENRIETA HOLLISTER IMOGENE IVIE
HENRIETTE HOLLYANN INGA IVONNE
HENRIK HOLLYANNE INGE IVOR
HENRY HOMER INGGA IVORY
HENRYK HONEY INGRAM IVY
HERBERT HOPE IOLA IZZIE
HERBIE HORACE IONA IZZY
HERBY HORST IONE JAC
HERCULE HORTENSE IRA JACK
HERM HORTON IRENA JACKIE
HERMA HOSEA IRENE JACKSON
HERMAN HOWARD IRIS JACKY
HERMANN HOWIE IRL JACOB
HEROLD HUBERT IRVIN JACQUELIN
HERSCH HUEL IRVING JACQUELINE
HERSCHEL HUEY IRWIN JACQUELYN
HERSHEL HUGH ISAAC JACQUES
HESTHER HULDA ISABEL JAKE
HETTA HULDAH ISABELLA JAKOB
111~:1111~: HUMPHREY ISABELLE JAMES
HETTLE HUNTINGTON ISAC JAMEY
HEYWOOD HY ISADORA JAMYE
H 1~:~.1 ;.K I ~H HYACINTH ISADORE JAN
HILARY HYMAN ISAIAH JANA
HILDA IAIN ISHAM JANE
HILDEGARD ICHABOD ISIAH JANEL
HILDEGARDE IDA ISIDOR JANELL
HILDRED IGGY ISIDORE JANELLE
HILDUR IGNATIUS ISMAEL JANET
HILLARY IGNAZ ISRAEL JANIE
T-TTT.T.FT~y IGOR ISTVAN JANINE
WO 97/27552 PCTrUS97/00212 ~ -38-JANIS JERMAIN JOETTE JUDY
JANISE JERMAINE JOEY JULEE
JANNA JEROL JOFFRE JULES
JAQUELYN JEROLD JOH JULIA
JARRETT JERRALD JOHANNES JULIANE
JARRYL JERRED JOHN JULIANNE
JARVIS JERRELD JOHNIE JULIE
JAS JERRELL JOHNNA JULIEN
JASON JERRIE JOHNNY JUIIET
JASPER JERROLD JOJO JULIETTE
JAY JERRY JOLENE JULIUS
JAYME JERZY JON JUNE
JAYNE JESSE JONAS JUNIE
JEAN JESSICA JONATHAN JUNIOR
JEANETTE JESSIE JONATHON JUNIUS
JEANIE JETHRO JONELL JURGEN
JEANNETTE J~'l"l'll~ JONNY JUSTICE
JEANNINE JEWEL JORDAN JUSTIN
JEBEDIAH JEWELL JORDON JUSTINE
JED JILL JOSEA KALVIN
JEFEREY JIM JOSEPH KAREN
JEFF JIMBO JOSEPHA KARIN
J l~ KEY JIMME JOSEPHINE KARL
J~ JIMMIE JOSEY KARLA
JEMIMA JINNY JOSHUA KASEY
JEMMA JO JOSIAH KASPAR
JEMMIE JOAB JOSIE KASPER
JEMMY JOACHIM JOY KATE
JENNIFER JOANN JOYCEANN KATEY
JENNY JOANNA JOZEF KATHERINE
JENS JOANNE JOZSEF KATHERYN
JERALD JOB JUBAL KATHLEEN
JERE JOCILYN JUDAH KATHY
JEREMIAH JOCLYN JUDAS KATIE
JEREMIAS JODI JUDD KATY
JEREMY JODIE JUDE KAY
4~ JERI JODY JUDI KAYE
JERIANE JOE JUDIE KAYLEEN
JERIE JOEL JUDITH KEENAN
W 097/27552 PCTrUS97/00212 . -39-KEISHA KORNELIA LAURENCE LENNIE
KEITH KORNELIUS LAURETA LENNY
KFT.T FY KRAIG LAURETTA LENORA
KET.T.T KRIS LAURETTE LENORE
5 KFT.TTF. KRISTA LAURI LENWOOD
KELLY KRISTEN LAURIE LEO
KELSEY KRISTI LAURIEN LEOLA
KELVIN KRISTIE LAVELL LEON
KEN KRISTIN LAVERA LEONA
KENDELL KRISTOFFER LAVERNA LEONID
KENDRICK KRISTY LAVERNE LEONIDA
KENNETH KURT LAVINA LEONIDAS
KENNY KYLE LAVINIA LEONORA
KERI LACEY LAWRENCE LERA
KERMIT LACIE LDA LEROY
KERRIE LACY LEA LES
KERRY LADON LEAH LESLIE
KERVIN LALAH LEANE LETA
KERWIN L~MAR LEANN LETHA
KF.VIN T A M ARTTNF I~ANNE ~.ETICT A
KIERSTEN LAMBERT LEATHA LETITIA
KILLIAN LANA LEEANN LETTY
KIM LANCE LEEANNE LEVERNE
KIMBER LANCELOT LEENA LEVERT
KIMBERLEE LANE LEESA LEVI
KIMBERLEY LARA LEFTY LEW
KIMBERLY LARENE LEIF LEWELL
KIP LARONE LEIGH LEWIS
KIRBY LARRIS LEILA LEXIE
KIRSTEN LARS LELAH LIANE
KIRSTI LASKA LELAND T.TRBIE
KIRSTIE LASLO LELIA LIBBY
KIRSTIN LASZLO LELIO LIDA
KIT LATIMER LEMUEL LILA
Kll lll~: LAUANNA LEN LILAC
KTTTY LAUNCIE LENA LILl~H
KLAUS LAURA LENARD LILE
KONSTANTINOS LAUREL LENETTE LILIEN
KOOS LAUREN LENISE LILITH
W 097/27~52 PCT~US97/00212 . ~0-LILLIA LONZIE LUCILLE - MADALYN
LILLIAN LONZO LUCINDA MADDIE
T.TT T.TF LONZY LUCIUS MADDY
LILLY LORA LUCRETIA MADELAINE
LIN LORAINE LUDLOW MADELENE
LINCOLN LORANE LUDVVIG MADELINE
LINDA LORAY LUDWIK MADELYN
LINDSAY LORAYNE LUDY MADGE
LINK LOREN LUGENE MADONNA
LINNEA LORENA LUIGI ~AE
LINNIE LORENE LUl~E MAGDA
LINNY LORETA LULA MAGDALENA
LINUS LORI LULU MAGDALINE
LINVAL LORIN LUMMIE MAGGIE
LINWOOD LORNA LUNA MAGGY
LINZIE LORNE LURLEEN MAGNUS
LIONEL LORRAYNE LURLINE MAHALIA
LISA LOTHAR LUTHER MAIA
LISABETH LOTTIE LUZ MAIBLE
LISE LOU LUZERNE MAIJA
LIZ LOUIE LYLE MAL
LIZA LOUIS LYMAN MALCOLM
LIZABETH LOUISA LYN MALINDA
T.T7.7.TF. LOUISE LYNDA MALISSA
LLEWELLYN LOURETTA LYNN MALLORIE
LLOYD LOVELL LYNNA MALLORY
LLOYDA LOVETTA LYNNE MALORIE
LOELLA LOVETTE LYNNETTE MALORY
LOIS LOY LYSANDER MAME
LOLA LOYAL M ' LINDA MAMIE
LOLETA LUANN MABEL MANDEE
LOLITA LUANNA MABELLE MANDI
LOLLY LUBY MAC MANDY
LON LUCAS MACE MANFRED
LONI LUCIAN MACIE MANICE
LONNA LUCIE MACK MANLEY
LONNY LUCILE MADALEINE MANNY
LONSO LUCILLA MADALINE MANSEL
CA 02244l27 l998-07-22 W 097/27552 PCT~US97/00212 MARABEL MARKIE MATTIE MELYNDA ' MARC MARKOS MATTY MENDEL
MARCEL MARKUS MATTYE MERCEDEL
MARCELIN MARLA MAUD MERCEDES
- S MARCELL MARLENA MAUDE MERCY
MARCELLA MARLENE MAURA MEREDETH
MARCELLE MARLEY MAUREEN MEREDITH
MARCELLUS MARLIN MAURENE MERIDETH
MARCI MARLON MAUREY MERIDITH
MARCIE MARNEY MAURIE MERLE
MARCUS MARNIE MAURINE MERLIN
MARCY MARNY MAURY MERLYN
MARDA MARSDEN MAVIS MERREL
MARGE MARSHAL MAXCIE MERRILL
MARGEAUX MARSHALL MAXCINE MERRY
MARGERY MARTA MAXIE MERVIN
MARGI MARTEA MAXIM MERWIN
MARGO MARTI MAXIMILLIAN MERYL
MARGOT MARTICA MAXINE META
MARGRET MARTIE MAXWELL MEYER
MARGY MARTIKA MAY MIA
25 MARIAM MARTILDA MAYRF.T,T,F, MIATTA
MARIAN MARTIN MAYDA MICAH
MARIANN MARTY MAYME MICHAEL
MARIANNE MARV MAYNARD MICHEL
MARIBEL MARVA MEAGAN MICHELE
30 MARIBELLE MARVIN MEG MIC~TF,T,T,F, MARIBETH MARY MEGAN MICK
MARIE MARYAM MEL MICKEY
MARIEL MARYANN MELANIE MICKIE
MARIETTA MARYANNE MELANY MICKY
MARILEE MARYFRANCES MELICENT MIKAEL
MARILYN MARYLOU MELINDA MIKAL
MARILYNN MASON MELISSA MIKE
MARINA MATE MELLICENT MIKEAL
MARION MATHEW MELODIE MILDRED
MARISSA MATHIAS MELODY MILES
MARIUS MATHILDA MELONIE MILICENT
MARJORIE MATIT~nA MELONY MILLARD
MARK MA'l"l'~l I i.W~ MELVIN MILLIE
MARKEE MATTHIAS MELVYN MILLY
W097/27~52 PCTrUS97/00212 . -42-MILO MYRAH NERSES NORBERT ' MILT MYRAL NESTOR NOREEN
MILTON MYREN ~1~;'1"1'1~; NORM
MIMI MYRNA NEVILLE NORMA
MINERVA MYRTLE NEWELL NORMAN
MINNIE NADENE NEWT NORRIS
MIRANDA NADIA NEWTON NORTON
MIRIAM NADINE NICHAEL NORVAL
MISSY NAMIE NICHOLAS NYLE
MISTY NAN NICK OBADIAH
MITCH NANCI NICKI OBED
MITCHEL NANCIE NICKIE OBEDIAH
MITZI NANETTE NICKY OCTAVE
MOE NANI NICODEMO OCTAVIA
MOLLIE NANNETTE NICODEMUS ODEL
MOLLY NANNI NICOL ODELL
MONICA NANNY NICOLAI ODIE
MONIQUE NAOMA NICOLAS ODIS
MONTE NAOMI NICOLE OGDEN
MONTGOMERY NAPOLEON NICOLETTE OKTAVIA
MONY NATALIE NICOLO OLAF
MORDECAI NATASHA NIGEL OLAN
MOREY NATASSIA NIKA OLEG
MORGAN NATHAN NIKE OLEN
MORRIS NATWICK NIKITA OLIN
MORT NAZARETH NIKITAS . OLIVE
MORTIMER NEAL NIKKI OLIVER
MORTON NEALY NILE OLIVIA
MOSE NEDINE NILS OLIIE
MOSES NEIL NIMROD OLOF
MO7,F,T.T,F, NEILL NINA OMER
MULLIGAN NELDA NOAH ONAL
MURPHY NELLE NOEL ONNIK
MURRAY NELLIE NOLA OPAL
MURRELL NELLY NOLAN OPEL
MURRY NELS NONA OPHELIA
MYNA NENA NORA ORA
MYRA NERO NORAH ORAL
W 097/27S52 PCT~US97/00212 ORAN PATRICE PHINEAS RAYFORD
OREN PATRICIA PHOEBE RAYMOND
ORESTE PATRICK PHYLLIS RBT
ORIN PATSY PIA REAGAN
S ORLAN PATTI PIER REBA
ORLEN PATTIE PIERCE - REBECCA
ORLIN PATTY PIERINA RECTOR
ORLYN PAUL PIERRE REED
ORPHA PAULA PIERS REGAN
ORTON PAULINA PIETRO REGGY
ORTRUD PAULINE ~ PLATO REGINA
ORVAL PAVEL POINDEXTER REGINALD
ORVID PEARCE POLLIE REGIS
ORVILLE PEARLENE PORTIA REINHARDT
OSBERT PEARLIE PRECY REINHOLD
OSCAR PEARLINE PRESTON REMI
OSGOOD PEARLY PRINCE REMO
OSSIE PEG PRISCILLA RENA
OSWALD PEGGI PRUDENCE RENATA
OTHA PEGGIE PRUE RENATE
OTHIS PEGGY PRUNELLA RENE
OTTIS PENNI QUEENIE RETA
OTTO PENNIE QUEENY RETHA
OVE PENNY QUENTIN REUBAN
OVETA PER QUINCY REUBEN
OZZIE PERCY RACHEL REUBIN
PADDIE PERRY RAC~TFT T F REUBINA
PADDY PERSIS RAE REVA
PAGE PETE RAIFORD REX
PAM PETRA RAMONA REY
PAMELA PETRO RANDAL REYNALD
PANSY PETROS RANDALL REYNOLD
PAOLO PHEBE RANDI RHEA
PARNELL PHIDIAS RANDOLF RHODA
PARRY PHIL RANDOLPH RHONA
PASCAL PHILBERT RANDY RHONDA
PASCHAL PHILIP RANSOM RHYS
PAT PHILLIP RAQUEL RICH
PATIENCE PHILO RAY RICHARD
WO 97/27552 PCTrUS97/00212 RICT-TF.T,T.F, RONDA ROYAL SAMUAL
RICHIE RONDELL ROYCE SAMUEL
RICK RONETTE ROZALIA SANDI
RICKEY RONNETTE ROZALIE SANDIE
RICKIE RONNY RUBEN SANDY
RICKY ROOSEVELT RUBENA SANJA
RIKKI RORY RUBERT SARA
RILEY ROSALEE RUBEY SARAH
RISA ROSALIE RUBIN SCHUYLER
RITA ROSALIND RUBINA SCOT
RITCHIE ROSALINDA RUBY SCOTT
RITHA ROSALYN RUBYE SCOTrIE
ROBBI ROSAMOND RUDOLF SEAMUS
ROBBIE ROSAMONDE RUDOLPH SEAN
ROBBIN ROSAMUND RUDY SEBASTIAN
ROBBY ROSAMUNDE RUE SEFERINO
ROBERTA ROSANNE RUFUS SFT,FNA
ROBIN ROSCOE RULOFF SELENE
ROBINA ROSE RUPERT SELINA
ROBINETTE ROSEANN RUSS SELMA
ROCCO ROSEBUD RUSTY SERENA
ROCT-TF,T,T,F, ROSELIN RUTH SERINA
ROCKY ROSELYN RUTHANNA SETH
ROD ROSEMARIE RUTHANNE SEYMORE
RODERIC ROSEMUND RUTHLYN SHANE
RODERICH ROSEMUNDE RYAN SHANNON
RODERICK ROSENA SABA SHARI
RODGER ROSETTA SABINA SHARLENE
RODRICK ROSINA SABRINA SHARYN
ROGER ROSLYN SADIE SHAUN
ROLAND ROSS SAL SHAWN
ROLF ROSWELL SALLI SHAWNA
ROLLO ROULETTE SALLY SHEARL
ROMAN ROWENA SALLYE SHEBA
ROMEO ROWLAND SALOMON SHEENA
ROMULUS ROXANNE SAM ~ST-TF,TT.A.
RONA ROXY SAMMIE SHELDEN
RONALD ROY SAMMY SHELDON
W 097/27S52 PCT~US97100212 SHELIA SMF.nLEY SUSANA TELMA
ST-TFTT.FY SOFIE SUSANNA TENA
SHELLY SOL SUSANNAH TENCH
SHELTON SOLOMAN SUSANNE TERENCE
SHERIDAN SONDRA SUZAN TERRANCE
SHERIE SONIA SUZANNE TERRELL
SHERILYN SONJA SU7.F.T.T.F.~ TERRENCE
SHERL SONNY SIJ~ ; TERRI
SHERLEE . SPARKY SUZIE TERRY
SHERMAN SPENCE SUZY TERRYL
SHERON SPENCER SVEN TESS
SHERREE SPENSER SWEN TESSIE
SHERRIE SPIROS SYBIL TEX
SHERRY SPYROS SYD THAD
SHERWIN STACEY SYDNEY THADDEUS
SHERYL STACI SYLVAN THADEUS
SHIRLE STACIE SYLVENE THEA
SHIRLEE STACY SYLVESTER THEDA
SHIRLEY STAN SYLVIA THEIMA
SI STANISLAW SYLVIE THELMA
SID STANLY TABATHA THEOBALD
SIDNEY STEFAN TABITHA THEODIS
SIEGFRIED ~ 1 TAD THEODOR
SIG STELLA TAFFY THEODORA
SIGMUND STEPHANIE TAMARA THEODORIS
SIGNE STEPHEN TAMI THEODOSIA
SIGURD STERIING TAMMI THEONE
SILAS STEVE TAMMIE THEORA
SILVIO STEVIE TANCRED TT-TFRF~.
SIMEON STEWART TANIA THERESIA
SIMON STU TANJA THERESSA
SIMONE STUART TANYA THOM
SISSIE SUANNE TATE THOMASINA
SISSY SUE TATIANA THOR
SKEET SUELLEN TAUBA THORA
SKIPPIE SUMNER TED THORE
SKYT.F.R SUNNY TEDDY THORVALD
SLIM SUSAN TEE THOS
W 097/27552 PCT~US97100212 THURMAN TRINA VERDEEN VIVIAN
THURMOND TRISH VERDELL VIVIANNE
THURSTAN TRISTAM VERGA VIVIEN
THURSTON TRISTAN VERGIL VLAD
TIFFANY TRIXY VERLIN VOL
TILLIE TROY VERLON VON
TILLY TRUDIE VERLYN VONDA
TIM TRUDY VERN VONNA
TIMO 'IWILA VERNARD WALDEMAR
TIMOTHY TWYLA VERNE WALDO
TINA TYCHO VERNEST WALDORF
TINO TYCHUS VERNESTINE WALLl~CE
TIPHANY TYRONE VERNIE WALLY
TITO UDIE VERNON WALT
TITUS UDY VERONICA WALTER
TOBIAS ULRICH VERSA WANDA
TODD UNA VI WARREN
TOLLIE URA VIC WASHINGTON
TOLLIVER URBAIN VICKI WAYEAN
TOLLY URIAS VICKIE WAYLEN
TOMMIE VACHEL VICTOR WAYMAN
TOMMY VADA VICTORIA WAYNE
TONEY VAL VIDAL WELDEN
TONI VALDA VIE WELDON
TONY VALENTINE VINCE WENDEL
TONYA VALENTINO VINCENT WENDELL
TOOTIE VALERIA VINNIE WENDI
TOVE VALE3~IE VINNY WENDY
TRACEY VANCE VIOLET WES
TRACI VANDER VIRACE WESLEY
TRACIE VANDY VIRDA WESLIE
TRACY VANESSA VIRGIE WESTLEY
TRAVIS VAUGHN VIRGINIA WIDDIE
TRENA VEDA VIRGINIUS WILBER
TRENT VELLA VITA WILBERFORCE
TREVOR VELMA VITINA WILBERT
TRICIA VEOLA VITO WILDA
TRILBY VERA VITTORIA WILEY
CA 02244l27 l998-07-22 W 097/27552 PCT~US97/00212 -~ -47-WILFORD ZALPH
WILFRED ZANE
WTT,T-TF,T,l~ ZEB
WTT,T-TF,T,l\~ENA ZEBADIAH
S WTT,T-TF,T.l\~INA ZEBEDEE
WILHEMENA ZECHARIAH
WILHEMINA ZEF
WILLARD ZEFF
WILLIAM ZEKE
10 WILLIE ZET,T A
WILLIS ZELIA
WILLMA ZELIG
WILLY ZELL
WILMA ZELLA
15 WILMAR 7.F.T.T.F, WILMOT ZELMA
WINFRED ZENA
WINIFRED ZENITH
WINNIE ZENO
WINNY ZENON
WINONA ZEPHERY
~N~LOW ~ETA
WINSTON ZETTA
WINTON ZILLA
WM ZILLAH
WOLFGANG ZINA
WOODIE ZITA
WOODRUFF ZOE
WOODY ZOLLIE
WYATT ZOLLY
WYLA ZORA
XAVIERA ZYGMUND
YARDLEY ZYGMUNT
YETTA
YOLANDA
YVES
YVETTE
YVONNE
ZACHARIA
ZACHARY
ZACK
WO 97/27552 PCT~US97100212 Modifications and variations of the above-described embodiments of the present invention are possible, as appreciated by those skilled in the art in light of the above S teachings. As mentioned, any of a variety of hardware systems, memory olg~ inns, software platforms, and pro~ i..g languages may embody the present invention without departing from its spirit and scope. Moreover, countless variations of the Partition List, company in~lic~tors, product names, ~ i.tion inflif ~tc~rs, Fngli~h first name list, and resulting Phrase Lists, and the like, may be employed or produced while r~om~inin~ within 10 the scope of the invention. It is therefore to be nn~lerstQod that, within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described.
Claims (30)
- CLAIMS:
A computer-implemented method of processing a stream of document text to form a list of phrases that may be used as index terms and search query terms in subsequently-performed full-text document searching, the method comprising:
partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and selecting certain chunks as phrases of the phrase list, based on frequencies of occurrence of the chunks within the stream of document text. - 2. The method of claim 1, wherein the partitioning step includes:
scanning a portion of the document text stream;
comparing the scanned portion of the document text stream to partition entities in the partition list;
substituting a partition tag for portions of the document text stream which match a partition entity;
generating a text chunk list;
scanning the text chunk list to determine a frequency of each text chunk in the text chunk list; and revising the text chunk list to include the respective frequencies of occurrence in association with text chunks. - 3. The method of claim 1, wherein the selecting step includes:
selecting the certain chunks as the phrases of the phrase list based only on the frequencies of occurrence of the chunks within the stream of document text on a quantity of words within the chunks. - 4. The method of claim 1, wherein:
a) the partitioning step includes:
a1) scanning a portion of the document text stream;
a2) comparing the scanned portion of the document text stream to partition entities in the partition list;
a3) substituting a partition tag for portions of the document text stream which match a partition entity;
a4) generating a text chunk list;
a5) scanning the text chunk list to determine a frequency of each text chunk in the text chunk list; and a6) revising the text chunk list to include the respective frequencies of occurrence in association with the text chunks; and b) the selecting step includes selecting the certain chunks as the phrases of the phrase list based only on the frequencies of occurrence of the chunks within the stream of document text and on a quantity of words within the chunks. - 5. The method of claim 1, wherein the selecting step includes:
excluding a chunk from being determined as a phrase if the chunk is a single word beginning with a lower case letter. - 6. The method of claim 1, wherein the selecting step includes:
determining a chunk as being a phrase if the chunk includes a plurality of words each constituting a lower case letters only if the chunk occurs at least twice in the document text stream. - 7. The method of claim 1, wherein the selecting step includes:
determining a chunk as being a proper name if the chunk includes a plurality of words each having at least a first letter which is upper case. - 8. The method of claim 1, wherein the selecting step includes:
mapping a sub-phrase to a phrase. - 9. The method of claim 1, wherein the selecting step includes:
mapping single upper case words to their respective proper names. - 10. The method of claim 1, wherein the selecting step includes:
detecting presence of acronyms;
incrementing a count of a proper name corresponding to the respective detected acronyms; and copying the proper name and the acronym to an acronym list. - 11. The method of claim l, wherein the selecting step includes:
combining a phrase list of lower case words with a phrase list of proper names. - 12. The method of claim 1, further comprising:
reducing the phrase list by consolidating phrases in the phrase list by using a synonym thesaurus. - 13. The method of claim 1, further comprising:
adding phrases to the phrase list by combining phrases which are separated in the document text stream only by prepositions. - 14. The method of claim l, further comprising:
trimming the phrase list by eliminating phrases which occur in fewer than a threshold number of document text streams. - 15. The method of claim 1, further comprising:
categorizing proper names in the proper name list into groups based on corresponding group lists. - 16. An apparatus of processing a stream of document text to form a list of phrases that may be used as index terms query terms in subsequently-performed full-text document searching, the apparatus comprising:
means for partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list;
and means for selecting certain chunks as the phrases of the phrase list, based on frequencies of occurrence of the chunks within the stream of the document text. - 17. The apparatus of claim 16, wherein the partitioning means includes:
means for scanning a portion of the document text stream;
means for comparing the scanned portion of the document text stream to partition entities in the partition list;
means for substituting a partition tag for portions of the document text stream which match a partition entity;
means for generating a text chunk list;
means for scanning the text chunk list to determine a frequency of each text chunk in the text chunk list; and means for revising the text chunk list to include the respective frequencies of occurrence in association with the text chunks. - 18. The apparatus of claim 16, wherein the selecting means includes:
means for selecting certain chunks as the phrases of the phrase list, based only on the frequencies of occurrence of the chunks within the stream of document text and on a quantity of words within the chunks. - 19. The apparatus of claim 16, wherein:
a) the partitioning means includes:
al) means for scanning a portion of the document text stream;
a2) means for comparing the scanned portion of the document text stream to partition entities in the partition list;
a3) means for substituting a partition tag for portions of the document text stream which match a partition entity.;
a4) means for generating a text chunk list;
a5) means for scanning the text chunk list to determine a frequency of each text chunk list; and a6) means for revising the text chunk list to include the respective frequencies of occurrence in association with the text chunks; and b) the selecting means includes means for selecting the certain chunks as the phrases of the phrase list based only on the frequencies of occurrence of the chunks within the stream of document text and on a quantity of words within the chunks. - 20. A computer-readable memory which, when used in conjunction with a computer, can carry out a phrase recognition method to form a phrase list that may be used as index terms and search query terms in subsequent full-text document searching, the computer-readable memory comprising:
computer-readable code for partitioning document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and computer-readable code for selecting certain chunks as the phrases of the phrase list based on frequencies of occurrence of the chunks within the stream of document text. - 21. The computer-readable memory of claim 20, wherein the computer-readable code for partitioning includes:
computer-readable code for scanning a portion of the document text stream;
computer-readable code for comparing the scanned portion of the document text stream to partition entities in the partition list;
computer-readable code for substituting a partition tag for portions of the document text stream which match a partition entity;
computer-readable code for generating a text chunk list;
computer-readable code for scanning the text chunk list to determine a frequency of each text chunk in the text chunk list; and computer-readable code for revising the text chunk list to include the respective frequencies of occurrence in association with the text chunks. - 22. The computer-readable memory of claim 20, wherein the computer-readable code for selecting includes:
computer-readable code for selecting certain chunks as the phrases of the phrase list based only on frequencies of occurrence of the chunks within the stream of document text and on a quantity of words within the chunks. - 23. The computer-readable memory of claim 20, wherein:
a) the computer-readable code for partitioning includes:
a1) computer-readable code for scanning a portion of the document text stream;
a2) computer-readable code for comparing the scanned portion of the document text stream to partition entities in the partition list;
a3) computer-readable code for substituting a partition tag for portions of the document text stream which match a partition entity;
a4) computer-readable code for generating a text chunk list;
a5) computer-readable code for scanning the text chunk list to determine a frequency of each text chunk in the text chunk list; and a6) computer-readable code for revising the text chunk list to include the respective frequencies of occurrence in association with the text chunks; and b) the computer-readable code for selecting includes computer-readable code for selecting the certain chunks as the phrases of the phrase list based only on the frequencies of occurrence of the chunks within the stream of document text and on a quantity of words within the chunks. - 24. A computer-implemented method of full-text, on-line searching, the method comprising:
a) receiving and executing a search query to display at least one current document;
b) receiving a command to search for documents having similar conceptual content to the current document;
c) executing a phrase recognition process to extract phrases allowing full text searches for documents having similar content to the current document, the phrase recognition process including the steps of:
c1) partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and c2) selecting certain chunks as the phrases, based on frequencies of occurrence of the chunks within the stream of document text; and d) automatically forming a second search query based on at least on the phrases determined in the phrase recognition process so as to allow automated searching for documents having similar conceptual content to the current document. - 25. The method of claim 24, further comprising:
validating phrases recognized by the phrase recognition process against phrases in a phrase dictionary before automatically forming the second search query. - 26. The method of claim 24, further comprising:
displaying an error message if less than a threshold number of phrases are recognized for the current document. - 27. A computer-implemented method of forming a phrase list containing phrases which are indicative of conceptual content of each of a plurality of documents, which phrases are used as index terms or in document search queries formed after the phrase list is formed, the method comprising:
a) selecting document text from the plurality of documents;
b) executing a phrase recognition process including the steps of:
b1) partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and b2) selecting certain chunks as the phrases, based on frequencies of occurrence of the chunks within the stream of document text; and c) forming the phrase list, wherein the phrase list includes:
1) phrases extracted by the phrase recognition process; and 2) respective frequencies of occurrence of the extracted phrases. - 28. The method of claim 27, further comprising:
forming a modified phrase list having only those phrases whose respective frequencies of occurrence are greater than a threshold number of occurrences. - 29. The method of claim 27, further comprising:
forming a phrase dictionary based on the phrase list formed in the forming step. - 30. A computer-implemented method of forming phrase lists containing phrases that are indicative of conceptual content of documents, which phrases are used as index terms or in document search queries formed after the phrase list is formed, the method comprising:
a) selecting document text from a sampling of documents from among a larger collection of documents; and b) executing a phrase recognition process to extract phrases to form a phrase list for each document processed, the phrase recognition process including:
b1) partitioning the document text into plural chunks of document text, each chunk being separated by at least one partition entity from a partition list; and b2) selecting certain chunks as phrases of the phrase list based on frequencies of occurrence of the chunks within the stream of document text.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US08/589,468 US5819260A (en) | 1996-01-22 | 1996-01-22 | Phrase recognition method and apparatus |
US08/589,468 | 1996-01-22 | ||
PCT/US1997/000212 WO1997027552A1 (en) | 1996-01-22 | 1997-01-21 | Phrase recognition method and apparatus |
Publications (2)
Publication Number | Publication Date |
---|---|
CA2244127A1 CA2244127A1 (en) | 1997-07-31 |
CA2244127C true CA2244127C (en) | 2002-04-09 |
Family
ID=24358144
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CA002244127A Expired - Lifetime CA2244127C (en) | 1996-01-22 | 1997-01-21 | Phrase recognition method and apparatus |
Country Status (5)
Country | Link |
---|---|
US (1) | US5819260A (en) |
EP (1) | EP0883846A4 (en) |
AU (1) | AU1528897A (en) |
CA (1) | CA2244127C (en) |
WO (1) | WO1997027552A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10474702B1 (en) | 2014-08-18 | 2019-11-12 | Street Diligence, Inc. | Computer-implemented apparatus and method for providing information concerning a financial instrument |
Families Citing this family (129)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0841624A1 (en) * | 1996-11-08 | 1998-05-13 | Softmark Limited | Input and output communication in a data processing system |
US6498921B1 (en) | 1999-09-01 | 2002-12-24 | Chi Fai Ho | Method and system to answer a natural-language question |
US5836771A (en) | 1996-12-02 | 1998-11-17 | Ho; Chi Fai | Learning method and system based on questioning |
US5893102A (en) * | 1996-12-06 | 1999-04-06 | Unisys Corporation | Textual database management, storage and retrieval system utilizing word-oriented, dictionary-based data compression/decompression |
US6122632A (en) * | 1997-07-21 | 2000-09-19 | Convergys Customer Management Group Inc. | Electronic message management system |
AU742831B2 (en) * | 1997-09-04 | 2002-01-10 | British Telecommunications Public Limited Company | Methods and/or systems for selecting data sets |
US7778954B2 (en) * | 1998-07-21 | 2010-08-17 | West Publishing Corporation | Systems, methods, and software for presenting legal case histories |
US7529756B1 (en) * | 1998-07-21 | 2009-05-05 | West Services, Inc. | System and method for processing formatted text documents in a database |
US6549897B1 (en) * | 1998-10-09 | 2003-04-15 | Microsoft Corporation | Method and system for calculating phrase-document importance |
US6411950B1 (en) * | 1998-11-30 | 2002-06-25 | Compaq Information Technologies Group, Lp | Dynamic query expansion |
US6981217B1 (en) * | 1998-12-08 | 2005-12-27 | Inceptor, Inc. | System and method of obfuscating data |
US6424982B1 (en) | 1999-04-09 | 2002-07-23 | Semio Corporation | System and method for parsing a document using one or more break characters |
US8327265B1 (en) * | 1999-04-09 | 2012-12-04 | Lucimedia Networks, Inc. | System and method for parsing a document |
US6665681B1 (en) * | 1999-04-09 | 2003-12-16 | Entrieva, Inc. | System and method for generating a taxonomy from a plurality of documents |
US20020032564A1 (en) * | 2000-04-19 | 2002-03-14 | Farzad Ehsani | Phrase-based dialogue modeling with particular application to creating a recognition grammar for a voice-controlled user interface |
JP3791877B2 (en) * | 1999-06-15 | 2006-06-28 | 富士通株式会社 | An apparatus for searching information using the reason for referring to a document |
US6438543B1 (en) * | 1999-06-17 | 2002-08-20 | International Business Machines Corporation | System and method for cross-document coreference |
US6785869B1 (en) * | 1999-06-17 | 2004-08-31 | International Business Machines Corporation | Method and apparatus for providing a central dictionary and glossary server |
US6829348B1 (en) | 1999-07-30 | 2004-12-07 | Convergys Cmg Utah, Inc. | System for customer contact information management and methods for using same |
US6519586B2 (en) * | 1999-08-06 | 2003-02-11 | Compaq Computer Corporation | Method and apparatus for automatic construction of faceted terminological feedback for document retrieval |
US6772149B1 (en) * | 1999-09-23 | 2004-08-03 | Lexis-Nexis Group | System and method for identifying facts and legal discussion in court case law documents |
US7725307B2 (en) * | 1999-11-12 | 2010-05-25 | Phoenix Solutions, Inc. | Query engine for processing voice based queries including semantic decoding |
US6385629B1 (en) * | 1999-11-15 | 2002-05-07 | International Business Machine Corporation | System and method for the automatic mining of acronym-expansion pairs patterns and formation rules |
US6651058B1 (en) | 1999-11-15 | 2003-11-18 | International Business Machines Corporation | System and method of automatic discovery of terms in a document that are relevant to a given target topic |
US6519602B2 (en) | 1999-11-15 | 2003-02-11 | International Business Machine Corporation | System and method for the automatic construction of generalization-specialization hierarchy of terms from a database of terms and associated meanings |
US6651059B1 (en) | 1999-11-15 | 2003-11-18 | International Business Machines Corporation | System and method for the automatic recognition of relevant terms by mining link annotations |
US6505197B1 (en) * | 1999-11-15 | 2003-01-07 | International Business Machines Corporation | System and method for automatically and iteratively mining related terms in a document through relations and patterns of occurrences |
US6539376B1 (en) | 1999-11-15 | 2003-03-25 | International Business Machines Corporation | System and method for the automatic mining of new relationships |
US7475334B1 (en) * | 2000-01-19 | 2009-01-06 | Alcatel-Lucent Usa Inc. | Method and system for abstracting electronic documents |
US6571240B1 (en) * | 2000-02-02 | 2003-05-27 | Chi Fai Ho | Information processing for searching categorizing information in a document based on a categorization hierarchy and extracted phrases |
WO2001071469A1 (en) * | 2000-03-17 | 2001-09-27 | Dahms Jeffrey Williams | Method and system for accessing medical information |
US7016937B1 (en) * | 2000-05-04 | 2006-03-21 | Bellsouth Intellectual Property Corporation | Method and apparatus for generating reminders to transmit electronic mail attachments by parsing e-mail message text |
US7752275B2 (en) * | 2000-05-04 | 2010-07-06 | At&T Intellectual Property I, L.P. | Method and apparatus for configuring electronic mail for delivery of electronic services |
US7007066B1 (en) * | 2000-05-04 | 2006-02-28 | Bellsouth Intellectual Property Corp. | Method and apparatus for configuring electronic mail according to a user-selected type |
US7219136B1 (en) * | 2000-06-12 | 2007-05-15 | Cisco Technology, Inc. | Apparatus and methods for providing network-based information suitable for audio output |
US20020049792A1 (en) * | 2000-09-01 | 2002-04-25 | David Wilcox | Conceptual content delivery system, method and computer program product |
DE10057634C2 (en) * | 2000-11-21 | 2003-01-30 | Bosch Gmbh Robert | Process for processing text in a computer unit and computer unit |
US6697793B2 (en) * | 2001-03-02 | 2004-02-24 | The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration | System, method and apparatus for generating phrases from a database |
US20020152064A1 (en) * | 2001-04-12 | 2002-10-17 | International Business Machines Corporation | Method, apparatus, and program for annotating documents to expand terms in a talking browser |
US20060253784A1 (en) * | 2001-05-03 | 2006-11-09 | Bower James M | Multi-tiered safety control system and methods for online communities |
US7194483B1 (en) | 2001-05-07 | 2007-03-20 | Intelligenxia, Inc. | Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information |
US7536413B1 (en) | 2001-05-07 | 2009-05-19 | Ixreveal, Inc. | Concept-based categorization of unstructured objects |
US7627588B1 (en) | 2001-05-07 | 2009-12-01 | Ixreveal, Inc. | System and method for concept based analysis of unstructured data |
US6970881B1 (en) | 2001-05-07 | 2005-11-29 | Intelligenxia, Inc. | Concept-based method and system for dynamically analyzing unstructured information |
USRE46973E1 (en) | 2001-05-07 | 2018-07-31 | Ureveal, Inc. | Method, system, and computer program product for concept-based multi-dimensional analysis of unstructured information |
US7475006B2 (en) * | 2001-07-11 | 2009-01-06 | Microsoft Corporation, Inc. | Method and apparatus for parsing text using mutual information |
US6966030B2 (en) * | 2001-07-18 | 2005-11-15 | International Business Machines Corporation | Method, system and computer program product for implementing acronym assistance |
US20030041072A1 (en) * | 2001-08-27 | 2003-02-27 | Segal Irit Haviv | Methodology for constructing and optimizing a self-populating directory |
US6978274B1 (en) * | 2001-08-31 | 2005-12-20 | Attenex Corporation | System and method for dynamically evaluating latent concepts in unstructured documents |
US20030105622A1 (en) * | 2001-12-03 | 2003-06-05 | Netbytel, Inc. | Retrieval of records using phrase chunking |
US7356527B2 (en) * | 2001-12-19 | 2008-04-08 | International Business Machines Corporation | Lossy index compression |
US6941293B1 (en) * | 2002-02-01 | 2005-09-06 | Google, Inc. | Methods and apparatus for determining equivalent descriptions for an information need |
US8589413B1 (en) | 2002-03-01 | 2013-11-19 | Ixreveal, Inc. | Concept-based method and system for dynamically analyzing results from search engines |
US20040205660A1 (en) * | 2002-04-23 | 2004-10-14 | Joe Acton | System and method for generating and displaying attribute-enhanced documents |
US7003511B1 (en) * | 2002-08-02 | 2006-02-21 | Infotame Corporation | Mining and characterization of data |
US7236923B1 (en) | 2002-08-07 | 2007-06-26 | Itt Manufacturing Enterprises, Inc. | Acronym extraction system and method of identifying acronyms and extracting corresponding expansions from text |
CA2496567A1 (en) * | 2002-09-16 | 2004-03-25 | The Trustees Of Columbia University In The City Of New York | System and method for document collection, grouping and summarization |
US7151864B2 (en) * | 2002-09-18 | 2006-12-19 | Hewlett-Packard Development Company, L.P. | Information research initiated from a scanned image media |
US7552051B2 (en) * | 2002-12-13 | 2009-06-23 | Xerox Corporation | Method and apparatus for mapping multiword expressions to identifiers using finite-state networks |
US7346511B2 (en) * | 2002-12-13 | 2008-03-18 | Xerox Corporation | Method and apparatus for recognizing multiword expressions |
US7412392B1 (en) * | 2003-04-14 | 2008-08-12 | Sprint Communications Company L.P. | Conference multi-tasking system and method |
US7296027B2 (en) * | 2003-08-06 | 2007-11-13 | Sbc Knowledge Ventures, L.P. | Rhetorical content management with tone and audience profiles |
US20050033750A1 (en) * | 2003-08-06 | 2005-02-10 | Sbc Knowledge Ventures, L.P. | Rhetorical content management system and methods |
US20050131935A1 (en) * | 2003-11-18 | 2005-06-16 | O'leary Paul J. | Sector content mining system using a modular knowledge base |
US7191175B2 (en) | 2004-02-13 | 2007-03-13 | Attenex Corporation | System and method for arranging concept clusters in thematic neighborhood relationships in a two-dimensional visual display space |
US20050203924A1 (en) * | 2004-03-13 | 2005-09-15 | Rosenberg Gerald B. | System and methods for analytic research and literate reporting of authoritative document collections |
US20050234724A1 (en) * | 2004-04-15 | 2005-10-20 | Andrew Aaron | System and method for improving text-to-speech software intelligibility through the detection of uncommon words and phrases |
US7746321B2 (en) | 2004-05-28 | 2010-06-29 | Erik Jan Banning | Easily deployable interactive direct-pointing system and presentation control system and calibration method therefor |
US20060025091A1 (en) * | 2004-08-02 | 2006-02-02 | Matsushita Electric Industrial Co., Ltd | Method for creating and using phrase history for accelerating instant messaging input on mobile devices |
US7765205B2 (en) | 2004-08-23 | 2010-07-27 | Lexisnexis | Landmark case identification system and method |
JP4814238B2 (en) | 2004-08-23 | 2011-11-16 | レクシスネクシス ア ディヴィジョン オブ リード エルザヴィア インコーポレイテッド | System and method for searching legal points |
JP4189369B2 (en) * | 2004-09-24 | 2008-12-03 | 株式会社東芝 | Structured document search apparatus and structured document search method |
US7783633B2 (en) * | 2004-11-19 | 2010-08-24 | International Business Machines Corporation | Display of results of cross language search |
US7689557B2 (en) * | 2005-06-07 | 2010-03-30 | Madan Pandit | System and method of textual information analytics |
US9285897B2 (en) | 2005-07-13 | 2016-03-15 | Ultimate Pointer, L.L.C. | Easily deployable interactive direct-pointing system and calibration method therefor |
US7747937B2 (en) * | 2005-08-16 | 2010-06-29 | Rojer Alan S | Web bookmark manager |
US20070067320A1 (en) * | 2005-09-20 | 2007-03-22 | International Business Machines Corporation | Detecting relationships in unstructured text |
US8874477B2 (en) | 2005-10-04 | 2014-10-28 | Steven Mark Hoffberg | Multifactorial optimization system and method |
CN101351795B (en) | 2005-10-11 | 2012-07-18 | Ix锐示公司 | System, method and device for concept based searching and analysis |
US7814102B2 (en) * | 2005-12-07 | 2010-10-12 | Lexisnexis, A Division Of Reed Elsevier Inc. | Method and system for linking documents with multiple topics to related documents |
US7676485B2 (en) | 2006-01-20 | 2010-03-09 | Ixreveal, Inc. | Method and computer program product for converting ontologies into concept semantic networks |
US20070226321A1 (en) * | 2006-03-23 | 2007-09-27 | R R Donnelley & Sons Company | Image based document access and related systems, methods, and devices |
US7735010B2 (en) * | 2006-04-05 | 2010-06-08 | Lexisnexis, A Division Of Reed Elsevier Inc. | Citation network viewer and method |
US20070271136A1 (en) * | 2006-05-19 | 2007-11-22 | Dw Data Inc. | Method for pricing advertising on the internet |
US20080065666A1 (en) * | 2006-09-08 | 2008-03-13 | Battelle Memorial Institute, A Part Interest | Apparatuses, data structures, and methods for dynamic information analysis |
EP2122506A4 (en) * | 2007-01-10 | 2011-11-30 | Sysomos Inc | Method and system for information discovery and text analysis |
WO2008120030A1 (en) * | 2007-04-02 | 2008-10-09 | Sobha Renaissance Information | Latent metonymical analysis and indexing [lmai] |
US20080288488A1 (en) * | 2007-05-15 | 2008-11-20 | Iprm Intellectual Property Rights Management Ag C/O Dr. Hans Durrer | Method and system for determining trend potentials |
US9251137B2 (en) * | 2007-06-21 | 2016-02-02 | International Business Machines Corporation | Method of text type-ahead |
CA2702651C (en) | 2007-10-15 | 2019-07-02 | Lexisnexis Group | System and method for searching for documents |
US20090113002A1 (en) * | 2007-10-30 | 2009-04-30 | At&T Bls Intellectual Property, Inc. | Electronic Message Attachment Options |
US10013536B2 (en) * | 2007-11-06 | 2018-07-03 | The Mathworks, Inc. | License activation and management |
WO2009079875A1 (en) * | 2007-12-14 | 2009-07-02 | Shanghai Hewlett-Packard Co., Ltd | Systems and methods for extracting phrases from text |
US7895205B2 (en) * | 2008-03-04 | 2011-02-22 | Microsoft Corporation | Using core words to extract key phrases from documents |
US8219397B2 (en) * | 2008-06-10 | 2012-07-10 | Nuance Communications, Inc. | Data processing system for autonomously building speech identification and tagging data |
US8984398B2 (en) * | 2008-08-28 | 2015-03-17 | Yahoo! Inc. | Generation of search result abstracts |
US20100125523A1 (en) * | 2008-11-18 | 2010-05-20 | Peer 39 Inc. | Method and a system for certifying a document for advertisement appropriateness |
CN101859309A (en) * | 2009-04-07 | 2010-10-13 | 慧科讯业有限公司 | System and method for identifying repeated text |
US9245243B2 (en) | 2009-04-14 | 2016-01-26 | Ureveal, Inc. | Concept-based analysis of structured and unstructured data using concept inheritance |
US8572084B2 (en) | 2009-07-28 | 2013-10-29 | Fti Consulting, Inc. | System and method for displaying relationships between electronically stored information to provide classification suggestions via nearest neighbor |
EP2471009A1 (en) | 2009-08-24 | 2012-07-04 | FTI Technology LLC | Generating a reference set for use during document review |
US9652802B1 (en) | 2010-03-24 | 2017-05-16 | Consumerinfo.Com, Inc. | Indirect monitoring and reporting of a user's credit data |
US9582575B2 (en) | 2010-07-09 | 2017-02-28 | Lexisnexis, A Division Of Reed Elsevier Inc. | Systems and methods for linking items to a matter |
US9262390B2 (en) | 2010-09-02 | 2016-02-16 | Lexis Nexis, A Division Of Reed Elsevier Inc. | Methods and systems for annotating electronic documents |
CA2827478C (en) | 2011-02-18 | 2020-07-28 | Csidentity Corporation | System and methods for identifying compromised personally identifiable information on the internet |
US9244984B2 (en) | 2011-03-31 | 2016-01-26 | Microsoft Technology Licensing, Llc | Location based conversational understanding |
US9858343B2 (en) | 2011-03-31 | 2018-01-02 | Microsoft Technology Licensing Llc | Personalization of queries, conversations, and searches |
US9842168B2 (en) | 2011-03-31 | 2017-12-12 | Microsoft Technology Licensing, Llc | Task driven user intents |
US9760566B2 (en) | 2011-03-31 | 2017-09-12 | Microsoft Technology Licensing, Llc | Augmented conversational understanding agent to identify conversation context between two humans and taking an agent action thereof |
US10642934B2 (en) | 2011-03-31 | 2020-05-05 | Microsoft Technology Licensing, Llc | Augmented conversational understanding architecture |
US9064006B2 (en) | 2012-08-23 | 2015-06-23 | Microsoft Technology Licensing, Llc | Translating natural language utterances to keyword search queries |
US11030562B1 (en) | 2011-10-31 | 2021-06-08 | Consumerinfo.Com, Inc. | Pre-data breach monitoring |
US20140012859A1 (en) * | 2012-07-03 | 2014-01-09 | AGOGO Amalgamated, Inc. | Personalized dynamic content delivery system |
US9201969B2 (en) | 2013-01-31 | 2015-12-01 | Lexisnexis, A Division Of Reed Elsevier Inc. | Systems and methods for identifying documents based on citation history |
US8812387B1 (en) * | 2013-03-14 | 2014-08-19 | Csidentity Corporation | System and method for identifying related credit inquiries |
US20150074202A1 (en) * | 2013-09-10 | 2015-03-12 | Lenovo (Singapore) Pte. Ltd. | Processing action items from messages |
US9886422B2 (en) * | 2014-08-06 | 2018-02-06 | International Business Machines Corporation | Dynamic highlighting of repetitions in electronic documents |
US11144994B1 (en) | 2014-08-18 | 2021-10-12 | Street Diligence, Inc. | Computer-implemented apparatus and method for providing information concerning a financial instrument |
US10339527B1 (en) | 2014-10-31 | 2019-07-02 | Experian Information Solutions, Inc. | System and architecture for electronic fraud detection |
US10331782B2 (en) | 2014-11-19 | 2019-06-25 | Lexisnexis, A Division Of Reed Elsevier Inc. | Systems and methods for automatic identification of potential material facts in documents |
US11151468B1 (en) | 2015-07-02 | 2021-10-19 | Experian Information Solutions, Inc. | Behavior analysis using distributed representations of event data |
AU2017274558B2 (en) | 2016-06-02 | 2021-11-11 | Nuix North America Inc. | Analyzing clusters of coded documents |
US10275444B2 (en) | 2016-07-15 | 2019-04-30 | At&T Intellectual Property I, L.P. | Data analytics system and methods for text data |
US10699028B1 (en) | 2017-09-28 | 2020-06-30 | Csidentity Corporation | Identity security architecture systems and methods |
US10896472B1 (en) | 2017-11-14 | 2021-01-19 | Csidentity Corporation | Security and identity verification system and architecture |
CN108664468A (en) * | 2018-05-02 | 2018-10-16 | 武汉烽火普天信息技术有限公司 | A kind of name recognition methods and device based on dictionary and semantic disambiguation |
US11544300B2 (en) * | 2018-10-23 | 2023-01-03 | EMC IP Holding Company LLC | Reducing storage required for an indexing structure through index merging |
US11687710B2 (en) * | 2020-04-03 | 2023-06-27 | Braincat, Inc. | Systems and methods for cloud-based productivity tools |
US11687733B2 (en) * | 2020-06-25 | 2023-06-27 | Sap Se | Contrastive self-supervised machine learning for commonsense reasoning |
Family Cites Families (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5225981A (en) * | 1986-10-03 | 1993-07-06 | Ricoh Company, Ltd. | Language analyzer for morphemically and syntactically analyzing natural languages by using block analysis and composite morphemes |
US5123103A (en) * | 1986-10-17 | 1992-06-16 | Hitachi, Ltd. | Method and system of retrieving program specification and linking the specification by concept to retrieval request for reusing program parts |
US4864502A (en) * | 1987-10-07 | 1989-09-05 | Houghton Mifflin Company | Sentence analyzer |
US4868750A (en) * | 1987-10-07 | 1989-09-19 | Houghton Mifflin Company | Collocational grammar system |
US4931936A (en) * | 1987-10-26 | 1990-06-05 | Sharp Kabushiki Kaisha | Language translation system with means to distinguish between phrases and sentence and number discrminating means |
US5146405A (en) * | 1988-02-05 | 1992-09-08 | At&T Bell Laboratories | Methods for part-of-speech determination and usage |
US4994966A (en) * | 1988-03-31 | 1991-02-19 | Emerson & Stern Associates, Inc. | System and method for natural language parsing by initiating processing prior to entry of complete sentences |
US4914590A (en) * | 1988-05-18 | 1990-04-03 | Emhart Industries, Inc. | Natural language understanding system |
JPH0743717B2 (en) * | 1989-02-06 | 1995-05-15 | 株式会社テレマティーク国際研究所 | Abstract sentence generator |
JPH077419B2 (en) * | 1989-06-30 | 1995-01-30 | シャープ株式会社 | Abbreviated proper noun processing method in machine translation device |
JPH03105566A (en) * | 1989-09-20 | 1991-05-02 | Hitachi Ltd | Summary preparing system |
JPH03122770A (en) * | 1989-10-05 | 1991-05-24 | Ricoh Co Ltd | Method for retrieving keyword associative document |
US5289375A (en) * | 1990-01-22 | 1994-02-22 | Sharp Kabushiki Kaisha | Translation machine |
US5481742A (en) * | 1990-05-04 | 1996-01-02 | Reed Elsevier Inc. | Printer control apparatus for remotely modifying local printer by configuration signals from remote host to produce customized printing control codes |
JPH04235672A (en) * | 1991-01-10 | 1992-08-24 | Sharp Corp | Translation machine |
US5251316A (en) * | 1991-06-28 | 1993-10-05 | Digital Equipment Corporation | Method and apparatus for integrating a dynamic lexicon into a full-text information retrieval system |
US5265065A (en) * | 1991-10-08 | 1993-11-23 | West Publishing Company | Method and apparatus for information retrieval from a database by replacing domain specific stemmed phases in a natural language to create a search query |
US5488725A (en) * | 1991-10-08 | 1996-01-30 | West Publishing Company | System of document representation retrieval by successive iterated probability sampling |
US5287278A (en) * | 1992-01-27 | 1994-02-15 | General Electric Company | Method for extracting company names from text |
JP3270783B2 (en) * | 1992-09-29 | 2002-04-02 | ゼロックス・コーポレーション | Multiple document search methods |
US5410475A (en) * | 1993-04-19 | 1995-04-25 | Mead Data Central, Inc. | Short case name generating method and apparatus |
JPH07182370A (en) * | 1993-12-22 | 1995-07-21 | Canon Inc | Text retrieval device |
US5745602A (en) * | 1995-05-01 | 1998-04-28 | Xerox Corporation | Automatic method of selecting multi-word key phrases from a document |
-
1996
- 1996-01-22 US US08/589,468 patent/US5819260A/en not_active Expired - Lifetime
-
1997
- 1997-01-21 WO PCT/US1997/000212 patent/WO1997027552A1/en active Application Filing
- 1997-01-21 EP EP97901375A patent/EP0883846A4/en not_active Ceased
- 1997-01-21 CA CA002244127A patent/CA2244127C/en not_active Expired - Lifetime
- 1997-01-21 AU AU15288/97A patent/AU1528897A/en not_active Abandoned
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10474702B1 (en) | 2014-08-18 | 2019-11-12 | Street Diligence, Inc. | Computer-implemented apparatus and method for providing information concerning a financial instrument |
Also Published As
Publication number | Publication date |
---|---|
US5819260A (en) | 1998-10-06 |
WO1997027552A1 (en) | 1997-07-31 |
EP0883846A1 (en) | 1998-12-16 |
CA2244127A1 (en) | 1997-07-31 |
EP0883846A4 (en) | 1999-04-14 |
AU1528897A (en) | 1997-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2244127C (en) | Phrase recognition method and apparatus | |
US6199067B1 (en) | System and method for generating personalized user profiles and for utilizing the generated user profiles to perform adaptive internet searches | |
US7856352B2 (en) | Method and system of presenting a document to a user | |
Craswell et al. | P@ noptic expert: Searching for experts not just for documents | |
US7209913B2 (en) | Method and system for searching and retrieving documents | |
US7089236B1 (en) | Search engine interface | |
US20010005847A1 (en) | Intelligent networked information sharing | |
CA2175187A1 (en) | Database search summary with user determined characteristics | |
Marcato | Personal Names in the Aramaic Inscriptions of Hatra | |
Feinglos | MEDLINE at BRS, DIALOG, and NLM: is there a choice? | |
Gershman | Knowledge-Based Parsing. | |
CN114706962A (en) | Information retrieval method and device and knowledge graph construction method and device | |
Muco | Change comes all too slowly in Albania | |
Lewis et al. | Retention of skill on the SAM Complex Coordinator | |
Bennett | Review essay: Folklorists and anthropologists | |
Raś | Information retrieval systems, an algebraic approach I | |
Fidel | Extracting knowledge for intermediary expert systems: the selection of search keys | |
Darie et al. | Modifying Data | |
Rossi | Individual names of the upper classes of Chieri (Turin) in the 16th century | |
WO2001079957A3 (en) | A method for creating content oriented databases and content files | |
Stupak | Corona Lexicon: Neologisms or Occasionalisms? | |
Kimura | The revolution and realignment of political parties in the Philippines (December 1985-January 1988): With a case in the province of Batangas | |
de Loupy et al. | Proper nouns thesaurus for document retrieval and question answering | |
Haskins | Black: Defiance, sensationalism and disaster | |
Naremore | Casting By |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
EEER | Examination request | ||
MKEX | Expiry |
Effective date: 20170123 |