WO1988004454A2 - Information retrieval system and method - Google Patents
Information retrieval system and method Download PDFInfo
- Publication number
- WO1988004454A2 WO1988004454A2 PCT/US1987/003143 US8703143W WO8804454A2 WO 1988004454 A2 WO1988004454 A2 WO 1988004454A2 US 8703143 W US8703143 W US 8703143W WO 8804454 A2 WO8804454 A2 WO 8804454A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- texts
- group
- words
- word
- sub
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3341—Query execution using boolean model
Definitions
- the present invention relates to an information retrieval system and method which analyzes and summarizes the information contained in a group of texts and identifies similar words and word collections.
- Information retrieval is the process of selecting and presenting specific items from within a large and heterogeneous collection of texts, according to users' descriptions of the subjects in which they are interested.
- Some information retrieval systems index all the words appearing in all the texts, others index "keywords" which are descriptors assigned to each text by the text's author or by someone else. In both cases the user who wants to find a text does so by asking for a search on a particular word, or on logical (Boolean) combinations of words, or on words with some maximum distance (or similar relationship) between them in the texts, etc. In addition to requesting a specific word or words, most systems allow the user to search for a character string; e.g., LEXISTM and DIALOGTM.
- a typical search request on traditional systems, generates a long list or a large collection of texts all of which logically satisfy the search criterion, but only a small percentage of which-will actually be of use.
- the user is forced to expend much time and much energy winnowing (searching) through the texts found by the system, to pick out those truly relevant to his needs.
- the user either looks through the texts themselves, one by one, or looks through sequential listings of some part of the information available about each text: that is, the user may choose to review a sequential list of the titles of the texts, or abstracts of the texts, or lists of keywords of the texts, or the initial paragraphs of the texts, or the dates and origins of the texts, or some combinations of the above.
- the user is then given some method of specifying (usually by number) those texts for which he wishes fuller information, printouts, etc.
- DIALOGTM provides a user with the number of records (texts) satisfying the search request. The user can then request that any or all of the records be displayed and/or printed in any one of a number of formats containing varying and differing amounts of information.
- a second method is generally used when the number of texts presented by an initial search is too large, or the original search criterion was too general, to make it practical for the user to look through sequential listings to pick out the texts he wants.
- This method is essentially an extension of the original boolean search facility: the user can ask for additional searches to be made, and then can manipulate the additional lists of texts thus generated by requesting further lists to be created based on Boolean combinations of the preceding lists (e.g., the new list to include all the texts on list "A" and also on list "B” but to exclude any which appear on list "C", etc.).
- DIALOGTM An example of this type of information retrieval system is DIALOGTM, where the user can make additional search requests, and create new lists of texts based on Boolean combinations of preceding lists.
- LEXISTM a user can modify his/her search request in an effort to narrow down the number of cases (texts) developed from the initial search request.
- the present invention solves the problems described above, by making it possible for the user to see at a glance a break-down of the types of information contained in the texts selected by his initial request. From the generated display, the user can choose the texts which are relevant to his true interest both easily and quickly.
- the present invention also relates to a system and method for identifying words in a target word list which are similar to a source word, and/or for identifying phrases or sentences in a target population which are similar to a source phrase or sentence.
- Computer programs are used in a number of contexts to obtain words which are "similar” to some given source word, most notably in indexing and information retrieval programs and in spelling checkers.
- indexing and retrieval programs the purpose of such a search for "similar" words is to provide a more exhaustive list of terms related to the input word, such as plurals or forms modified by prefixes or suffixes.
- spelling checkers the purpose is to be able to make a suggestion as to the most likely word the user had intended, once a word is encountered which does not appear in the program's dictionary.
- spelling checkers In the case of spelling checkers, a more flexible approach is needed, since the user does not usually know that he has made a spelling mistake, nor does he know in advance the relationship between the way he thinks a word is spelled and the way it is spelled in fact. Most typically, spelling checkers locate "similar" words by first restricting the search to words beginning with the same letter as the misspelled words and then use a list of comon spelling and typographical errors to find words which differ from the source word only by these letters.
- Some information retrieval programs use the phonetic approach also: along with a regular index of words (or of keywords) in their textbase, they create a parallel index in which those same words are represented phonetically. Search requests are then converted to phonetic format and the attempt is made to locate the search words' phonetic translation in the phonetic index.
- An example of this is the COMPUMARKTM system which is used in searching for trademarks.
- the analyzing and summarizing aspect of the invention makes explicit the inherent relationships among a group of texts with associated keyword descriptions, by analyzing the keywords held in common by subgroups of texts within the overall group.
- the invention comes into play once a group of texts has been selected using standard search methodology — at the point at which the user would either have to make further guesses as to how to narrow down his search criterion, or would be presented with a sequence of texts that would then have to be "winnowed through.”
- the invention is a system and method of analyzing and of presenting the informational content of this group of texts, as a group.
- the user sees presented on a display medium (screen) the equivalent of an annotated "TABLE OF CONTENTS,” organized as a standard outline or in some similarly graphic format, analyzing that group of texts into major subject areas, sub- categories, sub-sub-categories, etc.
- Each "TABLE OF CONTENTS" outline is dynamically generated in response to specific search requests, and constitutes a kind of "birds'-eye view" of the contents of the textbase in that subject area at that time.
- an appropriate command e.g., by moving a cursor and pressing a key, the user chooses the specific text or texts he wants to see according to the descriptions he sees on the table of contents, and with an appropriate command; e.g., a keystroke, brings those texts to the screen or sends them to be printed.
- the information retrieval system is coupled with a word- processor, for convenience in entering texts into the textbase, and with an output screen presenting the results of the above analysis in traditional outline format.
- the output screen shows the categories and subcategories of subjects found to be included in the texts selected as a result of the original search request, to any desired level of detail.
- the user moves the screen cursor to point to a category on the outline for which he wants a more detailed break-down, and the process continues until individual texts are being referenced on the outline. Then by pressing a key the user can direct the system to send the chosen text to the printer or bring it to the screen.
- the significance of the invention is that the amount of time needed for the user to isolate texts of interest to him, from among groups of texts which satisfy his initial search request but are in fact irrelevant to him, is reduced by a large factor.
- the information retrieval process is made more convenient thereby; it is practical using this system to find specific texts with only the most minimal initial information as to how they may have been keyworded; and various practical constraints which have restricted the ways in which textbases needed to be organized in order to guarantee that stored information could be found again, can be relaxed.
- the invention also relates to a process which enables the computer to locate "similar" words in a manner more flexible and more exhaustive than any currently used technique, so far as we know.
- the invention does not require any specification by the user as to the relationship between the input word and the target words, nor does it rely on phonetic translation or any restrictive list of typical mistakes.
- the invention rather makes use of the actual structure of the word itself, and searches for words which have a similar structure or which include a similar structure as part of a larger structure. The invention is therefore able to locate a far more comprehensive list of "similar” words than is the case with other techniques.
- the structure of the input word is analyzed in terms of groups of letters, starting with letter pairs and working up to larger groups, and accords to any word in the target dictionary which contains these letter groupings a number of points determined by the size of the group and/or its location in the word. Words which are given a large number of points by the process are then presented to the user, in descending order of the number of points allocated, for his selection.
- the technique is identical, except that groups of words, rather than groups of letters, are compared.
- One field of application for this invention is in information retrieval systems, where the user presents his search request in the form of a phrase or sentence, and texts are selected from the data-base and/or prioritized, according to the scores achieved when either their descriptions (keywords, title, abstract) or the texts themselves are evaluated according to this method. Since in information retrieval systems the typical search request finds many texts which are, in fact, irrelevant to the user, the invention, when employed to automatically winnow and/or prioritize texts can save time and trouble for users of the system.
- Figure 1 is a schematic view illustrating a stand-alone computer system wherein the present invention might be utilized
- Figure 2 is a schematic view illustrating a computer terminal system interconnected to a remote host computer system, the present invention being implemented in either or both computer systems;
- Figures 3A-G are schematic views illustrating file structures of an embodiment of an information retrieval system and method in accordance with the principles of the present invention
- Figure 4 is a schematic illustrating the addition of text to a textbase in accordance with the principles of the present invention.
- Figure 5 is a schematic view illustrating searching the textbase of Figure 2 in accordance with the principles of the present invention.
- Figure 6 is a schematic view illustrating locating of texts which match the search request
- Figure 7 is a schematic view illustrating analyzing the texts found in the search
- Figure 8 is a schematic view illustrating presentation of the analysis to a user.
- Figure 9 is a view illustrating a sample presentation of results at a display media
- Figure 10 is a schematic view of the process shown in Figure 7, but illustrating various additional features of analysis
- Figure 11 is a schematic view illustrating automatic keyword modification to groups of text
- Figure 12 is a schematic view illustrating the preparation of the search request prior to searching the textbase
- Figures 13A-B are logic flow designs of an embodiment of the present invention providing the ability to search for similar words
- Figure 14 is a schematic view illustrating an embodiment of the present invention for calculating the degree of similarity between two words
- Figure 15 is a schematic view illustrating the calculation of the similarity between two phrases or sentences or collections of words.
- Figure 16 is a schematic view illustrating the calculation of point scores used in calculating the similarity between two phrases or sentences or collections of words.
- One implementation of the invention is a program written in the C language, with some sections written in Assembly language.
- the implementation to be described runs on the IBM-PC and compatible microcomputers.
- the program is in principle easily transportable to a large family of micro, mini, and main-frame computers, and can be used in a multi-user environment.
- the program is loaded into the memory of a computer system 30 powered by a suitable power supply 31.
- the computer system 30 will include a user input device 32 such as a keyboard and/or mouse.
- the computer system will preferably include a storage device 33 for storage of text material.
- a printer 34 will be provided for hard copy printout of results and a display terminal 35 will be provided for display of the program analysis at the display terminal.
- the computer system 30 might be interconnected to a host computer system 36 by any number of different methods such as by telephone lines, a direct connect via a serial interface cable, a radio frequency (RF) interconnection, etc.
- the program of the present invention might be utilized in the computer system 30 and/or the host computer 36. In a multiuser environment, the program might be utilized from a dumb terminal 37 interconnected to the host computer 36.
- a text file 42 contains variable-size records 42a of the texts which have been saved in the textbase, there being a record for each text in the textbase. This information is ordinarily all kept in one file, though the possibility exists of splitting it into several smaller files if the physical limitations of the computer system being used prevent a single file of large enough size being maintained.
- a text pointer file 44 contains information as to where each individual text is located within the text file 42 itself. Space in the text file 42 is allocated as it becomes available (by old texts being deleted or updated); an ordered list is therefore necessary in order to locate the desired text at any time.
- Each record 44a of the text pointer file 44 includes a Text Number field assigning a unique number to each of the texts, a Location field specifying the text location, a Size field specifying the size of the text, and a Date field for providing information as to the date each text was last modified, for use when searching for texts which meet specific date criteria.
- a keyword file 46 contains variable-size records 46a listing every keyword which has been defined in the textbase.
- the keywords may have been defined in several different ways; for example, the author may define the keywords as the text is entered, the keywords may be defined automatically through the use of an automatic keywording feature as described below, text down-loaded from a commercial data base may have keywords already predefined, etc.
- a keynumber is allocated to each keyword on the basis of its position in the keyword file 46.
- An index of which texts contain which keywords is kept in text keyword file 48.
- This file contains a variable-size record 48a for each text in the textbase, the entries being in the form of the number of the text being referred to, followed by a list of the keynumbers of the keywords associated with that text, followed by an end marker to indicate the end of that list and then the entry for the next text.
- a text index file 50 includes a record 50a for each text providing an index to the location and size of the entry in the text keyword file 48 for each text.
- Free key file 52 and free text file 54 are lists of available space in files 48 and 42, respectively, so that space in those files can be reused as texts are deleted or updated.
- the files 52 and 54 and their associated records 52a, 54a have a structure similar to that of the text index file 50.
- Adding Texts To The Textbase Figure 4 illustrates the process by which texts are created and saved in the textbase. Texts might be created at 56 by a word processor function associated with the program of the present invention, or by "importing" texts from files which have been created by other programs, such as by other word processor programs or texts developed from a data base search request.
- the user defines the keywords which he wants to use to describe that text, either by marking them in the text itself at 58 or by entering the keywords in a separate keyword list at 60. Keywords which the user has marked in the text are automatically scanned at 62 and added to the keyword list; both the text itself and the keyword list are available for editing throughout this process.
- a command e.g., presses a key
- a number is allocated to the text at 66, based on the next available position in the text pointer file 44.
- the keyword list for the text is then converted to keynumbers at 68, either by finding the existing keyword in the keyword file 46 or by adding a new keyword to the file 46.
- the position of a keyword in the keyword file 46 corresponds to the keyword number assigned to that keyword.
- the text itself, together with its keyword list, is added to the text file 42 and the textnumber and list of keynumbers added to the text- keyword file 48.
- the index files 44, 50, 52, and 54 are then updated with the appropriate information.
- the procedure by which the user searches the textbase to find a particular text or texts is illustrated in Figure 5.
- the user initially enters his search request at 72, in the form of the keyword or keywords which describe the information he is looking for. Boolean combinations or keywords may be used in the description to logically describe the set of texts which is being searched for. If the user has asked that similar words or pre-defined "equivalent" words be substituted into his search request, the substitution is made at this time. (This process is described below and in Figures 12 through 14.)
- the program searches the textbase at 74 to locate all texts which match the search request, as is shown in further detail in Figure 6.
- the program analyzes the set of texts which are found to satisfy the search request as shown in Figures 7 and 10, and at 78, the program displays the results of this analysis at the user's display terminal (screen) as shown in Figure 8.
- texts are selected by scanning the text-keyword file 48 for each keyword in the search request, and building a list of the texts which match the request.
- This list is constructed by taking from the search request each keyword in turn at 80, looking up its keynumber at 81 in the keyword file 46, and then scanning the text-keyword file 48 to find all texts which contain that keynumber at 82. The numbers of the texts are added to the list as they are found at 82.
- This list is then combined at 83 with the list of texts which had been found by previous iterations of this process, which dealt with keywords mentioned in earlier parts of the keyword request.
- the lists are combined according to the logical operation specified by the user.
- the process is repeated, the list produced by each successive iteration being combined with the list created by all previous iterations, until all the keywords in the search request have been dealt with.
- the program checks if the user has requested that the search be limited to texts created (or modified) within certain date limits. If the user has imposed no such date limits, at 86 the text selection is terminated. If the user has requested date limits, at 87 the listed texts are checked against dates stored in the text pointer file 44 and only those texts whose creation/modification dates fall within the limits are retained.
- the program then analyzes the set of texts which has been found and presents the results of that analysis.
- the process by which the analysis is carried out, and the manner in which the results are presented, will now be described. Analyzing The Texts Found In The Search
- the program analyzes the set of texts which has been found to match the initial search request, by means of the process shown in Figure 7.
- the program obtains the list of keynumbers associated with each text in the set. These lists are obtained by reading them from the text-keyword file 48.
- the lists for each text are then scanned at 90 and the number of texts in which each keynumber occurs is counted in order to identify the "criterion key” — the most frequently occurring keynumber, i.e., the keyword which is associated with the greatest number of texts in that set.
- the set of texts is then divided into two subsets at 92; the "right-group” containing all texts which are described by the "criterion key", and the "down- group” containing those texts which are not described by the criterion key.
- the "right-group” is thus a list of all the texts in the current set which include among their keywords the "criterion key”; all remaining texts from the current set are listed in the "down-group”.
- these two subsets are then in turn analyzed by the same process of finding the most commonly occurring keynumber and using it to split the set of texts; the two sections of the program at 90 and 92 being performed recursively until all the texts have been analyzed, or until such time as a decision is reached not to continue analysis further in either the "right" or the "down" direction.
- the full analysis is thus a set of nested processes; for a group of text to be fully analyzed, the analyzing routines first split the initial group into two sub-groups, and then invoked themselves to handle the further analysis of each of the resulting subgroups.
- the process proceeding to handle the new current group at 90 and at 92 may again be interrupted at 94 to handle yet another right group produced at 92 during this second iteration, and/or at 96 to handle analysis of the "down-group" produced at 92 during this second iteration.
- Every sub-group of the original group of texts is analyzed to the desired depth and a "tree" built out of the original list of texts.
- This tree is an analysis of the relationships among the various texts in terms of the keywords which describe them; it groups related texts together according to the similarities in their subject matter and locates all the texts in a structure of headings and sub-headings.
- the list or node in the "right” direction defines the texts which belong to the largest category from the set of texts which was input to the node, and the list or node in the "down” direction defines those texts not included in that largest category.
- the list of texts generated by the user's original search request and reading down from node to node, provides a listing of the major categories into which the original group of texts has been divided.
- This listing is automatically sorted into "order of importance" through the above procedure of selecting the successive "criterion keys”; the larger the group of texts described by any particular criterion key, the closer it will be to the top of the list.
- the tree provides a break-down of the original list of texts into its various subject matters, and can be extended to any desired level of detail.
- Control of the analysis 94, 96 is achieved either under interactive, user control or automatically on the basis of the number of texts already found and displayed.
- analysis to the "right” that is, more detailed analysis of a group of texts which are described by the "criterion key"
- the depth of analysis of that set is such that further analysis would take up too much space, making it impossible to show the "down-list” within the limits which have been set for the number of lines of analysis to display.
- Analysis "down” (those texts which are not described by the current "criterion key”) is terminated at 96 either when all texts have been shown, or on reaching a predetermined limit as to the number of lines to show.
- the user may control the analysis process by setting in advance the number of display lines at which he wishes automatic analysis to stop, or interactively by at each stage in the process deciding whether to further continue analysis either "right” or “down", and how far to continue it in either direction.
- the user may invoke various additional features affecting the procedure of analysis as generally illustrated in Figure 10.
- results of the analysis procedure described above is presented to the user as a screen display, indicating the groups of texts which have been found and their relationships to each other, in the form of a "table of contents" of headings, sub-headings, and texts.
- the program then checks if there is a "right-node" associated with the current node (such a right node will have been produced by the analysis if there is room to expand further to the right in the outline, and if there are still texts with unexamined keywords in the node). If such a "right-node” exists, the count of how far to indent the next line on the screen or printer is increased by one at 106.
- the new current node is then handled as described, including the handling of its own right and down nodes, until the process runs to completion at 112.
- this node's parent node is seen to have been a right node, at 112b it is reinstated as current node and its processing continues at 108, which is just after the point at which handling of the node had been interrupted in order to handle its right node.
- the program checks whether a "down-node" exists at 108. If so, it is identified as the current node (without changing the indentation), a process similar to the one just described is undertaken at 108a,b,c, and the routine invokes itself again 100. Thus, handling of the parent node is again suspended while the down-node (now the current node in the new invokation) is handled.
- work on the down-node (which includes work on any of its subordinate nodes) reaches 112 and 112a, the parent node is reinstated as current node at 112c, and the level of indentation used (at 102) in creating display lines is reduced by one at 110. Since, in the example we have been running through, the node which is now the current node was the original "root" node, at 112 the display process terminates at 115.
- processing of the root-node is interrupted first to process the right-node, and then to process the down-node.
- Each of those processes may in turn be interrupted to process right-nodes and down- nodes, each of which may in turn be interrupted, etc.
- each time that the processing of a given node terminates (when there is no further right-node and no further down-node to be handled) the program checks at 112 if there are unfinished nodes to process. If such nodes exist, at 112a control is returned to the parent node from which the routine was invoked, and processing picks up where it left off. In the case of the root- node, there is no parent-node, and the process terminates at 115, the whole table of contents having been displayed.
- Illustrated in Figure 9 is an example of such a screen display.
- the first line is a heading indicating the search request which created this analysis.
- the remainder of the display represents, by showing the successive criterion keys as headings, the results of the analysis in the form of an organized "table of contents" of the section of the textbase under analysis.
- lines ending with an arrow represent the presence of a text which includes only the keywords shown in that line and in the headings above it.
- a text has been associated with the keywords "fruit”, “oranges” and "Jaffa”. Analysis right on this text has been completed. If there were more than one text with these keywords, a series of right arrows would be shown on the line, one for each text.
- Lines in the table of contents with a number shown in brackets, such as line I.A.I, indicate that there are that number of texts including the keywords shown in that line and in the headings above, as well as other keywords, and those texts are not shown individually in this analysis (i.e., analysis "right” has been terminated at this level).
- the user can either review the texts indicated by the analysis, or ask for a further "expansion" of a group of texts which have not been fully analyzed.
- the user moves the cursor up and down on the screen to point at the text or group of texts he is interested in; and then presses a key to request that the text be displayed by the word processor or that the group be expanded.
- a text is to be displayed, its number is taken from the non-printing codes embedded in the table of contents. That number corresponds to an entry in the text pointer file 44, where the location of the text itself (within the text file 42) is indicated.
- the text is read from the text file and passed to the word processor for reading, editing, or printing.
- the program referes to the non-printing codes embedded in the table of contents to find the location in memory of the list of texts and other information associated with the group.
- the information is then passed to the analy sis and display routines previously described ( Figure 7 and 8).
- This new analysis is presented in a new screen display, to be used in the same way as the "parent" analysis; the user can continue to "expand” any group until he finds and loads the text he is searching for, or can at any time return to a previous "parent" table of contents to look at a different group of texts.
- a special header which appears at the top of the screen whenever he stops moving the cursor on the table of contents; this header indicates the list of keywords describing either the text or the group of texts which that line represents.
- screen highlighting might be used to indicate that the cursor is pointing at a specific text, or to indicate all lines of the display which are contained in the group referred to by the cursor.
- Figure 10 illustrates the basic text analysis procedure of Figure 7 with additional features being present for enabling the user to modify the basic text analysis.
- Keyword Manipulation During Analysis It is possible to "hide” specified keywords so that they are removed at 88a and do not appear in the analysis at all, to "ignore” keywords at 90a so that they are shown in the results of the analysis but are never used to split the set of texts, or to declare certain keywords as "equal” to each other and substitute therefor at 88b so that they are treated as identical for purposes of the analysis.
- Hiding keywords can be useful in cases where some group of keywords, which would otherwise influence the display, are irrelevant for a particular purpose at hand. If the user has asked for words to be "hidden”, then at the time that the program obtains the lists of keynumbers associated with each text at 88 by reading them from the text-keyword file 48, the lists are compared to the list of words to be hidden. Keynumbers found on the "hidden" list are simply skipped at 88a, not included in the keyword lists which are subsequently used for the analysis.
- Keywords on the "equal” list are arranged in groups of words which will be made “equal” to each other. If the user has asked for words to be made “equal” to each other, then at the time that the program obtains the list of keynumbers at 88 by reading them from the text-keyword file 48, the "equals list” is scanned with each keynumber read from the file.
- the chosen key is found to be on that list, then it is disallowed as a criterion key at 90b, and the most popular key not on the "ignore" list is chosen as criterion key in its stead. This allows such words to appear on the display without affecting the manner in which sub-groups are defined.
- the user may, in fact, supply at 90c not just one word, but a logical combination of keywords, thus creating a "local" Boolean search request, which is then analyzed just as was the original main search request, as described above and in Figure 6.
- the current group of texts is then split into two groups, those satisfying the "local" Boolean search request going to the "right” group, those which do not satisfy it going to the "down” group.
- the next line of the outline which has the same level of indentation as the current line, if any, provides the criterion for analysis (and becomes the "current line") when the down-group created by the current analysis, if any, is in turn analyzed.
- the lines of the outline which had been respectively selected are then treated as the "current line", and further lines are identified for use in analyzing the resultant new right and down groups.
- input of a criterion key from a user- supplied outline at 90d then replaces the counting operation at 90 and is used to split the group into two groups at 92. Any group being analyzed for which no line of the outline has been designated to provide the analysis criterion, is split in either the usual automatic or the usual interactive manner, as described above .
- Lines from the outline supplying criteria in the analysis of groups of text at 90d are then reproduced as part of the display lines generated at 102.
- the result of this procedure is that the user provides an outline of his subject matter, and the system fills that outline with references to whatever texts in his textbase are relevant to each part and sub-part of the outline.
- An additional method of controlling the analysis is to split the group according to the success or failure of a scan for the presence or absence of words (or pairs or groups of words with a specified degree of contiguity) within the texts themselves (rather than checking for words within the keyword list) at 90e.
- This scanning operation at 90e then replaces the counting operation at 90 and the results are used to split the group into two groups at 92.
- This provides a facility for the use of what are called "full text searching techniques" (information retrieval techniques not based on designated keywords) in the context of a retrieval system whose major functions are based on the use of keywords.
- Mathematical Calculations Used as Criterion During Analysis An additional method of controlling the analysis is to split the group of texts into two groups in a manner dependent on the results of a mathematical calculation performed on a number or numbers found either within the text or among the text's keywords at 90f.
- the numbers to be used are identified by having a particular position on the text's keyword list, or by having a particular position on the text's keyword list, or by having a particular position with respect to some designated word found on the keyword list, or by having a particular positional relationship to some designated word found in the text itself.
- This numerical calculation at 90f either on the data found in the keyword list at 88 or on the data found by scanning the text at 90e, then replaces the counting operation at 90 and the results are used to split the group into two groups at 92.
- a group of texts might be split into a right-group and a down-group according to their success in fulfilling the criterion "cost is greater than 100", where the number to be inspected is either whatever keyword follows the keyword "cost" on the text's keyword list, or whatever word follows the word "cost” in the text itself.
- One use of such an analysis is to provide for prioritization of a group of texts according to the degree of similarity between the set of that text's keywords, treated as a phrase or sentence, and the user's search request, treated as a phrase or sentence, according to the method of measuring similarity described below and in Figures 15 and 16.
- the same method can be implemented in a comparison between the search request and the texts themselves, or portions of the texts, or non-keyword information (e.g., titles, abstracts) associated with the texts.
- One embodiment of the present invention enables users to allocate keywords automatically to texts created by the word processor or texts imported from outside sources.
- the implementation uses the normal text handling routines to load the texts one by one at 204, the keywords list is read, add words are added to it at 206 and delete words are eliminated from it if found at 208.
- the text is then scanned on a word by word basis, the words of the text being compared to the "scan words" at 210. When a match is found, the scan word is added to the keyword list at 210.
- a preliminary check can be made to find whether some designated additional scan word(s) are found within a designated proximity, and only if so is the scan word(s) added to the keyword list, and b) when a match with a scan word is found, some other designated word can be added to the keylist.
- the scan is completed, if any changes have been made at 212 in the keylist the files are updated at 214 using normal text-saving procedures.
- a check is made if there are any more texts to be handled. If there are no more texts, the routine terminates at 218, otherwise the next text is handled starting at 204.
- one embodiment of the invention enables the user to cause the search request he enters at 300 to be modified automatically in several ways.
- the program inspects the search request to determine whether the user has indicated (by means of an appropriate symbol) that "equivalent" words or combinations of words may have been previously defined to the system. (This technique is useful both to permit a single word to represent a complex and oftrepeated search request, and to provide the means for providing automatic "equivalence" between the habitual keyword vocabularies of different users of a common textbase system.)
- a word in the search request is proceeded by such a symbol (in this implementation a dollar sign was used), then the word (including the proceeding dollar sign) is searched in the keyword file 46. If found at 302, the text keyword file 48 is scanned to locate the associated text. The text, if one is found, is not displayed through the normal display process, rather the entire text is taken to be a redefinition of the word which had been preceeded by the dollar sign, and is substituted for it in the user's search request at 303. The search request is then again made available to the user for further editing at 300 or for him to repeat his command that the search request be processed.
- the program inspects the search request to determine whether any of the search words begin or end with an asterisk. If so, the similarity checking routines (described below and in Figures 13 and 14) are invoked to find all the keywords in the keyword file 46 which are similar to the given word in the search request. If a list of similar words is found at 305, the words are separated by the word "or", the list is enclosed in parentheses, and the whole is substituted for the original word at 303 in the users search request.
- the modified request is again presented to the user for further editing, or for his command to proceed with the processing the request 300.
- processing proceeds to the locating of texts matching the search request description.
- FIG. 13A-B Illustrated in Figures 13A-B is a process in accordance with the principles of the present invention, which enable words in a target word list to be identified which are similar to the key words. Indeed, as previously discussed, this aspect of the invention will have application in several other uses, such as spelling checkers and the like.
- the invention is part of an information retrieval system, where it is used to find key words related to words from a search request provided by the user (whether similar words, or misspellings of the same word).
- the source word i.e., the word to which we intend to find similar words, is input to the process at 342. If it has a suffix, a second copy of the word is made without the suffix 344,346 and this new version of the word is kept for later use.
- the program now fetches a word from the target dictionary at 348, and the original word (including suffix) is compared to the target word at 350 by the process described below and in Figure 14. If that comparison yields a score of zero or less than zero at 352, the program then checks at 372 if there is another word in the target dictionary to look at, and if there is such a word, fetches it at 348 and continues the comparison process.
- the program now compares the target word with the source word at 354. The comparison is repeated here because the process is essentially asymmetric - the first comparison at 350 checked whether the letter groupings of the source word are to be found in the target word; the second comparison checks whether the letter groupings of the target word are to be found in the source word.
- the scores resulting from these two comparisons are now added together and the total score examined. If the total is very low (below 200 out of a possible 2000 at this point), the comparison is abandoned and the program continues to examine the next target word at 348-352. If the total is very high (above 1900 out of 2000), the next stage in the comparison process is bypassed as being unnecessary and the program continues directly to calculating the average score at 366. Alternately, at 358,360 if the total score falls somewhere between the above cut-off values, the target word is examined to see if it has a suffix; if so, the suffix is removed.
- the suffix-less copy of the source word is now compared with the suffix-less copy of the target word at 364 and the score resulting from this comparison added to the total score.
- An average is now calculated at 366 for all the comparisons which have been carried out, and if 368 that average score is above a set threshhold (400 out of a possible score of 1,000), at 368,370 the word is added to the list of similar words found.
- the program now checks whether there are any more target words to which the source word should be compared; once the whole target dictionary has been scanned in this way, the list of similar words is sorted into descending numerical order at 374 and the list of words (cut off at some convenient threshhold) is returned to the user at 376 for further editing and/or use at 300 .
- the first letter is taken from the source word, and from the target word at 378,380. At 382, these letters are compared to each other. If the letters do not match, then at 392 the program keeps trying to find a match by taking one more letter at a time from the target word until there are no more target letters. The next source letter is then taken, and compared to each target letter in turn, and so on until there are no more source letters to compare.
- the comparison score is then calculated at 390, based on the number of consecutive letters which matched in the two words. This score is obtained from a table which converts the number of matching letters to the appropriate score value. The start of this table is shown below:
- a total weighted score is calculated at 396. This total is obtained by adding together all the subscores generated during the comparison, and dividing them by the total score possible based on the length of the source word. This highest possible total is simply the value found in the scoring table for the length of the source word itself, as this is the value that would have been found by comparing the source word with itself. In this way, the score is adjusted for the length of the word so that the same score will be obtained for words of comparable simlarity, no matter their lengths. This final score is multiplied by 1000 in order to convert it to an integer value between 0 and 1000.
- the source sentence is compared a third time to the target sentence, this time where neither sentence includes "noise words" 442.
- the total scores of all these comparisons is now averaged at 443, and if the average score lies above a predetermined threshold at 444, the target sentence is added to the list of similar sentences found at 445. This procedure is repeated until no more sentences exist in the list of target sentences being examined at 446; at this stage the list of similar sentences found is sorted at 447.
- the sorted list is either returned to the user, or is used by the program to control the selection and/or display of texts. In the implementation described above, the list is used at 90g, where a group text is divided into two sub-groups depending on whether each text's comparison score falls above or below a given threshold.
- the "sentences” referred to in the paragraphs above may be, but are not necessarily, grammatical natural language sentences.
- the procedure is also applied to "sentences" which are actually the keyword list provided by the user when he describes the text on saving it in the textbase.
- the request might be a collection of words in a predetermined order having no sentence structure.
- the program will then search for this collection of words appearing in the specified order within an area of the text.
- the area of the text might be limited to a predetermined sub-area of the text such as the title, abstract, paragraph, etc. or within a certain number of words. This feature enables the program to distinguish between areas of text having the same words but an entirely different meaning.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Circuits Of Receivers In General (AREA)
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DK434388A DK434388A (en) | 1986-12-04 | 1988-08-03 | PROCEDURE AND INFORMATION REQUIREMENTS |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US93816386A | 1986-12-04 | 1986-12-04 | |
US938,163 | 1986-12-04 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO1988004454A2 true WO1988004454A2 (en) | 1988-06-16 |
WO1988004454A3 WO1988004454A3 (en) | 1988-11-17 |
Family
ID=25470999
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US1987/003143 WO1988004454A2 (en) | 1986-12-04 | 1987-11-27 | Information retrieval system and method |
Country Status (6)
Country | Link |
---|---|
EP (1) | EP0334888A1 (en) |
AU (1) | AU607963B2 (en) |
CA (1) | CA1276728C (en) |
DK (1) | DK434388A (en) |
IL (1) | IL84706A (en) |
WO (1) | WO1988004454A2 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0743606A2 (en) * | 1995-05-17 | 1996-11-20 | Fuji Xerox Co., Ltd. | Data unit group handling apparatus |
EP0750266A1 (en) * | 1995-06-19 | 1996-12-27 | Sharp Kabushiki Kaisha | Document classification unit and document retrieval unit |
WO1999012108A1 (en) * | 1997-09-04 | 1999-03-11 | British Telecommunications Public Limited Company | Methods and/or systems for selecting data sets |
US10007932B2 (en) * | 2015-07-01 | 2018-06-26 | Vizirecruiter Llc | System and method for creation of visual job advertisements |
US10311489B2 (en) | 2015-07-01 | 2019-06-04 | Vizirecruiter, Llc | System and method for creation of visual job advertisements |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0130050A2 (en) * | 1983-06-21 | 1985-01-02 | Kabushiki Kaisha Toshiba | Data management apparatus |
-
1987
- 1987-11-27 EP EP88900194A patent/EP0334888A1/en not_active Withdrawn
- 1987-11-27 WO PCT/US1987/003143 patent/WO1988004454A2/en not_active Application Discontinuation
- 1987-11-27 AU AU10451/88A patent/AU607963B2/en not_active Ceased
- 1987-12-03 IL IL84706A patent/IL84706A/en not_active IP Right Cessation
- 1987-12-04 CA CA000553603A patent/CA1276728C/en not_active Expired - Lifetime
-
1988
- 1988-08-03 DK DK434388A patent/DK434388A/en not_active Application Discontinuation
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0130050A2 (en) * | 1983-06-21 | 1985-01-02 | Kabushiki Kaisha Toshiba | Data management apparatus |
Non-Patent Citations (7)
Title |
---|
Communications of the ACM, vol. 10, no. 5, May 1967 (New York, US), C.N. Alberga: 'String similarity and misspellings', pages 302 - 313 * |
Communications of the ACM, vol. 23, no. 12, December 1980 (New York, US), J.L. Peterson: 'Computer programs for detecting and correcting spelling errors', pages 676 - 687 * |
Communications of the ACM, vol. 7, no. 11, November 1964 (New York, US), R.D. Faulk: 'An inductive approach to language translation', pages 647 - 653 * |
E. Ozkarahan: 'Database machines and database management', 1986, Prentice-Hall, Inc. (Englewood Cliffs, New Jersey, US), pages 498 - 522 * |
Electronics International, vol. 56, no. 24, 1 December 1983 (New York, US), P.N. Yianilos: 'A dedicated comparator matches symbol strings fast and intelligently' pages 113 - 117 * |
G. Salton: 'Automatic information organization and retrieval', 1968, McGraw-Hill Book Company (New York, US), pages 57 - 65 * |
Information Processing and Management, vol. 19, no. 4, 1983, Pergamon Press Ltd (Oxford, GB), R.C. Angell et al.: 'Automatic spelling correction using a trigram similarity measure', pages 255 - 261 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0743606A2 (en) * | 1995-05-17 | 1996-11-20 | Fuji Xerox Co., Ltd. | Data unit group handling apparatus |
EP0743606A3 (en) * | 1995-05-17 | 1997-12-10 | Fuji Xerox Co., Ltd. | Data unit group handling apparatus |
EP0750266A1 (en) * | 1995-06-19 | 1996-12-27 | Sharp Kabushiki Kaisha | Document classification unit and document retrieval unit |
WO1999012108A1 (en) * | 1997-09-04 | 1999-03-11 | British Telecommunications Public Limited Company | Methods and/or systems for selecting data sets |
AU742831B2 (en) * | 1997-09-04 | 2002-01-10 | British Telecommunications Public Limited Company | Methods and/or systems for selecting data sets |
US10007932B2 (en) * | 2015-07-01 | 2018-06-26 | Vizirecruiter Llc | System and method for creation of visual job advertisements |
US10311489B2 (en) | 2015-07-01 | 2019-06-04 | Vizirecruiter, Llc | System and method for creation of visual job advertisements |
US10628860B2 (en) | 2015-07-01 | 2020-04-21 | Vizirecruiter Llc | System and method for creation of visual job advertisements |
Also Published As
Publication number | Publication date |
---|---|
IL84706A0 (en) | 1988-05-31 |
IL84706A (en) | 1992-11-15 |
CA1276728C (en) | 1990-11-20 |
DK434388D0 (en) | 1988-08-03 |
DK434388A (en) | 1988-10-03 |
AU1045188A (en) | 1988-06-30 |
WO1988004454A3 (en) | 1988-11-17 |
EP0334888A1 (en) | 1989-10-04 |
AU607963B2 (en) | 1991-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US4972349A (en) | Information retrieval system and method | |
US5062074A (en) | Information retrieval system and method | |
US5940624A (en) | Text management system | |
US6859800B1 (en) | System for fulfilling an information need | |
US6163775A (en) | Method and apparatus configured according to a logical table having cell and attributes containing address segments | |
US5794177A (en) | Method and apparatus for morphological analysis and generation of natural language text | |
US5598557A (en) | Apparatus and method for retrieving and grouping images representing text files based on the relevance of key words extracted from a selected file to the text files | |
US6286000B1 (en) | Light weight document matcher | |
US6496820B1 (en) | Method and search method for structured documents | |
US20050203900A1 (en) | Associative retrieval system and associative retrieval method | |
US20020123994A1 (en) | System for fulfilling an information need using extended matching techniques | |
WO1997004405A9 (en) | Method and apparatus for automated search and retrieval processing | |
JPH08190564A (en) | Method and system for information retrieval | |
JPH11102374A (en) | Method and device for displaying document of data base | |
JPH03172966A (en) | Similar document retrieving device | |
US6278990B1 (en) | Sort system for text retrieval | |
JPH0484271A (en) | Intra-information retrieval device | |
JPH1049543A (en) | Document retrieval device | |
JPH08161343A (en) | Related word dictionary preparing device | |
WO2000026839A9 (en) | Advanced model for automatic extraction of skill and knowledge information from an electronic document | |
CA1276728C (en) | Information retrieval system and method | |
JP4426893B2 (en) | Document search method, document search program, and document search apparatus for executing the same | |
JP3275813B2 (en) | Document search apparatus, method and recording medium | |
JP2002183195A (en) | Concept retrieving system | |
EP0592402B1 (en) | A text management system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AU DK FI JP NO |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): AT BE CH DE FR GB IT LU NL SE |
|
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AU DK FI JP NO |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): AT BE CH DE FR GB IT LU NL SE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1988900194 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 1988900194 Country of ref document: EP |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: 1988900194 Country of ref document: EP |