WO2004006133A1 - Code, systeme et procede d'adaptation a un texte - Google Patents

Code, systeme et procede d'adaptation a un texte Download PDF

Info

Publication number
WO2004006133A1
WO2004006133A1 PCT/US2002/021198 US0221198W WO2004006133A1 WO 2004006133 A1 WO2004006133 A1 WO 2004006133A1 US 0221198 W US0221198 W US 0221198W WO 2004006133 A1 WO2004006133 A1 WO 2004006133A1
Authority
WO
WIPO (PCT)
Prior art keywords
texts
terms
text
database
term
Prior art date
Application number
PCT/US2002/021198
Other languages
English (en)
Inventor
Peter J. Dehlinger
Shao Chin
Original Assignee
Iotapi., Com, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iotapi., Com, Inc. filed Critical Iotapi., Com, Inc.
Priority to PCT/US2002/021198 priority Critical patent/WO2004006133A1/fr
Priority to AU2002320280A priority patent/AU2002320280A1/en
Priority to US10/261,971 priority patent/US7181451B2/en
Priority to US10/262,192 priority patent/US20040006547A1/en
Publication of WO2004006133A1 publication Critical patent/WO2004006133A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Definitions

  • This invention relates to the field of text matching, and in particular, to a method, machine-readable code, and system for matching one text with each of a plurality of texts in a related field.
  • the invention includes, in one aspect, an automated method of comparing a target concept, invention, or event in a selected field with each of a plurality of natural-language texts in the same field.
  • each of a plurality of terms composed of non-generic words and, optionally, word groups characterizing the target concept, invention, or event is associated with a selectivity value related to the frequency of occurrence of that term in a database of digitally processed texts in the selected field, relative to the frequency of occurrence of the same term in a database of digitally processed texts in one or more unrelated fields.
  • the method determines for each of the plurality of natural-language texts in the same field, a match score related to the number of terms derived from that text that match those in the target concept, invention, or event, preferably weighted by selectivity values of the matching terms. Those natural-language texts in the same field having the highest match score or scores are then identified.
  • the method further includes constructing the plurality of terms by (a) identifying non-generic words in the input text, and (b) constructing from the identified words, a plurality of words groups, each containing two or more non-generic words that are proximately arranged in the input text.
  • the method may assign to each of the terms in the target concept, invention, or event, a match value related to the corresponding selectivity value, and sum the match values of terms that match those of each of the plurality of digitally processed texts in the given field.
  • Information about the highest match-score texts may be displayed displaying as a two-dimensional matrix, one dimension representing terms contained in the target concept, invention, or event, and the other dimension representing the highest matching encoded texts.
  • the information may be displayed as a list of highest matching encoded texts, and for each such text, a list of matching terms identified for that text. The method is useful, for example, for comparing a concept, invention, or event in a selected technical field, or in a selected legal field.
  • the invention includes an automated system for comparing a target concept, invention, or event in a selected field with each of a plurality of natural-language texts in the same field.
  • the system includes a database which provides, or from which can be determined, the selectivity value for each of a plurality of terms composed of non-generic words and, optionally, word groups representing proximately arranged non-generic words derived from a plurality of digitally-encoded natural-language texts (i) in the selected field and (ii) in one or more unrelated fields.
  • the selectivity value of any term is determined from the frequency of occurrence of that term derived from the plurality of digitally- encoded natural-language texts in the selected field, relative to the frequency of occurrence of that word pair derived from the plurality of digitally-encoded natural- language texts in one or more unrelated fields.
  • An electronic computer in the system is operable to access the database, for retrieving or determining said selectivity value for any of the terms supplied to the database from the computer.
  • Computer-readable code is operable, when read by the electronic computer, to perform the steps of (i) accessing the database, to retrieve or determine selectivity values for each of a plurality of terms composed of non-generic words and, optionally, word groups characterizing the target concept, invention, or event, (ii) determining for each of the plurality of natural-language texts in the same field, a match score related to the number of terms derived from that text that match those in the target concept, invention, or event, preferably weighted by selectivity values of the matching terms, and (iii) identifying from among the plurality of natural-language texts in the same field, one or more texts which have the highest match score or scores.
  • the electronic computer may be a central computer, where the code is operable to connect the central computer to each of a plurality of peripheral computers on which a user can input information about the target concept, invention, or event, and can receive information about said one or more texts which have the highest match score or scores.
  • the code may further be operable to construct the plurality of terms by (a) identifying non-generic words in the input text, and (b) constructing from the identified words, a plurality of words groups, each containing two or more non-generic words that are proximately arranged in the input text.
  • This code portion may be executed on one or more of the peripheral computers.
  • the code may be further operable to display on a peripheral computer, information about the texts having the highest match scores as a two-dimensional matrix, one dimension representing terms contained in the target concept, invention, or event, and the other dimension representing the highest matching encoded texts.
  • the code may be operable to display information about the texts having the highest match scores as a list of highest matching encoded texts, and for each such text, a list of matching terms identified for that text.
  • the information may be displayed in the form of text abstracts, with the matched terms highlighted.
  • the above computer-readable code, and code portion forms yet another aspect of the invention.
  • the code portion may be operable to identify terms in an input text by identifying those words and word groups in the input text having a selectivity value above a given threshold value.
  • the code portion may be operable to generate words groups from pairs of adjacent descriptive words in the input text.
  • the code may be operable to associate a selectivity value with each of the terms by accessing a look-up table containing, for each of the descriptive words and, optionally, word groups in the plurality of digitally processed texts in the selected field, the selectivity value ofthe word, and optionally, word group.
  • the code may be operable to determine a match score by assigning to each of the terms in the target concept, invention, or event, a match value related to the corresponding selectivity value, and summing the match values of terms that match those of each of the plurality of digitally processed texts in the given field.
  • the invention includes a computer-accessible database having a plurality of terms composed of non-generic words and, optionally, word groups representing proximately arranged non-generic words derived from a plurality of digitally-encoded natural-language texts in a given field.
  • the selectivity value associated with terms below a given threshold value may be a constant or null value.
  • Also associated with each term may be identifiers of the natural-language texts in the texts in the given field.
  • the selected field contributing to the database is a selected technical field
  • the one or more unrelated fields contributing to the database may be unrelated technical fields
  • the plurality of texts contributing to the database may be patent abstracts or claims or technical-literature abstracts.
  • the selected field contributing to the database is a selected legal field
  • the one or more unrelated fields contributing to the database may be unrelated legal fields
  • the plurality of texts contributing to the database may be legal-reporter case notes or head notes.
  • Fig. 1 illustrates components of the a system for searching texts in accordance with the invention
  • Fig. 2 shows in an overview, flow diagram form, the processing of text libraries to form a target attribute dictionary
  • FIG. 3 shows in an overview, flow diagram form, the steps in deconstructing a natural-language input text to generate search terms
  • Fig. 4 in an overview, flow diagram, the steps in a text matching or searching operation performed by the system of the invention
  • Fig. 5 is a flow diagram of Module A in the machine-readable code of the invention, for converting target- or reference text libraries to corresponding processed-text libraries, and as indicated in Fig. 2;
  • Fig. 6 is a flow diagram of Module B in the machine-readable code of the invention, for determining the selectivity values of terms in the processed target- text libraries, also as indicated in Fig. 2;
  • Fig. 7 is a flow diagram of Module C for generating a target attribute library from the processed target-text library and selectivity value database, also as indicated in Fig. 2;
  • Fig. 8 is a flow diagram of Module D for matching an input text against a plurality of same-field texts, in accordance with the invention, and as indicated in Fig. 4;
  • Fig. 9 is a flow diagram of the algorithm in Module D for accumulating match values
  • Fig. 10 is a flow diagram for construction groups of top-ranking texts having a high covering value
  • Figs. 1-14 are histograms of search terms, showing the distribution of the terms among a selected number of top-ranked texts (light bars), and the distribution of the same terms among an equal number of cited US patent references (dark bars) for two different text inputs in the surgical field (Figs. 11 and 12), and in the diagnostics field (Figs. 13 and 14).
  • Natural-language text refers to text expressed in a syntactic form that is subject to natural-language rules, e.g., normal English-language rules of sentence construction. Examples include descriptive sentences, groups of descriptive sentences making up paragraphs, such as summaries and abstracts, and single- sentence texts, such as patent claims.
  • Sentence is a structurally independent grammatical unit in a natural- language written text, typically beginning with a capital letter and ending with a period.
  • Target concept, invention, or event refers to an idea, invention, or event that is the subject matter to be searched in accordance with the invention.
  • a target concept, invention, or concept may be expressed as a list of descriptive words and/or word groups, such as word pairs, as phrases or as natural-language text, e.g., composed of one or more sentences.
  • Target input text refers to a target concept, invention, or event that is expressed in natural-language text, typically containing at least one, usually two or more complete sentences. Text summaries, abstracts and patent claims are all examples of target input texts.
  • Digitally-encoded text refers to a natural-language text that is stored and accessible in digitized form, e.g., abstracts or patent claims or other text stored in a database of abstracts, full text or the like.
  • Abstract refers to a summary form, typically composed of multiple sentences, of an idea, concept, invention, discovery or the like. Examples, include abstracts from patents and published patent applications, journal article abstracts, and meeting presentation abstracts, such as poster-presentation abstracts, and case notes form case-law reports.
  • Full text refers to the full text of an article, patent, or case law report.
  • Field refers to a field of text matching, as defined, for example, by a specified technical field, patent classification, group of classes or sub- classification, or a legal field or speciality, such "torts" or “negligence” or “property rights".
  • An "unrelated field” is a field that is unrelated to or different from the field of the text matching, e.g., unrelated patent classes, or unrelated technical specialities, or unrelated legal fields.
  • Generic words refers to words in a natural-language text that are not descriptive of, or only non-specifically descriptive of, the subject matter of the text. Examples include prepositions, conjunctions, pronouns, as well as certain nouns, verbs, adverbs, and adjectives that occur frequently in texts from many different fields.
  • a dictionary of generic words e.g., in a look-up table of generic words, is somewhat arbitrary, and can vary with the type of text analysis being performed, and the field of search as will be appreciated below.
  • generic words have a selectivity value (see below) less than 1to 1.25.
  • Non-generic words are those words in a text remaining after generic words are removed. The following text, where generic words are enclosed by brackets, and non-generic words left unbracketed, will illustrate:
  • [A method and apparatus for] treating psoriasis includes a] source [of] incoherent electromagnetic energy. [The] energy [is directed to a region of] tissue [to be] treated. [The] pulse duration [and the] number [of] pulses [may be] selected [to] control treatment parameters [such as the] heating [of] healthy tissue [and the] penetration depth [of the] energy [to] optimize [the] treatment. [Also, the] radiation [may be] filtered [to] control [the] radiation spectrum [and] penetration depth.
  • a "word string" is a sequence of non-generic words formed of non-generic words.
  • a word group is a group, typically a pair, of non-generic words that are proximately arranged in a natural-language text.
  • words in a word group are non-generic words in the same sentence. More typically they are nearest or next-nearest non-generic word neighbours in a string of non-generic words.
  • the above word string "treating psoriasis source incoherent electromagnetic energy” would generate the word pairs “treating psoriasis,” treating source,” “psoriasis source,” “psoriasis incoherent,” source incoherent,” source electromagnetic,” and so forth is all combination of nearest neighbors and next-nearest neighbors are considered.
  • Non-generic words and words groups generated therefrom are also referred to herein as "terms"
  • Digitally processed text refers to a digitally-encoded, natural-language text that has been processed to generate non-generic words and word groups.
  • Database of digitally encoded texts refers to large number, typically at least 100, and up to 1 ,000,000 or more such texts. The texts in the database have been preselected or are flagged or otherwise identified to relate to a specific field, e.g., the field of the desired search or an unrelated field.
  • “Dictionary” of terms refers to a collection of words and, optionally, word pairs, each associated with identifying information, such as the texts containing that word, or from which that word pair was derived, in the case of a "processed target-term” or “processed reference-term” dictionary, and additionally, selectivity value information for that word or word term in the case of a "target attribute dictionary.”
  • the words and word pairs in a dictionary may be arranged in some easily searchable form, such as alphabetically.
  • the "selectivity value" of a term is related to the frequency of occurrence of that term in a database of digitally processed texts in the selected field, relative to the frequency of occurrence of the same term in a database of digitally processed texts in one or more unrelated fields.
  • the selectivity value of a given term may be calculated, for example, as the ratio of the percentage texts in a given field that contain that term, to the percentage texts in an unrelated field that contain the same term.
  • a selectivity value so measured may be as low as 0.1 or less, or as high as 1 ,000 or greater.
  • a “descriptive term” or “descriptive search term” in a text is a term that has a selectivity value above a given threshold, e.g., 1.25 for a given non-generic word, and 1.5 for a given word pair.
  • a "match value" of a term is a value corresponding to some mathematical function of the selectivity value of that term, such as a fractional exponential function.
  • the match value of a given term having a selectivity value of X might be X 1/2 orX 1/3 .
  • a "verbized” word refers a word that has a verb root.
  • the verbized word has been converted to one form of the verb root, e.g., a truncated, present tense, active voice form of the verb.
  • the verb root "light” could be the verbized form of light (the noun), light (the verb), lighted, lighting, lit, lights, has been lighted, etc.
  • Verb form refers to the form of a verb, including present and past tense, singular and plural, present and past participle, gerund, and infinitive forms of a verb.
  • Verb phrase refers to a combination of a verb with one or more auxiliary verbs including (i) to, for, (ii) shall, will, would, should, could, can, and may, might, must, (iii) have has, had, and (iv) is are, was and were.
  • Fig 1 shows the basic elements of a text-matching system 20 in accordance with the present invention.
  • a cental computer or processor 22 receives user input and user-processed information from a user computer 24.
  • the user computer has a user-input-device, such as a keyboard or disc reader 28 by which the user can enter input text or text words describing an idea, concept, or event to be searched, and a display or monitor 26, for displaying search information to the user.
  • a target attribute dictionary 30 in the system is accessible by the central computer in carrying out several of the operations of the system, as will be described.
  • the system includes a separate target attribute dictionary for each different field of search.
  • the user computer is one of several remote access stations, each of which is operably connected to the central computer, e.g., as part of an internet system in which users communicate with the central computer through an internet connection.
  • the system may be an intranet system in which one or multiple peripheral computers are linked internally to a central processor.
  • the user computer serves as the central computer.
  • system includes separate user computers) communicating with a central computer
  • certain operations relating to text processing are or can be carried out on a user computer
  • operations related to text searching and matching are or can be carried out on the central computer, through its interaction with one or more target attribute dictionaries.
  • This allows the user to input a target text, have the text deconstructed in a format suitable for text searching at the user terminal, and have the search itself conducted by the central computer.
  • the central computer is, in this scheme, never exposed to the actual target text. Once a text search is completed, the results are reported to the user at the user- computer display. Generating a target-attribute dictionary. Fig.
  • a target-text database is a database of digitally encoded texts, e.g., abstracts, summaries, and/or patent claims, along with pertinent identifying information, e.g., (i) pertinent patent information such as patent number, patent-office classification, inventor names, and patent filing and issues dates, (ii) pertinent journal-reference information, such as source, dates, and author, or (iii) pertinent law-reporter information, such as reporter name, dates, and appellate court.
  • pertinent patent information such as patent number, patent-office classification, inventor names, and patent filing and issues dates
  • pertinent journal-reference information such as source, dates, and author
  • pertinent law-reporter information such as reporter name, dates, and abortion court.
  • databases are available commercially from a variety of sources, such as the US Patent and Trademark Office, the European Patent Office PO, Dialog Search Service, legal reporter services, and other database sources whose database offerings are readily identifiable from their internet sites.
  • the databases are typically supplied in tape or CD ROM form.
  • the text databases are U.S. Patent Bibliographic databases which contain, for all patents issued between 1976 and 2000, various patent identifier information and corresponding patent abstracts. Thjs database is available in tape form from the USPTO.
  • the target-text database in the figure refers to a collection of digitally- encoded texts and associated identifiers in a given field of interest, e.g., a given technical or scientific or legal field.
  • the target-text database contains the text sources and information that will be searched and identified in the search method of the invention.
  • the reference-text database in the figure refers to a collection of digitally encoded texts and associated information that are outside of the field of interest, i.e., unrelated to the field of the search, and serve as a "control" for determining the field-specificity of search terms in the target input, as will be described.
  • "Reference-text library” may refer to texts and associated information in one or more different fields that are unrelated to the field of the search.
  • the target-text database may include all post-1976 U.S. patent abstracts and identifying information in the field of surgery
  • the reference-text database may include all post-1976 U.S. patent abstracts and identifying information in unrelated fields of machines, textiles, and electrical devices.
  • the target- text database may include all Medline (medical journal literature) abstracts and identifying information in the field of cancer chemotherapy, and the reference-field database, all Medline abstracts and identifying information in the fields or organ transplantation, plant vectors, and prosthetic devices.
  • the target-text database may include all head notes and case summaries and identifying information in federal legal reporters in the field of trademarks, and the reference-text database, all head notes and case summaries and identifying information in the same reporters in the fields of criminal law, property law, and water law.
  • the target-text database and reference- text database are processed by a Module A, as described below with reference to the flow diagram in Fig. 5.
  • Module AC in Module A operates to identify verb-root words, remove generic words, identify remaining words, and parse the text remaining after removal of generic words into word strings, typically 2-6 words long.
  • the module uses a moving window algorithm to generate proximately arranged word pairs in each of the word strings. - Thus each text is deconstructed into a list of non-generic words and word groups, e.g., word pairs.
  • a target-term dictionary 36 or reference-term dictionary 38 containing all of the words and word pairs (terms) derivable from the texts in the associated text database, and for each term, text identifiers which identify, e.g., by patent number or other bibliographic information, each of the texts in the associated database that contains that term.
  • the words and word pairs may be arranged in a conveniently searchable form, such as alphabetically within each class (words and word pairs) of terms.
  • an entry in a processed target-term or reference-term dictionary generated from a patent-abstract database would include a non-generic word or word-pair term, and a list of all patent or publication numbers that contain that term.
  • the target-term and reference-term dictionaries just described, and identified at 36, 38, respectively in Fig. 2, are used to generate, for each term in the target-term dictionary, a target-term selectivity value that indicates the relative occurrence of that term in the processed target- and reference-term dictionaries.
  • the determination of a selectivity value for each term in the target-term dictionary is carried out by Module B, described below with reference to Fig. 6.
  • the selectivity value is determined simply as the normalized ratio of the number of texts in the target-term dictionary containing that term to the number of texts in the reference-term dictionary that contain the same term.
  • the term "electromagnetic" in the target-term dictionary contains 1 ,500 different text identifiers (meaning that this term occurs at least once in 1 ,500 different abstracts in the target-text database), and contains 500 different text identifiers in the reference-term dictionary.
  • the ratio of occurrence of the word in the two dictionaries in thus 3:1 , or 3.
  • this number is then multiplied by the ratio of total number of texts in the reference and target databases.
  • the selectivity ratio of 3 is multiplied by 100/50 or 2 to yield a normalized ' selectivity value for the term "electromagnetic" of 6.
  • the processed target-term and reference-term dictionaries are generated by considering only some arbitrary number, e.g., 50,000, of the texts in the respective target-text and reference-text databases, for generating the target-term and processed-term dictionaries, so that the normalization factor for is always 1. It will be appreciated that by selecting a sufficiently large number of texts from each database, a statistically meaningful rate of occurrence of any term from the database texts is obtained.
  • the selectivity values just described, and stored at 40 are assigned to each of the associated terms in the target-term dictionary.
  • the selectivity value of "6" is assigned to the term "electromagnetic" in the target- term dictionary.
  • the target-attribute dictionary now contains, for each dictionary term, a list of text identifiers for all texts in the target-text database that contain that term and the selectivity value of the term calculated with respect to a reference text database of texts in an unrelated field. This target-attribute dictionary forms one aspect of the invention.
  • the invention provides a dictionary of terms (words and/or word groups, e.g., word pairs) contained in or generated from the texts in a database of natural- language texts in a given field.
  • Each entry in the dictionary is associated with a selectivity value which is related to the ratio of the number of texts in the given field that contain that term or from which that term can be generated (in the case of word groups) to the number of texts in a database of texts in one or more unrelated fields that contain the same term, normalized to the total number of texts in the two databases.
  • the selectivity value associated with terms below a given threshold value may be a constant or null value.
  • the dictionary may additional include, for each term, text identifiers that identify all of the texts in the target text database that contain that term or from which the term can be derived.
  • the concept, invention of event to be searched may be expressed as a group of words and, optionally, word pairs that are entered by the user at the user terminal. Such input, as will be seen, is already partly processed for searching in accordance with the invention.
  • the user inputs a natural-language text 42 that describes the concept, invention, or event as a summary, abstract or precis, typically containing multiple sentences, or as a patent claim, where the text may be in a single-sentence format.
  • An exemplary input would be a text corresponding to the abstract or independent claim in a patent or patent application or the abstract in a journal article, or a description of events or conditions, as may appear in case notes in a legal reporter.
  • the input text which is preferably entered by the user on user computer 24, is then processed on the user computer (or alternatively, on the central computer) to generate non-generic words contained in the text and word groups constructed from proximately arranged non-generic words in the text, as indicated at 44.
  • the deconstruction of text into non-generic words and word groups, e.g., word pairs is carried out by Module C in the applicable computer, including Module AC which is also used in processing database texts, as noted above. The operation of Module C is described more fully below with reference to Fig. 7. With continuing reference to Fig. 3, non-generic words and word pairs
  • Module C in the system, detailed below with respect to Fig. 7. As indicated in Fig. 3, Module C may be executed partly on a user computer (processing of input text) and partly on a central computer (identifying descriptive terms).
  • the descriptive terms are then displayed to the user at the user display.
  • the user may have several options at this point.
  • the user may accept the terms as pertinent and appropriate for the search to be conducted, without further editing; the user may add synonyms to one or more of the words (including words in the word pairs) to expand the range of the search; the user may add or delete certain terms; and/or specify a lower or higher selectivity-value threshold for the terms, and ask the central computer to generate a new list of descriptive terms, based on the new threshold values.
  • the invention thus provides, in another aspect, computer-readable code which is operable, when read by an electronic computer, to generate descriptive terms (words and/or multi-word groups) from a digitally encoded, natural-language input text that describes a concept, invention, or event in a selected field.
  • the code operates to (a) identify non-generic words in the input text, (b) construct from the identified non-generic words, a plurality of word groups, each containing two or more words that are proximately arranged in the input text (where word groups are to be generated), (c) select terms from (b) as descriptive terms if, for each group considered, the group has a selectivity value above a selected threshold value, and (d) storing or displaying the terms selected in (c).
  • the invention provides computer readable code, and a system and method for identifying descriptive words and/or word groups in a natural-language text.
  • the method includes identifying from the text, those terms, including non-generic words contained in the text and/or word groups generated from the non-generic words, that have a selectivity value greater than a given or specified threshold.
  • This section will provide an overview of the text-matching or text-searching operation in the invention.
  • the purpose of the operation is to identify those texts originally present in a target-text database that most closely match the target input text (or terms) in content, that is, the most pertinent references in terms of content and meaning.
  • the search is based on two strategies which have found to be useful, in accordance with the invention, in extracting content from natural-language texts: (i) by using selectivity values to identify those terms, i.e., words and optionally, word groups, having above- threshold selectivity values, and (ii) considering all of the high selectivity value terms collectively, the search compares target and search texts in a content-rich manner.
  • the final match score will also reflect the relative "content" value of the different search terms, as measured by some function related to the selectivity value of that term.
  • This function is referred to as a match value.
  • a match value For example, if a term has a selectivity value of 8, and the function is a cube root function (SV 1/3 ), the match value will be 2.
  • a cube root function would compress the match values of terms having selectivity values to between 1 and 1 ,000 to 1 to 10; a square root function would compress the same range to between 1 and about 33.
  • Fig. 4 shows the overall flow of the components and operations in the text- matching method.
  • the initial input in the method is the descriptive search terms generated from the input text as above.
  • the code in the central computer For each term (word and optionally, word group), the code in the central computer (Module D described below with reference to Figs. 8 and 9) operates to look up that term in the target-attribute dictionary. If the selectivity value of the term is at or above a given threshold, the code operates to record the text identifiers of all of the texts that contain the term (word) or from which that term (word group) is generated. The text identifiers are placed form an accumulating or update list of text IDs, each associated with one or more terms, and therefore, with one or more match values those terms.
  • the steps are repeated until each term has been so processed.
  • the text IDs and match value associated with that term are added to update list of Ids, either as new ID's or as additional match values added to existing Ids, as described below with reference to Fig. 9.
  • the update list includes each text ID having at least one of the search terms, and the total match score for that text.
  • the program then applies a standard ranking algorithm to rank the text entries in the updated in buffer 50, yielding some selected number, e.g., 25, 50, or 100 of the top ranked matching texts, stored at 52.
  • the system may also find, from among the texts with the top match scores, e.g., top 100 texts, a group of texts, e.g., group of 2 or 3 texts, having the highest collective number of hits.
  • This group will be said to be a spanning or covering group, in that the texts in the group maximally span or cover the terms from the target input.
  • Information relating to the top-ranked texts, and/or covering groups may be displayed to the user, e.g., in the form discussed below with respect to Fig. 14. This display is indicated at 26 in Fig. 4.
  • the information displayed at 26 may include information about text scores, and/or matching terms, and the text itself, but not include specific identifying information, such as patent numbers of bibliographic citations.
  • the user selects, as at 54, those texts which are of greatest interest, based, for example, on the match score, the matching terms, and/or the displayed text of a given reference.
  • This input is fed to the central computer, which then retrieves the identifying information for the texts selected by the user, as at 56, and supplies this to the user at display 26.
  • This allows for a variety of user-fee arrangements, e.g., one in which the user pays a relatively low rate to examine the quality of a search, but a higher rate to retrieve specific texts of interest.
  • the first is used to deconstruct each text in the target-text or the reference-text database into a list of words and word groups, e.g., word pairs, that are contained in or derivable from that text.
  • the second is the deconstruction of a target text into meaningful search terms, e.g., words and word pairs.
  • Both text- processing operations involve a Module AC which processes a text into terms, that is, non-generic words and optionally word groups formed proximately arranged non-generic words. Module AC is described in this section with reference to Fig. 5, and illustrated with respect to a model patent claim.
  • the first step in text processing module of the program is to "read" the text for punctuation and other syntactic clues that can be used to parse the text into smaller units, e.g., single sentences, phrases, and/or subphrases, and more generally, word strings.
  • These steps are represented by parsing function 60 in Module AC.
  • the design of and steps for the parsing function will be appreciated form the following description of its operation.
  • the parsing function will first look for sentence periods.
  • a sentence period should be followed by at least one space, followed by a word that begins with a capital letter, indicating the beginning of a the next sentence, or should end the text, if the final sentence in the text.
  • Periods used in abbreviations can be distinguished either from an internal dictionary of common abbreviations and/or by a lack of a capital letter in the word following the abbreviation.
  • the preamble of the claim can be separated from the claim elements by a transition word "comprising" or “consisting” or variants thereof.
  • Individual elements may be distinguished by semi-colons and/or new paragraph markers, and/or element numbers of letters, e.g., 1 , 2, 3, or i, ii, iii, or a, b, c.
  • the parsing algorithm need not be too strict, or particularly complicated, since the purpose is simply to parse a long string of words (the original text) into a series of shorter ones that encompass logical word groups.
  • the parsing algorithm may also use word clues. For example, by parsing at prepositions other than "of, or at transition words, useful word-strings can be generated.
  • the program carries out word classification functions, indicated at 62, which act to classify the words in the text into one of ' three basic groups: (i) a group of verbs and verb-root words, (ii) generic words, and (iii) remaining groups, i.e., words other than those in groups (i) or (ii), which tend to be nouns and adjectives.
  • the verb word group includes all words that act as verbs or have a verb root, including words that may (and often are) nouns or adjectives, such as "light,” “monitor,” “discussion,” “interference,” and so on.
  • verbizing all verb-root words is twofold. First, the resulting search term containing the verb root will be more robust, since it will encompass a variety of ways that the verb-root word may be used in natural-language text, i.e., either as a verb, adverb, noun, or adjective. Second, to the extent meaningful synonyms can be attached to verbs, the program will be able to automatically generate word synonyms.
  • verb and verb-root words have been identified, it may be useful, although not necessary, to identify actual verbs.
  • the presence of compound verbs containing "to be” or “to have” forms indicates that associated verb word is a true verb.
  • the purpose of identifying true verbs is to provide another "marker" for sentence parsing.
  • the group of generic words include all articles, prepositions, conjunctions, and pronouns as well as many noun or verb words that are so generic within the texts of a database as to have little or no meaning in terms of describing a particular invention, idea, or event.
  • the words “device,” “method,” “apparatus,” “member,” “system,” “means,” “identify,” “correspond,” or “produce” would be considered generic.
  • such words could also be retained at this stage, and eliminated at a later stage, on the basis of a low selectivity factor value.
  • the program includes a dictionary of generic words that are removed from the text at this initial stage in text processing.
  • Words remaining after tagging all verb words and generic words are object or noun words that typically function as nouns or adjectives in the text.
  • verb-root words indicating by italics
  • true verbs by bold italics
  • generic words by normal font
  • remainder "noun” words by bold type.
  • a device for monitoring heart rhythms comprising: means for storing digitized electrogram segments including signals indicative of depolarizations of a chamber or chamber of a patient's heart; means for transforming the digitized signals into signal wavelet coefficients; means for identifying higher amplitude ones of the signal wavelet coefficients; and means for generating a match metric corresponding to the higher amplitude ones of the signal wavelet coefficients and a corresponding set of template wavelet coefficients derived from signals indicative of a heart depolarization of known type, and identifying the heart rhythms in response to the match metric.
  • the text may be further parsed at all prepositions other than "of.
  • the program When this is done, and generic words are removed, the program generates the following strings of non-generic verb and noun words. monitoring heart rhythms//stor/ng digitized electrogram segments//s/g/?a/s depolarizations chamber patient's heart / transforming digitized signals/Zsignal wavelet coefficients// amplitude signal wavelet coefficients// match metric// amplitude signal wavelet coefficients// template wavelet coefficients// signals heart depolarization// heart rhythms/ 'I match metric.
  • the operation for generating words strings of non-generic words is indicated at 64 in Fig.
  • the desired word groups e.g., word pairs
  • word pairs are now generated from the words strings. This may be done, for example, by constructing every permutation of two words contained in each string.
  • One suitable approach that limits the total number of pairs generated is a moving window algorithm, applied separately to each word string, and indicated at 66 in the figure.
  • the overall rules governing the algorithm, for a moving "three-word' window, are as follows:
  • the string contains more than three words, treat the first three words as a three-word string to generate three two-words pairs; then move the window to the right by one word, and treat the three words now in the window (words 2-4 in the string) as the next three-word string, generating two additional word pairs (the word pair formed by the second and third words in preceding group will be the same as the first two words in the present group) string;
  • each database contains a large number of natural- language texts, such as abstracts, summaries, full text, claims, or head notes, along with reference identifying information for each text, e.g., patent number, literature reference, or legal reporter volume.
  • the two database are US patent bibliographic databases containing information about all US patents, including an abstract patent, issued between 1976 and 2000, in a particular field, in this case, several classes in covering surgical devices and procedures. That each,, each entry in the database includes bibliographic information for one patent and the associated abstract.
  • program represented by Module A initially retrieves a first entry, as at 68, where the entry number is indicated at t.
  • the abstract from this entry is then processed by Module AC according to the steps detailed above to generate for that abstract, a list of non-generic words and word pairs constructed from proximately arranged non-generic words in the parsed word strings of the text.
  • Each term, i.e., word and word pair is then added to an updated dictionary 70, typically in alphabetical order, and with the words and word-groups either mixed together or treated as separate sections of the dictionary.
  • the program operates to add the text ID, i.e., patent number, of the text containing that word or from which the word pair is derived. This operation is indicated at 72. If the words being entered are from the first text, the program merely enters each new term and adds the associated text ID. For all subsequent text, t>1 , the program carries out the following operations.
  • update dictionary 74 becomes dictionary 32 (or 34), and includes (i) all of the non-generic words and word groups contained in and derived from the texts in the corresponding database, and (ii) for each term, all of the text Ids containing that term.
  • Module B shown in Fig. 6 shows how the two dictionaries whose construction was just described are used to generate selectivity values for each of the terms in the target-term dictionary.
  • "w" indicates a term, either a word or word pair in the corresponding dictionary.
  • the program calls up this term in both dictionaries, as at 76.
  • the program determines at 78 the occurrence Ot of that term in the target-term dictionary as the total number of Ids (different texts) associated with that term/divided by the total number of texts processed from the target-database to produce the dictionary. For example, if the word "heart" has 500 associated text Ids, and the total number of texts used in generating the target-term dictionary is 50,000, the occurrence of the term is 500/50,000, or 1/100.
  • the program similarly calculates at 80 the occurrence of O r of the same term in the reference- term dictionary.
  • the word "heart” appears in 200 texts out of a total of 100,000 used to construct the reference-term dictionary, its occurrence Or is 200/100,000 or 1/500.
  • the selectivity value of the term is then calculated at 82 as Ot/O r , or, in the example given, 1/100 divided by 1/500 or 5.
  • the Ot and Or values may be equivalently calculated using total text numbers, then normalizing the ratio by the the total number of target and reference texts used in constructing the respective dictionaries.
  • the selectivity value thus calculated is then added to that term in the target-term dictionary, as at 88.
  • the selectivity value may be assigned an arbitrary or zero value in either of the two following cases: 1. the occurrence of the term in either the target- or reference-term dictionary is below a selected threshold, e.g., 1 , 2 or 3;
  • the calculated selectivity value is below a given threshold, e.g., 0.5 or 0.75.
  • the selectivity value may be assigned a large number such as 100. If O t is zero or very small, the term may be assigned a zero or very small value. In addition, a term with a low selectivity value, e.g., below 0.5, may be dropped from the dictionary, since it may have not real search value.
  • Entries in the dictionary might have the following form, where a term is followed by its selectivity value, and a list of all associated text IDs. heart (622.6906);
  • Each text identifier may include additional identifying information, such as the page and lines numbers in the text containing that term, or passages from the text containing that term, so that the program can output, in addition to text identification for highest matching texts, pertinent portions of that text containing the descriptive search term at issue.
  • the automated search method of the invention involves first, extracting meaningful search terms or descriptors from an input text (when the input is a natural-language text), and second, using the extracted search terms to identify natural-language texts describing ideas, concepts, or events that are pertinent to the input text. This section considers the two operations separately.
  • Fig. 7 is a flow diagram of the operations carried out by Module C of the program for extracting meaning search terms or descriptors.
  • the input text is a natural-language text, e.g., e.g., an abstract, summary, patent claim head note or short expository text describing an invention, idea, or event.
  • This text is processed by Module AC as described with reference to Fig. 5 above, to produce a list of non-generic words and word pairs formed by proximately arranged word pairs. These terms are referred to here as "unfiltered terms" and are stored at 94, e.g., in a buffer.
  • An unfiltered term will be deemed a meaningful search term is its selectivity value is above a given, and preferably preselected selectivity value, e.g., a value of at least 1.25 for individual word terms, and a value of at least 1.5 for word-pair terms. This is done, in accordance with the floTw diagram shown in Fig. 7, by calling up each successive selectivity value from 94, finding its selectivity value from dictionary 30, and saving the term as a meaningful search term if its selectivity value is above a given threshold, as indicated by the logic at 94.
  • the program When all of the terms have been so processed, the program generates a list 96 of filtered search terms (descriptive terms). The program may also calculate corresponding match values for each of the terms in 96. As indicated above, the match value is determined from some function of the selectivity value, e.g., the cube root or square root of the selectivity value, and serves as a weighting factor for the text-match scoring, as will be seen in the next section.
  • the next step in program operation is to employ the descriptive search terms generated as above to identify texts from the target-text database that contain terms that most closely match the descriptive search terms.
  • the text-matching search operation may be carried out in various ways. In one method, each the program actually processes the individual texts in the target database (or other searchable database), to deconstruct the texts indo word and word-pair terms, as above, and then carries out a term-by-term matching operation with each text, looking for and recording a match between descriptive term and terms in each text. The total number of term matches for a text, and the total match score, is then recorded for each text. After processing all of the texts in the database in this way, the program identifies those texts having the highest match scores. This search procedure is relatively slow, since each text must be individually processed before a match score can be calculated.
  • a preferred and much faster search method uses the target-attribute dictionary for the search operation, as will now be discussed with respect to Figs. 8 and 9.
  • the program then accesses the target- attribute dictionary to record the text Ids and the corresponding match value for that term, as indicated at 98.
  • This operation it will be appreciated, is effective to record all text Ids in the target database containing the term being considered.
  • the next step is to assign and accumulate match values for all of Ids being identified, on an ongoing basis, that is, with each new group of text Ids associated with each new descriptive term.
  • This operation is indicated at 100 in Fig. 8 and given in more detail in Fig. 9.
  • the updated list of all Ids, indicated at 102 contains a list of all text Ids which have been identified at any particular point during the search. Initially the list includes all of those text Ids associated with the first descriptive term.
  • the program operates to compare each new text Id with each existing text Id in the updated list, as indicated at 106.
  • the match value for the new term is assigned to that text Id, which now includes at least two match values. This operation, and the decision logic, is indicated at 110, 112 in the figures. If the text Id being examined is not already in the updated list, it is added to the list, as at 108, along with the corresponding match score. This process is repeated until all of the text Ids for a new term have been so processed, either by being added to the updated list or being added to an already existing text Id. The program is now ready to proceed to the next descriptive term, according to the program flow in Fig. 8.
  • the updated Id list includes, for each text Id found, one or more match values which indicate the number of matched terms associated with that text ID, and the match value for each of the terms.
  • the total match score for each text Id is then readily determined, e.g., by adding the match values associated with each text Id.
  • the scores for the text Ids are then ranked, as at 118 and a selected number of highest-matching scores, e.g., top 50 or 100 scores, are output at 120, e.g., at the user computer display.
  • FIG. 8 also shows at 122 a logic decision for another feature another of the invention — the ability to group top-ranked hits into groups of "covering" or "spanning" references containing an optimal number of terms, as will now be described with reference to Fig. 10.
  • the program operation here starts with some number X of top-ranked texts for a specific target input, as at 124.
  • the program constructs an N-dimensional vector, where N is the number of target descriptive terms, as indicated at 126.
  • N is the number of target descriptive terms, as indicated at 126.
  • Each of the N vector coefficients in the N-dimensional vector is either a 1 , indicating a given term is present in the text, or a 0, indicating the term is absent from that text.
  • the program For each group, e.g., triplet, the program adds the vectors in the group, using a logical OR operation, as at 132. This operation creates a new vector whose coefficients are 1 if the corresponding term is present in any of the Y vectors being added, and is 0 only if the corresponding term is absent from all Y vectors.
  • the 1 -value coefficients may now be replaced by the match values of the corresponding terms, as at 134.
  • the "covering" score for each permutation groups is now calculated as the sum of the vector coefficients, as at 136.
  • the scores are then sorted, and the highest ranking group of text identified by the highest vector-sum value, as at 136 and 138.
  • the combined vector with the highest scores give the best combinations of terms matching the target terms.
  • the following examples illustrate four searches carried out in accordance with the invention. Each example gives the target text (abstract and claim that served as the input text for the search), the words and word pairs generated from the input text, and the corresponding selectivity values, and the patents numbers of the top-ranked abstracts found in the search.
  • the target input is from an issued US patents having predominantly issued US patents as cited references.
  • the occurrences of descriptive terms in the cited references are compared with those in the same number of top-ranked patents found in the search.
  • the histograms for Examples 1-4 are given in Examples 11-14.
  • Example 1 Method and apparatus for detection and treatment of cardiac arrhythmias
  • a device for monitoring heart rhythms is provided with an amplifier for receiving electrogram signals, a memory for storing digitized electrogram segments including signals indicative of depolarizations of a chamber or chamber of a patient's heart and a microprocessor and associated software for transforming analyzing the digitized signals.
  • the digitized signals are analyzed by first transforming the signals into signal wavelet coefficients using a wavelet transform. The higher amplitude ones of the signal wavelet coefficients are identified and the higher amplitude ones of the signal wavelet coefficients are compared with a corresponding set of template wavelet coefficients derived from signals indicative of a heart depolarization of known type.
  • the digitized signals may be transformed using a Haar wavelet transform to obtain the signal wavelet coefficients, and the transformed signals may be filtered by deleting lower amplitude ones of the signal wavelet coefficients.
  • the transformed signals may be compared by ordering the signal and template wavelet coefficients by absolute amplitude and comparing the orders of the signal and template wavelet coefficients. Alternatively, the transformed signals may be compared by calculating distances between the signal and wavelet coefficients.
  • the Haar transform may be a simplified transform which also emphasizes the signal contribution of the wider wavelet coefficients.
  • a device for monitoring heart rhythms comprising: means for storing digitized electrogram segments including signals indicative of depolarizations of a chamber or chamber of a patient's heart; means for transforming the digitized signals into signal wavelet coefficients; means for identifying higher amplitude ones of the signal wavelet coefficients; and means for generating a match metric corresponding to the higher amplitude ones of the signal wavelet coefficients and a corresponding set of template wavelet coefficients derived from signals indicative of a heart depolarization of known type, and identifying the heart rhythms in response to the match metric.
  • a method for removing motion artifacts from devices for sensing bodilyparameters and apparatus and system for effecting same includes analyzing segments of measured data representing bodily parameters and possibly noise from motion artifacts.
  • Each segment of measured data may correspond to a single light signal transmitted and detected after transmission or reflection through bodily tissue.
  • Each data segment is frequency analyzed to determine dominant frequency components.
  • the frequency component which represents at least one bodily parameter of interest is selected for further processing.
  • the segment of data is subdivided into subsegments, each subsegment representing one heartbeat. The subsegments are used to calculate a modified average pulse as a candidate output pulse.
  • the candidate output pulse is analyzed to determine whether it is a valid bodily parameter and, if yes, it is output for use in calculating the at least one bodily parameter of interest without any substantial noise degradation.
  • the above method may be applied to red and infrared pulse oximetry signals prior to calculating pulsatile blood oxygen concentration. Apparatus and systems disclosed incorporate methods disclosed according to the invention.
  • Claim 1. A method of removing motion-induced noise artifacts from a single electrical signal representative of a pulse oximetry light signal, comprising: receiving a segment of raw data spanning a plurality of heartbeats from said single electrical signal; analyzing said segment of raw data for candidate frequencies, one of which said candidate frequencies may be representative of a valid plethysmographic pulse; analyzing each of said candidate frequencies to determine a best frequency including narrow bandpass filtering said segment of raw data at each of said candidate frequencies; outputting an average pulse signal computed from said segment of raw data and said best frequency; and repeating the above steps with a new segment of raw data. (Main Claim)
  • Target words/pairs with SF remove (5711 ,12291 ,1.5) motion (1181 ,2651 ,1.5) artifacts (263,57,13.8421) sense (4497,8708,1.54927) bodily (10148,8611 ,3.53548) parameters (997,1659,1.80289) segment (970,2202,1.5) measure (4930,8592,1.72137) noise (477,1280,1.5) light (2161 ,5201 ,1.5) signal (5733,17424,1.5) transmit (3147,9513,1.5) detect (3975,9067,1.31521) reflect (1146,3388,1.5) tissue (4351 ,131 ,99.6412) frequency (2126,4687,1.5) dominant (21 ,46,1.5) subdivided (27,155,1.5) subsegment (2,1 ,6) hear (2322,307,22.6906) calculate (1016,2312,1.5) modify (588,2379,1.5) average (559, 1100
  • An alkylation process that employs a refractive index analyzer to monitor, control, and/or determine acid catalyst strength before, during, or after the alkylation reaction.
  • the invention relates to the alkylation of an olefinic feedstock with a sulfuric acid catalyst.
  • the acid typically enters the alkylation reactor train at between from about 92 to about 98 weight percent strength.
  • the concentration of acid is controlled and maintained by monitoring the refractive index of the acid in the product mixture comprising alkylate, mineral acid, water, and red oil.
  • At least one online analyzer using a refractometer prism sensor producing real-time measurements of the refractive index of the solution may be compared to the results of manual laboratory tests on the acid strength of the catalyst using manual sample analyses or titration methods. Periodically, after calibration of the system, samples may be taken to verify the precision of the online analyzer, if desired.
  • at least one sensor is connected to at least one transmitter and is capable of providing information related to the concentration of alkylation catalyst in the mixture such that the concentration level of acid in the mixture may be monitored and maintained.
  • a method for determining the concentration of acid in a solution containing unkown quantities of said acid within an alkylation reactor comprising forming a solution containing said acid in said alkylation reactor and measuring the concentration of said acid, the improvement comprising measuring the concentration of said acid within the said alkylation reactor with a refractive index sensor having: (a) a refracting prism with a measuring surface in contact with the solution; and (b) an image detector capable of producing a digital signal related to the refractive index of the solution by determining a bright-dark boundary between reflected and refracted light from the measuring surface and correlating the refractive index of the solution to the concentration of said acid in said solution.
  • a method for single well addressability in a sample processor with row and column feeds A sample processor or chip has a matrix of reservoirs or wells arranged in columns and rows. Pressure or electrical pumping is utilized to fill the wells with materials.
  • single well addressability is achieved by deprotecting a single column (row) and coupling each transverse row (column) independently. After the coupling step, the next column (row) is deprotected and then coupling via rows (columns) is performed. Each well can have a unique coupling event.
  • the chemical events could include, for example, oxidation, reduction or cell lysis.
  • M row(0,3,0) row— wells(6,1 ,18) channel — wells(0,2,0)
  • N matrix(0,4,0) columns— matrix(3,38,0.236842)
  • M wells(0,1,0) agents—couple(28,61 ,1.37705) agents—row(0,1,0)

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé automatisé, un système et un code lisible par machine, permettant de comparer un texte d'entrée dans un domaine sélectionné avec chacune des pluralités de textes en langage naturel dans le même domaine. Dans le procédé selon l'invention, chaque pluralité de termes caractérisant le texte d'entrée (32) est associée à une valeur de sélectivité (40) ayant trait à la fréquence d'occurrence de ce terme dans une base de données (38) de textes traités numériquement dans le domaine sélectionné (30), par rapport à la fréquence d'occurrence du même terme dans une base de données de textes traités numériquement dans un ou plusieurs domaines sans rapport avec l'objet. Pour chacune des pluralités de textes en langage naturel dans le même domaine, on détermine une cotation de correspondance ayant trait au nombre de termes dérivés de ce texte qui correspondent à ceux du texte d'entrée, pondéré par des valeurs de sélectivité des termes se rapprochant. Les correspondances aux cotes les plus élevées sont alors déterminées.
PCT/US2002/021198 2002-07-03 2002-07-03 Code, systeme et procede d'adaptation a un texte WO2004006133A1 (fr)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/US2002/021198 WO2004006133A1 (fr) 2002-07-03 2002-07-03 Code, systeme et procede d'adaptation a un texte
AU2002320280A AU2002320280A1 (en) 2002-07-03 2002-07-03 Text-machine code, system and method
US10/261,971 US7181451B2 (en) 2002-07-03 2002-09-30 Processing input text to generate the selectivity value of a word or word group in a library of texts in a field is related to the frequency of occurrence of that word or word group in library
US10/262,192 US20040006547A1 (en) 2002-07-03 2002-09-30 Text-processing database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2002/021198 WO2004006133A1 (fr) 2002-07-03 2002-07-03 Code, systeme et procede d'adaptation a un texte

Related Child Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/021200 Continuation WO2004006134A1 (fr) 2002-07-03 2002-07-03 Code, systeme et procede de traitement de texte

Publications (1)

Publication Number Publication Date
WO2004006133A1 true WO2004006133A1 (fr) 2004-01-15

Family

ID=30113573

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2002/021198 WO2004006133A1 (fr) 2002-07-03 2002-07-03 Code, systeme et procede d'adaptation a un texte

Country Status (2)

Country Link
AU (1) AU2002320280A1 (fr)
WO (1) WO2004006133A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006012892A2 (fr) 2004-08-06 2006-02-09 Ulrich Rohs Mecanisme de transmission a anneau de friction comprenant deux corps de roulement places a distance l'un de l'autre autour d'une fente
US8677149B2 (en) 2010-12-14 2014-03-18 C3S Pte. Ltd. Method and system for protecting intellectual property in software
CN104063370B (zh) * 2014-07-01 2017-09-22 北京博雅立方科技有限公司 一种基于关键词的智能分组方法及装置
CN112650791A (zh) * 2020-12-29 2021-04-13 招联消费金融有限公司 字段处理方法、装置、计算机设备和存储介质
CN113535963A (zh) * 2021-09-13 2021-10-22 深圳前海环融联易信息科技服务有限公司 一种长文本事件抽取方法、装置、计算机设备及存储介质
CN117828030A (zh) * 2024-03-01 2024-04-05 微网优联科技(成都)有限公司 基于大数据的用户分析方法及电子设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5873056A (en) * 1993-10-12 1999-02-16 The Syracuse University Natural language processing system for semantic vector representation which accounts for lexical ambiguity
US6006221A (en) * 1995-08-16 1999-12-21 Syracuse University Multilingual document retrieval system and method using semantic vector matching
US6374210B1 (en) * 1998-11-30 2002-04-16 U.S. Philips Corporation Automatic segmentation of a text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5873056A (en) * 1993-10-12 1999-02-16 The Syracuse University Natural language processing system for semantic vector representation which accounts for lexical ambiguity
US6006221A (en) * 1995-08-16 1999-12-21 Syracuse University Multilingual document retrieval system and method using semantic vector matching
US6374210B1 (en) * 1998-11-30 2002-04-16 U.S. Philips Corporation Automatic segmentation of a text

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006012892A2 (fr) 2004-08-06 2006-02-09 Ulrich Rohs Mecanisme de transmission a anneau de friction comprenant deux corps de roulement places a distance l'un de l'autre autour d'une fente
EP2682645A2 (fr) 2004-08-06 2014-01-08 Rohs, Ulrich Mécanisme de transmission à anneau de friction doté de deux corps de laminage séparés l'un de l'autre par une fente
EP2687754A2 (fr) 2004-08-06 2014-01-22 Rohs, Ulrich Mécanisme de transmission à anneau de friction doté de deux corps de laminage séparés l'un de l'autre par une fente
EP2687753A2 (fr) 2004-08-06 2014-01-22 Ulrich Rohs Mécanisme de transmission à anneau de friction doté de deux corps de laminage séparés l'un de l'autre par une fente
US8677149B2 (en) 2010-12-14 2014-03-18 C3S Pte. Ltd. Method and system for protecting intellectual property in software
CN104063370B (zh) * 2014-07-01 2017-09-22 北京博雅立方科技有限公司 一种基于关键词的智能分组方法及装置
CN112650791A (zh) * 2020-12-29 2021-04-13 招联消费金融有限公司 字段处理方法、装置、计算机设备和存储介质
CN112650791B (zh) * 2020-12-29 2023-12-26 招联消费金融有限公司 字段处理方法、装置、计算机设备和存储介质
CN113535963A (zh) * 2021-09-13 2021-10-22 深圳前海环融联易信息科技服务有限公司 一种长文本事件抽取方法、装置、计算机设备及存储介质
CN117828030A (zh) * 2024-03-01 2024-04-05 微网优联科技(成都)有限公司 基于大数据的用户分析方法及电子设备
CN117828030B (zh) * 2024-03-01 2024-05-07 微网优联科技(成都)有限公司 基于大数据的用户分析方法及电子设备

Also Published As

Publication number Publication date
AU2002320280A1 (en) 2004-01-23

Similar Documents

Publication Publication Date Title
US7181451B2 (en) Processing input text to generate the selectivity value of a word or word group in a library of texts in a field is related to the frequency of occurrence of that word or word group in library
US20040006459A1 (en) Text-searching system and method
US7016895B2 (en) Text-classification system and method
EP0751470B1 (fr) Méthode automatique de génération de probabilités de charactéristiques de texte pour l'extraction automatique de résumés
CA2577376C (fr) Systeme et procede de recherche de questions de droit
US7162465B2 (en) System for analyzing occurrences of logical concepts in text documents
Hancock‐Beaulieu Evaluating the impact of an online library catalogue on subject searching behaviour at the catalogue and t the shelves
Huettner et al. Fuzzy typing for document management
US20060259475A1 (en) Database system and method for retrieving records from a record library
Milios et al. Automatic term extraction and document similarity in special text corpora
US20040064304A1 (en) Text representation and method
US20040059565A1 (en) Text-representation code, system, and method
US20040006547A1 (en) Text-processing database
Jacquemin et al. Retrieving terms and their variants in a lexicalized unification-based framework
US20040054520A1 (en) Text-searching code, system and method
Medelyan et al. Thesaurus-based index term extraction for agricultural documents
US20040044547A1 (en) Database for retrieving medical studies
WO2004006133A1 (fr) Code, systeme et procede d'adaptation a un texte
US20040186833A1 (en) Requirements -based knowledge discovery for technology management
WO2004006134A1 (fr) Code, systeme et procede de traitement de texte
CN116936135A (zh) 基于nlp技术的医疗大健康数据采集分析方法
JP2006221478A (ja) 文書検索装置及びマクロアプローチによるポートフォリオ分析装置
JP4301496B2 (ja) データベース検索装置、データベース検索方法およびプログラム
Grefenstette SEXTANT: Extracting semantics from raw text implementation details
Pomares-Quimbaya et al. Leveraging pubmed to create a specialty-based sense inventory for spanish acronym resolution

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LU MC NL PT SE SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established
32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: COMMUNICATION PURSUANT TO RULE 69 EPC (EPO FORM 1205A OF 060705)

NENP Non-entry into the national phase

Ref country code: JP

WWW Wipo information: withdrawn in national office

Country of ref document: JP

122 Ep: pct application non-entry in european phase