EP1078322B1 - System for creating a dictionary - Google Patents
System for creating a dictionary Download PDFInfo
- Publication number
- EP1078322B1 EP1078322B1 EP99922966A EP99922966A EP1078322B1 EP 1078322 B1 EP1078322 B1 EP 1078322B1 EP 99922966 A EP99922966 A EP 99922966A EP 99922966 A EP99922966 A EP 99922966A EP 1078322 B1 EP1078322 B1 EP 1078322B1
- Authority
- EP
- European Patent Office
- Prior art keywords
- dictionary
- entries
- corpus
- word
- lemma
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/268—Morphological analysis
Definitions
- the present invention relates to computerized language systems.
- the present invention relates to dictionaries used in computerized language systems.
- Computerized language systems include a wide array of computer implemented functions that manipulate language to improve communication between a computer and a user. Examples include text-to-speech and speech-to-text converters, as well as natural language systems. In each of these systems, the computer must be able to determine the syntax of a sentence. In speech systems the syntax allows the computer to identify the proper tonal inflection for the speech. In natural language systems, the syntax allows the computer to identify the key words in a sentence.
- each dictionary entry indicates the word's part of speech and its stem, also known as its lemma.
- a dictionary entry for "wash” would indicate that the word is a noun and a verb, while the entry for "elate” would indicate that the word is only a verb.
- N. Ide et al. "Multext East Language specific resources", Copemixus project COP 106, Deliverable D1.2 - May 1996 , describes the resources for text segmentation and corresponding lexicons for language specific resources.
- the lexicon is created semiautomatically. First, a frequency list of the word forms is made. Then, non-words are deleted from the list and the list is run through a morphology analyzer. Next, the output is transformed to conform to the MTE specifications.
- a computer readable medium has computer executable components that include a morphological analyzer capable of using a corpus of a large number of words to automatically form a dictionary containing words associated with a lemma and a part of speech.
- the computer executable components also include a dictionary analyzer capable of automatically improving the dictionary.
- FIG, 1 and the related discussion are intended to provide a brief, general description of a suitable computing environment in which the invention may be implemented.
- the invention will be described, at least in part, in the general context of computer-executable instructions, such as program modules, being executed by a personal computer.
- program modules include routine programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- program modules include routine programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- program modules include routine programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
- the invention may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like.
- the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
- program modules may be located in both local and remote memory storage devices.
- an exemplary system for implementing the invention includes a general purpose computing device in the form of a conventional personal computer 20, including a processing unit (CPU) 21, a system memory 22, and a system bus 23 that couples various system components including the system memory 22 to the processing unit 21.
- the system bus 23 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
- the system memory 22 includes read only memory (ROM) 24 and random access memory (RAM) 25.
- ROM read only memory
- RAM random access memory
- a basic input/output (BIOS) 26 containing the basic routine that helps to transfer information between elements within the personal computer 20, such as during start-up, is stored in ROM 24.
- the personal computer 20 further includes a hard disk drive 27 for reading from and writing to a hard disk (not shown), a magnetic disk drive 28 for reading from or writing to removable magnetic disk 29, and an optical disk drive 30 for reading from or writing to a removable optical disk 31 such as a CD ROM or other optical media.
- the hard disk drive 27, magnetic disk drive 28, and optical disk drive 30 are connected to the system bus 23 by a hard disk drive interface 32, magnetic disk drive interface 33, and an optical drive interface 34, respectively.
- the drives and the associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for the personal computer 20.
- the exemplary environment described herein employs the hard disk, the removable magnetic disk 29 and the removable optical disk 31, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories (RAMs), read only memory (ROM), and the like, may also be used in the exemplary operating environment.
- RAMs random access memories
- ROM read only memory
- a number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including an operating system 35, one or more application programs 36, other program modules 37, and program data 38.
- a user may enter commands and information into the personal computer 20 through input devices such as a keyboard 40, pointing device 42 and a microphone 43.
- Other input devices may include a joystick, game pad, satellite dish, scanner, or the like.
- These and other input devices are often connected to the processing unit 21 through a serial port interface 46 that is coupled to the system bus 23, but may be connected by other interfaces, such as a sound card, a parallel port, a game port or a universal serial bus (USB).
- a monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48.
- personal computers may typically include other peripheral output devices, such as a speaker 45 and printers (not shown).
- the personal computer 20 may operate in a networked environment using logic connections to one or more remote computers, such as a remote computer 49.
- the remote computer 49 may be another personal computer, a hand-held device, a server, a router, a network PC, a peer device or other network node, and typically includes many or all of the elements described above relative to the personal computer 20, although only a memory storage device 50 has been illustrated in FIG. 1 .
- the logic connections depicted in FIG. 1 include a local area network (LAN) 51 and a wide area network (WAN) 52.
- LAN local area network
- WAN wide area network
- the personal computer 20 When used in a LAN networking environment, the personal computer 20 is connected to the local area network 51 through a network interface or adapter 53. When used in a WAN networking environment, the personal computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52, such as the Internet.
- the modem 54 which may be internal or external, is connected to the system bus 23 via the serial port interface 46.
- program modules depicted relative to the personal computer 20, or portions thereof may be stored in the remote memory storage devices. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used. For example, a wireless communication link may be established between one or more portions of the network.
- FIG. 2 is a block diagram of system 100 of the present invention.
- a corpus 102 consisting of a large number of words is provided to a morphological analyzer 104.
- corpus 102 consists of words written as sentences.
- corpus 102 can include news articles, fictional stories, or instruction booklets.
- corpus 102 consists of at least 1 million words.
- Morphological analyzer 104 produces a dictionary of analyses from corpus 102 by applying morphological rules to the words in corpus 102.
- the analyses for each word are triples having three parts: the word, the word's lemma and the word's part of speech.
- the rules that morphological analyzer 104 uses to produce the analyses from corpus 102 are developed by a person skilled in the particular language being analyzed. An example rule in English is that words that end in "ed" are commonly verbs and their lemma is formed by either removing the "d” or the "ed”.
- dictionary analyzer 106 improves the dictionary by adding a set of default entries and by deleting entries that are unlikely to be valid words in the language.
- the process used by dictionary analyzer 106 is discussed further below.
- the results of the improvements provided by dictionary analyzer 106 form final dictionary 108, which can be used in computer language systems. For example final dictionary 108 only includes one entry for each lemma/part-of-speech pair. The different forms of the lemma that appear in the corpus are generally not stored in final dictionary 108.
- FIG. 3 is a flow diagram of the method of the present invention for automatically producing a dictionary.
- the morphological analyzer 104 produces a set of analyses using corpus 102 as input.
- these analyses take the form of triples consisting of a word, a lemma and a part of speech. Examples of such triples are shown in dictionary portion 150 of FIG. 4 .
- the triples listed in dictionary portion 150 of FIG. 4 are limited to variations of the word "arrest" that appear in corpus 102. Those skilled in the art will recognize that with at least one million words in corpus 102, there are several thousand unique words. As such, morphological analyzer 104 will produce several thousand analyses or triples in its initial dictionary. Since it is impossible to show a complete dictionary, FIG. 4 limits itself to variations of the word "arrest”.
- the results from morphological analyzer 104 that are shown in dictionary portion 150 are illustrative of the errors that morphological analyzer 104 produces in attempting to build a dictionary.
- the word "arrest” was analyzed by morphological analyzer 104 as being a form of the lemma "arr” and was identified as an adjective.
- Morphological analyzer 104 guessed that "arrest” was an adjective based on the "est” suffix, which typically is associated with the superlative form of an adjective (as in, for example, "quick”/"quickest”). However, it is clear that arrest is not an adjective and that its lemma is not "arr”.
- Entries 160 and 162 of dictionary portion 150 illustrate that morphological analyzer 104 provides multiple lemma/word combinations if several analyses are possible, given the morphological rules used. Specifically, for the word "arrested” found both in entries 160 and 162, morphological analyzer 104 used a separate morphological rule for each entry. For entry 160, morphological analyzer 104 used a rule that states that a word ending in "ed” has a lemma that is constructed by dropping the "d” from the word (as in the pair "please”/"pleased”).
- morphological analyzer 104 used a rule that states that a word ending in "ed” has a lemma that is constructed by dropping the "ed” from the word (as in the pair "walk”/"walked”). Since morphological analyzer 104 cannot tell which rule gives the right lemma in this case, it provides both lemmas. Entries 164 and 166 show similar dual rules for the word "arresting".
- Entries 168 and 170 of dictionary portion 150 show that morphological analyzer 104 can assign a single word to two different parts of speech.
- a word ending in "s" can either be the plural of a noun or can be the third person singular of a verb.
- morphological analyzer 104 produces two entries for any word ending in "s”.
- entries 168 and 170 morphological analyzer 104 has produced two entries for the word "arrests”. Both entries have the same lemma "arrest”, but entry 168 identifies the word "arrest” as being a verb and entry 170 identifies the word as being a noun.
- morphological analyzer 104 has produced its dictionary of triples, the process continues at step 112 where default analyses, explained below, are added to the dictionary. Default analyses can either be added by morphological analyzer 104 or by dictionary analyzer 106.
- FIG. 5 depicts expanded dictionary portion 180, which is dictionary portion 150 expanded by the inclusion of the default triples formed in step 112.
- Each word found in corpus 102 has an associated set of default triples.
- each set of default triples consists of four separate triples that each use their respective word as both the WORD and the LEMMA in the triple. Although their WORDs and LEMMAs are the same, each triple in a set of triples has a different part of speech.
- the word "arrest" in entry 182 has a set of default triples 184 consisting of triples 186, 188, 190 and 192.
- each of the triples 186, 188, 190 and 192 "arrest” appears as the WORD in the triple and "arrest” appears as the LEMMA in the triple.
- each of the triples in the set of default triples 184 has a unique part of speech.
- “arrest” is identified as an adjective; in triple 188, “arrest” is identified as an adverb; in triple 190, “arrest” is identified as a noun; and in triple 192, “arrest” is identified as a verb.
- sets of default triples 194, 196 and 198 provide default triples for the words “arrested”, “arresting” and "arrests", respectively.
- the default triples of expanded dictionary portion 180 are added to assist in identifying the correct lemma for a word. As will be discussed below, this is based on the observation that the lemma of a given word will also be present in the corpus. Default triples are an implementation of that hypothesis: at this stage, every word is treated as its own lemma. This will be useful in cases such as entry 182, where morphological analyzer 104 has analyzed the form "arrest" as an adjective with the lemma "arr”. As will be shown, the fact that there will be no default triple associated with the form "arr" will be used to reject that analysis. Note, of course, that the creation of the default triples adds many invalid entries to expanded dictionary portion 180 at this stage.
- the process of FIG. 3 performs a two-tier sort at box 114.
- the entries are sorted in alphabetical order by their lemmas.
- the entries for identical lemmas are sorted on their parts of speech.
- FIG. 6 shows a dictionary portion 200 which is formed by performing the two-tier sort of step 114 of FIG. 3 on expanded dictionary portion 180 of FIG. 5 .
- Group 202 is an exemplary group of entries that all share the lemma "arrest”.
- the entries are sorted based on their part of speech to form sub-groups. For example, each of the entries in sub-group 210 has "arrest" as its lemma and "verb" as its part of speech.
- entries in sub-groups 204, 206 and 208 are limited to nouns, adverbs and adjectives, respectively. This is because in English these are the parts of speech that inflect; in other languages, different parts of speech might be used.
- dictionary analyzer 106 can begin to eliminate entries that are not likely to be real words in the language.
- the first step for eliminating such entries is step 116 where entries that have a unique lemma/part of speech combination are eliminated unless their respective lemma is different from their respective word.
- the effects of step 116 are exemplified in dictionary portion 220 of FIG. 7 , which shows the effects of step 116 on dictionary portion 200 of FIG. 6 .
- dictionary portion 220 of FIG. 7 entries that have been eliminated by step 116 have a line drawn through them.
- entry 222 has been eliminated by step 116 because entry 222 has the only occurrence of "arrest” as a lemma for an adjective and the lemma of entry 222, "arrest", is identical to the word of entry 222.
- Entry 224 of dictionary portion 220 has not been stricken at step 116 because entry 224 is not the only entry in the dictionary that uses "arrest” as a lemma for a noun. Specifically, entry 226 also uses "arrest" as a lemma for a noun.
- Entry 228 of dictionary portion 220 has not been eliminated by step 116 even though it is the only entry in the dictionary that uses "arr” as a lemma for an adjective.
- the reason entry 228 has not been eliminated is that the lemma for entry 228, "arr”, is different from the word for entry 228, "arrest”.
- Step 116 removes entries based on the assumption that all valid entries for the dictionary will have lemmas that are inflected to produce other words in the dictionary. For example, the lemma of entry 224 is "arrest" which is inflected to form the word “arrests" in entry 226.
- dictionary analyzer 106 advances to step 118 where it eliminates entries that have a lemma that does not appear in corpus 102.
- Step 118 is best shown using dictionary portion 230 of FIG. 8 .
- dictionary portion 230 of FIG. 8 the lined entries that appeared in dictionary portion 220 of FIG. 7 have been removed.
- entries that are eliminated by step 118 of FIG. 3 have lines drawn through them in dictionary portion 230.
- dictionary portion 230 three entries 232, 234 and 236 are eliminated by step 118.
- entry 232 its associated lemma, "arr” does not appear in corpus 102. This is confirmed by the fact that "arr" does not appear as a word in any other entry in the dictionary. Since each word in corpus 102 appears as a word in the dictionary, if a lemma is not found as a word in the dictionary, it does not appear in corpus 102.
- dictionary analyzer 106 proceeds to step 120 where it identifies entries with identical word/lemma combinations, and for each set of entries that share a word/lemma combination, dictionary analyzer 106 applies language-specific heuristics to determine whether all are valid words in the language.
- FIG. 9 shows the state of the dictionary after dictionary analyzer 106 has applied such heuristics, assuming that the phrase "the arrest” was found in the corpus.
- the lemma "arrest” is associated with both a verb and a noun.
- dictionary analyzer 106 proceeds to step 122 where it identifies words in corpus 102 that are not present in the dictionary. The dictionary analyzer then produces analyses of these words using morphological analyzer 104. Step 122 is needed because words found in the corpus can be deleted from the dictionary in steps 116, 118 and 120.
- dictionary portion 260 of FIG. 10 is provided.
- Dictionary portion 260 is the same as dictionary portion 230 of FIG. 8 except that, for the purposes of this explanation, in dictionary portion 260 it is assumed that the word “arrest” is not present in the corpus 102 even though the words “arrests", “arrested” and “arresting” are present in corpus 102.
- step 118 of FIG.3 eliminates all entries that have "arrest” as a lemma. As such, entries 262, 264, 266 and 268 would be eliminated from the dictionary along with entries 270 and 272, which have a lemma of "arreste”.
- supplemental dictionary portion 280 shows the triples for the words “arrests”, “arrested” and “arresting” that appear in the corpus 102 but not in the dictionary.
- dictionary analyzer 106 selects one entry from each group of entries that share the same word/part of speech combination. The selection is performed by preferring those entries with lemmas that appear the most in the dictionary.
- Supplemental dictionary portion 290 of FIG. 12 shows the effects of step 124 on supplemental dictionary portion 280.
- entries eliminated by step 124 are shown with lines through them.
- dictionary analyzer 106 looks for entries that have the same word/part-of-speech combination. For example entries 292 and 294 both identify the word “arrested” as being a verb. However, entry 292 predicts that the lemma for "arrested” is “arrest” and entry 294 predicts that the lemma is "arreste”.
- dictionary analyzer 106 counts the number of times each lemma appears in supplemental dictionary portion 280. It then selects the entry that has the most frequently appearing lemma.
- dictionary analyzer 106 prefers entry 292 and eliminates entry 294. Similarly, dictionary analyzer 106 prefers entry 296 over entry 298, which both identify the word "arresting" as a verb.
- step 124 dictionary analyzer 106 proceeds to step 126 where it applies the same set of language heuristics discussed in step 120 to determine whether all the entries are valid words in the language.
- FIG. 13 shows the effects of step 126 with supplemental dictionary portion 300, which is produced from supplemental dictionary portion 290. In supplemental dictionary portion 300, those entries with lines through them in supplemental dictionary portion 290 have been removed.
- entries 302 and 304 each have “arrest” as a word and have “arrest” as a lemma. However, entry 302 treats “arrest” as a noun and entry 304 treats “arrest” as a verb. Since "arrest” forms both valid nouns and verbs in English, both entries remain in the dictionary after step 126.
- dictionary analyzer 106 adds the supplemental dictionary to the dictionary formed at the end of step 120 to form a complete dictionary.
- This complete dictionary may be reduced by eliminating the "WORD" from each entry to produce entries that only have a lemma and a part of speech. Entries with the same lemma/part of speech pair are then reduced to a single entry.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Saccharide Compounds (AREA)
Abstract
Description
- The present invention relates to computerized language systems. In particular, the present invention relates to dictionaries used in computerized language systems.
- Computerized language systems include a wide array of computer implemented functions that manipulate language to improve communication between a computer and a user. Examples include text-to-speech and speech-to-text converters, as well as natural language systems. In each of these systems, the computer must be able to determine the syntax of a sentence. In speech systems the syntax allows the computer to identify the proper tonal inflection for the speech. In natural language systems, the syntax allows the computer to identify the key words in a sentence.
- To determine syntax in a sentence, computerized language systems rely on dictionaries that list valid words for a particular language. Preferably, each dictionary entry indicates the word's part of speech and its stem, also known as its lemma. For example, a dictionary entry for "wash" would indicate that the word is a noun and a verb, while the entry for "elate" would indicate that the word is only a verb.
- In the art, such dictionaries are built by hand. This requires a great deal of time, which greatly increases the cost of producing computerized language systems for the various languages of the world.
- N. Ide et al., "Multext East Language specific resources", Copemixus , describes the resources for text segmentation and corresponding lexicons for language specific resources. The lexicon is created semiautomatically. First, a frequency list of the word forms is made. Then, non-words are deleted from the list and the list is run through a morphology analyzer. Next, the output is transformed to conform to the MTE specifications.
- It is therefore the object of the equipment mentioned to provide an improved method for creating a dictionary of words for a language that can be implemented automatically by a computerized language system, as well as a corresponding computer-readable medium.
- This object is solved by the subject matter of the independent claims.
- Preferred embodiments are defined by the dependent claims.
- A computer readable medium has computer executable components that include a morphological analyzer capable of using a corpus of a large number of words to automatically form a dictionary containing words associated with a lemma and a part of speech. The computer executable components also include a dictionary analyzer capable of automatically improving the dictionary.
-
FIG. 1 is a block, diagram of an operating environment for the present invention. -
FIG. 2 is a block diagram of the components of the present invention. -
FIG. 3 is a flow diagram of the process of the present invention. -
FIG. 4 is a portion of a dictionary produced by the morphological analyzer ofFIG. 2 . -
FIG. 5 is the portion of a dictionary ofFIG. 4 expanded by inserting default entries for each word in the corpus. -
FIG. 6 is a sorted version of the dictionary portion ofFIG. 5 . -
FIG. 7 is the dictionary portion ofFIG. 6 showing entries eliminated bystep 116 ofFIG. 3 . -
FIG. 8 is the dictionary portion ofFIG. 7 afterstep 118 ofFIG. 3 . -
FIG. 9 is the dictionary portion ofFIG. 8 afterstep 120 ofFIG. 3 . -
FIG. 10 provides a second dictionary portion for a corpus that lacks the word "arrest". -
FIG. 11 is a portion of a dictionary supplement based on words found in the corpus that are not found in the dictionary atstep 122 ofFIG. 3 . -
FIG. 12 is the dictionary supplement ofFIG. 11 afterstep 124 ofFIG. 3 . -
FIG. 13 is the dictionary supplement ofFIG. 12 afterstep 126 ofFIG. 3 . -
FIG, 1 and the related discussion are intended to provide a brief, general description of a suitable computing environment in which the invention may be implemented. Although not required, the invention will be described, at least in part, in the general context of computer-executable instructions, such as program modules, being executed by a personal computer. Generally, program modules include routine programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations, including hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices. - With reference to
FIG. 1 , an exemplary system for implementing the invention includes a general purpose computing device in the form of a conventionalpersonal computer 20, including a processing unit (CPU) 21, asystem memory 22, and asystem bus 23 that couples various system components including thesystem memory 22 to theprocessing unit 21. Thesystem bus 23 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. Thesystem memory 22 includes read only memory (ROM) 24 and random access memory (RAM) 25. A basic input/output (BIOS) 26, containing the basic routine that helps to transfer information between elements within thepersonal computer 20, such as during start-up, is stored inROM 24. Thepersonal computer 20 further includes ahard disk drive 27 for reading from and writing to a hard disk (not shown), amagnetic disk drive 28 for reading from or writing to removablemagnetic disk 29, and anoptical disk drive 30 for reading from or writing to a removableoptical disk 31 such as a CD ROM or other optical media. Thehard disk drive 27,magnetic disk drive 28, andoptical disk drive 30 are connected to thesystem bus 23 by a harddisk drive interface 32, magneticdisk drive interface 33, and anoptical drive interface 34, respectively. The drives and the associated computer-readable media provide nonvolatile storage of computer readable instructions, data structures, program modules and other data for thepersonal computer 20. - Although the exemplary environment described herein employs the hard disk, the removable
magnetic disk 29 and the removableoptical disk 31, it should be appreciated by those skilled in the art that other types of computer readable media which can store data that is accessible by a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli cartridges, random access memories (RAMs), read only memory (ROM), and the like, may also be used in the exemplary operating environment. - A number of program modules may be stored on the hard disk,
magnetic disk 29,optical disk 31,ROM 24 orRAM 25, including anoperating system 35, one ormore application programs 36,other program modules 37, andprogram data 38. A user may enter commands and information into thepersonal computer 20 through input devices such as akeyboard 40, pointingdevice 42 and amicrophone 43. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to theprocessing unit 21 through aserial port interface 46 that is coupled to thesystem bus 23, but may be connected by other interfaces, such as a sound card, a parallel port, a game port or a universal serial bus (USB). Amonitor 47 or other type of display device is also connected to thesystem bus 23 via an interface, such as avideo adapter 48. In addition to themonitor 47, personal computers may typically include other peripheral output devices, such as aspeaker 45 and printers (not shown). - The
personal computer 20 may operate in a networked environment using logic connections to one or more remote computers, such as aremote computer 49. Theremote computer 49 may be another personal computer, a hand-held device, a server, a router, a network PC, a peer device or other network node, and typically includes many or all of the elements described above relative to thepersonal computer 20, although only amemory storage device 50 has been illustrated inFIG. 1 . The logic connections depicted inFIG. 1 include a local area network (LAN) 51 and a wide area network (WAN) 52. Such networking environments are commonplace in offices, enterprisewide computer network Intranets and the Internet. - When used in a LAN networking environment, the
personal computer 20 is connected to thelocal area network 51 through a network interface oradapter 53. When used in a WAN networking environment, thepersonal computer 20 typically includes amodem 54 or other means for establishing communications over thewide area network 52, such as the Internet. Themodem 54, which may be internal or external, is connected to thesystem bus 23 via theserial port interface 46. In a network environment, program modules depicted relative to thepersonal computer 20, or portions thereof, may be stored in the remote memory storage devices. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used. For example, a wireless communication link may be established between one or more portions of the network. -
FIG. 2 is a block diagram ofsystem 100 of the present invention. A corpus 102 consisting of a large number of words is provided to amorphological analyzer 104. Preferably, corpus 102 consists of words written as sentences. For instance, corpus 102 can include news articles, fictional stories, or instruction booklets. Preferably, corpus 102 consists of at least 1 million words. -
Morphological analyzer 104 produces a dictionary of analyses from corpus 102 by applying morphological rules to the words in corpus 102. In preferred embodiments, the analyses for each word are triples having three parts: the word, the word's lemma and the word's part of speech. The rules thatmorphological analyzer 104 uses to produce the analyses from corpus 102 are developed by a person skilled in the particular language being analyzed. An example rule in English is that words that end in "ed" are commonly verbs and their lemma is formed by either removing the "d" or the "ed". - The dictionary produced by
morphological analyzer 104 is passed todictionary analyzer 106, which improves the dictionary.Dictionary analyzer 106 improves the dictionary by adding a set of default entries and by deleting entries that are unlikely to be valid words in the language. The process used bydictionary analyzer 106 is discussed further below. The results of the improvements provided bydictionary analyzer 106 formfinal dictionary 108, which can be used in computer language systems. For examplefinal dictionary 108 only includes one entry for each lemma/part-of-speech pair. The different forms of the lemma that appear in the corpus are generally not stored infinal dictionary 108. -
FIG. 3 is a flow diagram of the method of the present invention for automatically producing a dictionary. Instep 110 of the process, themorphological analyzer 104 produces a set of analyses using corpus 102 as input. In preferred embodiments, these analyses take the form of triples consisting of a word, a lemma and a part of speech. Examples of such triples are shown indictionary portion 150 ofFIG. 4 . - The triples listed in
dictionary portion 150 ofFIG. 4 are limited to variations of the word "arrest" that appear in corpus 102. Those skilled in the art will recognize that with at least one million words in corpus 102, there are several thousand unique words. As such,morphological analyzer 104 will produce several thousand analyses or triples in its initial dictionary. Since it is impossible to show a complete dictionary,FIG. 4 limits itself to variations of the word "arrest". - In
FIG. 4 , the three portions of the triples are aligned in three respective columns.Column 152, headed by the identifier "WORD" includes the words of corpus 102. Each word's associated lemma is found incolumn 154, which is headed by the term "LEMMA". The part of speech assigned to the word by the morphological analyzer is listed incolumn 156 under the heading "PART-OF-SPEECH". - The results from
morphological analyzer 104 that are shown indictionary portion 150 are illustrative of the errors thatmorphological analyzer 104 produces in attempting to build a dictionary. For example, inentry 158, the word "arrest" was analyzed bymorphological analyzer 104 as being a form of the lemma "arr" and was identified as an adjective.Morphological analyzer 104 guessed that "arrest" was an adjective based on the "est" suffix, which typically is associated with the superlative form of an adjective (as in, for example, "quick"/"quickest"). However, it is clear that arrest is not an adjective and that its lemma is not "arr". -
Entries dictionary portion 150 illustrate thatmorphological analyzer 104 provides multiple lemma/word combinations if several analyses are possible, given the morphological rules used. Specifically, for the word "arrested" found both inentries morphological analyzer 104 used a separate morphological rule for each entry. Forentry 160,morphological analyzer 104 used a rule that states that a word ending in "ed" has a lemma that is constructed by dropping the "d" from the word (as in the pair "please"/"pleased"). Forentry 162,morphological analyzer 104 used a rule that states that a word ending in "ed" has a lemma that is constructed by dropping the "ed" from the word (as in the pair "walk"/"walked"). Sincemorphological analyzer 104 cannot tell which rule gives the right lemma in this case, it provides both lemmas.Entries -
Entries 168 and 170 ofdictionary portion 150 show thatmorphological analyzer 104 can assign a single word to two different parts of speech. In English morphological rules, a word ending in "s" can either be the plural of a noun or can be the third person singular of a verb. To cover both situations,morphological analyzer 104 produces two entries for any word ending in "s". In the particular case ofentries 168 and 170,morphological analyzer 104 has produced two entries for the word "arrests". Both entries have the same lemma "arrest", butentry 168 identifies the word "arrest" as being a verb and entry 170 identifies the word as being a noun. - Referring to
FIG. 3 , oncemorphological analyzer 104 has produced its dictionary of triples, the process continues atstep 112 where default analyses, explained below, are added to the dictionary. Default analyses can either be added bymorphological analyzer 104 or bydictionary analyzer 106. -
FIG. 5 depicts expandeddictionary portion 180, which isdictionary portion 150 expanded by the inclusion of the default triples formed instep 112. Each word found in corpus 102 has an associated set of default triples. For English, each set of default triples consists of four separate triples that each use their respective word as both the WORD and the LEMMA in the triple. Although their WORDs and LEMMAs are the same, each triple in a set of triples has a different part of speech. For example, the word "arrest" inentry 182 has a set ofdefault triples 184 consisting oftriples triples default triples 184 has a unique part of speech. Thus, in triple 186, "arrest" is identified as an adjective; in triple 188, "arrest" is identified as an adverb; in triple 190, "arrest" is identified as a noun; and in triple 192, "arrest" is identified as a verb. Similarly, sets ofdefault triples - The default triples of expanded
dictionary portion 180 are added to assist in identifying the correct lemma for a word. As will be discussed below, this is based on the observation that the lemma of a given word will also be present in the corpus. Default triples are an implementation of that hypothesis: at this stage, every word is treated as its own lemma. This will be useful in cases such asentry 182, wheremorphological analyzer 104 has analyzed the form "arrest" as an adjective with the lemma "arr". As will be shown, the fact that there will be no default triple associated with the form "arr" will be used to reject that analysis. Note, of course, that the creation of the default triples adds many invalid entries to expandeddictionary portion 180 at this stage. - To make it easier to remove the invalid entries from the expanded dictionary, the process of
FIG. 3 performs a two-tier sort atbox 114. In the first tier of the sort, the entries are sorted in alphabetical order by their lemmas. In the second tier of the sort, the entries for identical lemmas are sorted on their parts of speech. -
FIG. 6 shows adictionary portion 200 which is formed by performing the two-tier sort ofstep 114 ofFIG. 3 on expandeddictionary portion 180 ofFIG. 5 . For clarity, spaces have been left between groups of entries that share common lemmas.Group 202 is an exemplary group of entries that all share the lemma "arrest". Withingroup 202, the entries are sorted based on their part of speech to form sub-groups. For example, each of the entries insub-group 210 has "arrest" as its lemma and "verb" as its part of speech. Similarly, entries insub-groups - Once the entries in the dictionary have been sorted in
step 114,dictionary analyzer 106 can begin to eliminate entries that are not likely to be real words in the language. The first step for eliminating such entries isstep 116 where entries that have a unique lemma/part of speech combination are eliminated unless their respective lemma is different from their respective word. The effects ofstep 116 are exemplified indictionary portion 220 ofFIG. 7 , which shows the effects ofstep 116 ondictionary portion 200 ofFIG. 6 . Indictionary portion 220 ofFIG. 7 , entries that have been eliminated bystep 116 have a line drawn through them. - In
dictionary portion 220,entry 222 has been eliminated bystep 116 becauseentry 222 has the only occurrence of "arrest" as a lemma for an adjective and the lemma ofentry 222, "arrest", is identical to the word ofentry 222.Entry 224 ofdictionary portion 220 has not been stricken atstep 116 becauseentry 224 is not the only entry in the dictionary that uses "arrest" as a lemma for a noun. Specifically,entry 226 also uses "arrest" as a lemma for a noun. -
Entry 228 ofdictionary portion 220 has not been eliminated bystep 116 even though it is the only entry in the dictionary that uses "arr" as a lemma for an adjective. Thereason entry 228 has not been eliminated is that the lemma forentry 228, "arr", is different from the word forentry 228, "arrest". - Step 116 removes entries based on the assumption that all valid entries for the dictionary will have lemmas that are inflected to produce other words in the dictionary. For example, the lemma of
entry 224 is "arrest" which is inflected to form the word "arrests" inentry 226. - After
step 116 ofFIG. 3 ,dictionary analyzer 106 advances to step 118 where it eliminates entries that have a lemma that does not appear in corpus 102. Step 118 is best shown usingdictionary portion 230 ofFIG. 8 . Indictionary portion 230 ofFIG. 8 , the lined entries that appeared indictionary portion 220 ofFIG. 7 have been removed. In addition, entries that are eliminated bystep 118 ofFIG. 3 have lines drawn through them indictionary portion 230. - In
dictionary portion 230, threeentries step 118. Forentry 232, its associated lemma, "arr" does not appear in corpus 102. This is confirmed by the fact that "arr" does not appear as a word in any other entry in the dictionary. Since each word in corpus 102 appears as a word in the dictionary, if a lemma is not found as a word in the dictionary, it does not appear in corpus 102. - Similarly, the lemma "arreste" in
entries - After
step 118 ofFIG. 3 ,dictionary analyzer 106 proceeds to step 120 where it identifies entries with identical word/lemma combinations, and for each set of entries that share a word/lemma combination,dictionary analyzer 106 applies language-specific heuristics to determine whether all are valid words in the language. - An example of a language-specific heuristic for English is the following: look if a word has been analyzed as a noun as well as a verb, look for patterns such as "the + lemma", "a + lemma", "many + word" etc. in the corpus. For example, if the pattern "the arrest" is indeed found in the text, the analysis of the word "arrest" as a noun is recognized as valid.
FIG. 9 shows the state of the dictionary afterdictionary analyzer 106 has applied such heuristics, assuming that the phrase "the arrest" was found in the corpus. InFIG. 9 , the lemma "arrest" is associated with both a verb and a noun. - After
step 120,dictionary analyzer 106 proceeds to step 122 where it identifies words in corpus 102 that are not present in the dictionary. The dictionary analyzer then produces analyses of these words usingmorphological analyzer 104. Step 122 is needed because words found in the corpus can be deleted from the dictionary insteps - To understand the need for
step 122,dictionary portion 260 ofFIG. 10 is provided.Dictionary portion 260 is the same asdictionary portion 230 ofFIG. 8 except that, for the purposes of this explanation, indictionary portion 260 it is assumed that the word "arrest" is not present in the corpus 102 even though the words "arrests", "arrested" and "arresting" are present in corpus 102. With "arrest" not present in the corpus, step 118 ofFIG.3 eliminates all entries that have "arrest" as a lemma. As such,entries entries - An example of the analyses produced in
step 122 based on the assumption that "arrest" does not appear in the corpus is shown insupplemental dictionary portion 280 ofFIG. 11 . Specifically,supplemental dictionary portion 280 shows the triples for the words "arrests", "arrested" and "arresting" that appear in the corpus 102 but not in the dictionary. - Once the analyses have been produced in
step 122,dictionary analyzer 106 selects one entry from each group of entries that share the same word/part of speech combination. The selection is performed by preferring those entries with lemmas that appear the most in the dictionary. -
Supplemental dictionary portion 290 ofFIG. 12 shows the effects ofstep 124 onsupplemental dictionary portion 280. Insupplemental dictionary portion 290, entries eliminated bystep 124 are shown with lines through them. - In
step 124,dictionary analyzer 106 looks for entries that have the same word/part-of-speech combination. Forexample entries entry 292 predicts that the lemma for "arrested" is "arrest" andentry 294 predicts that the lemma is "arreste". - To choose between entries with the same word/part of speech combination,
dictionary analyzer 106 counts the number of times each lemma appears insupplemental dictionary portion 280. It then selects the entry that has the most frequently appearing lemma. - Continuing the example above, in
supplemental dictionary portion 280, the lemma "arrest" ofentry 292 appears more often than the lemma "arreste" ofentry 294. Therefore,dictionary analyzer 106 prefersentry 292 and eliminatesentry 294. Similarly,dictionary analyzer 106 prefersentry 296 overentry 298, which both identify the word "arresting" as a verb. - After
step 124,dictionary analyzer 106 proceeds to step 126 where it applies the same set of language heuristics discussed instep 120 to determine whether all the entries are valid words in the language.FIG. 13 shows the effects ofstep 126 withsupplemental dictionary portion 300, which is produced fromsupplemental dictionary portion 290. Insupplemental dictionary portion 300, those entries with lines through them insupplemental dictionary portion 290 have been removed. - In
supplemental dictionary portion 300,entries entry 302 treats "arrest" as a noun andentry 304 treats "arrest" as a verb. Since "arrest" forms both valid nouns and verbs in English, both entries remain in the dictionary afterstep 126. - Once
dictionary analyzer 106 has finishedstep 126, it adds the supplemental dictionary to the dictionary formed at the end ofstep 120 to form a complete dictionary. This complete dictionary may be reduced by eliminating the "WORD" from each entry to produce entries that only have a lemma and a part of speech. Entries with the same lemma/part of speech pair are then reduced to a single entry. - Although the invention described above has been described with reference to English, those skilled in the art will recognize that the invention can be used with many other languages. Although the morphological analyzer and the language heuristics will change for each language, the basic invention remains the same.
- Although the present invention has been described with reference to preferred embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the scope of the invention.
Claims (3)
- A computer-implemented method for creating a dictionary of words for a language, each entry in the dictionary indicating a part of speech for the word and a lemma for the word, the method comprising:analyzing a corpus (102) of a large number of words with a morphological analyzer (104) that utilizes morphological rules to assign a part of speech (156) and a lemma (154) to the words (152) of the corpus to generate a dictionary entry storing the dictionary entry in the dictionary, wherein the corpus has been provided to the morphological analyzer;adding default entries (184, 194, 196, 198) to the dictionary;removing entries that are not likely to represent real words from the dictionary based on lemmas in the entries,characterized in that
adding default entries to the dictionary comprises generating multiple default entries for each word in the corpus by using the word itself as a lemma with multiple parts of speech, one part of speech per default entry;
removing entries comprises the steps of:deleting those entries (222) having lemmas that only appear once in the dictionary as lemmas and that match their respective word in their respective entry;deleting those entries (232, 234, 236) having lemmas that do not appear inthe corpus; andsaid computer-implemented method further comprises:comparing the corpus to the dictionary to identify words that appear in the corpus but not in the dictionary and using the morphological analyzer to generate second pass entries 280 for words that appear in the corpus but not in the dictionary. - The computer-implemented method of claim 1 further comprising eliminating all but one entry (292, 296) from multiple said second pass entries (292, 296, 294, 298) that have the same word and part of speech by choosing the entry (292, 296) having a lemma that appears as a lemma in the most entries in the dictionary.
- A computer readable medium having computer executable components comprising:a morphological analyzer (104) capable of using a corpus (102) of a large number of words to form a dictionary having dictionary entries containing words (152) associated with a lemma (154) and a part of speech (156); anda dictionary analyzer (106) capable of
automatically improving the dictionary by adding default dictionary entries (184, 194, 196, 198) by generating multiple default entries for each word in the corpus by using the word itself as a lemma with multiple parts of speech, one part of speech per default entry;
removing dictionary entries that are not likely to represent real words from the dictionary based on lemmas in the dictionary entries;
deleting those entries (222) having lemmas that only appear once in the dictionary as lemmas and that match their respective word in their respective entry;
deleting those entries (232, 234, 236) having lemmas that do not appear in the corpus; andcomparing the corpus to the dictionary to identify words that appear in the corpus but not in the dictionary and using the morphological analyzer to generate second pass entries (262-272) for words that appear in the corpus but not in the dictionary.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/076,163 US6192333B1 (en) | 1998-05-12 | 1998-05-12 | System for creating a dictionary |
US76163 | 1998-05-12 | ||
PCT/US1999/010402 WO1999059082A1 (en) | 1998-05-12 | 1999-05-12 | System for creating a dictionary |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1078322A1 EP1078322A1 (en) | 2001-02-28 |
EP1078322B1 true EP1078322B1 (en) | 2009-11-25 |
Family
ID=22130333
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP99922966A Expired - Lifetime EP1078322B1 (en) | 1998-05-12 | 1999-05-12 | System for creating a dictionary |
Country Status (6)
Country | Link |
---|---|
US (1) | US6192333B1 (en) |
EP (1) | EP1078322B1 (en) |
AT (1) | ATE450007T1 (en) |
CA (1) | CA2331815C (en) |
DE (1) | DE69941694D1 (en) |
WO (1) | WO1999059082A1 (en) |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7680649B2 (en) * | 2002-06-17 | 2010-03-16 | International Business Machines Corporation | System, method, program product, and networking use for recognizing words and their parts of speech in one or more natural languages |
EP1584023A1 (en) * | 2002-12-27 | 2005-10-12 | Nokia Corporation | Predictive text entry and data compression method for a mobile communication terminal |
US7181396B2 (en) * | 2003-03-24 | 2007-02-20 | Sony Corporation | System and method for speech recognition utilizing a merged dictionary |
US7293005B2 (en) * | 2004-01-26 | 2007-11-06 | International Business Machines Corporation | Pipelined architecture for global analysis and index building |
US7499913B2 (en) * | 2004-01-26 | 2009-03-03 | International Business Machines Corporation | Method for handling anchor text |
US7424467B2 (en) | 2004-01-26 | 2008-09-09 | International Business Machines Corporation | Architecture for an indexer with fixed width sort and variable width sort |
US8296304B2 (en) * | 2004-01-26 | 2012-10-23 | International Business Machines Corporation | Method, system, and program for handling redirects in a search engine |
US7430716B2 (en) * | 2004-07-28 | 2008-09-30 | International Business Machines Corporation | Enhanced efficiency in handling novel words in spellchecking module |
US7785197B2 (en) * | 2004-07-29 | 2010-08-31 | Nintendo Co., Ltd. | Voice-to-text chat conversion for remote video game play |
US7491123B2 (en) * | 2004-07-29 | 2009-02-17 | Nintendo Co., Ltd. | Video game voice chat with amplitude-based virtual ranging |
US7461064B2 (en) | 2004-09-24 | 2008-12-02 | International Buiness Machines Corporation | Method for searching documents for ranges of numeric values |
US8417693B2 (en) * | 2005-07-14 | 2013-04-09 | International Business Machines Corporation | Enforcing native access control to indexed documents |
WO2007029348A1 (en) * | 2005-09-06 | 2007-03-15 | Community Engine Inc. | Data extracting system, terminal apparatus, program of terminal apparatus, server apparatus, and program of server apparatus |
RU2639280C2 (en) * | 2014-09-18 | 2017-12-20 | Общество с ограниченной ответственностью "Аби Продакшн" | Method and system for generation of articles in natural language dictionary |
US8812296B2 (en) * | 2007-06-27 | 2014-08-19 | Abbyy Infopoisk Llc | Method and system for natural language dictionary generation |
DE102008010753A1 (en) * | 2008-02-23 | 2009-08-27 | Bayer Materialscience Ag | Elastomeric polyurethane molded parts, obtained by reacting polyol formulation consisting of e.g. polyol component and optionally organic tin catalyst, and an isocyanate component consisting of e.g. prepolymer, useful e.g. as shoe sole |
US8521516B2 (en) * | 2008-03-26 | 2013-08-27 | Google Inc. | Linguistic key normalization |
GB2471811B (en) * | 2008-05-09 | 2012-05-16 | Fujitsu Ltd | Speech recognition dictionary creating support device,computer readable medium storing processing program, and processing method |
US20150347570A1 (en) * | 2014-05-28 | 2015-12-03 | General Electric Company | Consolidating vocabulary for automated text processing |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4887212A (en) * | 1986-10-29 | 1989-12-12 | International Business Machines Corporation | Parser for natural language text |
US4862408A (en) | 1987-03-20 | 1989-08-29 | International Business Machines Corporation | Paradigm-based morphological text analysis for natural languages |
US5099426A (en) * | 1989-01-19 | 1992-03-24 | International Business Machines Corporation | Method for use of morphological information to cross reference keywords used for information retrieval |
US5229936A (en) * | 1991-01-04 | 1993-07-20 | Franklin Electronic Publishers, Incorporated | Device and method for the storage and retrieval of inflection information for electronic reference products |
US5940624A (en) * | 1991-02-01 | 1999-08-17 | Wang Laboratories, Inc. | Text management system |
US5251316A (en) * | 1991-06-28 | 1993-10-05 | Digital Equipment Corporation | Method and apparatus for integrating a dynamic lexicon into a full-text information retrieval system |
US5412567A (en) * | 1992-12-31 | 1995-05-02 | Xerox Corporation | Augmenting a lexical transducer by analogy |
US5724594A (en) * | 1994-02-10 | 1998-03-03 | Microsoft Corporation | Method and system for automatically identifying morphological information from a machine-readable dictionary |
JPH0844719A (en) * | 1994-06-01 | 1996-02-16 | Mitsubishi Electric Corp | Dictionary access system |
US5873660A (en) * | 1995-06-19 | 1999-02-23 | Microsoft Corporation | Morphological search and replace |
US5794177A (en) * | 1995-07-19 | 1998-08-11 | Inso Corporation | Method and apparatus for morphological analysis and generation of natural language text |
US5995922A (en) * | 1996-05-02 | 1999-11-30 | Microsoft Corporation | Identifying information related to an input word in an electronic dictionary |
-
1998
- 1998-05-12 US US09/076,163 patent/US6192333B1/en not_active Expired - Lifetime
-
1999
- 1999-05-12 AT AT99922966T patent/ATE450007T1/en not_active IP Right Cessation
- 1999-05-12 WO PCT/US1999/010402 patent/WO1999059082A1/en active Application Filing
- 1999-05-12 CA CA2331815A patent/CA2331815C/en not_active Expired - Fee Related
- 1999-05-12 DE DE69941694T patent/DE69941694D1/en not_active Expired - Lifetime
- 1999-05-12 EP EP99922966A patent/EP1078322B1/en not_active Expired - Lifetime
Non-Patent Citations (1)
Title |
---|
N. IDE ET AL.: "Multext East Language Specific Resources"", COPERNICUS PROJECT COP 106, DELIVERABLE D1.2, 1996, XP001152293, Retrieved from the Internet <URL:http://aune.lpl.univ-aix.fr/projects/multext-east/MTE2.html> * |
Also Published As
Publication number | Publication date |
---|---|
US6192333B1 (en) | 2001-02-20 |
EP1078322A1 (en) | 2001-02-28 |
WO1999059082A1 (en) | 1999-11-18 |
DE69941694D1 (en) | 2010-01-07 |
CA2331815A1 (en) | 1999-11-18 |
ATE450007T1 (en) | 2009-12-15 |
CA2331815C (en) | 2010-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1078322B1 (en) | System for creating a dictionary | |
EP0953192B1 (en) | Natural language parser with dictionary-based part-of-speech probabilities | |
Liddy | Natural language processing | |
EP1308851B1 (en) | Method of calculating translation relationships among words of different languages | |
US6760695B1 (en) | Automated natural language processing | |
KR100542755B1 (en) | Hybrid automatic translation Apparatus and Method by combining Rule-based method and Translation pattern method, and The medium recording the program | |
JP3476237B2 (en) | Parser | |
JPS6299865A (en) | Maintenance system for co-occurrence relation dictionary of natural language | |
Wu | Grammarless extraction of phrasal translation examples from parallel texts | |
Byrd | Word Formation in Natural Language Processing Systems. | |
JP5231698B2 (en) | How to predict how to read Japanese ideograms | |
Belkredim et al. | An ontology based formalism for the Arabic language using verbs and their derivatives | |
WO1997040453A1 (en) | Automated natural language processing | |
KR20060043583A (en) | Compression of logs of language data | |
US20050027509A1 (en) | Left-corner chart parsing | |
WO2005093600A2 (en) | Induction of grammar rules | |
Tanaka et al. | Integration of morphological and syntactic analysis based on LR parsing algorithm | |
Mustafa | Phonology of Acehnese Reduplication: Applying Optimality Theory | |
Chowdhury et al. | Parts of speech tagging of bangla sentence | |
Elwert | Network analysis between distant reading and close reading | |
Kunter et al. | Distributional and lexical exploration of semantics of derivational morphology | |
Kadam | Develop a Marathi Lemmatizer for Common Nouns and Simple Tenses of Verbs | |
JP2000250913A (en) | Example type natural language translation method, production method and device for list of bilingual examples and recording medium recording program of the production method and device | |
Kuta et al. | A case study of algorithms for morphosyntactic tagging of Polish language | |
JP2765618B2 (en) | Language analyzer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20001205 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE |
|
17Q | First examination report despatched |
Effective date: 20040910 |
|
17Q | First examination report despatched |
Effective date: 20040910 |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: EP |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FG4D |
|
REF | Corresponds to: |
Ref document number: 69941694 Country of ref document: DE Date of ref document: 20100107 Kind code of ref document: P |
|
REG | Reference to a national code |
Ref country code: NL Ref legal event code: VDEP Effective date: 20091125 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20091125 Ref country code: PT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20100325 Ref country code: FI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20091125 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CY Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20091125 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: BE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20091125 Ref country code: AT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20091125 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: NL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20091125 Ref country code: ES Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20100308 Ref country code: DK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20091125 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: MC Payment date: 20100428 Year of fee payment: 12 Ref country code: LU Payment date: 20100518 Year of fee payment: 12 Ref country code: IE Payment date: 20100514 Year of fee payment: 12 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: NL Payment date: 20100501 Year of fee payment: 12 |
|
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: GR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20100226 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: CH Payment date: 20100514 Year of fee payment: 12 |
|
26N | No opposition filed |
Effective date: 20100826 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MC Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20110531 |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: PL |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CH Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20110531 Ref country code: LI Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20110531 |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: MM4A |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20110512 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: LU Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20110512 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R082 Ref document number: 69941694 Country of ref document: DE Representative=s name: GRUENECKER, KINKELDEY, STOCKMAIR & SCHWANHAEUS, DE |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: 732E Free format text: REGISTERED BETWEEN 20150115 AND 20150121 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R082 Ref document number: 69941694 Country of ref document: DE Representative=s name: GRUENECKER PATENT- UND RECHTSANWAELTE PARTG MB, DE Effective date: 20150126 Ref country code: DE Ref legal event code: R081 Ref document number: 69941694 Country of ref document: DE Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, REDMOND, US Free format text: FORMER OWNER: MICROSOFT CORP., REDMOND, WASH., US Effective date: 20150126 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: TP Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, US Effective date: 20150724 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: PLFP Year of fee payment: 18 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: PLFP Year of fee payment: 19 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: GB Payment date: 20170510 Year of fee payment: 19 Ref country code: FR Payment date: 20170413 Year of fee payment: 19 Ref country code: DE Payment date: 20170509 Year of fee payment: 19 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: IT Payment date: 20170522 Year of fee payment: 19 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R119 Ref document number: 69941694 Country of ref document: DE |
|
GBPC | Gb: european patent ceased through non-payment of renewal fee |
Effective date: 20180512 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: FR Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180531 Ref country code: IT Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180512 Ref country code: DE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20181201 Ref country code: GB Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180512 |