US5337232A - Morpheme analysis device - Google Patents

Morpheme analysis device Download PDF

Info

Publication number
US5337232A
US5337232A US07/853,601 US85360192A US5337232A US 5337232 A US5337232 A US 5337232A US 85360192 A US85360192 A US 85360192A US 5337232 A US5337232 A US 5337232A
Authority
US
United States
Prior art keywords
morphemes
morpheme
ary
dictionary
memorized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
US07/853,601
Other languages
English (en)
Inventor
Shinsuke Sakai
Takao Miyabe
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to JP1051112A priority Critical patent/JPH02297195A/ja
Priority to GB9004566A priority patent/GB2229558A/en
Application filed by NEC Corp filed Critical NEC Corp
Priority to US07/853,601 priority patent/US5337232A/en
Assigned to NEC CORPORATION reassignment NEC CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST. Assignors: MIYABE, TAKAO, SAKAI, SHINSUKE
Application granted granted Critical
Publication of US5337232A publication Critical patent/US5337232A/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • This invention relates to a device for breaking a Japanese sentence into a succession of morphemes.
  • the device previously called a "Device for Analyzing Japanese Sentences into Morphemes with Attention Directed to Morpheme Groups", is herein called by the short form name, "Morpheme Analysis Device.”
  • a sentence consists of morphemes. Each morpheme may be either a dictionary word or an allomorph, depending on the circumstances. A sentence or portion of a sentence may properly be called a syntagm, since a syntagm is defined as a phrase, a clause, or a sentence.
  • morphemes are very useful for a language such as Japanese.
  • English language morphemes are easily detected since morphemes correspond to words, and spaces are placed around words. This is not true, in contrast, for a language such as Japanese, in which sentences are written without spacing, and thus, there is no pause between successive morphemes.
  • the morpheme analysis device is useful in a machine translation system which deals with the Japanese language as a source language.
  • the morpheme analysis device is useful also in a speech sound synthesis system for producing speech sound in compliance with a text written in the Japanese language.
  • Japanese syntagms written in an English equivalent as often as possible.
  • Japanese use Chinese characters and phonetic characters called Kanji and Kana, respectively
  • Chinese characters will not be used herein. If, however, it becomes necessary to phonetically represent a Japanese syntagm, the Japanese syntagm will be written in accordance with International Standard ISO 3602.
  • morpheme groups in which the morphemes appear in a customary order.
  • the symbol "PT" is used in lieu of one Chinese character.
  • the character PT means, a part or parts, and is used in expressing, inter alia, (1) a time instant, (2) a measure of an angle, and (3) a ratio.
  • One Japanese syntagm containing the morpheme PT, is "Zyuzi nizippun,” which means “twenty minutes past ten” in English.
  • the syntagm is composed of four morphemes: "ten", "o'clock”, “twenty”, and "minutes” in English.
  • the syntagm is usually written in Japan by a set of two Arabic numerals (or, more correctly, Malawi numerals) 10, another Chinese character represented herein by O'C, another set of Arabic numerals 20, and the character PT.
  • the character PT When used to express a time instant in this manner, the character PT customarily appears after the other character O'C in a morpheme group which consists of two morphemes written by the characters 0'C and PT. These characters may be called a time instant group.
  • a second Japanese syntagm also containing the morpheme PT is "zyudo nizippun," which means “ten degrees [and] twenty minutes” in English.
  • the syntagm is composed of four morphemes: "ten”, “degrees", “twenty”, and "minutes” in English.
  • the syntagm is ordinarily written by a set of Arabic numerals 10, still another Chinese character represented herein by DG, a set of Arabic numerals 20, and the character PT.
  • DG Chinese character represented herein by DG
  • a set of Arabic numerals 20 When used to express a measure of an angle, the character PT is found after the other character DG in a group which consists of two morphemes written by the characters DG and PT. These morphemes may be called an angle group.
  • a third example of a syntagmcontaining the morpheme PT is a ratio.
  • a ratio may be expressed in the Japanese language either according to a traditional expression, or by using "percent” which is pronounced and written "pasento" in kana letters.
  • the character PT means a hundredth or hundredths and is located after yet another Chinese character which means a tenth or tenths.
  • the tenths character will herein be represented by TTH.
  • TTH When used to express a ratio according to the traditional expression, the character PT occurs after the other character, TTH, in a group which consists of two morphemes written by the characters TTH and PT. These morphemes may be called a ratio group.
  • morpheme groups with customary orders are post office addresses. It is possible to express any address in Japan by using some of about fifteen morphemes of a group, however, addresses within postal wards are simplified and thus shorter. For example, the full address of the NEC Corporation is "Tokyo-to Minato-ku Siba Gotyome Sitiban [or, Nanaban] Zyugogo" in the Japanese language. Depending on localities in twenty-three wards in Tokyo, a proper noun is substituted together with a suffix "mati” or "tyo" for the word "Siba” used in the above address.
  • an address used within a Tokyo ward uses five or six morphemes out of a group of the seven morphemes “to”, “ku”, “mati”, “tyo”, “tyome”, “ban”, and “go” in that order.
  • This morpheme group may be called a Japan address group or simply an address group.
  • Such a group of morphemes is useful, if applicable, in breaking a syntagm into analyzed morphemes and resolving ambiguities.
  • the character PT has several meanings, depending on its context as determined by other characters or morpheme group within which it appears.
  • the morpheme PT expresses a time instant if found after the morpheme O'C in the time instant group, a measure of an angle if located after the morpheme DG in the angle group, or a ratio if it appears after the morpheme TTH in the ratio group.
  • morphemes within a Japan address group can be resolved if it is noted that they occur within a Japan address group.
  • a device in accordance with the present invention for breaking a syntagm into analyzed morphemes, using morpheme groups includes a dictionary for storing a plurality of dictionary morphemes. It also includes a separating unit, accessing the dictionary morphemes in the dictionary, and supplied with the syntagm, wherein said syntagm is separated into at least one morpheme, and at least one candidate morpheme is selected from the dictionary morphemes for the morpheme. It further includes a memory, accessed by the separating unit, the memory storing the candidate morphemes as memorized morphemes, wherein at least one of the memorized morphemes is stored for the morphemes.
  • It also includes an output unit, accessing the memorized morphemes in the memory and connected to the separating unit, wherein the memorized morphemes are selected by determining a morpheme group to which the morpheme belongs, the analyzed morphemes being selected from the morpheme group and produced as the analyzed morphemes.
  • a method in accordance with the present invention for breaking a syntagm containing a plurality of morphemes belonging to a morpheme group into analyzed morphemes includes the steps of inputting a syntagm.
  • the syntagm is separated into a plurality of morphemes.
  • the method also includes checking whether the first morpheme has at least one candidate morpheme selected which has a first field identical to a first field of at least one candidate morpheme for the subsequent morpheme, and if so, marking the candidate morpheme as a marked morpheme.
  • the checking step is repeated for the plurality of morphemes, using the subsequent morpheme as the first morpheme.
  • the marked morpheme is selected as an analyzed morpheme.
  • the analyzed morphemes are output.
  • FIG. 1 is a block diagram of a morpheme analysis device according to one embodiment of the instant invention
  • FIG. 2 is an overall flow chart for use in describing operation of the morpheme analysis device illustrated in FIG. 1;
  • FIG. 3 is a chart of a Japanese sentence divided into morphemes
  • FIG. 4 illustrates the fields in a dictionary morpheme
  • FIG. 5 is a table showing correspondence between AXIS fields and Japanese words in a morpheme group
  • FIG. 6 is an example of dictionary morphemes in a morpheme group
  • FIG. 7 illustrates morpheme groups in the dictionary
  • FIG. 8 is a version of FIG. 7, expanded to illustrate dictionary morphemes
  • FIG. 9 illustrates M(g) candidate morphemes selected from the morpheme group having N(g) dictionary morphemes
  • FIG. 10 is a flow chart of a separating unit illustrated in FIG. 1;
  • FIG. 11 is a partial flow chart of an output unit illustrated in FIG. 1;
  • FIG. 12 is an alternative flow chart for the output unit illustrated in FIG. 1;
  • FIG. 13 is a detailed flow chart for one step of FIG. 7;
  • FIG. 14 is a detailed flow chart for another step of FIG. 7.
  • FIG. 15 is a schematic representation of a memory, which is used in the morpheme analysis device illustrated in FIG. 1.
  • a morpheme analysis device for breaking a Japanese syntagminto a succession of analyzed morphemes includes a dictionary 11, a separating unit 12, a memory 16, and an output unit 17.
  • An input connection 13 supplies the separating unit 12 with the syntagm in the form of a sequence of characters recognizable by an electronic digital computer. It will be presumed that the syntagm is given in a text written, together with punctuation marks, in Chinese characters and kana letters. Chinese characters and kana letters will herein be called characters, with no further distinction. However, the text could be written in other writing systems, such as Cyrillic or Roman letters.
  • the input connection 13 includes an optical character recognition device (not shown) which supplies the syntagmto the separating unit 12, character by character, as signals which an electronic digital computer can recognize.
  • the separating unit 12 is connected to the dictionary 11 by a first connection 20. Prior to operation of the separating unit, the dictionary is loaded with dictionary morphemes. The separating unit 12 is also connected to the memory 16 by a second connection 14. The separating unit 12 supplies the memory 16 with potential dictionary morphemes, called candidate morphemes, along the second connection 14. The memory 16 then stores the candidate morphemes in connection with the morphemes. These stored candidate morphemes are called memorized morphemes.
  • the output unit 17 is connected to the memory 16 by a third connection 15.
  • the separating unit 12 supplies an end signal along a fourth connection 19 to the output unit 17 when the separating unit 12 finishes selection of the candidate morphemes.
  • the output unit 17 then resolves ambiguities and selects the appropriate memorized morphemes from the memory 16, which the output unit 17 outputs along an output connection 18.
  • the separating unit 12 is supplied with the syntagmby the input connection 13.
  • the separating unit 12 refers to the dictionary 11 and separates the syntagm into a plurality of morphemes by using a known algorithm such as discussed above.
  • a known algorithm such as discussed above.
  • a syntagm is in Roman letters, such as English
  • the algorithm is of course very simple, since morphemes are separated by spaces.
  • a spoken syntagm may use as the algorithm a speech pattern recognition system.
  • One speech pattern recognition system is disclosed in U.S. Pat. No. 4,059,725 issued to Sakoe (incorporated herein by reference), and is often implemented by a microprocessor.
  • FIG. 3 illustrates a Japanese sentence that has been divided into a plurality of morphemes by using the known algorithm.
  • the sentence which has been divided is "AMEGAFURU.” (punctuation included) which means "It rains.”
  • the sentence is divided into the five morphemes 31-35.
  • a spelling row 36 shows each morpheme.
  • An interval row 37 shows the order in which each morpheme customarily occurs.
  • a possible part of speech row 38 shows the possible part of speech for each of the morphemes. Note that the morphemes include the punctuation 35. There can be multiple dictionary definitions for any morpheme.
  • AME 31 can be defined as “candy” and “rain”, and therefore would have two dictionary definitions.
  • the dictionary 11 contains a dictionary morpheme for each definition of a morpheme, and thus would have two dictionary morphemes for "AME.”
  • the separating unit 12 selects candidate morphemes from the dictionary morphemes in dictionary 11. For each morpheme, one or more candidate morphemes may be selected.
  • candidate morphemes are stored as memorized morphemes in the following way.
  • the separating unit 12 selects a plurality of candidate morphemes from dictionary 11
  • the separating unit 12 causes the memory 16 to store the plurality of candidate morphemes as a plurality of memorized morphemes.
  • the output unit 17 is supplied with the end signal along the fourth connection 19 from the separating unit 12, when the separating unit 12 finishes selection of the candidate morphemes for the syntagm.
  • the output unit 17 resolves ambiguities among the memorized morphemes, and supplies the output connection 18 with the analyzed morphemes. Operation of the output unit 17 and separating unit 12 will be described later in detail.
  • the dictionary 11 has a plurality of fields for each of a plurality of dictionary morphemes, for example, three fields. Each field provides at least one signal representative of a field entry.
  • FIG. 4 illustrates the fields in a dictionary morpheme 61.
  • One of the fields is called a morpheme field 41, which is for storing the morpheme.
  • AKO field 43 Another of the fields, called an AKO field 43, is used to memorize an AKO field entry representative of "a kind of" semantic class into which the morpheme in question is classified in accordance with its meanings. Examples are AKO(time instant) and AKO(angle) for the morpheme, PT.
  • Still another field will be called an AXIS field 45, which further has first and second AXIS fields.
  • the element in the first AXIS field a descriptor 47, designates a semantic class, in which words occurring together in a morpheme group are used.
  • the element in the second AXIS field an identifier 49, defines the relative position in the morpheme group in which it may occur.
  • Each first and second AXIS field 47, 49 corresponds to a morpheme field.
  • the fields of the dictionary morpheme may be implemented as a record with four fields. For every definition of a morpheme, there is a dictionary morpheme.
  • a morpheme is ambiguous, because it occurs within a plurality of morpheme groups, the morpheme will have more than one first and second AXIS field in the dictionary 11.
  • Other structures may advantageously be used in implementing dictionary morphemes for ambiguous morphemes, for example, pointers can be used.
  • FIG. 5 illustrates the AXIS field values 47, 49 of a morpheme group for the dictionary morphemes for the semantic class "time instant.” This includes spellings 51 “NEN” (meaning year), “GATSU” (month), “NICHI” (date), “JI” (o'clock), “FUN” (minute), and “BYOU” (second) An explanation 57 of the spelling 51 is included in FIG. 5 for clarity.
  • the element in the first AXIS field 47 is the descriptor TIME, which is the same for all entries for this semantic class.
  • the element in the second AXIS field 49 shows the order that is customarily used in a time instant expression such as 12GATSU25NICHI12JI15FUN (12:15, December 25).
  • Dictionary morphemes for a morpheme group are preferably contiguous. Thus, those words with the same element in the first AXIS field 47 appear together. They should be in the customary order, arranged in descending order according to the element in the second AXIS field 47.
  • the AXIS field entry 45 is herein written as AXIS(t, n), where the variables t and n represent the first field, a descriptor, and the second field, an identifier which may be an integer.
  • FIG. 6 is an illustration of the morpheme group 63 in FIG. 5, showing the morpheme field 41, the AKO field 43, the descriptor 47 and the identifier 49 of the AXIS field 45, for each dictionary morpheme 61 of the morpheme group 63.
  • Each semantic class corresponds to a morpheme group.
  • the number G will herein be used to represent the number of all semantic classes that appear as the first element of AXIS fields. Thus, G represents the number of all morpheme groups.
  • the number g refers herein to an index to any morpheme group in the dictionary 11.
  • the number N(g) as used herein refers to the number of different identifiers of second AXIS fields 49 for the morpheme group indexed by g. Thus, in the example in FIG. 6, if g is TIME, N(g) will be 6.
  • first through G-th groups 63 of dictionary morphemes are used when breaking a syntagm into the analyzed morphemes.
  • Each morpheme group 63 will herein be called a g-th group, where g is a variable between one and G, inclusive.
  • FIG. 8 is a version of FIG. 7, expanded to show dictionary morphemes 61 in morpheme groups 63.
  • the g-th group comprises a plurality of first through N-th dictionary morphemes 61 in the customary order from the first morpheme to the N-th morpheme, where N represents a first integer which is not less than two.
  • the first through the N-th morphcrees of the g-th group are herein referred to alternatively as first g-ary through N(g)-th g-ary morphemes, where N(g) represents a first g-ary integer which is not less than two.
  • the customary order of the g-th group is called a g-th order.
  • morphemes of the g-th group may actually be used, as is the case with the address group.
  • the morphemes which are used are thus subsets of the morpheme group, and are herein referred to as first g-ary through M(g)-th g-ary candidate morphemes of the first g-ary through the N(g)-th g-ary dictionary morphemes, where M(g) represents a second g-ary integer which is not greater than N(g).
  • the AXIS field of the dictionary morpheme is empty.
  • the dictionary morpheme under consideration belongs to the g-th group of morphemes, the AXIS field stores the g-th descriptor and the g-th identifier.
  • first through the G-th groups are represented by first through G-th descriptors including a g-th descriptor.
  • First through G-th customary orders are indicated by first through G-th identifiers including a g-th identifier.
  • a mark is used for indicating whether a morpheme belongs to a morpheme group.
  • a plurality of first g-ary through N(g)-th g-ary marks are contiguous to the g-th identifier of the first g-ary through the N(g)-th g-ary morphemes.
  • the marks can be the identifiers, and the identifiers are integers, either in an ascending or descending order. The integers may or may not be consecutive.
  • the dictionary 11 Prior to operation of the device, the dictionary 11 is loaded with the g-th descriptor, the g-th identifier, the AKO field, and the morpheme field, for each of the first g-ary through the N(g)-th g-ary morphemes of the dictionary morphemes.
  • the separating unit 12 selects candidate morphemes 65 from dictionary morphemes 61.
  • the syntagm includes first g-ary through M(g)-th g-ary entries of the first g-ary through the N(g)-th g-ary morphemes of the g-th group 63
  • the separating unit 12 selects a g-ary morpheme as a first candidate morpheme 65 from one of the morpheme groups that may be called a first g-ary group.
  • the first g-ary candidate morpheme comprises one of the first g-ary through the N(g)-th g-ary dictionary morphemes.
  • the separating unit 12 selects at least one M(g)-th g-ary candidate morpheme.
  • the M(g)-th g-ary candidate morpheme 65 or morphemes are all or a subset of the first g-ary through the N(g)-th g-ary dictionary morphemes 61.
  • the first g-ary through the M(g)-th g-ary candidate morphemes 65 are in the g-th customary order among the morpheme group.
  • FIG. 10 illustrates the operation of the separating unit 12.
  • the input connection supplied the characters in sequence. Assume that the characters have been supplied.
  • the character sequence will be analyzed by the separating unit 12.
  • One implementation of the separating unit 12 advantageously treats the character sequence as a string with a pointer.
  • P is the pointer to a character in the input character sequence.
  • the pointer P is set to the first character of the input character sequence.
  • the separating unit 12 searches the dictionary 11 and retrieves all possible dictionary morphemes starting with the character pointed to by P as candidate morphemes. These candidate morphemes are supplied to the memory 16 and stored as memorized morphemes in the memory 16.
  • the separating unit 12 finds the candidate morphemes with the longest morpheme field and increments P by the length of the longest morpheme field such that P points one character beyond the last character of the character sequence corresponding to the dictionary morpheme just chosen.
  • the separating unit 12 ascertains whether the input syntagm is completely analyzed. If P is pointing to the end of a sentence, the separating unit terminates. Otherwise, it repeats beginning with step B2, with P now pointing to the next character in the input character sequence.
  • the memory 16 then has stored the first g-ary through the M(g)-th g-ary memorized morphemes in connection with the first g-ary through the N(g)-th g-ary dictionary morphemes.
  • the first g-ary through the M(g)-th g-ary memorized morphemes include the fields of the first g-ary through the M(g)-th g-ary dictionary morphemes of the first g-ary through the N(g)-th g-ary dictionary morphemes.
  • the memory 16 need only store a descriptor and an identifier accompanying each of the first through the M-th morphemes of the dictionary morphemes. Frequently, only one group of morphemes is sufficient.
  • the dictionary 11 has the first g-ary through the N(g)-th g-ary dictionary morphemes.
  • the memory 16 has the first g-ary through M(g)-th g-ary memorized morphemes.
  • the memorized morpheme which was in a morpheme group has the mark.
  • the mark can be the identifier.
  • a memorized morpheme with a mark is a marked morpheme.
  • the g-ary marked morphemes are accompanied by the g-th descriptor and the g-th identifier.
  • the g-ary marked morphemes are accompanied by the g-th descriptor and the g-th identifier.
  • the output unit 17 selects memorized morphemes from the memory 16, removes any ambiguities and outputs analyzed morphemes. To select the analyzed morphemes from the memorized morphemes, the output unit 17 checks whether or not each of the memorized morphemes is marked, i.e., accompanied by the g-th descriptor and the g-th identifier. Beginning with a first memorized morpheme, if the memorized morpheme is found to be accompanied by the g-th descriptor and the g-th identifier, that first one would very likely be a first analyzed morpheme.
  • a first analyzed morpheme does not necessarily mean that this one stands first in the analyzed morphemes.
  • a next memorized morpheme would be a second analyzed morpheme.
  • the g-ary marked morphemes are selected from the memorized morphemes in memory 16 and are used as the analyzed morphemes when the g-th identifier indicates that the g-ary marked morphemes are in the g-th customary order.
  • the morpheme analysis device is implemented by an electronic digital computer which may be a microprocessor, it is readily possible to carry out such check and selection by the output unit 17.
  • the output unit 17 selects the g-ary marked morphemes for use as the analyzed morphemes.
  • the output unit 17 can be implemented as a part of the microprocessor, and carries out first g-ary through M(g)-th g-ary steps including an m(g)-th g-ary step, where m(g) is variable between two and M(g), both inclusive.
  • the output unit 17 checks whether or not the first memorized morpheme is accompanied by the g-th descriptor and an (m(g)-l)-th g-ary identifier of the first g-ary through the N(g)-th g-ary marks.
  • the first memorized morphemes need not necessarily be first among the memorized morphemes in the memory 16.
  • the output unit 17 checks at the m(g)-th g-ary step whether or not a second memorized morpheme is accompanied by the g-th descriptor and an m(g)-th g-ary identifier of the first g-ary through the N(g)-th g-ary marks.
  • a third step C3 the output unit 17 uses the first and the second memorized morphemes as two of the g-ary selected morphemes. Steps C1 and C2 are readily carried out by the electronic digital computer in parallel rather than in series.
  • the output unit 17 checks whether or not the two selected morphemes are in the g-th customary order. If they are in the g-th customary order or if the proper descriptor has not been found in steps C1 or C2, at step C5 the output unit 17 produces the memorized morphemes of the two selected morphemes as two analyzed morphemes.
  • the output unit uses the first memorized morpheme as one of the analyzed morphemes and repeats the first step C1 until one of the memorized morphemes having a g-th descriptor is found, and the (m(g)-l)-th g-ary memorized morpheme of the first g-ary through N(g)-th g-ary is used recursively as the first memorized morpheme.
  • the second step C2 is likewise carried out. If the result is negative at the fourth step C4, the output unit 17 repeats the first step C1 until all memorized morphemes are checked.
  • the output unit may be implemented in a microprocessor as illustrated in FIGS. 12-14.
  • FIG. 12 is an upper level flow chart.
  • step D1 corresponding to FIG. 13
  • possible memorized morphemes within the morpheme group are flagged.
  • step D2 corresponding to FIG. 14
  • one memorized morpheme for the morpheme is chosen and output.
  • the input character sequence of N morphemes can be implemented as a string having N substrings, s (1), . . . , s (N) .
  • N N substrings
  • J(n) should be large enough so that d(n,j) does not overflow.
  • An array of flags f(n,j) is used, with flags which are set to 1 when there is a co-occurrence of a memorized morpheme with the same descriptor. All flags are initially reset to 0.
  • the output unit 17 searches the memory 16 and flags possible memorized morphemes in the following manner.
  • step E1 the output unit 17 initializes a counter n
  • step E2 the output unit 17 initializes a counter j.
  • step E3 it is determined whether the memorized morpheme d(n,j) has a non-empty AXIS field. If the AXIS field is empty, the entry is skipped. Otherwise, at step E4, the memorized morpheme is set to the AXIS field AXIS (T, O), where T is the first element and O is the second element.
  • step E6 counter j is incremented.
  • step E7 if j is less than J(n), the search continues at step E3. If j equals J(n), counter n is incremented at step E8.
  • step E9 if n is less than N, the search continues at step E2. In this way, the entire memory 16 is searched.
  • one of the flagged memorized morphemes is chosen as an analyzed morpheme in the following way.
  • the counter n is initialized to 1.
  • the counter n is incremented.
  • the search continues at step F2. In this way, all of the memorized morphemes are scanned. The analyzed morphemes are then output.
  • FIG. 15 schematically represents the contents of the memory 16 when the morpheme analysis device is used.
  • a Japanese word “desu” means an English word set “It is” and is placed at the end of a sentence, followed by a full stop. If the Japanese word “desu” is omitted, the syntagmconsists of the above-described morphemes or characters "10", “O'C", "20", and "PT” and is so depicted. Including the Japanese word "desu”, the syntagm means in English "It is twenty minutes past ten.”
  • step A1 the syntagm is separated into morphemes, 21-26.
  • a plurality of memorized "morphemes” are stored at the first step A3 of FIG. 2 in connection with first through sixth morphemes 21, 22, 23, 24, 25, and 26. It should be noted that such morphemes 21 through 26 are depicted without regard to their actual lengths.
  • step A2 for the first morpheme 21, only one dictionary morpheme, "10", is selected as a first candidate morpheme by the separating unit 12. Thus, only one candidate morpheme is stored by the separating unit 12 in the memory 16 as a memorized morpheme.
  • a dictionary morpheme "O'C” is selected as a candidate morpheme and is stored as a second memorized morpheme.
  • the second memorized morpheme here is accompanied by two field entries, AXIS(TIME, 30) and AKO(time instant), depicted in a first rectangle 71.
  • first and second dictionary morphemes are selected as candidate morphemes and stored as a fourth primary and a fourth secondary memorized morphemes.
  • the fourth primary memorized morpheme, "PT” is accompanied by two field entries 32, AXIS(TIME, 20) and AKO(time instant).
  • the fourth secondary memorized, also "PT” is accompanied by two different field entries 73, AXIS(ANGLE, 20) and AKO(angle).
  • the syntagm includes a time instant group having first and second morphemes "O'C” 22 and "PT" 24.
  • Candidate morphemes are selected from the dictionary morphemes, are stored as memorized morphemes, and are accompanied by a descriptor TIME and two entries in the second AXIS field, "30" and "20", representative of the customary order, arranged in descending order.
  • the second, the fourth primary, and the fourth secondary memorized morphemes are the marked morphemes because they belong to morpheme groups.
  • the output unit 17 checks at the first step C1 of FIG. 11 whether or not each memorized morpheme is accompanied by the descriptor t and the identifier n. Successively checking the memorized morphemes in this manner, at the first step C1, the output unit 17 finds the descriptor TIME accompanying the second memorized morpheme 22. This will be represented in general by w. Also at the first step C1, the output unit 17 finds the mark "30" for the second morpheme 22. (Recall that the mark may be the identifier.) At the second step C2, the output unit 17 locates the descriptor TIME accompanying the fourth primary memorized morpheme. This will be represented by w'. The output unit 17 finds the mark "20" for the fourth morpheme 24. At step C2 for the fourth secondary memorized morpheme, the descriptor TIME is not found.
  • the output unit finds that the second and the fourth primary memorized morphemes, w and w', are accompanied by the descriptor TIME in common, and by the identifiers which correctly indicate the customary order in the group. Then, the output unit 17 determines that the second and the fourth primary memorized morphemes, w and w', should be used as the analyzed morphemes, and that the fourth secondary memorized morphemes should be discarded from the result of analysis of the syntagm. The output unit 17 thereby selects at the fifth step C5 the first, the second, the third, the fourth primary, the fifth, and the sixth memorized "morphemes" as the analyzed morphemes. Thus, the ambiguities in the morpheme "PT" between the time instant and the measure of an angle are removed.
  • the syntagm may be given by speech sound when the morpheme analysis device is used in a machine translation system put into operation by a substantially continuously spoken syntagm.
  • the input connection 13 should be connected to a speech recognition device.
  • a spoken language used with the machine translation system may not necessarily be the Japanese language. This is because some particular morphemes are used in a customary order in other languages, for example, a China address group in written or spoken Chinese.
  • analyzed morphemes may be produced in parallel.
US07/853,601 1989-03-02 1992-03-18 Morpheme analysis device Expired - Fee Related US5337232A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP1051112A JPH02297195A (ja) 1989-03-02 1989-03-02 形態素解析方式
GB9004566A GB2229558A (en) 1989-03-02 1990-03-01 Device for analyzing Japanese sentences into morphemes with attention directed to morpheme groups
US07/853,601 US5337232A (en) 1989-03-02 1992-03-18 Morpheme analysis device

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
JP1051112A JPH02297195A (ja) 1989-03-02 1989-03-02 形態素解析方式
US48704490A 1990-03-02 1990-03-02
US07/853,601 US5337232A (en) 1989-03-02 1992-03-18 Morpheme analysis device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US48704490A Continuation-In-Part 1989-03-02 1990-03-02

Publications (1)

Publication Number Publication Date
US5337232A true US5337232A (en) 1994-08-09

Family

ID=26391640

Family Applications (1)

Application Number Title Priority Date Filing Date
US07/853,601 Expired - Fee Related US5337232A (en) 1989-03-02 1992-03-18 Morpheme analysis device

Country Status (3)

Country Link
US (1) US5337232A (ja)
JP (1) JPH02297195A (ja)
GB (1) GB2229558A (ja)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5721938A (en) * 1995-06-07 1998-02-24 Stuckey; Barbara K. Method and device for parsing and analyzing natural language sentences and text
US6243669B1 (en) 1999-01-29 2001-06-05 Sony Corporation Method and apparatus for providing syntactic analysis and data structure for translation knowledge in example-based language translation
US6266642B1 (en) 1999-01-29 2001-07-24 Sony Corporation Method and portable apparatus for performing spoken language translation
US6278968B1 (en) 1999-01-29 2001-08-21 Sony Corporation Method and apparatus for adaptive speech recognition hypothesis construction and selection in a spoken language translation system
US6282507B1 (en) 1999-01-29 2001-08-28 Sony Corporation Method and apparatus for interactive source language expression recognition and alternative hypothesis presentation and selection
US6356865B1 (en) * 1999-01-29 2002-03-12 Sony Corporation Method and apparatus for performing spoken language translation
US6374224B1 (en) 1999-03-10 2002-04-16 Sony Corporation Method and apparatus for style control in natural language generation
US6442524B1 (en) 1999-01-29 2002-08-27 Sony Corporation Analyzing inflectional morphology in a spoken language translation system
US20030191625A1 (en) * 1999-11-05 2003-10-09 Gorin Allen Louis Method and system for creating a named entity language model
US20050141391A1 (en) * 2000-07-13 2005-06-30 Tetsuo Ueyama Optical pickup
US6963871B1 (en) 1998-03-25 2005-11-08 Language Analysis Systems, Inc. System and method for adaptive multi-cultural searching and matching of personal names
US7085720B1 (en) * 1999-11-05 2006-08-01 At & T Corp. Method for task classification using morphemes
US20070005586A1 (en) * 2004-03-30 2007-01-04 Shaefer Leonard A Jr Parsing culturally diverse names
US7286984B1 (en) 1999-11-05 2007-10-23 At&T Corp. Method and system for automatically detecting morphemes in a task classification system using lattices
US20110010165A1 (en) * 2009-07-13 2011-01-13 Samsung Electronics Co., Ltd. Apparatus and method for optimizing a concatenate recognition unit
US20120023398A1 (en) * 2010-07-23 2012-01-26 Masaaki Hoshino Image processing device, information processing method, and information processing program
US20120089400A1 (en) * 2010-10-06 2012-04-12 Caroline Gilles Henton Systems and methods for using homophone lexicons in english text-to-speech
US8392188B1 (en) 1999-11-05 2013-03-05 At&T Intellectual Property Ii, L.P. Method and system for building a phonotactic model for domain independent speech recognition
US8812300B2 (en) 1998-03-25 2014-08-19 International Business Machines Corporation Identifying related names
US8855998B2 (en) 1998-03-25 2014-10-07 International Business Machines Corporation Parsing culturally diverse names
US20140309986A1 (en) * 2013-04-11 2014-10-16 Microsoft Corporation Word breaker from cross-lingual phrase table
CN109213992A (zh) * 2017-07-06 2019-01-15 富士通株式会社 词素分析装置和词素分析方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE19624987A1 (de) * 1996-06-22 1998-01-02 Peter Dr Toma Automatisches Sprachumsetzungsverfahren
DE19624988A1 (de) * 1996-06-22 1998-01-02 Peter Dr Toma Verfahren zur automatischen Erkennung eines gesprochenen Textes

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4059725A (en) * 1975-03-12 1977-11-22 Nippon Electric Company, Ltd. Automatic continuous speech recognition system employing dynamic programming
US4771385A (en) * 1984-11-21 1988-09-13 Nec Corporation Word recognition processing time reduction system using word length and hash technique involving head letters
US4931936A (en) * 1987-10-26 1990-06-05 Sharp Kabushiki Kaisha Language translation system with means to distinguish between phrases and sentence and number discrminating means

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS60144868A (ja) * 1984-01-06 1985-07-31 Nec Corp 文脈解析装置
JPH01114976A (ja) * 1987-10-28 1989-05-08 Sharp Corp 文書処理装置の辞書構造

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4059725A (en) * 1975-03-12 1977-11-22 Nippon Electric Company, Ltd. Automatic continuous speech recognition system employing dynamic programming
US4771385A (en) * 1984-11-21 1988-09-13 Nec Corporation Word recognition processing time reduction system using word length and hash technique involving head letters
US4931936A (en) * 1987-10-26 1990-06-05 Sharp Kabushiki Kaisha Language translation system with means to distinguish between phrases and sentence and number discrminating means

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5721938A (en) * 1995-06-07 1998-02-24 Stuckey; Barbara K. Method and device for parsing and analyzing natural language sentences and text
US8855998B2 (en) 1998-03-25 2014-10-07 International Business Machines Corporation Parsing culturally diverse names
US20080312909A1 (en) * 1998-03-25 2008-12-18 International Business Machines Corporation System for adaptive multi-cultural searching and matching of personal names
US8041560B2 (en) 1998-03-25 2011-10-18 International Business Machines Corporation System for adaptive multi-cultural searching and matching of personal names
US20050273468A1 (en) * 1998-03-25 2005-12-08 Language Analysis Systems, Inc., A Delaware Corporation System and method for adaptive multi-cultural searching and matching of personal names
US6963871B1 (en) 1998-03-25 2005-11-08 Language Analysis Systems, Inc. System and method for adaptive multi-cultural searching and matching of personal names
US8812300B2 (en) 1998-03-25 2014-08-19 International Business Machines Corporation Identifying related names
US6278968B1 (en) 1999-01-29 2001-08-21 Sony Corporation Method and apparatus for adaptive speech recognition hypothesis construction and selection in a spoken language translation system
US20020198713A1 (en) * 1999-01-29 2002-12-26 Franz Alexander M. Method and apparatus for perfoming spoken language translation
US6442524B1 (en) 1999-01-29 2002-08-27 Sony Corporation Analyzing inflectional morphology in a spoken language translation system
US6356865B1 (en) * 1999-01-29 2002-03-12 Sony Corporation Method and apparatus for performing spoken language translation
US6282507B1 (en) 1999-01-29 2001-08-28 Sony Corporation Method and apparatus for interactive source language expression recognition and alternative hypothesis presentation and selection
US6266642B1 (en) 1999-01-29 2001-07-24 Sony Corporation Method and portable apparatus for performing spoken language translation
US6243669B1 (en) 1999-01-29 2001-06-05 Sony Corporation Method and apparatus for providing syntactic analysis and data structure for translation knowledge in example-based language translation
US6374224B1 (en) 1999-03-10 2002-04-16 Sony Corporation Method and apparatus for style control in natural language generation
US20080215328A1 (en) * 1999-11-05 2008-09-04 At&T Corp. Method and system for automatically detecting morphemes in a task classification system using lattices
US20030191625A1 (en) * 1999-11-05 2003-10-09 Gorin Allen Louis Method and system for creating a named entity language model
US20080177544A1 (en) * 1999-11-05 2008-07-24 At&T Corp. Method and system for automatic detecting morphemes in a task classification system using lattices
US7286984B1 (en) 1999-11-05 2007-10-23 At&T Corp. Method and system for automatically detecting morphemes in a task classification system using lattices
US7440897B1 (en) 1999-11-05 2008-10-21 At&T Corp. Method and system for automatically detecting morphemes in a task classification system using lattices
US9514126B2 (en) 1999-11-05 2016-12-06 At&T Intellectual Property Ii, L.P. Method and system for automatically detecting morphemes in a task classification system using lattices
US7620548B2 (en) 1999-11-05 2009-11-17 At&T Intellectual Property Ii, L.P. Method and system for automatic detecting morphemes in a task classification system using lattices
US8909529B2 (en) 1999-11-05 2014-12-09 At&T Intellectual Property Ii, L.P. Method and system for automatically detecting morphemes in a task classification system using lattices
US8010361B2 (en) 1999-11-05 2011-08-30 At&T Intellectual Property Ii, L.P. Method and system for automatically detecting morphemes in a task classification system using lattices
US7085720B1 (en) * 1999-11-05 2006-08-01 At & T Corp. Method for task classification using morphemes
US20080046243A1 (en) * 1999-11-05 2008-02-21 At&T Corp. Method and system for automatic detecting morphemes in a task classification system using lattices
US8612212B2 (en) 1999-11-05 2013-12-17 At&T Intellectual Property Ii, L.P. Method and system for automatically detecting morphemes in a task classification system using lattices
US8200491B2 (en) 1999-11-05 2012-06-12 At&T Intellectual Property Ii, L.P. Method and system for automatically detecting morphemes in a task classification system using lattices
US8392188B1 (en) 1999-11-05 2013-03-05 At&T Intellectual Property Ii, L.P. Method and system for building a phonotactic model for domain independent speech recognition
US20050141391A1 (en) * 2000-07-13 2005-06-30 Tetsuo Ueyama Optical pickup
US20070005586A1 (en) * 2004-03-30 2007-01-04 Shaefer Leonard A Jr Parsing culturally diverse names
US20110010165A1 (en) * 2009-07-13 2011-01-13 Samsung Electronics Co., Ltd. Apparatus and method for optimizing a concatenate recognition unit
US20120023398A1 (en) * 2010-07-23 2012-01-26 Masaaki Hoshino Image processing device, information processing method, and information processing program
US9569420B2 (en) * 2010-07-23 2017-02-14 Sony Corporation Image processing device, information processing method, and information processing program
US20120089400A1 (en) * 2010-10-06 2012-04-12 Caroline Gilles Henton Systems and methods for using homophone lexicons in english text-to-speech
US20140309986A1 (en) * 2013-04-11 2014-10-16 Microsoft Corporation Word breaker from cross-lingual phrase table
US9330087B2 (en) * 2013-04-11 2016-05-03 Microsoft Technology Licensing, Llc Word breaker from cross-lingual phrase table
CN109213992A (zh) * 2017-07-06 2019-01-15 富士通株式会社 词素分析装置和词素分析方法

Also Published As

Publication number Publication date
GB2229558A (en) 1990-09-26
GB9004566D0 (en) 1990-04-25
JPH02297195A (ja) 1990-12-07

Similar Documents

Publication Publication Date Title
US5337232A (en) Morpheme analysis device
US5794177A (en) Method and apparatus for morphological analysis and generation of natural language text
US5752051A (en) Language-independent method of generating index terms
US7174290B2 (en) Multi-language document search and retrieval system
WO1997004405A9 (en) Method and apparatus for automated search and retrieval processing
US20070242071A1 (en) Character Display System
JPH0519184B2 (ja)
JPH02299068A (ja) 入力文字列からワードを分離する方法
Thabet Stemming the Qur’an
US20040193399A1 (en) System and method for word analysis
JP2002503849A (ja) 漢字文における単語区分方法
JP3531222B2 (ja) 類似文字列検索装置
Narejo et al. Sindhi morphological analysis: an algorithm for sindhi word segmentation into morphemes
Riggs The importance of concepts: Some considerations on how they might be designated less ambiguously
JPH08339376A (ja) 外国語検索装置及び情報検索システム
JP2792147B2 (ja) 文字処理方法およびその装置
US7539611B1 (en) Method of identifying and highlighting text
Paumier A time-efficient token representation for parsers
JP3187671B2 (ja) 電子辞書表示装置
Szanser Elastic Matching of Coded Strings and its Applications
JPH04130578A (ja) 未登録語検索方法および装置
KR20010067827A (ko) 다국어 한자 데이터 베이스 구조
JP2004206659A (ja) 読み情報決定方法及び装置及びプログラム
JP3508312B2 (ja) キーワード抽出装置
JPH0146895B2 (ja)

Legal Events

Date Code Title Description
AS Assignment

Owner name: NEC CORPORATION, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST.;ASSIGNORS:SAKAI, SHINSUKE;MIYABE, TAKAO;REEL/FRAME:006153/0758

Effective date: 19920519

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20060809