JP2002503849A

JP2002503849A - Word segmentation method in Kanji sentences

Info

Publication number: JP2002503849A
Application number: JP2000531795A
Authority: JP
Inventors: ウー，アンディ; リチャードソン，スティーヴン・ディー; チャン，チーシン
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 1998-02-13
Filing date: 1999-01-13
Publication date: 2002-02-05
Anticipated expiration: 2019-01-13
Also published as: WO1999041680A3; EP1055182A2; WO1999041680A2; JP5100770B2; JP2010157260A; JP4573432B2; CN1114165C; CN1290371A

Abstract

(57)【要約】本発明は、単語である可能性がある文字の組み合わせを、自然言語の文字列から選択するファシリティを提供する。このファシリティは、複数の文字の各々について、（ａ）当該文字で始まる単語の第２位置に現れる文字、および（ｂ）文字が単語内において現れる位置の指示を用いる。シーケンス内に現れる複数の文字の連続する組み合わせの各々について、ファシリティは、組み合わせの第２位置に現れる文字が、組み合わせの第１位置に現れる文字で始まる単語の中に現れるか否かについて判定を行なう。現れる場合、ファシリティは、組み合わせの各文字が、当該組み合わせの中でそれが現れる位置において、単語の中で現れることが示されているか否かについて判定を行なう。示されている場合、ファシリティは、文字の組み合わせが単語である可能性があると判定する。実施形態の中には、ファシリティが文字の組み合わせを有効単語のリストと比較して、文字の組み合わせが単語であるか否かについて判定するものもある。 (57) [Summary] The present invention provides a facility for selecting a combination of characters that may be a word from a character string in a natural language. This facility uses, for each of a plurality of characters, (a) the character that appears at the second position of the word beginning with the character, and (b) the indication of the position where the character appears within the word. For each successive combination of characters in the sequence, the facility determines whether the character appearing in the second position of the combination appears in a word beginning with the character appearing in the first position of the combination. . If so, the facility determines whether each character of the combination is indicated to appear in a word at the location where it appears in the combination. If so, the facility determines that the combination of characters may be a word. In some embodiments, the facility compares the character combination with a list of valid words to determine whether the character combination is a word.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】（技術分野）本発明は、一般的に、自然言語処理の分野に関し、更に特定すれば単語区分（
word segmentation）の分野に関するものである。TECHNICAL FIELD The present invention relates generally to the field of natural language processing, and more particularly to word segmentation (
word segmentation).

【０００２】（発明の背景）単語区分とは、文のような言語の表現を構成する個々の単語を識別するプロセ
スのことである。単語の区分は、綴りや文法をチェックしたり、文から音声を合
成したり、自然言語の解析や理解を行なったりする際に有用である。これらは全
て、個々の単語の識別によって得られる効果である。BACKGROUND OF THE INVENTION Word segmentation is the process of identifying the individual words that make up a language expression, such as a sentence. Word division is useful for checking spelling and grammar, synthesizing speech from sentences, and analyzing and understanding natural languages. These are all effects obtained by identifying individual words.

【０００３】英文の場合、単語の区分を行なうのはむしろ単純である。即ち、空間や句読点
符号が、文内の個々の単語を区切っているからである。以下の表１における英文
を考えてみる。In the case of English sentences, it is rather simple to perform word division. That is, spaces and punctuation marks separate individual words in a sentence. Consider the English text in Table 1 below.

【０００４】[0004]

【表１】 [Table 1]

【０００５】隣接する一連の空間および／または一連の空間に先立つ単語の末端としての句
読点符号を識別することによって、表１の英文は、以下の表２に示すように単純
に区分することができる。[0005] By identifying adjacent series and / or punctuation marks as terminators of words preceding the series of spaces, the English sentences in Table 1 can be simply partitioned as shown in Table 2 below. .

【０００６】[0006]

【表２】 [Table 2]

【０００７】中国語の文では、単語の境界は、明示的ではなくむしろ暗示的である。以下の
表３における文章を考えてみる。これは、“委員会はこの問題を昨日の午後ブエ
ノス・アイレスで論じた”という意味である。[0007] In Chinese sentences, word boundaries are implicit rather than explicit. Consider the text in Table 3 below. This means "the committee discussed this issue in Buenos Aires yesterday afternoon."

【０００８】[0008]

【表３】 [Table 3]

【０００９】文章には句読点や空間がないにも拘らず、中国語の読者であれば、表３の文章
を、以下の表４において下線を引いて区別した単語から成るものとして認識する
。[0009] Despite the lack of punctuation and spaces in the text, Chinese readers recognize the text in Table 3 as consisting of words that are underlined in Table 4 below.

【００１０】[0010]

【表４】 [Table 4]

【００１１】上述の例から、中国語の単語区分は、英語の単語区分と同様にはできないこと
がわかる。したがって、中国語の区分を自動的に行なう高精度で効率的な手法が
あれば、大きな有用性を有するであろう。From the above example, it can be seen that Chinese word divisions cannot be done in the same way as English word divisions. Therefore, a highly accurate and efficient method of automatically classifying Chinese would have great utility.

【００１２】（発明の概要）本発明によれば、単語区分ソフトウエア・ファシリティ（“ファシリティ”）
が、中国語のような非区分言語における文の単語区分操作を行なう際に、（１）
入力文章における文字の可能な組み合わせを評価して、入力文章内の単語を表す
可能性がないものを破棄し、（２）辞書において残りの文字の組み合わせを調べ
、これらが単語を構成できるか否か判定し、（３）単語であると判定した文字の
組み合わせを、入力文章を表す代替語彙レコードとして、自然言語パーザに提出
する。パーザは、入力文章の構文構造を表す構文解析ツリーを生成する。これは
、入力文章における単語であることが証明された文字の組み合わせを表す語彙レ
コードのみを含む。語彙レコードをパーザに提出する際、ファシリティは、語彙
レコードに重み付けを行い、パーザが長い文字の組み合わせを、短い文字の組み
合わせよりも前に検討するようにする。何故なら、一般に、長い文字の組み合わ
せの方が、短い文字の組み合わせよりも文章の正しい区分を表す場合が多いから
である。SUMMARY OF THE INVENTION According to the present invention, a word segmentation software facility ("facility")
Performs a word segmentation operation on a sentence in a non-segmented language such as Chinese, (1)
Evaluate the possible combinations of characters in the input sentence, discard those that are unlikely to represent words in the input sentence, and (2) check the remaining character combinations in the dictionary to see if they can form a word (3) The combination of characters determined to be a word is submitted to the natural language parser as an alternative vocabulary record representing the input sentence. The parser generates a parse tree representing the syntax structure of the input sentence. It contains only vocabulary records that represent combinations of characters that have proven to be words in the input sentence. When submitting a vocabulary record to the parser, the facility weights the vocabulary record so that the parser considers long letter combinations before short letter combinations. This is because, in general, a combination of long characters often represents a correct section of a sentence more than a combination of short characters.

【００１３】入力文章における単語を表す可能性が低い文字の組み合わせを容易に破棄する
ために、ファシリティは、辞書内で現れる文字毎に、（１）単語長および単語が
現れる文字位置の異なる組み合わせ全ての指示、および（２）この文字が単語を
開始するときに、この文字の後に続く可能性がある文字全ての指示を、辞書に追
加する。更に、ファシリティは、（３）多文字単語内に部分単語が存在可能で検
討すべきか否かについて、多文字単語に対する指示も追加する。文章を処理する
際、ファシリティは、いずれかの単語が辞書内にない単語長／位置の組み合わせ
で用いられている文字の組み合わせ、および（２）２番目の文字が最初の文字に
可能な第２文字としてリストされていない文字の組み合わせを破棄する。更に、
ファシリティは、（３）部分単語を考慮しない単語内に現れる文字の組み合わせ
も破棄する。In order to easily discard combinations of characters that are unlikely to represent words in the input sentence, the facility provides, for each character that appears in the dictionary, (1) all combinations of different word lengths and character positions where words appear. And (2) when the character starts a word, add instructions to the dictionary for all characters that may follow the character. Further, the facility also adds (3) an instruction for the multi-character word as to whether or not the partial word can exist in the multi-character word and should be considered. When processing a sentence, the facility may include character combinations where any of the words are used in word length / position combinations that are not in the dictionary, and (2) the second character in which the second character may be the first character. Discard character combinations not listed as characters. Furthermore,
The facility also discards (3) combinations of characters that appear in words that do not consider partial words.

【００１４】このように、ファシリティは、辞書内で調べる文字の組み合わせ数を最少に抑
え、かつ文章の構文的文脈を利用して各々有効な単語から構成された区分選択肢
の結果間で差別化する。As described above, the facility minimizes the number of combinations of characters to be looked up in the dictionary, and differentiates between the results of the category choices each composed of valid words using the syntactic context of the sentence. .

【００１５】（発明の詳細な説明）本発明は、中国語文において単語区分を行なう。好適な実施形態では、単語区
分ソフトウエア・ファシリティ（“ファシリティ”）が、中国語のような非区分
言語における文の単語区分を行なう際に、（１）入力文章における文字の可能な
組み合わせを評価して、入力文章内の単語を表す可能性がないものを破棄し、（
２）辞書において残りの文字の組み合わせを調べ、これらが単語を構成できるか
否か判定し、（３）単語であると判定した文字の組み合わせを、入力文章を表す
代替語彙レコードとして、自然言語パーザに提出する。パーザは、入力センテン
スの構文構造を表す構文解析ツリーを生成する。これは、入力文章における単語
であることが証明された文字の組み合わせを表す語彙レコードのみを含む。語彙
レコードをパーザに提出する際、ファシリティは、語彙レコードに重み付けを行
い、パーザが長い文字の組み合わせを、短い文字の組み合わせよりも前に検討す
るようにする。何故なら、一般に、長い文字の組み合わせの方が、短い文字の組
み合わせよりも文章の正しい区分を表す場合が多いからである。DETAILED DESCRIPTION OF THE INVENTION The present invention performs word segmentation in Chinese sentences. In a preferred embodiment, the word segmentation software facility ("facility") evaluates possible combinations of characters in the input sentence when performing word segmentation of a sentence in a non-segmented language such as Chinese. To discard words that are unlikely to represent words in the input text,
2) Look up the remaining character combinations in the dictionary and determine if they can form a word. (3) Use the character combinations determined as words as natural language parser as alternative vocabulary records representing input sentences. Submit to The parser generates a parse tree representing the syntax structure of the input sentence. It contains only vocabulary records that represent combinations of characters that have proven to be words in the input sentence. When submitting a vocabulary record to the parser, the facility weights the vocabulary record so that the parser considers long letter combinations before short letter combinations. This is because, in general, a combination of long characters often represents a correct section of a sentence more than a combination of short characters.

【００１６】入力文章における単語を表す可能性が低い文字の組み合わせを容易に破棄する
ために、ファシリティは、辞書内で現れる文字毎に、（１）単語長および単語が
現れる文字位置の異なる組み合わせ全ての指示、および（２）この文字が単語の
先頭にあるときに、この文字の後に続く可能性がある文字全ての指示を、辞書に
追加する。更に、ファシリティは、（３）多文字単語に対して、多文字単語内に
部分単語が存在可能であり検討すべきか否かについての指示も追加する。文章を
処理する際、ファシリティは、（１）いずれかの単語が辞書内にない単語長／位
置の組み合わせで用いられている文字の組み合わせ、および（２）２番目の文字
が最初の文字に可能な第２文字としてリストされていない文字の組み合わせを破
棄する。更に、ファシリティは、（３）部分単語を考慮しない単語内に現れた文
字の組み合わせも破棄する。In order to easily discard combinations of characters that are unlikely to represent words in the input sentence, the facility includes, for each character that appears in the dictionary, (1) all combinations of different word lengths and character positions where words appear. And (2) when this character is at the beginning of a word, add all the characters that may possibly follow this character to the dictionary. Further, the facility also adds (3) an instruction on whether or not a partial word can exist in the multi-character word and should be considered for the multi-character word. When processing a sentence, the facility can be: (1) a combination of letters used in word length / position combinations where any word is not in the dictionary; and (2) the second letter can be the first letter Discard any combination of characters not listed as a second character. Further, the facility also discards (3) combinations of characters that appear in words that do not consider partial words.

【００１７】このように、ファシリティは、辞書内で調べる文字の組み合わせ数を最少に抑
え、かつ文章の構文的文脈を利用して、各々有効な単語から構成された代替区分
の結果間で差別化する。As described above, the facility minimizes the number of combinations of characters to be looked up in a dictionary, and uses the syntactic context of a sentence to differentiate between the results of alternative sections each composed of valid words. I do.

【００１８】図１は、ファシリティが実行するのが好ましい汎用コンピュータ・システムの
上位ブロック図である。コンピュータ・システム１００は、中央演算装置（ＣＰ
Ｕ）１１０、入出力デバイス１２０、およびコンピュータ・メモリ（メモリ）１
３０を内蔵する。入出力装置間には、ハード・ディスク・ドライブのような記憶
装置１２１、ＣＤ−ＲＯＭのようなコンピュータ読取可能媒体上で供給され、フ
ァシリティを含むソフトウエア製品をインストールするために使用可能なコンピ
ュータ読取可能媒体ドライブ１２２、およびコンピュータ１００が他の接続して
あるコンピュータ・システム（図示せず）と通信可能なネットワーク接続部１２
３がある。メモリ１３０は、中国語文内に現れる個々の単語を識別する単語区分
ファシリティ１３１、自然言語文内に現れる単語を表す語彙レコードから、自然
言語文の文章の構文構造を表す解析ツリーを生成する構文パーザ１３３、および
パーザが用いて解析ツリーのための語彙レコードを構築し、ファシリティが用い
て自然言語文内に現れる単語を識別する語彙知識ベース１３２を含むことが好ま
しい。ファシリティは、前述のように構成したコンピュータ・システム上で実現
することが好ましいが、異なる構成を有するコンピュータ・システム上でも実現
可能であることを当業者は認めよう。FIG. 1 is a high-level block diagram of a general-purpose computer system that the facility preferably executes. The computer system 100 includes a central processing unit (CP)
U) 110, input / output device 120, and computer memory (memory) 1
30 is built in. Between the input / output devices, a storage device 121, such as a hard disk drive, a computer readable medium provided on a computer readable medium, such as a CD-ROM, which can be used to install a software product including the facility. Possible media drive 122 and network connection 12 that allows the computer 100 to communicate with other connected computer systems (not shown)
There are three. The memory 130 includes a word segmentation facility 131 for identifying individual words appearing in the Chinese sentence, and a syntax parser for generating a parse tree representing the syntax structure of the sentence of the natural language sentence from a vocabulary record representing the word appearing in the natural language sentence. 133 and a vocabulary knowledge base 132 that the parser uses to build a vocabulary record for the parse tree and that the facility uses to identify words that appear in natural language sentences. The facility is preferably implemented on a computer system configured as described above, but those skilled in the art will recognize that the facility can be implemented on a computer system having a different configuration.

【００１９】図２は、ファシリティが動作することが好ましい２つのフェーズを示す概略フ
ロー図である。ステップ２０１において、初期化フェーズの一部として、ファシ
リティは語彙知識ベースを増強し、ファシリティが単語区分を実行する際に用い
る情報を含ませる。ステップ２０１については、図３と関連付けて以下で更に詳
しく論ずる。端的に言うと、ステップ２０１では、ファシリティは語彙知識ベー
ス内のいずれかの単語内に現れた文字について、語彙知識ベースにエントリを追
加する。文字毎に追加するエントリは、文字が単語内に現れる異なる位置を表す
ＣｈａｒＰｏｓ属性を含む。更に、文字毎のエントリは、現文字で始まる単語の
２番目の位置において現れる文字の集合を示すＮｅｘｔＣｈａｒｓ属性も含む。
最後に、ファシリティは、語彙知識ベース内に現れる各単語にＩｇｎｏｒｅＰａ
ｒｔｓ属性も追加する。これは、当該単語を構成する文字列も、現単語を共に構
成する、より小さな単語を構成すると考えるべきか否かを示す。FIG. 2 is a schematic flow diagram illustrating two phases in which the facility preferably operates. In step 201, as part of the initialization phase, the facility augments the vocabulary knowledge base and includes information that the facility uses in performing word segmentation. Step 201 is discussed in further detail below in connection with FIG. Briefly, in step 201, the facility adds an entry to the vocabulary knowledge base for characters that occur in any of the words in the vocabulary knowledge base. The entry to add for each character includes a CharPos attribute that represents the different positions where the character appears in the word. In addition, the entry for each character also includes a NextChars attribute that indicates the set of characters that appear at the second position of the word starting with the current character.
Finally, the facility assigns an IgnorePas to each word that appears in the vocabulary knowledge base.
The rts attribute is also added. This indicates whether the character strings that make up the word should also be considered to make up smaller words that also make up the current word.

【００２０】ステップ２０１の後、ファシリティはステップ２０２に進み、初期化フェーズ
を終了し、単語区分フェーズを開始する。単語区分フェーズでは、ファシリティ
は、語彙知識ベースに追加した情報を用いて、中国語文の文章の単語区分を実行
する。ステップ２０２において、ファシリティは、単語区分のために、中国語文
の文章を受け取る。ステップ２０３において、ファシリティは、受け取った文章
をその構成単語に区分する。ステップ２０３については、図５と関連付けて以下
で更に詳しく論ずる。端的に言えば、ファシリティは、語彙知識ベースにおいて
、文章内の文字の可能な連続する組み合わせ全ての小さな断片を調べる。次いで
、ファシリティは、語彙知識ベースによって単語であることが示された文字の調
査済みの組み合わせを、構文パーザに提出する。パーザは、文章の構文構造を判
定する際に、文章の著者が当該文章において単語を構成しようと意図した文字の
組み合わせを識別する。ステップ２０３の後、ファシリティはステップ２０２に
進み、単語区分のために次の文章を受け取る。After step 201, the facility proceeds to step 202, ends the initialization phase, and starts the word segmentation phase. In the word segmentation phase, the facility performs word segmentation of Chinese sentences using information added to the vocabulary knowledge base. In step 202, the facility receives a Chinese sentence for word segmentation. In step 203, the facility classifies the received sentence into its constituent words. Step 203 is discussed in further detail below in connection with FIG. In short, the facility looks in the vocabulary knowledge base for small fragments of all possible consecutive combinations of characters in a sentence. The facility then submits the examined combination of characters indicated as words by the lexical knowledge base to the syntactic parser. When determining the syntactic structure of a sentence, the parser identifies combinations of characters that the author of the sentence intends to form words in the sentence. After step 203, the facility proceeds to step 202 to receive the next sentence for word segmentation.

【００２１】図３は、初期化フェーズにおいて語彙知識ベースを増強し、単語区分を実行す
る際に用いる情報を含ませるために、ファシリティが実行することが好ましいス
テップを示すフロー図である。これらのステップは、（ａ）語彙知識ベース内の
単語に現れる文字について、語彙知識ベースにエントリを追加し、（ｂ）語彙知
識ベース内にあるこの文字のエントリにＣｈａｒＰｏｓおよびＮｅｘｔＣｈａｒ
ｓ属性を追加し、（ｃ）語彙知識ベース内の単語に対するエントリに、Ｉｇｎｏ
ｒｅＰａｒｔｓ属性を追加する。FIG. 3 is a flow diagram illustrating the steps that the facility preferably performs to augment the vocabulary knowledge base in the initialization phase and include information used in performing word segmentation. These steps include (a) adding an entry to the vocabulary knowledge base for characters that appear in words in the vocabulary knowledge base, and (b) adding CharPos and NextChar to the entry for this character in the vocabulary knowledge base.
s attribute is added, and (c) the entry for the word in the vocabulary knowledge base is
Add a reParts attribute.

【００２２】ファシリティは、語彙知識ベース内の単語エントリ毎に、ステップ３０１〜３
１２のループを繰り返す。ステップ３０２において、ファシリティは単語内の文
字位置毎にループを繰り返す。即ち、３つの文字を含む単語について、ファシリ
ティは、当該単語の第１、第２および第３文字に対してループを実行する。ステ
ップ３０３において、現文字位置にある文字が語彙知識ベース内にエントリを有
する場合、ファシリティはステップ３０５に進み、それ以外の場合、ファシリテ
ィはステップ３０４に進む。ステップ３０４において、ファシリティは現文字の
エントリを語彙知識ベースに追加する。ステップ３０４の後、ファシリティはス
テップ３０５に進む。ステップ３０５において、ファシリティは、語彙知識ベー
ス内の文字のエントリに格納してあるＣｈａｒＰｏｓ属性に、整列対（ｏｒｄｅ
ｒｅｄｐａｉｒ）を追加し、その文字は、現単語において現れた位置において
現れる可能性があることを示す。追加する整列対は、（ｐｏｓｉｔｉｏｎ、ｌｅ
ｎｇｔｈ）という形態を有し、ｐｏｓｉｔｉｏｎとは当該文字が単語内で占める
位置であり、ｌｅｎｇｔｈは単語内の文字数である。例えば、 The facility performs steps 301 to 3 for each word entry in the vocabulary knowledge base.
Repeat the 12 loops. In step 302, the facility repeats the loop for each character position in the word. That is, for words containing three characters, the facility performs a loop on the first, second, and third characters of the word. In step 303, if the character at the current character position has an entry in the vocabulary knowledge base, the facility proceeds to step 305; otherwise, the facility proceeds to step 304. In step 304, the facility adds an entry for the current character to the vocabulary knowledge base. After step 304, the facility proceeds to step 305. In step 305, the facility stores the CharPos attribute stored in the entry of the character in the vocabulary knowledge base in an ordered pair (orde
red pair) to indicate that the character is likely to appear at the location where it appeared in the current word. The alignment pairs to be added are (position, le
ngth), where position is the position of the character in the word and length is the number of characters in the word. For example,

【００２３】という単語における文字“委”について、ファシリティは、整列対（１，３）を
、文字“委”に対する語彙知識ベース・エントリ内のＣｈａｒＰｏｓ属性に格納
されている整列対のリストに追加する。好ましくは、ファシリティは、整列対が
既に現単語に対するＣｈａｒＰｏｓ属性に既に含まれている場合、ステップ３０
５で説明したように、整列対を追加しない。ステップ３０６において、処理する
現単語に未だ文字が残っている場合、ファシリティはステップ３０２に進み次の
文字を処理する。それ以外の場合、ファシリティはステップ３０７に進む。ステップ３０７において、単語が単一文字単語である場合、ファシリティはステ
ップ３０９に進み、それ以外の場合ファシリティはステップ３０８に進む。ステ
ップ３０８において、ファシリティは、現単語の２番目の位置にある文字を、現
単語の第１位置にある文字の語彙知識ベース・レコード内にあるＮｅｘｔＣｈａ
ｒｓ属性内の文字リストに追加する。例えば、 For the character “文字” in the word “という”, the facility adds the alignment pair (1, 3) to the list of alignment pairs stored in the CharPos attribute in the vocabulary knowledge base entry for the character “委”. . Preferably, if the alignment pair is already included in the CharPos attribute for the current word, step 30
As explained in 5, no alignment pairs are added. At step 306, if there are still characters remaining in the current word to be processed, the facility proceeds to step 302 to process the next character. Otherwise, the facility proceeds to step 307. At step 307, if the word is a single letter word, the facility proceeds to step 309; otherwise, the facility proceeds to step 308. In step 308, the facility converts the character at the second position of the current word to the NextCha in the vocabulary knowledge base record of the character at the first position of the current word.
Add to the character list in the rs attribute. For example,

【００２４】という単語では、ファシリティは、文字 In the word, the facility is a letter

【００２５】を文字“委”のＮｅｘｔＣｈａｒｓ属性に対して格納してある文字リストに追加
する。ステップ３０８の後、ファシリティはステップ３０９に進む。ステップ３０９において、現単語が他の更に小さい単語を含むことができる場
合、ファシリティはステップ３１１に進み、それ以外の場合ファシリティはステ
ップ３１０に進む。ステップ３０９については、図４と関連付けて以下で更に詳
しく論ずる。端的に言えば、ファシリティは多数の発見的方法を用いて、現単語
を構成する文字列がある場合、ある文脈では、この文字列が２つ以上のより小さ
な単語を構成する可能性があるか否かについて判定を行なう。Is added to the character list stored for the NextChars attribute of the character “C”. After step 308, the facility proceeds to step 309. In step 309, if the current word can include other smaller words, the facility proceeds to step 311; otherwise, the facility proceeds to step 310. Step 309 is discussed in further detail below in connection with FIG. In short, the facility uses a number of heuristics to determine, if there is a string that makes up the current word, in some context this string might make up more than one smaller word. A determination is made as to whether or not it is.

【００２６】ステップ３１０において、ファシリティは、前述の単語に対する語彙知識ベー
ス・エントリにおいて、単語のＩｇｎｏｒｅＰａｒｔｓ属性をセット（set）する。ＩｇｎｏｒｅＰａｒｔｓ属性をセットすると、ファシリティがこの単語を入
力文の文章内において発見した場合、この単語がより小さい単語を含むか否かに
ついて判定するこれ以上のステップを実行しないことを意味する。ステップ３１
０の後、ファシリティはステップ３１２に進む。ステップ３１１において、現単
語は他の単語を含む可能性があるので、ファシリティは当該単語に対するＩｇｎ
ｏｒｅＰａｒｔｓ属性をクリア（clear）し、入力文の文章内でその単語を発見した場合、ファシリティは当該単語がより小さな単語を含むか否かについての調
査に進むようにする。ステップ３１１の後、ファシリティはステップ３１２に進
む。ステップ３１２において、処理する語彙知識ベースに未だ単語が残っている
場合、ファシリティはステップ３０１に進み、次の単語を処理する。それ以外の
場合、これらのステップは終了する。In step 310, the facility sets the WordParts attribute of the word in the vocabulary knowledge base entry for the word. Setting the IgnoreParts attribute means that if the facility finds this word in the text of the input sentence, it will not take any further steps to determine whether this word contains a smaller word. Step 31
After 0, the facility proceeds to step 312. In step 311, since the current word may include other words, the facility may generate an Ign for that word.
If the oreParts attribute is cleared and the word is found in the text of the input sentence, the facility will proceed to check whether the word contains a smaller word. After step 311, the facility proceeds to step 312. At step 312, if there are still words left in the vocabulary knowledge base to be processed, the facility proceeds to step 301 to process the next word. Otherwise, these steps end.

【００２７】ファシリティが図３に示すステップを実行し、ＣｈａｒＰｏｓおよびＮｅｘｔ
Ｃｈａｒｓ属性を各文字に割り当てることによって、語彙知識ベースを増強する
際、以下の表５に示すように、表３に示したサンプル文章内に現れた文字（Char
acter）に対して、これらの属性を割り当てる。The facility performs the steps shown in FIG. 3 and returns the CharPos and Next
When the vocabulary knowledge base is augmented by assigning the Chars attribute to each character, as shown in Table 5 below, a character (Char) appearing in the sample sentence shown in Table 3 is used.
acter).

【００２８】表５：文字語彙知識ベース・エントリTable 5: Character Vocabulary Knowledge Base Entries

【表５】 [Table 5]

【００２９】図５の表から、例えば、文字“昨”のＣｈａｒＰｏｓ属性から、この文字は２
、３または４文字長の単語の最初の文字として現れる可能性があることがわかる
。更に、文字“昨”のＮｅｘｔＣｈａｒｓ属性から、この文字で始まる単語では
、２番目の文字は、“儿”、“天”または“晩”のいずれかの可能性があること
もわかる。From the table of FIG. 5, for example, from the CharPos attribute of the character
It can be seen that it may appear as the first character of a word that is three or four characters long. Furthermore, from the NextChars attribute of the character “Year”, it is also understood that in a word starting with this character, the second character may be any of “Child”, “Heaven” or “Night”.

【００３０】図４は、特定の単語が、他の更に小さい単語を含む可能性がる否かについて判
定するために実行することが好ましいステップを示すフロー図である。英語との
類似性として、英文からスペースおよび句読点記号を除去した場合、“ｂｅａｔ
”という文字列は、単語“ｂｅａｔ”または２つの単語“ｂｅ”および“ａｔ”
のいずれかとして解釈することが可能である。ステップ４０１において、単語が
４つ以上の文字を含む場合、ファシリティはステップ４０２に進み、この単語は
他の単語を含む可能性がないという結果を返す。それ以外の場合、ファシリティ
はステップ４０３に進む。ステップ４０３において、単語内の全ての文字が単一
文字単語を構成することができる場合、ファシリティはステップ４０５に進み、
それ以外の場合、ファシリティはステップ４０４に進み、単語は他の単語を含む
可能性がないという結果を返す。ステップ４０５において、単語は派生接辞、即
ち、接頭辞または接尾辞として頻繁に用いられる単語を含む場合、ファシリティ
はステップ４０６に進み、単語は他の単語を含む可能性がないという結果を返す
。それ以外の場合、ファシリティはステップ４０７に進む。ステップ４０７にお
いて、単語内の隣接する文字対が言語の文中で隣接して現れる際分割されること
が多い場合、ファシリティはステップ４０９に進み、単語は他の単語を含む可能
性があるという結果を返す。それ以外の場合、ファシリティはステップ４０８に
進み、単語は他の単語を含む可能性がないという結果を返す。FIG. 4 is a flow diagram illustrating the steps that are preferably performed to determine if a particular word may include other smaller words. As a similarity with English, when the spaces and punctuation marks are removed from the English sentence, "beat
Is a word "beat" or two words "be" and "at".
Can be interpreted as any of In step 401, if the word contains more than three letters, the facility proceeds to step 402, returning a result that this word is unlikely to contain other words. Otherwise, the facility proceeds to step 403. In step 403, if all the characters in the word can make up a single character word, the facility proceeds to step 405,
Otherwise, the facility proceeds to step 404, returning a result that the word cannot contain other words. In step 405, if the word contains a derived suffix, ie, a word that is frequently used as a prefix or suffix, the facility proceeds to step 406, returning a result that the word is unlikely to contain another word. Otherwise, the facility proceeds to step 407. If, in step 407, adjacent character pairs in the word are often split when they appear adjacent in the language sentence, the facility proceeds to step 409, where the result is that the word may include other words. return. Otherwise, the facility proceeds to step 408, returning a result that the word cannot contain other words.

【００３１】個々の単語が他のより小さい単語を含む可能性があるか否かについての判定結
果を以下の表６に示す。The results of determining whether an individual word may include other smaller words are shown in Table 6 below.

【００３２】表６：単語語彙知識ベース・エントリTable 6: Vocabulary Knowledge Base Entry

【表６】 [Table 6]

【００３３】例えば、表６から、単語（Word）“昨天”は他の単語を含む可能性はなく、一
方“天下”は他の単語を含む可能性があるとファシリティが判定したことがわか
る。For example, from Table 6, it can be seen that the facility has determined that the word “Word” may not include other words, while “World” may include other words.

【００３４】図５は、文章をその構成単語に区分するためにファシリティが実行することが
好ましいステップのフロー図である。これらのステップは、文章内で現れる言語
の異なる単語を識別する単語リストを生成し、この単語リストをパーザに提出し
、著者が文章を構成しようとした、単語リスト内の単語の部分集合を識別する。FIG. 5 is a flow diagram of the steps that the facility preferably performs to segment a sentence into its constituent words. These steps generate a word list that identifies the different language words that appear in the sentence, submit this word list to the parser, and identify the subset of words in the word list that the author tried to compose the sentence I do.

【００３５】ステップ５０１において、ファシリティは、文章内に現れた多文字単語を単語
リストに追加する。ステップ５０１については、図６と関連付けて以下で更に詳
しく論ずる。ステップ５０２において、ファシリティは、文章内に現れた単一文
字単語を、単語リストに追加する。ステップ５０２については、図９と関連付け
て以下で更に詳しく説明する。ステップ５０３において、ファシリティは、語彙
レコードを生成する。これは、語彙パーザが、ステップ５０１および５０２にお
いて単語リストに追加した単語のために用いる。ステップ５０４において、ファ
シリティは、語彙レコードに確率を割り当てる。語彙レコードの確率は、当該語
彙レコードが文章の正しい解析ツリーの一部である確度を反映し、パーザが解析
プロセスにおいて語彙レコードの適用を命令するために用いる。パーザは、解析
プロセス中、語彙レコードをその確率が小さくなる順に適用する。ステップ５０
４については、図１０と関連付けて以下で更に詳しく論ずる。ステップ５０５に
おいて、ファシリティは構文パーザを利用して、語彙レコードを解析し、文章の
構文構造を反映する解析ツリーを生成する。この解析ツリーは、ステップ５０３
において生成した語彙レコードの部分集合を、そのリーフとして有する。ステッ
プ５０６において、ファシリティは、解析ツリーのリーフである語彙レコードが
表す単語を、文章の単語として識別する。ステップ５０６の後、これらのステッ
プは終了する。In step 501, the facility adds a multi-character word appearing in a sentence to a word list. Step 501 is discussed in further detail below in connection with FIG. In step 502, the facility adds single letter words that appear in the sentence to the word list. Step 502 is described in further detail below in connection with FIG. In step 503, the facility generates a vocabulary record. This is used by the vocabulary parser for the words added to the word list in steps 501 and 502. In step 504, the facility assigns probabilities to the vocabulary records. The vocabulary record probability reflects the likelihood that the vocabulary record is part of the correct parse tree of the sentence, and is used by the parser to direct the application of the vocabulary record in the parsing process. During the parsing process, the parser applies lexical records in order of decreasing probability. Step 50
4 is discussed in more detail below in connection with FIG. In step 505, the facility uses the syntactic parser to parse the vocabulary record and generate a parse tree that reflects the syntactic structure of the sentence. This parse tree is stored in step 503
Has a subset of the vocabulary records generated in the above as its leaves. In step 506, the facility identifies the word represented by the vocabulary record, which is a leaf of the parse tree, as a sentence word. After step 506, these steps end.

【００３６】図６は、多文字単語を単語リストに追加するためにファシリティが実行するこ
とが好ましいステップを示すフロー図である。これらのステップは、文章内の現
位置を用いて文章を分析し、多文字単語を識別する。更に、これらのステップは
、図４に示したように、ファシリティが語彙知識ベースに追加したＣｈａｒＰｏ
ｓ、ＮｅｘｔＣｈａｒおよびＩｇｎｏｒｅＰａｒｔｓ属性を利用する。第１の好
適な実施形態によれば、ファシリティは、図６に示すステップの実行中、必要に
応じて、語彙知識ベースからこれらの属性を検索する。第２の好適な実施形態で
は、文章内の文字のＮｅｘｔＣｈａｒ属性および／またはＣｈａｒＰｏｓ属性の
値は、全て、図６に示すステップの実行前に、予めロードしてある。第２の好適
な実施形態に関連して、文章内に現れた文字毎に、ＣｈａｒＰｏｓ属性の値を含
む三次元アレイをメモリに格納することが好ましい。このアレイは、文章内の所
与の位置における文字について、当該文字が所与の長さの単語において所与の位
置にある可能性があるか否かについて示すものである。これらの属性の値をキャ
ッシュすることによって、図６に示すステップを実行する際に、これらに正式に
アクセスすることが可能となる。FIG. 6 is a flow diagram showing the steps that the facility preferably performs to add multi-character words to the word list. These steps use the current position in the sentence to analyze the sentence and identify multi-character words. In addition, these steps are based on the CharPo added by the facility to the vocabulary knowledge base, as shown in FIG.
Utilizes the s, NextChar, and IgnoreParts attributes. According to a first preferred embodiment, the facility retrieves these attributes from the vocabulary knowledge base as needed during the execution of the steps shown in FIG. In a second preferred embodiment, the values of the NextChar and / or CharPos attributes of the characters in the text are all pre-loaded before performing the steps shown in FIG. In the context of the second preferred embodiment, it is preferable to store in memory a three-dimensional array containing the value of the CharPos attribute for each character that appears in the text. This array indicates, for a character at a given position in a sentence, whether the character may be at a given position in a word of a given length. By caching the values of these attributes, they can be formally accessed when performing the steps shown in FIG.

【００３７】ステップ６０１において、ファシリティは、文章の最初の文字にこの位置をセ
ットする。ステップ６０２ないし６１４において、ファシリティは、位置が文章
の終端まで進み終えるまで、ステップ６０３ないし６１３を繰り返し続ける。In step 601, the facility sets this position to the first character of the sentence. In steps 602 to 614, the facility continues to repeat steps 603 to 613 until the position has reached the end of the sentence.

【００３８】ステップ６０３ないし６０９において、ファシリティは、現位置から開始する
単語候補毎にループを繰り返す。ファシリティは、現位置から開始し７文字長で
ある単語候補から開始し、繰り返し毎に、単語候補の終端から１つの文字を除去
し、単語候補が２文字長になるまで続ける。現位置から開始する文章内に残って
いる文字が７つ未満の場合、ファシリティは、文章内に十分な文字が残っていな
い単語候補に対する繰り返しを省略することが好ましい。ステップ６０４におい
て、ファシリティは、単語候補を構成する文字のＮｅｘｔＣｈａｒおよびＣｈａ
ｒＰｏｓ属性に関係する現単語候補の条件を検査する。ステップ６０４について
は、図７と関連付けて以下で更に詳しく論ずる。ＮｅｘｔＣｈａｒおよびＣｈａ
ｒＰｏｓ条件双方が単語候補に対して満たされる場合、ファシリティはステップ
６０５に進み、それ以外の場合、ファシリティはステップ６０９に進む。ステッ
プ６０５において、ファシリティは、語彙知識ベース内で単語候補を調べ、当該
単語候補が単語であるか否かについて判定を行なう。ステップ６０６において、
単語候補が単語である場合、ファシリティはステップ６０７に進み、それ以外の
場合、ファシリティはステップ６０９に進む。ステップ６０７において、ファシ
リティは、文章内に現れた単語のリストに、この単語候補を追加する。ステップ
６０８において、単語候補が他の単語を含む可能性がある場合、即ち、この単語
のＩｇｎｏｒｅＰａｒｔｓ属性がクリア（clear）の場合、ファシリティはステップ６０９に進み、それ以外の場合、ファシリティはステップ６１１に進む。ス
テップ６０９において、処理すべき単語候補が未だ残っている場合、ファシリテ
ィはステップ６０３に進み、次の単語候補を処理する。それ以外の場合、ファシ
リティはステップ６１０に進む。ステップ６１０において、ファシリティは、文
章の終端に向かって現位置を１文字だけ進ませる。ステップ６１０の後、ファシ
リティはステップ６１４に進む。In steps 603 to 609, the facility repeats a loop for each word candidate starting from the current position. The facility starts with a word candidate that is 7 characters long starting from the current position, removing one character from the end of the word candidate at each iteration and continuing until the word candidate is 2 characters long. If less than seven characters remain in the sentence starting from the current position, the facility preferably skips repetition for word candidates for which there are not enough characters remaining in the sentence. In step 604, the facility determines the NextChar and Cha of the characters that make up the word candidate.
Check the condition of the current word candidate related to the rPos attribute. Step 604 is discussed in further detail below in connection with FIG. NextChar and Cha
If both rPos conditions are satisfied for the word candidate, the facility proceeds to step 605; otherwise, the facility proceeds to step 609. In step 605, the facility examines word candidates in the vocabulary knowledge base and determines whether the word candidates are words. In step 606,
If the word candidate is a word, the facility proceeds to step 607; otherwise, the facility proceeds to step 609. In step 607, the facility adds the word candidate to the list of words that appeared in the sentence. In step 608, if the word candidate may include another word, that is, if the IgnoreParts attribute of the word is clear, the facility proceeds to step 609; otherwise, the facility proceeds to step 609. Proceed to 611. In step 609, if there are still word candidates to be processed, the facility proceeds to step 603 to process the next word candidate. Otherwise, the facility proceeds to step 610. In step 610, the facility advances the current position by one character toward the end of the sentence. After step 610, the facility proceeds to step 614.

【００３９】ステップ６１１において、単語候補の最後の文字が、同様に単語であり得る他
の単語候補と重複する場合、ファシリティはステップ６１３に進み、それ以外の
場合、ファシリティはステップ６１２に進む。ステップ６１１については、図８
と関連付けて以下で更に詳しく論ずる。ステップ６１２において、ファシリティ
は、文章内の単語候補の最後の文字の後ろにある文字に位置を進ませる。ステッ
プ６１２の後、ファシリティはステップ６１４に進む。ステップ６１３において
、ファシリティは、現単語候補の最後の文字に位置を進ませる。ステップ６１３
の後、ファシリティはステップ６１４に進む。ステップ６１４において、位置が
文章の終端でない場合、ファシリティはステップ６０２に進み、新たな単語候補
群を検討する。それ以外の場合、これらのステップは終了する。In step 611, if the last character of the word candidate overlaps with another word candidate that may also be a word, the facility proceeds to step 613; otherwise, the facility proceeds to step 612. Step 611 is described in FIG.
Discussed in more detail below in connection with In step 612, the facility advances the character after the last character of the word candidate in the sentence. After step 612, the facility proceeds to step 614. In step 613, the facility advances to the last character of the current word candidate. Step 613
After, the facility proceeds to step 614. In step 614, if the position is not at the end of the sentence, the facility proceeds to step 602 to consider a new set of word candidates. Otherwise, these steps end.

【００４０】図７は、単語候補に対してＮｅｘｔＣｈａｒおよびＣｈａｒＰｏｓ条件を検査
するためにファシリティが実行することが好ましいステップを示すフロー図であ
る。ステップ７０１において、単語候補の２番目の文字が、単語候補の最初の文
字のＮｅｘｔＣｈａｒリスト内にある場合、ファシリティはステップ７０３に進
む。それ以外の場合、ファシリティはステップ７０２に進み、条件を双方共満足
したという結果を返す。ステップ７０３ないし７０６において、ファシリティは
、単語候補内の文字位置毎にループを繰り返す。ステップ７０４において、単語
候補の現位置および長さで構成した整列対が、現文字位置における文字に対する
ＣｈａｒＰｏｓリスト内の整列対の中にある場合、ファシリティはステップ７０
６に進み、それ以外の場合、ファシリティはステップ７０５に進み、双方の条件
を満たしてはいないという結果を返す。ステップ７０６において、単語候補内に
処理すべき文字位置が未だ残っている場合、ファシリティはステップ７０３に進
み、単語候補内の次の文字位置を処理する。それ以外の場合、ファシリティはス
テップ７０７に進み、単語候補が双方の条件を満足したという結果を返す。FIG. 7 is a flow diagram illustrating the steps that the facility preferably performs to check the NextChar and CharPos conditions for word candidates. In step 701, if the second character of the word candidate is in the NextChar list of the first character of the word candidate, the facility proceeds to step 703. Otherwise, the facility proceeds to step 702 and returns a result that both conditions have been met. In steps 703 to 706, the facility repeats the loop for each character position in the word candidate. In step 704, if the alignment pair consisting of the current position and length of the word candidate is among the alignment pairs in the CharPos list for the character at the current character position, the facility proceeds to step 70
6; otherwise, the facility proceeds to step 705, returning a result that both conditions are not met. At step 706, if there is still a character position to be processed in the word candidate, the facility proceeds to step 703 to process the next character position in the word candidate. Otherwise, the facility proceeds to step 707 and returns a result that the word candidate satisfies both conditions.

【００４１】図８は、現単語候補の最後の文字が、単語であり得る別の単語候補と重複する
か否かについて判定を行なうためにファシリティが実行することが好ましいステ
ップを示すフロー図である。ステップ８０１において、単語候補の後ろにある文
字が、当該単語候補の最後の文字に対するＮｅｘｔＣｈａｒ属性における文字リ
スト内にある場合、ファシリティはステップ８０３に進む。それ以外の場合、フ
ァシリティはステップ８０２に進み、重複はないという結果を返す。ステップ８
０３において、ファシリティは、語彙知識ベースにおいて、単語候補を、その最
後の文字を除いて調べ、最後の文字を除いた単語候補が単語となるか否かについ
て判定を行なう。ステップ８０４において、最後の文字を除いた単語候補が単語
になる場合、ファシリティはステップ８０６に進み、重複があるという結果を返
す。それ以外の場合、ファシリティはステップ８０５に進み、重複がないという
結果を返す。FIG. 8 is a flow diagram illustrating the steps preferably performed by the facility to determine whether the last character of the current word candidate overlaps another word candidate that may be a word. . In step 801, if the character after the word candidate is in the character list in the NextChar attribute for the last character of the word candidate, the facility proceeds to step 803. Otherwise, the facility proceeds to step 802 and returns a result of no overlap. Step 8
At 03, the facility examines the word candidates in the vocabulary knowledge base except for the last character, and determines whether the word candidate excluding the last character is a word. If, in step 804, the word candidate excluding the last character is a word, the facility proceeds to step 806, returning a result that there is an overlap. Otherwise, the facility proceeds to step 805 and returns a result of no duplication.

【００４２】前述の例に関する、図６に示したステップの実行を、以下の表７に示す。The execution of the steps shown in FIG. 6 for the above example is shown in Table 7 below.

【００４３】表７：検討した文字の組み合わせTable 7: Character combinations considered

【表７】 [Table 7]

【００４４】表７は、ファシリティが検討したサンプル文章からの文字の５３通りの組み合
わせ（combination）の各々について、ＣｈａｒＰｏｓ検査の結果、ＮｅｘｔＣｈａｒｓ検査の結果、ファシリティが語彙知識ベース内で当該単語を調べたか否
か（look up?）、そして語彙知識ベースが、文字の組み合わせが単語になること
を示したか否か（is a word?）を示すものである。Table 7 shows that for each of the 53 combinations of characters from the sample sentences examined by the facility, the results of the CharPos test, the results of the NextChars test, and the facility looking up the word in the vocabulary knowledge base And whether or not the vocabulary knowledge base indicates that the combination of characters is a word (is a word?).

【００４５】組み合わせ１ないし４は、ＣｈａｒＰｏｓ検査で不合格（fail）であったこと
がわかる（fail on 昨）。何故なら、文字“昨”のＣｈａｒＰｏｓ属性は、整列
対（１，７）、（１，６）、（１，５）または（１，４）を含まないからである
。一方、組み合わせ５および６では、ＣｈａｒＰｏｓ検査およびＮｅｘｔＣｈａ
ｒｓ検査双方共、合格（pass）である。したがって、ファシリティは、組み合わ
せ５および６を語彙知識ベース内で調べ、組み合わせ５は単語ではないが、組み
合わせ６は単語であると判定する。組み合わせ６を処理し、現在位置からどれだ
け進ませるか決定した後、ファシリティは、ＩｇｎｏｒｅＰａｒｔｓ属性がセッ
ト（set）されているが、単語“昨天”は文字“天”で始まるある単語候補と重複することを判定する。したがって、ファシリティは、ステップ６１３にしたが
って、組み合わせ６の終端にある文字“天”まで進む。組み合わせ７〜１２では
、組み合わせ１２のみがＣｈａｒＰｏｓ検査およびＮｅｘｔＣｈａｒｓ検査双方
に合格している。したがって、組み合わせ１２を調べ、単語であると判定する。
組み合わせ１２を処理し、現在位置をどれだけ進ませるか決定した後、ファシリ
ティは、組み合わせ１２が構成する単語のＩｎｇｏｒｅＰａｒｔｓ属性がクリア
（clear）であることを判定し、したがって、現位置を、組み合わせ１２に続く文字ではなく、文字“下”まで１文字進ませる。It can be seen that combinations 1 to 4 failed the CharPos test (fail on). This is because the CharPos attribute of the character "last" does not include the alignment pair (1,7), (1,6), (1,5) or (1,4). On the other hand, in combinations 5 and 6, the CharPos test and NextCha
Both rs tests pass. Thus, the facility looks up combinations 5 and 6 in the vocabulary knowledge base and determines that combination 5 is not a word but combination 6 is a word. After processing combination 6 and deciding how far to advance from the current position, the facility determines that the IgnoreParts attribute is set, but that the word "Yearly" is a duplicate of a word candidate beginning with the letter "heaven". Is determined. Thus, the facility proceeds to the character “heaven” at the end of combination 6 according to step 613. In combinations 7-12, only combination 12 passed both the CharPos test and the NextChars test. Therefore, the combination 12 is examined and determined to be a word.
After processing the combination 12 and determining how much to advance the current position, the facility determines that the IngoreParts attribute of the word that the combination 12 comprises is clear, and thus the current position is determined by the combination 12 Advance one character to the character "down" instead of the character following.

【００４６】更に、組み合わせ１８、２４、３７および４３は、ＩｇｎｏｒｅＰａｒｔｓ属
性がセットされ、単語であり得るいずれの単語候補ともそれらの最後の文字が重
複しない単語であることもわかる。したがって、各々を処理した後、ファシリテ
ィは、ステップ６１２にしたがって、当該文字の組み合わせに続く文字まで現位
置を進ませることによって、これらの４つの組み合わせの各々に対して、不必要
に４１個の余分な組み合わせまで処理することを省略する。Further, it can be seen that the combinations 18, 24, 37, and 43 have the IgnorParts attribute set and are words whose last characters do not overlap with any word candidates that may be words. Thus, after processing each, the facility unnecessarily 41 extra for each of these four combinations by advancing the current position to the character following the character combination in accordance with step 612. Processing of up to various combinations is omitted.

【００４７】更に、組み合わせ２３および５０が構成する単語のＩｇｎｏｒｅＰａｒｔｓ属
性はクリアであることもわかる。このため、ファシリティは、これらの組み合わ
せを処理した後、ステップ６１０にしたがって、１文字だけ現位置を進ませる。Further, it can be seen that the IgnoreParts attribute of the word formed by the combinations 23 and 50 is clear. Thus, after processing these combinations, the facility advances the current position by one character according to step 610.

【００４８】更に、２文字の組み合わせ３０、３６、４７および５２は、ファシリティが単
語を構成するとは判定しなかったこともわかる。したがって、ファシリティは、
ステップ６１０にしたがって、これらの組み合わせを処理した後、１文字だけ現
位置を進ませる。結局、ファシリティは、サンプル文章において可能な１１２個
の組み合わせの内、わずか１４個のみを調べたに過ぎない。ファシリティが調べ
た１４個の組み合わせの内、９つは実際の単語である。Furthermore, it can be seen that the two letter combinations 30, 36, 47 and 52 did not determine that the facility constituted a word. Therefore, the facility
After processing these combinations according to step 610, the current position is advanced by one character. In the end, the facility looked at only 14 of the 112 possible combinations in the sample sentence. Of the 14 combinations that the facility looked at, nine were actual words.

【００４９】表８に示すように、表７と関連付けて説明した処理の後、単語リストは、組み
合わせ６、１２、１８、２３，２４，３７、４３、５０および５３で構成された
単語を含む（名詞（noun）、動詞（verb）、代名詞（pronoun）の品詞（part of
speech）も示す）。As shown in Table 8, after the processing described in connection with Table 7, the word list includes words composed of combinations 6, 12, 18, 23, 24, 37, 43, 50, and 53 (Noun, noun, verb, pronoun, part of speech)
speech)).

【００５０】表８：多文字単語の単語リストTable 8: Word list of multi-character words

【表８】 [Table 8]

【００５１】図９は、単一文字単語を単語リストに追加するためにファシリティが実行する
ことが好ましいステップを示すフロー図である。ステップ９０１ないし９０６に
おいて、ファシリティは、文章における文字毎に、最初の文字から最後の文字ま
で、ループを繰り返す。ステップ９０２において、ファシリティは、語彙知識ベ
ース内にあるそのエントリに基づいて、当該文字が単一文字単語を構成するか否
かについて判定を行い、構成しない場合、ファシリティは、単語リストに文字を
追加せずに、ステップ９０６に進む。文字が単一文字単語を構成する場合、ファ
シリティはステップ９０３に進み、それ以外の場合、ファシリティはステップ９
０６に進み、単語リストに文字を追加しない。ステップ９０３において、他の単
語を含む可能性がない単語、即ち、既に単語リスト上にありそのＩｇｎｏｒｅＰ
ａｒｔｓ属性がセットされている単語にこの文字が含まれる場合、ファシリティ
はステップ９０４に進み、それ以外の場合、ファシリティはステップ９０５に進
み、この文字を単語リストに追加する。ステップ９０４において、この文字が、
単語リスト上の別の単語と重複する、単語リスト上の別の単語内に含まれている
場合、ファシリティはステップ９０６に進み、この文字を単語リストに追加しな
い。それ以外の場合、ファシリティはステップ９０５に進む。ステップ９０５に
おいて、ファシリティは、現文字を構成する単一文字単語を単語リストに追加す
る。ステップ９０６において、文章内に未だ処理すべき文字が残っている場合、
ファシリティはステップ９０１に進み、文章内の次の文字を処理する。それ以外
の場合、これらのステップは終了する。FIG. 9 is a flow diagram illustrating the steps that the facility preferably performs to add a single character word to the word list. In steps 901 to 906, the facility repeats the loop from the first character to the last character for each character in the sentence. At step 902, the facility determines, based on its entry in the vocabulary knowledge base, whether the character forms a single character word, and if not, the facility adds the character to the word list. Instead, the process proceeds to step 906. If the character forms a single letter word, the facility proceeds to step 903; otherwise, the facility proceeds to step 9
Proceed to 06 and do not add any characters to the word list. In step 903, a word that is unlikely to contain another word, that is, the
If the word for which the arts attribute is set includes this character, the facility proceeds to step 904; otherwise, the facility proceeds to step 905 to add the character to the word list. In step 904, this character is
If it is contained within another word on the word list that overlaps with another word on the word list, the facility proceeds to step 906 and does not add this character to the word list. Otherwise, the facility proceeds to step 905. In step 905, the facility adds the single letter words that make up the current letter to the word list. In step 906, if characters to be processed still remain in the text,
The facility proceeds to step 901 and processes the next character in the sentence. Otherwise, these steps end.

【００５２】以下の表９は、図９に示すステップを実行する際に、ファシリティが単語リス
トに追加した単一文字単語５４〜６１を示す（名詞（noun）、形態素（morpheme
）、名詞（場所限定語）（noun(localizer）、動詞（verb）、前置詞（preposit
ion）、副詞（adverb）、機能語（function word）、代名詞（pronoun）、名詞（分類辞）（noun(classifier)）の品詞（part of speech）も示す）。Table 9 below shows the single letter words 54-61 that the facility added to the word list when performing the steps shown in FIG. 9 (nouns, morphemes).
), Noun (place qualifier) (noun (localizer), verb (verb), preposition (preposit
ion), adverb, function word, pronoun, part of speech of noun (classifier).

【００５３】表９：単一および多文字単語の単語リストTable 9: Word list of single and multi-letter words

【表９】 [Table 9]

【００５４】多文字単語および単一文字単語を単語リストに追加し、これらの単語に対する
語彙レコードを生成した後、ファシリティは、語彙レコードに確率を割り当てる
。これは、パーザが、解析プロセスにおいて語彙レコードの適用を順序付ける際
に用いる。以下で論ずる図１０および図１１は、ファシリティが語彙レコードに
確率を割り当てるために用いる２つの代替手法を示す。After adding multi-letter and single-letter words to the word list and generating vocabulary records for these words, the facility assigns probabilities to the vocabulary records. This is used by the parser in ordering the application of lexical records in the parsing process. FIGS. 10 and 11, discussed below, show two alternative approaches that the facility uses to assign probabilities to vocabulary records.

【００５５】図１０は、第１手法にしたがって単語リスト内の単語から生成した語彙レコー
ドに確率を割り当てるためにファシリティが実行することが好ましいステップを
示すフロー図である。ファシリティは、究極的には、語彙レコード毎の確率を、
パーザに解析プロセス中早期に語彙レコードを検討させる高い確率値、またはパ
ーザに解析プロセス中後期に語彙レコードを検討させる低い確率値のいずれかに
セットすることが好ましい。ステップ１００１ないし１００５において、ファシ
リティは、単語リスト内における単語毎にループを繰り返す。ステップ１００２
において、現単語が単語リスト内にあるより大きな単語に含まれる場合、ファシ
リティはステップ１００４に進み、それ以外の場合、ファシリティはステップ１
００３に進む。ステップ１００３において、ファシリティは、この単語を表す語
彙レコードの確率を、高い確率値にセットする。ステップ１００３の後、ファシ
リティはステップ１００５に進む。ステップ１００４において、ファシリティは
、その単語を表す語彙レコードの確率を、低い確率値にセットする。ステップ１
００４の後、ファシリティはステップ１００５に進む。ステップ１００５におい
て、単語リスト内に未だ処理すべき単語が残っている場合、ファシリティはステ
ップ１００１に進み、単語リスト内にある次の単語を処理する。それ以外の場合
、これらのステップは終了する。FIG. 10 is a flow diagram showing the steps preferably performed by the facility to assign probabilities to vocabulary records generated from words in the word list according to the first approach. The facility ultimately determines the probability for each vocabulary record,
Preferably, it is set to either a high probability value that causes the parser to consider the vocabulary record early during the parsing process, or a low probability value that causes the parser to consider the vocabulary record late during the parsing process. In steps 1001 to 1005, the facility repeats the loop for each word in the word list. Step 1002
At, if the current word is included in a larger word in the word list, the facility proceeds to step 1004; otherwise, the facility proceeds to step 1
Proceed to 003. In step 1003, the facility sets the probability of the vocabulary record representing this word to a high probability value. After step 1003, the facility proceeds to step 1005. In step 1004, the facility sets the probability of the vocabulary record representing the word to a low probability value. Step 1
After 004, the facility proceeds to step 1005. In step 1005, if there is still a word to be processed in the word list, the facility proceeds to step 1001 to process the next word in the word list. Otherwise, these steps end.

【００５６】以下の表１０は、図１０に示すステップにしたがって、単語リスト内の各単語
に割り当てた確率値（probability value）を示す。確率を調べることにより、ファシリティは各文字を含む少なくとも１つの単語に高い（high）確率値を割り
当てており、各文字を含む少なくとも１つの語彙レコードを解析プロセスの早期
において検討するようにしていることがわかる。Table 10 below shows the probability values assigned to each word in the word list according to the steps shown in FIG. By examining the probabilities, the facility has assigned a high probability value to at least one word containing each letter, so that at least one vocabulary record containing each letter is considered early in the parsing process. I understand.

【００５７】表１０：確率を加えた単語リストTable 10: Word list with added probabilities

【表１０】 [Table 10]

【００５８】図１１は、第２手法にしたがって単語リスト内の単語から発生した語彙レコー
ドに確率を割り当てるためにファシリティが実行することが好ましいステップを
示すフロー図である。ステップ１１０１において、ファシリティは、単語リスト
を用いて、単語リスト内の単語で全体的に構成されている文章に可能な全ての区
分を特定する。ステップ１１０２において、ファシリティは、ステップ１１０１
において特定した、可能な区分の内、含む単語数が最も少ない１つ以上の区分を
選択する。最少数の単語を有する可能な区分が１つよりも多い場合、ファシリテ
ィは、このような可能な区分の各々を選択する。FIG. 11 is a flow diagram showing the steps preferably performed by the facility to assign probabilities to vocabulary records generated from words in the word list according to the second technique. In step 1101, the facility uses the word list to identify all possible sections of the sentence that are entirely composed of the words in the word list. In step 1102, the facility includes step 1101
Select one or more sections having the smallest number of words from among the possible sections specified in. If there is more than one possible partition with the fewest words, the facility selects each such possible partition.

【００５９】以下の表１１は、表９に示した単語リストから生成した、最も少ない単語（９
個）を有する、可能な区分を示す。Table 11 below shows the minimum number of words (9) generated from the word list shown in Table 9.
) Indicates possible divisions.

【００６０】[0060]

【表１１】 [Table 11]

【００６１】ステップ１１０３において、ファシリティは、選択した区分（複数）における
単語の語彙レコードの確率を高い確率値にセットする。ステップ１１０４におい
て、ファシリティは、選択した区分（複数）にない単語の語彙レコードの確率を
低い確率値にセットする。ステップ１１０４の後、これらのステップは終了する
。In step 1103, the facility sets the probability of the vocabulary record of the word in the selected section (s) to a high probability value. In step 1104, the facility sets the probability of the vocabulary record for words not in the selected category (s) to a low probability value. After step 1104, these steps end.

【００６２】以下の表１２は、図１１に示すステップにしたがって、単語リスト内にある各
単語に割り当てた確率値（probability value）を示す。確率を調べることにより、ファシリティは各文字を含む少なくとも１つの単語に高い（high）確率値を
割り当てており、各文字を含む少なくとも１つの語彙レコードを解析プロセスの
早期において検討するようにしていることがわかる。Table 12 below shows the probability values assigned to each word in the word list according to the steps shown in FIG. By examining the probabilities, the facility has assigned a high probability value to at least one word containing each letter, so that at least one vocabulary record containing each letter is considered early in the parsing process. You can see that there is.

【００６３】表１２：確率を加えた単語リストTable 12: Word list with added probabilities

【表１２】 [Table 12]

【００６４】図１２は、パーザが生成した、サンプル文章の構文構造を表す解析ツリーを示
す解析ツリー図である。解析ツリーは、単一の文章レコード１２３１をその頭部
として有し、かつ多数の語彙レコード１２０１〜１２１１をそのリーフとして有
する階層構造であることがわかる。更に、解析ツリーは、単語を表す各語彙レコ
ードを組み合わせて、１つ以上の単語を表すより大きな構文構造にする、中間構
文レコード１２２１〜１２２７も有する。例えば、前置詞句レコード１２２３は
、前置詞（preposition）を表す語彙レコード１２０４および名詞（noun）を表す語彙レコード１２０６を組み合わせる。図５のステップ５０６にしたがって、
ファシリティは、解析ツリー内にある語彙レコード１２０１〜１２１１が表す単
語を、サンプル文章を区分すべき単語として特定する。ファシリティがこの解析
ツリーを保有して、文章に対して更に別の自然言語処理を実行するようにしても
よい。FIG. 12 is a parse tree diagram showing a parse tree generated by the parser and representing the syntax structure of the sample sentence. It can be seen that the parse tree has a hierarchical structure having a single sentence record 1231 as its head and many vocabulary records 1201 to 1211 as its leaves. In addition, the parse tree also has intermediate syntax records 1221-1227 that combine each vocabulary record representing a word into a larger syntax structure representing one or more words. For example, the preposition phrase record 1223 combines a vocabulary record 1204 representing a preposition and a vocabulary record 1206 representing a noun (noun). According to step 506 of FIG.
The facility specifies the words represented by the vocabulary records 1201 to 1211 in the parse tree as words that should be used to divide the sample sentences. The facility may hold this parse tree and perform further natural language processing on the sentence.

【００６５】以上、好適な実施形態を参照しながら本発明について示しかつ説明したが、本
発明の範囲から逸脱することなく、形態および詳細において種々の変化または変
更が可能であることは、当業者には認められよう。例えば、中国語以外の言語に
おいても、単語の区分を行なうために、前述のファシリティの特徴（ａｓｐｅｃ
ｔ）を適用することができる。更に、ここに記載した技術の適当な部分集合また
は上位集合も、単語の区分を実行するために適用することができる。While the invention has been shown and described with reference to preferred embodiments, it will be appreciated by those skilled in the art that various changes or modifications can be made in form and detail without departing from the scope of the invention. Will be recognized. For example, even in languages other than Chinese, in order to perform word division, the above-mentioned facility features (spec
t) can be applied. Further, any suitable subset or superset of the techniques described herein may also be applied to perform word segmentation.

[Brief description of the drawings]

【図１】ファシリティが実行するすることが好ましい汎用コンピュータ・システムの上
位ブロック図である。FIG. 1 is a high-level block diagram of a general-purpose computer system that the facility preferably executes.

【図２】ファシリティが動作することが好ましい２つのフェーズを示す概略フロー図で
ある。FIG. 2 is a schematic flow diagram illustrating two phases in which the facility preferably operates.

【図３】初期化フェーズにおいて語彙知識ベースを増強し、単語の区分を実行する際に
用いる情報を含ませるために、ファシリティが実行することが好ましいステップ
を示すフロー図である。FIG. 3 is a flow diagram showing the steps preferably performed by the facility to augment the vocabulary knowledge base in the initialization phase and include information used in performing word segmentation.

【図４】特定の単語が、他の更に小さい単語を含む可能性があるか否かについて判定す
るために実行することが好ましいステップを示すフロー図である。FIG. 4 is a flow diagram illustrating steps that are preferably performed to determine whether a particular word may include other smaller words.

【図５】文章をその構成単語に区分するためにファシリティが実行することが好ましい
ステップのフロー図である。FIG. 5 is a flow diagram of the steps preferably performed by the facility to segment a sentence into its constituent words.

【図６】多文字単語を単語リストに追加するためにファシリティが実行することが好ま
しいステップを示すフロー図である。FIG. 6 is a flow diagram illustrating the steps that the facility preferably performs to add multi-character words to the word list.

【図７】単語候補に対してＮｅｘＣｈａｒおよびＣｈａｒＰｏｓ条件を検査するために
ファシリティが実行することが好ましいステップを示すフロー図である。FIG. 7 is a flow diagram illustrating the steps that the facility preferably performs to check NexChar and CharPos conditions on word candidates.

【図８】現単語候補の最後の文字が、単語であり得る別の単語候補と重複するか否かに
ついて判定を行なうためにファシリティが実行することが好ましいステップを示
すフロー図である。FIG. 8 is a flow diagram showing the steps preferably performed by the facility to determine whether the last character of the current word candidate overlaps another word candidate that may be a word.

【図９】単一文字単語を単語リストに追加するためにファシリティが実行することが好
ましいステップを示すフロー図である。FIG. 9 is a flow diagram illustrating the steps that the facility preferably performs to add a single character word to the word list.

【図１０】第１手法にしたがって単語リスト内の単語から発生した語彙レコードに確率を
割り当てるためにファシリティが実行することが好ましいステップを示すフロー
図である。FIG. 10 is a flow diagram illustrating the steps preferably performed by the facility to assign probabilities to vocabulary records generated from words in the word list according to a first technique.

【図１１】第２手法にしたがって単語リスト内の単語から発生した語彙レコードに確率を
割り当てるためにファシリティが実行することが好ましいステップを示すフロー
図である。FIG. 11 is a flow diagram illustrating the steps preferably performed by the facility to assign probabilities to vocabulary records generated from words in the word list according to a second technique.

【図１２】サンプル文章の構文構造を表す、パーザが生成する解析ツリーを示す解析ツリ
ー図である。FIG. 12 is a parse tree diagram showing a parse tree generated by a parser, representing a syntax structure of a sample sentence.

【手続補正書】[Procedure amendment]

【提出日】平成１３年３月１３日（２００１．３．１３）[Submission date] March 13, 2001 (2001.3.13)

【手続補正１】[Procedure amendment 1]

【補正対象書類名】図面[Document name to be amended] Drawing

【補正対象項目名】全図[Correction target item name] All figures

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【図１】 FIG.

【図２】 FIG. 2

【図３】 FIG. 3

【図４】 FIG. 4

【図５】 FIG. 5

【図６】 FIG. 6

【図７】 FIG. 7

【図８】 FIG. 8

【図９】 FIG. 9

【図１０】 FIG. 10

【図１１】 FIG. 11

【図１２】 FIG.

───────────────────────────────────────────────────── フロントページの続き (72)発明者リチャードソン，スティーヴン・ディーアメリカ合衆国ワシントン州98052，レッドモンド，ノースイースト・ワンハンドレッドアンドサーティセカンド・ストリート 18028 (72)発明者チャン，チーシンアメリカ合衆国ワシントン州98052，レッドモンド，ワンハンドレッドアンドセヴンティシックスス・プレイス・ノースイースト 9307，ナンバー３Ｆターム(参考） 5B091 AA15 CA02 CA05 CC02 CC15────────────────────────────────────────────────── ─── Continuing the front page (72) Richardson, Steven Dee 98052, Washington, United States of America 98052, Redmond, Northeast One Handed and Thirty-Second Street 18028 (72) Inventor Chang, Chishin Washington, United States of America 98052, Redmond, One Hundred and Seventy Sixth Place Northeast 9307, Number 3 F-term (reference) 5B091 AA15 CA02 CA05 CC02 CC15

Claims

[Claims]

1. A method for identifying words comprising a body of a natural language sentence in a computer system, wherein the body of the natural language sentence starts with a first character, ends with a last character, and includes the first and the last characters. Consisting of an aligned character string comprising an intermediate character selected between final characters, the method comprising: in the character string, a first character including the first character and the selected intermediate character.
Identifying a word; identifying a second word in the character string that includes the final character but not the selected intermediate character, and concatenates the first and second words to form the character string. And identifying a third word in the character string that includes the first character but does not include the selected intermediate character. In the character string, identifies the selected intermediate character and the last character. 4th including
Identifying a word and concatenating the third and fourth words to form the string; submitting the first, second, third and fourth words to a syntax parser; Generating a parse tree representing a structure, wherein the parse tree includes one of the first and second words or the third and fourth words; and wherein the parse tree includes the first and second words. Indicating that the first and second words constitute the body of the natural language sentence when including the second word, and when the parse tree includes the third and fourth words, indicating the third and fourth words. Indicating that four words make up the body of the natural language sentence.

2. The submitting step submits a supersequence of characters including the character string and constituting a sentence to the syntax parser, and generates a parse tree representing a syntax structure of the sentence. The method of claim 1, comprising a step.

3. For each of the plurality of characters, using the character appearing in a second position of the plurality of words beginning with the character and an indication of the position where the character appears in the word, For each of multiple consecutive combinations,
A computer-readable medium having content that causes a computer system to select a combination of characters that may be words from a natural language character string by performing the following steps: Making a determination as to whether the appearing character is indicated to appear in a word beginning with the character appearing in the first position of the combination; and determining whether the character appearing in a second position of the combination is the combination Is determined to appear in a word beginning with the letter that appears in the first position of the combination, it is indicated that each letter of the combination appears in a plurality of words at the location where it appears in the combination. Determining whether or not each character of the combination is In the position in which it appears in the allowed,
Determining that the combination of characters is likely to be a word if it is determined that the combination of characters is indicated to appear in a plurality of words.

4. The computer-readable medium of claim 1, further comprising the step of comparing the character combination with a word list to determine whether the character combination is a word.

5. A computer memory incorporating a word segmentation data structure for use in identifying individual words appearing in a natural language sentence, said data structure comprising: for each of a plurality of characters, beginning with said character Including identifying a character appearing in a second position of the word, including, for a word including the character, identifying a length of the word and a character position within the word occupied by the character; Computer memory, including an indication that the strings that make up a word may also make up a series of short words.