JP2006277674A

JP2006277674A - Sentence division computer program

Info

Publication number: JP2006277674A
Application number: JP2005100017A
Authority: JP
Inventors: Lepage Yves; イヴ・ルパージュ
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-03-30
Filing date: 2005-03-30
Publication date: 2006-10-12

Abstract

PROBLEM TO BE SOLVED: To provide a computer program for making a computer accurately and quickly divide various sentences without having to perform pre-processing or editing data. SOLUTION: This sentence division program 86 is used with a database 88 including three or more pairs constituted of a sentence and divided character strings obtained by dividing the sentence. The sentence division program 86 is provided with a division program section 90 for, when given a sentence to be processed, selecting arbitrary two sentences from the database 88, and for generating an analogical sentence from the two sentences and the sentence to be processed, and for, when any analogical sentence exists in the database 88, reading divided character strings making a pair with the analogical sentence and the two sentences in the database 88, and for generating an analogical divided character strings by using the read three divided character strings. This sentence division program 86 is also provided with an input/output program section 92 for giving an input sentence 82 to the division program section 90, and for outputting the analogical divided character strings generated by the division program section 90 from the input sentence 82 as an output sentence 84. COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、自然言語処理に関し、特に、単語等を単位とする所定単位に文字列を分割する技術に関する。 The present invention relates to natural language processing, and more particularly to a technique for dividing a character string into predetermined units each including a word or the like.

自然言語の文章をコンピュータで処理するためには、まず、処理対象の文章を単語又は形態素等の単位に分割し、個々の要素がどのようなものであるかを同定する処理が必要となる。例えば、単語と単語との境界を記号「｜」で表した場合、日本語の文「来週出発したいと思っています。」を単語単位で分割すると、「来週｜出発｜したい｜と｜思って｜います｜。」となる。特に、日本語、中国語、タイ語等の言語の文は、英語等の言語のように単語ごとに分かち書きされることなく一続きに表記されるため、文を単語等の単位に分割する処理は必須の処理といえる。 In order to process a natural language sentence with a computer, first, it is necessary to divide the sentence to be processed into units such as words or morphemes and to identify what individual elements are. For example, if the boundary between words is represented by the symbol “|”, the Japanese sentence “I want to leave next week” is divided into words, then “Next week | "There is." Especially, sentences in languages such as Japanese, Chinese, Thai, etc. are written in a single line without being divided into words as in languages such as English, so the process of dividing the sentence into units such as words Is an essential process.

文を単語単位に分割するための手法として、大別して次の２種類の手法が知られている。すなわち、一つは辞書を用いる技術であり、もう一つは統計的な手法である。 As methods for dividing a sentence into words, the following two types of methods are known. That is, one is a technique using a dictionary and the other is a statistical technique.

辞書を用いる手法で分割を行なう場合、一般的にはまず有限状態オートマトンとして構成された辞書を用意する。そして、文頭から順に文中の文字と辞書中の文字とを照合し、先行する単語を構成する最後の文字と、その次の単語を構成する最初の文字との境界を決定する。この処理を文末に向かって順次行ない、単語と単語との境界を決定する。 When division is performed by a method using a dictionary, generally a dictionary configured as a finite state automaton is first prepared. Then, the characters in the sentence and the characters in the dictionary are collated in order from the beginning of the sentence, and the boundary between the last character constituting the preceding word and the first character constituting the next word is determined. This process is sequentially performed toward the end of the sentence to determine the boundary between words.

統計的な手法で分割を行なう場合、まず、事前に用意した学習データを処理して確率モデルを推論するための統計情報を得る、いわゆる学習段階の処理を行なう。確率モデルには様々なものが存在するが、通常、Ｎ−グラム頻度に依存するモデルを用いる。確率モデルは、例えばＧｏｏｄ−Ｔｕｒｉｎｇの評価法等を用いて平滑化され評価される。このような学習の結果、状態遷移という形態の情報が得られる。この状態遷移を使えば、処理中の文字又は処理予定の文字についての知識が与えられるので、処理中の文字が単語の最後の文字であるか否か、すなわちその位置に空白を置くべきか否かを決定することができる。学習段階の処理は、学習データが適切に処理され、確率モデルより知識が得られるまで実行される。その後に、学習で得られた知識を用いて単語と単語との境界を決定する。 When the division is performed by a statistical method, first, a so-called learning stage process is performed in which learning data prepared in advance is processed to obtain statistical information for inferring a probability model. There are various probabilistic models, but a model that depends on the N-gram frequency is usually used. The probability model is smoothed and evaluated using, for example, the Good-Turing evaluation method. As a result of such learning, information in the form of state transition is obtained. This state transition gives knowledge about the character being processed or to be processed, so whether or not the character being processed is the last character of a word, i.e. whether a blank should be placed at that position. Can be determined. The process in the learning stage is executed until the learning data is appropriately processed and knowledge is obtained from the probability model. After that, the boundary between words is determined using the knowledge obtained by learning.

特許第３１７２５１１号公報Japanese Patent No. 3172511

辞書を用いる手法では、処理は非決定性を有する。すなわち、同じ文が異なる分割の仕方で分割されることがある。また、この手法で分割を行なうには、処理対象の言語の文法的知識等に基づき予め大規模かつ適切な辞書を編集しておく必要がある。 In the method using a dictionary, the processing is nondeterministic. That is, the same sentence may be divided in different ways. Further, in order to perform division by this method, it is necessary to edit a large-scale and appropriate dictionary in advance based on grammatical knowledge of the language to be processed.

統計的な手法に基づく技術では、用意したデータを前処理しておかなければならない。すなわち、処理に先立ち、膨大な量の学習データを用いて学習を行なう必要がある。また、確率モデルにＮ−グラムモデルを用いた場合、分割の処理対象となる文脈は所定の長さに固定されてしまい、長い文脈での処理に適さない。 In the technique based on the statistical method, the prepared data must be preprocessed. That is, it is necessary to perform learning using a huge amount of learning data prior to processing. Further, when the N-gram model is used as the probability model, the context to be processed for division is fixed to a predetermined length, which is not suitable for processing in a long context.

それゆえに、本発明の目的は、膨大な量の前処理やデータの編集を必要とせず、コンピュータに、様々な文を正確かつ迅速に分割させるコンピュータプログラムを提供することである。 Therefore, an object of the present invention is to provide a computer program that allows a computer to divide various sentences accurately and quickly without requiring an enormous amount of preprocessing and data editing.

本発明の第１の局面に係る文分割コンピュータプログラムは、文と、その文を所定単位で分割した分割文字列とからなる対を３個以上含む、コンピュータ読取可能なデータベースとともに用いられるコンピュータプログラムである。この文分割コンピュータプログラムは、処理対象の文が与えられると、データベースを用いて処理対象の文に対する分割文字列を生成するための文分割プログラム部分と、外部から与えられる入力文を文分割プログラム部分に与え、入力文に対して文分割プログラム部分が出力する分割文字列を、入力文に対する分割文字列として出力するための入出力プログラム部分とを含む。文分割プログラム部分は、処理対象の文が与えられると、データベースから任意の二つの文を選択し、当該二つの文と処理対象の文とから所定の第１の類推式を生成し、第１の類推式を解いて類推文を生成するための第１の類推プログラム部分と、類推文がデータベースに存在するか否かを判定するための判定プログラム部分と、判定プログラム部分により類推文がデータベースに存在すると判定されたことに応答して、データベース中で類推文と対になっている分割文字列、及びデータベース中で二つの文とそれぞれ対になっている二つの分割文字列を、データベースから読出すための読出プログラム部分と、読出プログラム部分により読出された３つの分割文字列を用いて、第１の類推式と所定の関係を有する第２の類推式を生成し、第２の類推式を解いて類推分割文字列を生成して文分割プログラム部分の戻り値として出力するための第２の類推プログラム部分とを含む。 The sentence division computer program according to the first aspect of the present invention is a computer program that is used together with a computer-readable database that includes three or more pairs of a sentence and a divided character string obtained by dividing the sentence by a predetermined unit. is there. When a sentence to be processed is given, the sentence dividing computer program uses a database to generate a divided character string part for generating a divided character string for the sentence to be processed, and an input sentence given from the outside as a sentence dividing program part. And an input / output program part for outputting a divided character string output by the sentence dividing program part for the input sentence as a divided character string for the input sentence. When a sentence to be processed is given, the sentence division program part selects arbitrary two sentences from the database, generates a predetermined first analogy from the two sentences and the sentence to be processed, A first analogy program part for generating an analogy sentence by solving the analogy formula, a determination program part for determining whether the analogy sentence exists in the database, and an analogy sentence in the database by the determination program part In response to being determined to exist, it reads from the database the split character string that is paired with the analogy sentence in the database and the two split character strings that are paired with two sentences in the database. A second analogical equation having a predetermined relationship with the first analogical equation is generated using the read program portion for the output and the three divided character strings read by the read program portion; And generating an analogy divided character string by solving the analogy equation and a second analogy program part for outputting as the return value of the sentence divided program parts.

好ましくは、第１の類推プログラム部分は、処理対象の文が与えられると、データベース中の文の順序対（Ａ，Ｂ）を全て選択するためのプログラム部分と、選択された順序対の各々に対し、処理対象の文Ｄとの間で第１の類推式「Ａ：Ｂ：：ｘ：Ｄ」を生成し、第１の類推式を解いて解ｘ＝Ｃを生成することを試み、生成された解Ｃを全て出力するための類推式解決プログラム部分とを含む。 Preferably, when a sentence to be processed is given, the first analogizing program part includes a program part for selecting all order pairs (A, B) of sentences in the database, and each of the selected order pairs. On the other hand, the first analogical expression “A: B :: x: D” is generated with the sentence D to be processed, and the first analogical expression is solved to generate a solution x = C. And an analog reasoning solution program part for outputting all the solutions C.

より好ましくは、データベース中の文Ａ、Ｂ、及びＣは、データベース中ではそれぞれ分割文字列Ａ’、Ｂ’、及びＣ’と対になっており、第２の類推プログラム部分は、読出プログラム部分により読出された３つの分割文字列Ａ’、Ｂ’、及びＣ’を用いて、第１の類推式における順序と一致する順序でこれらを配置した第２の類推式「Ａ’：Ｂ’：：Ｃ’：ｙ」を生成し、この第２の類推式を解いて類推分割文字列ｙ＝Ｄ’を生成して文分割プログラム部分の戻り値として出力するためのプログラム部分を含む。 More preferably, sentences A, B, and C in the database are paired with divided character strings A ′, B ′, and C ′, respectively, in the database, and the second analogy program part is the read program part. Using the three divided character strings A ′, B ′, and C ′ read out by the second analogy formula “A ′: B ′: : C ′: y ”is generated, a program part for solving the second analogy formula and generating an analogy split character string y = D ′ and outputting it as a return value of the sentence split program part is included.

文分割プログラム部分は、判定プログラム部分により類推文がデータベースに存在していないと判定されたことに応答して、類推文を処理対象の文として文分割プログラム部分を再帰的に呼出すための再帰呼出プログラム部分と、類推文と、再帰呼出プログラム部分により呼出された文分割プログラム部分により出力される分割文字列とを対にして、データベースに追加するためのプログラム部分とをさらに含んでもよい。 The sentence split program part is a recursive call for recursively calling the sentence split program part using the analogy sentence as a processing target sentence in response to the judgment program part determining that the analog reason sentence does not exist in the database. The program part may further include a program part for adding to the database a pair of the program part, the analogy sentence, and the split character string output by the sentence split program part called by the recursive call program part.

本発明の第２の局面に係る記録媒体は、本発明の第１の局面に係るいずれかの文分割コンピュータプログラムを記録した、コンピュータ読取可能な記録媒体である。 A recording medium according to a second aspect of the present invention is a computer-readable recording medium on which any sentence division computer program according to the first aspect of the present invention is recorded.

［概要］
以下、図面を参照しつつ、本発明の一実施の形態について説明する。なお、本明細書では、単語単位で区切られることなく表記される言語の例として日本語を挙げるが、以下の説明からも明らかなように、本発明は、言語、文字、文法構造等に関係なく適用可能である。また、この実施の形態を説明するにあたり、図面及び明細書において、日本語における単語間の境界を記号「｜」で表す。また、英語における単語間の境界は、通例どおり空白で表す。 [Overview]
Hereinafter, an embodiment of the present invention will be described with reference to the drawings. In this specification, Japanese is given as an example of a language written without being divided in units of words. As is clear from the following description, the present invention relates to a language, a character, a grammatical structure, and the like. It is applicable. In describing this embodiment, the boundary between words in Japanese is represented by the symbol “|” in the drawings and specification. Moreover, the boundary between words in English is represented by a blank as usual.

この実施の形態の文分割システムは、文字列により構成された文と、文中の単語同士の境界にそれぞれ境界を表す境界記号を挿入した記号列（以下、この記号列を「分割文字列」と呼ぶ。）との対からなる多数の用例を用いて、記号列類推によって単語単位の分割を行なう。記号列類推については、特許文献１を参照されたい。 The sentence division system according to this embodiment includes a sentence composed of a character string and a symbol string in which boundary symbols representing the boundaries are inserted at the boundaries between words in the sentence (hereinafter, this symbol string is referred to as a “divided character string”). The word unit is divided by symbol string analogy using a number of examples consisting of pairs. Refer to Patent Document 1 for symbol string analogy.

特許文献１では、所定の順序で与えられた３つの記号列から別の記号列を類推生成する手法が開示されている。この手法では、まず、与えられた３つの記号列を構成する二つの記号列の間に成立する類推関係を特定する。続いて、３つの記号列を構成する残りの一つの記号列との間にその類推関係と同様の関係が成立する、別の記号列を生成する。二つの記号列の間に成立する類推関係と同様の類推関係が、残りの一つの記号列と生成される記号列との間に成立する。ここでは、記号列Ａ、記号列Ｂ、及び記号列Ｃから、類推関係により記号列Ｄが生成されることを「Ａ：Ｂ：：Ｃ：Ｄ」と表すものとする。 Japanese Patent Application Laid-Open No. 2004-151561 discloses a method of generating another symbol string by analogy from three symbol strings given in a predetermined order. In this method, first, an analogy relation established between two symbol strings constituting three given symbol strings is specified. Subsequently, another symbol string is generated in which the same relationship as the analogy relationship is established with the remaining one symbol string constituting the three symbol strings. An analogy relationship similar to the analogy relationship established between two symbol strings is established between the remaining one symbol string and the generated symbol string. Here, the generation of the symbol string D from the symbol string A, the symbol string B, and the symbol string C by analogy is expressed as “A: B :: C: D”.

この実施の形態では、まず分割の対象となる文字列を文Ｄとした場合に、Ａ：Ｂ：：Ｃ：Ｄが成立する文Ａ、Ｂ、及びＣを探索する。文Ａ、Ｂ、及びＣが探索されると、用例を用いて文Ａ、Ｂ、及びＣに対応する分割文字列Ａ’、分割文字列Ｂ’、及び分割文字列Ｃ’を準備する。分割文字列Ａ’、Ｂ’、及びＣ’もまた記号列であり、記号列類推に用いることができる。そこで、分割文字列Ａ’、Ｂ’、及びＣ’から、Ａ’：Ｂ’：：Ｃ’：Ｄ’が成立する記号列Ｄ’を記号列類推により生成する。 In this embodiment, first, when a character string to be divided is a sentence D, sentences A, B, and C that satisfy A: B :: C: D are searched. When the sentences A, B, and C are searched, a divided character string A ′, a divided character string B ′, and a divided character string C ′ corresponding to the sentences A, B, and C are prepared using an example. The divided character strings A ′, B ′, and C ′ are also symbol strings and can be used for symbol string analogy. Therefore, a symbol string D ′ that satisfies A ′: B ′ :: C ′: D ′ is generated from the divided character strings A ′, B ′, and C ′ by symbol string analogy.

分割文字列Ａ’、Ｂ’、及びＣ’は分割文字列であるため、記号列Ｄ’も同様に分割文字列となる。また、分割文字列Ａ’、Ｂ’、及びＣ’はそれぞれ文Ａ、Ｂ、及びＣに対応するものであるため、記号列Ｄ’も同様に文Ｄに対応するものとなる。したがって、記号列Ｄ’は、文Ｄに対応する分割文字列となる。 Since the divided character strings A ′, B ′, and C ′ are divided character strings, the symbol string D ′ is similarly divided character strings. Since the divided character strings A ′, B ′, and C ′ correspond to the sentences A, B, and C, respectively, the symbol string D ′ also corresponds to the sentence D. Therefore, the symbol string D ′ is a divided character string corresponding to the sentence D.

この実施の形態の文分割システムはさらに、文Ａ、Ｂ、及びＣの探索の過程で生成される文を対象とした再帰的な処理により、用例の補充を行なう。 The sentence division system of this embodiment further replenishes the example by recursive processing for sentences generated in the process of searching for sentences A, B, and C.

［構成］
この実施の形態の文分割システムは、コンピュータハードウェアと、そのコンピュータハードウェアにより実行されるプログラムと、コンピュータハードウェアに格納されるデータとにより実現される。図１はこのコンピュータシステム３０の外観を示し、図２はコンピュータシステム３０の内部構成を示す。 [Constitution]
The sentence division system of this embodiment is realized by computer hardware, a program executed by the computer hardware, and data stored in the computer hardware. FIG. 1 shows the external appearance of the computer system 30, and FIG. 2 shows the internal configuration of the computer system 30.

図１を参照して、このコンピュータシステム３０は、ＣＤ−ＲＯＭ（コンパクトディスク読出専用メモリ）ドライブ５０及びＦＤ（フレキシブルディスク）ドライブ５２を有するコンピュータ４０と、モニタ４２と、キーボード４６と、マウス４８とを含む。 Referring to FIG. 1, a computer system 30 includes a computer 40 having a CD-ROM (Compact Disc Read Only Memory) drive 50 and an FD (Flexible Disc) drive 52, a monitor 42, a keyboard 46, and a mouse 48. including.

図２を参照して、コンピュータ４０は、ＦＤドライブ５２及びＣＤ−ＲＯＭドライブ５０に加えて、ハードディスク５４と、ＣＰＵ（中央処理装置）５６と、ＣＰＵ５６、ハードディスク５４、ＦＤドライブ５２、及びＣＤ−ＲＯＭドライブ５０に接続されたバス６６と、バス６６に接続され、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）５８と、バス６６に接続され、プログラム命令、システムプログラム、及び作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）６０とを含む。コンピュータシステム３０はさらに、プリンタ４４を含んでいる。 2, in addition to the FD drive 52 and the CD-ROM drive 50, the computer 40 includes a hard disk 54, a CPU (central processing unit) 56, a CPU 56, a hard disk 54, an FD drive 52, and a CD-ROM. A bus 66 connected to the drive 50, a read-only memory (ROM) 58 connected to the bus 66 for storing a boot-up program and the like, and a bus 66 for storing a program command, a system program, work data, and the like. Random access memory (RAM) 60. The computer system 30 further includes a printer 44.

ここでは示さないが、コンピュータ４０はさらにローカルエリアネットワーク（ＬＡＮ）への接続を提供するネットワークアダプタボードを含んでもよい。 Although not shown here, the computer 40 may further include a network adapter board that provides a connection to a local area network (LAN).

コンピュータシステム３０にこの実施の形態の文分割システムとしての動作を行なわせるためのコンピュータプログラムは、ＣＤ−ＲＯＭドライブ５０又はＦＤドライブ５２に挿入されるＣＤ−ＲＯＭ６２又はＦＤ６４に記憶され、さらにハードディスク５４に転送される。又は、プログラムは図示しないネットワークを通じてコンピュータ４０に送信されハードディスク５４に記憶されてもよい。プログラムは実行の際にＲＡＭ６０にロードされる。ＣＤ−ＲＯＭ６２から、ＦＤ６４から、又はネットワークを介して、直接にＲＡＭ６０にプログラムをロードしてもよい。 A computer program for causing the computer system 30 to operate as the sentence division system of this embodiment is stored in the CD-ROM 62 or FD 64 inserted into the CD-ROM drive 50 or FD drive 52 and further stored in the hard disk 54. Transferred. Alternatively, the program may be transmitted to the computer 40 through a network (not shown) and stored in the hard disk 54. The program is loaded into the RAM 60 when executed. The program may be loaded directly into the RAM 60 from the CD-ROM 62, from the FD 64, or via a network.

このプログラムは、コンピュータ４０にこの実施の形態の文分割システムとして動作を行なわせる複数の命令を含む。この動作を行なわせるのに必要な基本的機能のいくつかはコンピュータ４０上で動作するオペレーティングシステム（ＯＳ）又はサードパーティのプログラム、若しくはコンピュータ４０にインストールされる各種ツールキットのモジュールにより提供される。したがって、このプログラムはこの実施の形態の文分割システムを実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られるように制御されたやり方で適切な機能又は「ツール」を呼出すことにより、上記した文分割システムとしての動作を実行する命令のみを含んでいればよい。コンピュータシステム３０の動作は周知であるので、ここでは繰返さない。 This program includes a plurality of instructions that cause the computer 40 to operate as the sentence division system of this embodiment. Some of the basic functions required to perform this operation are provided by operating system (OS) or third party programs running on the computer 40 or modules of various toolkits installed on the computer 40. Therefore, this program does not necessarily include all functions necessary for realizing the sentence division system of this embodiment. This program may include only instructions that execute the above-described operation as a sentence splitting system by calling an appropriate function or “tool” in a controlled manner so as to obtain a desired result. That's fine. The operation of computer system 30 is well known and will not be repeated here.

図３に、この実施の形態の文分割システムの機能的構成をブロック図形式で示す。図３を参照して、この実施の形態の文分割システム８０は、文字列で構成された入力文８２を受け、単語間の境界を表す記号を入力文８２中に挿入した分割文字列を生成し、出力文８４として出力するための単語分割装置８６と、単語分割装置８６の処理に用いるデータとして、文とその文に対応する分割文字列とからなる対を多数記憶するためのデータベース８８とを含む。 FIG. 3 shows a functional configuration of the sentence division system of this embodiment in a block diagram form. Referring to FIG. 3, the sentence division system 80 of this embodiment receives an input sentence 82 composed of character strings, and generates a divided character string in which a symbol representing a boundary between words is inserted into the input sentence 82. A word dividing device 86 for outputting as an output sentence 84, and a database 88 for storing a large number of pairs of sentences and divided character strings corresponding to the sentence as data used for processing of the word dividing device 86; including.

図４にデータベース８８に記憶されたデータの構造を示す。図４を参照して、データベース８８は、文字列により構成された文を多数含む第１ファイル１００と、第１ファイルに含まれる文にそれぞれ対応する多数の分割文字列からなる第２ファイル１０２とを含む。図４に示す第１ファイル１００は、リスト形式のデータであり、多数のエントリを含む。第１ファイル１００の各エントリは、そのエントリの識別番号と分割されていない文とを含む。第２ファイル１０２もまた第１ファイル１００と同様のリスト形式のデータであり、多数のエントリを含む。各エントリは、そのエントリの識別番号と分割文字列とを含む。第１ファイル１００と第２ファイル１０２とにおける同一の識別番号のエントリには、それぞれ文とその文に対応する分割文字列とが格納される。 FIG. 4 shows the structure of data stored in the database 88. Referring to FIG. 4, a database 88 includes a first file 100 that includes a large number of sentences composed of character strings, and a second file 102 that includes a large number of divided character strings respectively corresponding to sentences included in the first file. including. The first file 100 shown in FIG. 4 is data in a list format and includes a large number of entries. Each entry of the first file 100 includes an identification number of the entry and an undivided sentence. The second file 102 is also data in a list format similar to the first file 100, and includes a large number of entries. Each entry includes an identification number of the entry and a divided character string. Each entry having the same identification number in the first file 100 and the second file 102 stores a sentence and a divided character string corresponding to the sentence.

再び図３を参照して、単語分割装置８６は、文を引数として呼出されると、引数とデータベース８８内の用例とを用いて、引数に対応する分割文字列を生成して戻り値として返すとともに、データベース８８に用例を補充する処理を実行するための分割プログラム９０と、分割プログラム９０による処理で生成される分割文字列とその処理に対応する引数とを対にして一時保持するためのメモリ９４と、入力文８２が入力されたことに応答して、入力文８２を引数として分割プログラム９０を呼出し、その結果分割プログラム９０から返される分割文字列を出力文８４として出力するための入出力部９２とを含む。 Referring to FIG. 3 again, when the word segmentation device 86 is called with a sentence as an argument, it uses the argument and the example in the database 88 to generate a segmented character string corresponding to the argument and return it as a return value. In addition, a memory for temporarily holding a split program 90 for executing a process for supplementing the example in the database 88, a split character string generated by the process by the split program 90, and an argument corresponding to the process 94, in response to the input sentence 82 being input, the division program 90 is called with the input sentence 82 as an argument, and as a result, the input / output for outputting the division character string returned from the division program 90 as the output sentence 84 Part 92.

図５に、分割プログラム９０のフローチャートを示す。図５を参照して、分割プログラム９０は、文を引数Ｄとして呼出される。分割プログラム９０が開始されると、ステップ１２０Ａとステップ１２０Ｂとで囲まれた、ステップ１２２、ステップ１２４、及びステップ１２６の処理を、第１ファイル内の、順序を含めて考えられる互いに異なる全ての文のペア（順序対）（Ａ，Ｂ）について処理が完了するまで繰返す。 FIG. 5 shows a flowchart of the division program 90. Referring to FIG. 5, the division program 90 is called with a sentence as an argument D. When the dividing program 90 is started, the processing of Step 122, Step 124, and Step 126, which is surrounded by Step 120A and Step 120B, is performed on all the different sentences in the first file that can be considered including the order. Repeat until the processing is completed for the pair (ordered pair) (A, B).

ステップ１２２では、特許文献１に記載の手法で、処理対象の文のペアを構成する２文Ａ及びＢと引数Ｄとからなる３文をもとに記号列類推を行なう。すなわち、類推関係Ａ：Ｂ：：ｘ：Ｄの成立する文ｘ＝Ｃ（以下、この文を「第１類推文」と呼ぶ。）の生成を試みる。ステップ１２４では、ステップ１２２において第１類推文Ｃが生成されたか否かを判定する。生成されたならばステップ１２６に進む。さもなければ、ステップ１２０Ｂに進む。 In step 122, symbol string analogy is performed based on three sentences consisting of two sentences A and B and an argument D constituting a sentence pair to be processed by the method described in Patent Document 1. That is, it tries to generate a sentence x = C in which the analogy relation A: B :: x: D is established (hereinafter, this sentence is referred to as a “first analogy sentence”). In step 124, it is determined whether or not the first kind inference C is generated in step 122. If generated, go to step 126. Otherwise, go to step 120B.

ステップ１２６では、ステップ１２２で生成された第１類推文Ｃを用いた処理を実行する。すなわち、データベース８８（図４参照）内での第１類推文Ｃの検索と、第１類推文、文のペア（Ａ，Ｂ）、及びデータベース８８の第２ファイル１０２を用いて、引数Ｄに対応する分割文字列となる第２類推文Ｄ’を生成する処理とを行なう。ステップ１２６の処理が終了すると、ステップ１２０Ｂに進む。ステップ１２６での処理については、図６を用いて後述する。 In step 126, a process using the first analogy sentence C generated in step 122 is executed. That is, the search for the first kind inference C in the database 88 (see FIG. 4), the first kind inference, the sentence pair (A, B), and the second file 102 of the database 88 are used as the argument D. A process for generating a second class inference D ′ to be a corresponding divided character string is performed. When the process of step 126 ends, the process proceeds to step 120B. The processing in step 126 will be described later with reference to FIG.

全ての文のペアについてステップ１２０Ａとステップ１２０Ｂとで囲まれた一連の処理が完了すると、ステップ１２８に進む。ステップ１２８では戻り値を設定する。すなわち一連の処理により引数Ｄに対応する分割文字列Ｄ’が生成され、引数Ｄと分割文字列Ｄ’との対がメモリ９４（図３参照）に格納されていたならば、正常終了を表す値（例えば「０」）とメモリ９４上における文と分割文字列との対の格納位置を表すアドレスとを、戻り値として設定する。引数Ｄに対応する分割文字列Ｄ’が生成されていなければ、エラーを表す所定の値（例えば「１」）を戻り値として設定する。そして、分割プログラム９０を終了する。 When a series of processes surrounded by step 120A and step 120B is completed for all sentence pairs, the process proceeds to step 128. In step 128, a return value is set. That is, if a divided character string D ′ corresponding to the argument D is generated by a series of processes and a pair of the argument D and the divided character string D ′ is stored in the memory 94 (see FIG. 3), it indicates a normal end. A value (for example, “0”) and an address indicating a storage position of a pair of a sentence and a divided character string on the memory 94 are set as a return value. If the divided character string D ′ corresponding to the argument D is not generated, a predetermined value (for example, “1”) indicating an error is set as a return value. Then, the division program 90 is terminated.

図６に、ステップ１２６（図５参照）において実行される処理のフローチャートを示す。図６を参照して、ステップ１２６の処理が開始されると、図５に示すステップ１２２において生成される全ての第１類推文Ｃに対し、ステップ１４０Ａとステップ１４０Ｂとで囲まれたステップ１４２〜ステップ１５８の処理を実行する。 FIG. 6 shows a flowchart of processing executed in step 126 (see FIG. 5). Referring to FIG. 6, when the process of step 126 is started, steps 142 to 142 surrounded by steps 140 A and 140 B are performed for all first type inference C generated in step 122 shown in FIG. 5. The process of step 158 is executed.

ステップ１４２では、第１ファイル１００（図４参照）内で、第１類推文Ｃと一致する文を検索する。ステップ１４４では、ステップ１４２での検索の結果について判定を行なう。すなわち、検索により第１ファイル１００から一致する文が得られたならば、ステップ１４６に進む。さもなければステップ１５４に進む。 In step 142, the first file 100 (see FIG. 4) is searched for a sentence that matches the first analogy sentence C. In step 144, the result of the search in step 142 is determined. That is, if a matching sentence is obtained from the first file 100 by the search, the process proceeds to step 146. Otherwise, go to step 154.

ステップ１４６では、処理対象の文のペア（Ａ，Ｂ）及び処理対象の第１類推文Ｃからなる３文と、第１ファイル１００（図４参照）と、第２ファイル１０２（図４参照）とを照合して、この３文に対応する３つの分割文字列Ａ’、Ｂ’、及びＣ’を第２ファイル１０２から読出す。ステップ１４８では、ステップ１４６で読出した３つの分割文字列Ａ’、Ｂ’、及びＣ’をもとにステップ１２２（図５参照）と同様の記号列類推を行なう。すなわち、類推式Ａ’：Ｂ’：：Ｃ’：ｙの成立する第２類推文ｙ＝Ｄ’の生成を試みる。ここで生成される第２類推文Ｄ’が、引数Ｄに対応する分割文字列となる。ステップ１５０では、ステップ１４８において第２類推文Ｄ’が生成されたか否かを判定する。生成されたならばステップ１５２に進む。さもなければステップ１４０Ｂに進む。ステップ１５２では、引数Ｄと第２類推文Ｄ’との対をメモリ９４（図３参照）に格納し、ステップ１４０Ｂに進む。 In step 146, three sentences including a pair of sentences to be processed (A, B) and a first analogy sentence C to be processed, a first file 100 (see FIG. 4), and a second file 102 (see FIG. 4). And the three divided character strings A ′, B ′, and C ′ corresponding to the three sentences are read from the second file 102. In step 148, symbol string analogy similar to that in step 122 (see FIG. 5) is performed based on the three divided character strings A ′, B ′, and C ′ read in step 146. That is, the generation of the second analogy sentence y = D ′ in which the analogy formula A ′: B ′ :: C ′: y is established is attempted. The second type inference D ′ generated here is a divided character string corresponding to the argument D. In step 150, it is determined whether or not the second kind inference D 'has been generated in step 148. If generated, the process proceeds to step 152. Otherwise, go to step 140B. In step 152, the pair of the argument D and the second analogy inference D ′ is stored in the memory 94 (see FIG. 3), and the process proceeds to step 140B.

ステップ１４４からステップ１５４へ処理が進むと、ステップ１５４において、第１類推文を引数として、分割プログラム９０の再帰呼出を行なう。この再帰呼出により、第１類推文Ｃに対応する分割文字列Ｃ’が生成され、戻り値として、正常終了を表す値（０）と第１類推文Ｃに対応する分割文字列Ｃ’との対のメモリ９４上でのアドレスとが返される。又は、第１類推文Ｃに対応する分割文字列Ｃ’が生成されず、戻り値としてエラーを表す値（１）が返される。戻り値が返されるとステップ１５６に進む。ステップ１５６では、ステップ１５４での戻り値の判定を行なう。戻り値がエラーを表す値（１）であれば、ステップ１４０Ｂに進む。さもなければステップ１５８に進む。ステップ１５８では、その時点でメモリ９４に格納されている全ての文と分割文字列との対をデータベース８８（図４参照）に補充する。すなわち第１類推文を第１ファイル１００（図４参照）に格納し、第１類推文に対する分割文字列を第２ファイルに格納する。補充が完了するとステップ１４０Ｂに進む。 When the process proceeds from step 144 to step 154, in step 154, the division program 90 is recursively called with the first type inference as an argument. By this recursive call, a divided character string C ′ corresponding to the first class reasoning C is generated, and as a return value, a value (0) indicating normal termination and the divided character string C ′ corresponding to the first class reasoning C are used. The address on the pair of memories 94 is returned. Alternatively, the divided character string C ′ corresponding to the first kind inference C is not generated, and a value (1) representing an error is returned as a return value. If a return value is returned, the process proceeds to step 156. In step 156, the return value in step 154 is determined. If the return value is a value (1) indicating an error, the process proceeds to step 140B. Otherwise, go to step 158. In step 158, the database 88 (see FIG. 4) is replenished with pairs of all sentences and divided character strings stored in the memory 94 at that time. That is, the first type reasoning is stored in the first file 100 (see FIG. 4), and the divided character string for the first type reasoning is stored in the second file. When the replenishment is completed, the process proceeds to step 140B.

以上の一連の処理が全ての第１類推文について完了すると、図５に示すステップ１２６の処理が完了する。処理は、上記の通り図５のステップ１２０Ｂに進む。 When the above series of processing is completed for all the first kind inferences, the processing of step 126 shown in FIG. 5 is completed. The process proceeds to step 120B in FIG. 5 as described above.

［動作］
この実施の形態の文分割システム８０は、以下のように動作する。 [Operation]
The sentence division system 80 of this embodiment operates as follows.

−第１類推文の生成と検索−
まず、文分割システム８０が第１類推文を生成しデータベース８８内で第１類推文を検索する動作について説明する。図７に、第１類推文を生成しデータベース８８内で第１類推文を検索する動作を概略的に示す。図７を参照して、単語分割装置８６（図３参照）に「もっと淡いいろはありますか。」という入力文８２が与えられたものとする。入出力部９２（図３参照）は、この入力文８２を引数Ｄ２００として、分割プログラム９０を呼出す。 -Generation and retrieval of first kind reasoning-
First, an operation in which the sentence division system 80 generates a first type inference and searches for the first type inference in the database 88 will be described. FIG. 7 schematically shows an operation of generating a first type inference and searching for the first type inference in the database 88. Referring to FIG. 7, it is assumed that an input sentence 82 “Is there anything more light?” Is given to word segmentation device 86 (see FIG. 3). The input / output unit 92 (see FIG. 3) calls the division program 90 using the input sentence 82 as an argument D200.

分割プログラム９０が呼出されると、図５に示すステップ１２０Ａにおいて、データベース８８（図３及び図４参照）の第１ファイル１００（図４参照）から、任意の二つの文Ａ及びＢを選び、文Ａ及びＢからなる文のペア２０２を形成する。ここでは、図４に示す第１ファイル１００から、文Ａとして識別番号２の「もっと濃いいろはありますか。」が選ばれ、文Ｂとして識別番号４の「もっと淡い色はありますか。」という文が選ばれたものとする。 When the dividing program 90 is called, in step 120A shown in FIG. 5, two arbitrary sentences A and B are selected from the first file 100 (see FIG. 4) of the database 88 (see FIGS. 3 and 4). A sentence pair 202 consisting of sentences A and B is formed. Here, from the first file 100 shown in FIG. 4, “Are there more dark colors?” With identification number 2 is selected as sentence A, and the sentence “Is there a lighter color” with identification number 4 as sentence B? Is chosen.

文のペア２０２を選択すると、続くステップ１２２において、引数Ｄ２００と文のペア２０２を構成する文Ａ及びＢとからなる３文の組２０４を形成する。そして記号列類推によって、この３文からＡ：Ｂ：：Ｃ：Ｄの成立する第１類推文Ｃ２０６の生成を試みる。例えば、図７に示す３文の組２０４においては、文Ｂは、文Ａ中の「濃」という文字を「淡」という文字に、「いろ」という文字列を「色」という文字に、それぞれ置換した文である。文Ｂにおけるこれら２箇所の文字列を除く部分の文字列は、文Ａにおける当該部分の文字列に共通する。引数Ｄは、「淡」という文字と「いろ」という文字列とを含み、この２箇所の文字列を除く部分の文字列は、文Ａ及び文Ｂにおける当該箇所の文字列に共通する。特許文献１に記載の手法では、文Ａ、文Ｂ、及び文Ｄに共通する部分と、文Ｄにおける文字「淡」と置換の関係にある文字「濃」と、文Ｄにおける文字「いろ」と置換の関係にある文字「色」とから、新たな文が生成される。生成される文は、「もっと濃い色はありますか。」という文である。この実施の形態では、この文が第１類推文Ｃ２０６となる。 When the sentence pair 202 is selected, in a subsequent step 122, a three-sentence group 204 composed of an argument D200 and sentences A and B constituting the sentence pair 202 is formed. Then, by symbol string analogy, an attempt is made to generate a first analogy C206 in which A: B :: C: D is established from these three sentences. For example, in the three-sentence group 204 shown in FIG. 7, the sentence B includes the character “dark” in the sentence A as “light” and the character string “color” as “color”. This is the replaced sentence. The character string of the part excluding these two character strings in the sentence B is common to the character string of the part in the sentence A. The argument D includes a character “light” and a character string “Iro”, and the character string of the portion excluding these two character strings is common to the character strings of the corresponding portions in the sentence A and the sentence B. In the method described in Patent Document 1, a part common to sentence A, sentence B, and sentence D, a character “dark” that has a replacement relationship with the character “light” in sentence D, and a character “Iro” in sentence D A new sentence is generated from the character “color” in the relationship of replacement. The generated sentence is a sentence "Is there a darker color?" In this embodiment, this sentence is the first kind inference C206.

なお、特許文献１に記載の手法では、３文の組２０４において類推的な類似関係が成立せず、第１類推文Ｃが生成できない場合がある。この場合、第１ファイル１００（図４参照）から文を選びなおして新たに文のペア２０２を形成する。また、特許文献１に記載の手法では、第１類推文が複数生成される場合がある。この場合、生成された第１類推文の各々について、以下に説明する処理を行なう。 In the method described in Patent Document 1, there is a case in which an analogy relationship is not established in the set of three sentences 204, and the first class sentence C cannot be generated. In this case, a sentence is newly selected from the first file 100 (see FIG. 4), and a new sentence pair 202 is formed. In the method described in Patent Document 1, a plurality of first type inferences may be generated. In this case, the process described below is performed for each of the generated first kind inferences.

第１類推文Ｃ２０６が生成されると、ステップ１２６が開始される。ステップ１２６が開始されると、第１類推文Ｃ２０６と一致する文をデータベース８８（図４参照）の第１ファイル１００内で検索する。図４に示すデータベース８８においては、第１ファイル１００の識別番号３の文が第１類推文２０６と一致する。 When the first analog reasoning C206 is generated, step 126 is started. When step 126 is started, a sentence that matches the first analogy sentence C206 is searched in the first file 100 of the database 88 (see FIG. 4). In the database 88 shown in FIG. 4, the sentence with the identification number 3 in the first file 100 matches the first analogy sentence 206.

第１類推文２０６と一致する文が第１ファイル１００内に存在した場合、文のペア２０２（図７参照）を構成する文Ａ及びＢ、並びに第１類推文Ｃ２０６は、いずれもデータベース８８（図４参照）の第１ファイル１００内に存在することになる。そこで、文Ａ、文Ｂ、及び第１類推文Ｃ２０６の３文からなる３文の組２２０を探索の結果とする。 When a sentence that matches the first kind inference 206 exists in the first file 100, the sentences A and B and the first kind inference C206 that constitute the sentence pair 202 (see FIG. 7) are all in the database 88 ( It exists in the first file 100 of FIG. Therefore, a set of three sentences 220 consisting of three sentences of sentence A, sentence B, and first kind inference C206 is taken as a search result.

第１類推文と一致する文が第１ファイル１００（図４参照）内になかった場合、第１類推文Ｃ２０６に対応する分割文字列もまた、データベース８８には存在しない。この場合、第１類推文Ｃ２０６を引数として分割プログラム９０を再帰的に呼出し、データベース８８への用例の補充を行なう。この動作については後述する。 If there is no sentence in the first file 100 (see FIG. 4) that matches the first kind inference, the divided character string corresponding to the first kind inference C206 also does not exist in the database 88. In this case, the division program 90 is recursively called with the first kind inference C206 as an argument, and the database 88 is supplemented with examples. This operation will be described later.

−第２類推文の生成−
図８に、探索された３文の組２２０と用例とを用いて、第２類推文を生成する動作を概略的に示す。図８を参照して、探索された３文の組２２０を構成する文Ａ、文Ｂ、及び第１類推文Ｃ２０６（図７参照）と、引数Ｄ２００との間には、上記のとおり類推関係Ａ：Ｂ：：Ｃ：Ｄが成立する。すなわち、３文の組２２０から引数Ｄ２００を類推できる。また３文の組２２０を構成する文Ａ、文Ｂ、及び第１類推文Ｃの各々は、第１ファイル１００（図４参照）内に存在する。すなわち、これらの文に対応する分割文字列を、第２ファイル１０２（図４参照）からそれぞれ得ることができる。ステップ１４６（図６参照）においてはさらに、第２ファイル１０２から、３文の組２２０を構成する文Ａ、Ｂ、及びＣに対応する分割文字列Ａ’、Ｂ’、及びＣ’を読出し、分割文字列Ａ’、Ｂ’、及びＣ’の３つからなる分割文字列の組２２４を形成する。本例においては、図４に示すデータベース８８の第２ファイル１０２から、識別番号２、４、及び３の文を読出し、分割文字列の組２２４を形成する。 -Generation of second analogy-
FIG. 8 schematically shows an operation of generating a second analog reasoning using the searched three sentence set 220 and an example. Referring to FIG. 8, between the sentence A, sentence B, and first class reasoning sentence C206 (see FIG. 7) constituting the searched three sentence group 220, and the argument D200, the reasoning relation is as described above. A: B :: C: D is established. That is, the argument D200 can be inferred from the set of three sentences 220. Each of the sentence A, the sentence B, and the first analogy sentence C constituting the three sentence set 220 exists in the first file 100 (see FIG. 4). That is, the divided character strings corresponding to these sentences can be obtained from the second file 102 (see FIG. 4). In step 146 (see FIG. 6), further, the divided character strings A ′, B ′, and C ′ corresponding to the sentences A, B, and C constituting the set of three sentences 220 are read from the second file 102, A divided character string set 224 including three divided character strings A ′, B ′, and C ′ is formed. In this example, the sentences of the identification numbers 2, 4, and 3 are read from the second file 102 of the database 88 shown in FIG.

分割文字列の組２２４が形成されると、ステップ１４８において、分割文字列の組２２４から、記号列類推により第２類推文Ｄ’２２６を生成する。例えば図８に示す分割文字列の組２２４からは、「もっと｜淡い｜いろ｜は｜あります｜か｜。」という分割文字列が生成される。分割文字列Ａ’、Ｂ’、及びＣ’は、文Ａ、Ｂ、及びＣに対応する。したがって、第２類推文Ｄ’２２６は引数Ｄ２００に対応する分割文字列となる。第２類推文Ｄ’２２６の生成に成功すると、引数Ｄ２００と第２類推文Ｄ’２２６との対２２８をメモリ９４（図３参照）に格納する。第２類推文Ｄ’２２６の類推生成に失敗した場合、メモリ９４には何も格納せずに、第１類推文Ｃ２０６に対する一連の処理を完了する。 When the divided character string set 224 is formed, in step 148, a second class inference D'226 is generated from the divided character string set 224 by symbol string analogy. For example, from the divided character string set 224 shown in FIG. 8, a divided character string “more | pale | The divided character strings A ′, B ′, and C ′ correspond to the sentences A, B, and C, respectively. Therefore, the second kind inference D'226 is a divided character string corresponding to the argument D200. When the generation of the second class inference D ′ 226 is successful, the pair 228 of the argument D 200 and the second class inference D ′ 226 is stored in the memory 94 (see FIG. 3). If the analog generation of the second analogy inference D ′ 226 fails, nothing is stored in the memory 94 and the series of processes for the first analogistic inference C 206 is completed.

図５に示すステップ１２２の処理で生成された全ての第１類推文について以上の処理が完了すると、図５に示すステップ１２０Ａにおいて、第１ファイル１００（図４参照）から文を選びなおして新たに文のペアを形成し、新たな文のペアに対する処理を開始する。 When the above processing is completed for all the first kind inferences generated in step 122 shown in FIG. 5, in step 120A shown in FIG. 5, a sentence is selected again from the first file 100 (see FIG. 4) and a new one is newly created. A pair of sentences is formed, and processing for a new pair of sentences is started.

全ての文のペアについて以上の処理が完了すると、分割プログラム９０の戻り値の設定を行なう。すなわち、メモリ９４（図３参照）に文と分割文字列との対２２８が格納されていれば、正常終了を表す値（０）とメモリ９４上で文と分割文字列との対２２８の格納位置を表すアドレスとを分割プログラム９０（図３）の戻り値に設定する。メモリ９４に文と分割文字列との対が格納されていれなければ、エラーを表す値（１）を戻り値に設定する。 When the above processing is completed for all sentence pairs, the return value of the division program 90 is set. That is, if the sentence 228 and the divided character string pair 228 are stored in the memory 94 (see FIG. 3), the value (0) indicating normal termination and the sentence 94 and the divided character string pair 228 stored in the memory 94 are stored. The address indicating the position is set as the return value of the division program 90 (FIG. 3). If a pair of a sentence and a divided character string is not stored in the memory 94, a value (1) indicating an error is set as a return value.

図３を参照して、入出力部９２に、正常終了を表す値（０）とメモリ９４上で文と分割文字列との対２２８の格納位置を表すアドレスとが戻り値として与えられると、入出力部９２（図３参照）は、戻り値に基づきメモリに格納された文と分割文字列との対２２８を読出し、文と分割文字列との対２２８を構成する分割文字列２２６を出力文８４として出力する。 Referring to FIG. 3, when a value (0) indicating normal termination and an address indicating the storage position of a pair 228 of a sentence and a divided character string are given as return values to the input / output unit 92, The input / output unit 92 (see FIG. 3) reads a pair 228 of the sentence and the divided character string stored in the memory based on the return value, and outputs a divided character string 226 constituting the pair 228 of the sentence and the divided character string. Output as sentence 84.

−用例の補充−
以下、第１類推文Ｃ２０６と一致する文が第１ファイル１００（図４参照）内になかった場合の動作について説明する。第１類推文と一致する文が第１ファイル１００（図４参照）内になかった場合、第１類推文Ｃ２０６とその分割文字列との対は、データベース８８には格納されていないことになる。 -Supplementation of examples-
Hereinafter, an operation in a case where a sentence that matches the first analogy sentence C206 is not in the first file 100 (see FIG. 4) will be described. If there is no sentence in the first file 100 (see FIG. 4) that matches the first kind inference, the pair of the first kind inference C206 and its divided character string is not stored in the database 88. .

このような第１類推文Ｃ２０６に対応する分割文字列が得られるならば、第１類推文Ｃ２０６及びその分割文字列をデータベース８８の第１ファイル１００及び第２ファイル１０２にそれぞれ追加することにより、データベース８８を充実させることができる。そこで、ステップ１５４（図６参照）において、第１類推文Ｃ２０６を新たな引数として分割プログラム９０の再帰呼出を行ない、第１類推文に対応する分割文字列の生成を試みる。再帰呼出によって、新たな引数に対する第２類推文すなわち新たな引数の分割文字列が生成されれば、戻り値により、その分割文字列が得られ、分割文字列が生成されなければ、エラーを表す値（１）が返される。生成された分割文字列に対応する文は、データベース８８（図４参照）の第１ファイル１００に格納され、生成された分割文字列は、第２ファイル１０２に格納される。 If such a divided character string corresponding to the first type inference C206 is obtained, the first type inference C206 and the divided character string are added to the first file 100 and the second file 102 of the database 88, respectively. The database 88 can be enriched. Therefore, in step 154 (see FIG. 6), a recursive call of the division program 90 is performed using the first kind inference C206 as a new argument, and an attempt is made to generate a divided character string corresponding to the first kind inference. If a second kind of reasoning for a new argument is generated by recursive call, that is, a new divided character string of the argument is generated, the divided character string is obtained by a return value. If no divided character string is generated, an error is indicated. The value (1) is returned. The sentence corresponding to the generated divided character string is stored in the first file 100 of the database 88 (see FIG. 4), and the generated divided character string is stored in the second file 102.

［実験］
−日本語の単語分割実験−
この実施の形態の手法を用いて、日本語の文を単語単位に分割する実験を行なった。実験に際し、１５３７３文のコーパスを用意した。図９に、用意した文の一例を示す。図９を参照して、第１ファイル２５０は、分割されていない文を列挙したものである。第２ファイル２５２は、第１ファイル２５０に列挙された各文を分割した場合に得られる分割文字列の正解を列挙したものである。第２ファイル２５２においては、文中の単語間に記号「｜」を挿入してある。なお、用意した１５３７３文において、１文を構成する単語の数は、平均７．６０、標準偏差３．３４であった。 [Experiment]
-Japanese word segmentation experiment-
Using the method of this embodiment, an experiment was conducted to divide a Japanese sentence into words. During the experiment, a corpus of 15373 sentences was prepared. FIG. 9 shows an example of the prepared sentence. Referring to FIG. 9, the first file 250 lists sentences that are not divided. The second file 252 lists correct answers of the divided character strings obtained when the sentences listed in the first file 250 are divided. In the second file 252, a symbol “|” is inserted between words in the sentence. In the 15373 sentences prepared, the average number of words constituting one sentence was 7.60 and the standard deviation was 3.34.

この実験では、第１ファイル２５０に列挙された各文を入力文８２（図３参照）として用いた。この入力に対し、出力文８４が得られた場合には、出力文を第２ファイル２５２と照合した。 In this experiment, each sentence listed in the first file 250 was used as the input sentence 82 (see FIG. 3). When an output sentence 84 is obtained for this input, the output sentence is collated with the second file 252.

用意した１５３７３文について、この実施の形態の手法で分割を行なった結果、そのうち１５３７０文について出力文８４を得ることができた。得られた出力文８４を正解と照合した結果、いずれも正しく分割されていることが判明した。 As a result of dividing the prepared 15373 sentences by the method of this embodiment, an output sentence 84 could be obtained for 15370 sentences. As a result of collating the obtained output sentence 84 with the correct answer, it was found that both were correctly divided.

−英語の単語分割実験−
また、この実施の形態の手法を用いて、英語の文を単語単位に分割する実験を行なった。実験に際し、５２８４８文のコーパスを用意した。図１０に、用意した文の一例を示す。図１０を参照して、英語の文は単語ごとに分かち書きされるため、用意した文が分割における正解となる。用意した文を列挙したものが第２ファイル２６２である。第１ファイル２６０は、この実験における入力文８２となる文字列であり、第２ファイル２６２に列挙された各文から、空白文字を除去した文字列を列挙したものである。なお、用意した５２８４８文において、１文を構成する単語の数は、平均５．６９、標準偏差２．１２、最大２３であった。 -English word segmentation experiment-
In addition, using the method of this embodiment, an experiment was performed to divide an English sentence into words. During the experiment, a 52848 sentence corpus was prepared. FIG. 10 shows an example of the prepared sentence. Referring to FIG. 10, since the English sentence is divided for each word, the prepared sentence becomes the correct answer in the division. A list of prepared sentences is the second file 262. The first file 260 is a character string that becomes the input sentence 82 in this experiment, and is a list of character strings from which blank characters are removed from the sentences listed in the second file 262. In the prepared 52848 sentences, the number of words constituting one sentence was 5.69 on average, standard deviation 2.12 and 23 at maximum.

この実験では、第１ファイル２６０に列挙された文字列を入力文に用いた。言い換えれば、この実験で行なわれた分割処理は、入力文中に空白文字を挿入しなおす処理である。この処理により出力文８４が得られた場合には、得られた出力文８４を第２ファイル２６２と照合した。 In this experiment, the character strings listed in the first file 260 were used as input sentences. In other words, the division process performed in this experiment is a process of reinserting a blank character in the input sentence. When the output sentence 84 is obtained by this processing, the obtained output sentence 84 is collated with the second file 262.

用意した５２８４８文について、この実施の形態の手法で単語単位の分割を行なった結果、そのうち５０８９７文について出力文８４を得ることができた。得られた出力文と正解とを照合した結果、５０８９７文の出力文８４のうち、不正解のものはわずかに４文であった。 As a result of dividing the prepared 52848 sentences in units of words by the method of this embodiment, an output sentence 84 could be obtained for 50897 sentences. As a result of collating the obtained output sentence and the correct answer, out of 50897 output sentences 84, only four sentences were incorrect.

以上のように、この実施の形態における文分割の手法は、言語、文字、文法規則等に関係なく、分割されていない任意の文字列とその分割文字列との対からなる用例を用意するだけで、文の分割を行なうことができる。また、用例をデータベースに自動的に補充しながら分割を行なうため、事前に多量のデータを用意したり、学習段階の処理を行なったりしなくても、分割を行なうことができる。 As described above, the sentence division method in this embodiment is merely to prepare an example consisting of a pair of an arbitrary character string that is not divided and the divided character string, regardless of language, characters, grammatical rules, and the like. The sentence can be divided. Further, since the database is divided while automatically replenishing the database, it is possible to perform the division without preparing a large amount of data in advance or performing the learning stage process.

また、Ｎ−グラムモデルを用いた統計的手法のように、処理対象の文脈のサイズが固定されることはなく、入力文を文単位の長い文脈で処理することが可能になる。 Further, unlike the statistical method using the N-gram model, the size of the context to be processed is not fixed, and the input sentence can be processed in a long context of a sentence unit.

なお、上記の実施の形態では、図４に示すような用例のみを用いて分割を行なったが、本発明はこのような実施の形態には限定されない。例えば、辞書を用いた分割手法を併用することも可能である。 In the above embodiment, the division is performed using only the example as shown in FIG. 4, but the present invention is not limited to such an embodiment. For example, a division method using a dictionary can be used in combination.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の一実施の形態に係る文分割システムを実現するためのコンピュータシステムの外観を示す図である。It is a figure which shows the external appearance of the computer system for implement | achieving the sentence division | segmentation system which concerns on one embodiment of this invention. 図１に示すコンピュータシステムのブロック図である。It is a block diagram of the computer system shown in FIG. 本発明の一実施の形態の文分割システム８０の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the sentence division | segmentation system 80 of one embodiment of this invention. データベース８８の構成を示す図である。It is a figure which shows the structure of the database 88. FIG. 図３に示す分割プログラム９０のフローチャートである。It is a flowchart of the division | segmentation program 90 shown in FIG. 図５のステップ１２６において実行される、第１類推文の検索、第２類推文の生成、及び用例を補充する処理のフローチャートである。It is a flowchart of the process performed in step 126 of FIG. 5 of the search of a 1st kind reason sentence, the production | generation of a 2nd kind reason sentence, and an example. 第１類推文を生成しデータベース８８内で第１類推文を検索する動作を概略的に示す図である。It is a figure which shows schematically the operation | movement which produces | generates a 1st kind inference and searches a 1st kind inference in the database 88. FIG. 探索された３つの文字列から出力文８４を生成する動作を概略的に示す図である。It is a figure which shows roughly the operation | movement which produces | generates the output sentence 84 from the searched three character strings. 本発明の一実施の形態について行なった実験に使用した、日本語の文の概要を示す図である。It is a figure which shows the outline | summary of the Japanese sentence used for the experiment conducted about one embodiment of this invention. 本発明の一実施の形態について行なった実験に使用した、英語の文の概要を示す図である。It is a figure which shows the outline | summary of the sentence of English used for the experiment conducted about one embodiment of this invention.

Explanation of symbols

８０文分割システム
８２入力文
８４出力文
８６単語分割装置
８８データベース
９０分割プログラム
９２入出力部
１００第１ファイル
１０２第２ファイル 80 sentence division system 82 input sentence 84 output sentence 86 word division device 88 database 90 division program 92 input / output unit 100 first file 102 second file

Claims

A sentence division computer program used with a computer-readable database, including three or more pairs of sentences and divided character strings obtained by dividing the sentence by a predetermined unit,
When a sentence to be processed is given, a sentence division program part for generating a divided character string for the sentence to be processed using the database;
Input / output program part for giving an input sentence given from the outside to the sentence division program part and outputting a divided character string output by the sentence division program part for the input sentence as a divided character string for the input sentence Including
The sentence division program part is:
When the processing target sentence is given, arbitrary two sentences are selected from the database, a predetermined first analogy is generated from the two sentences and the processing target sentence, and the first analogy is generated. A first analogy program part for solving the formula and generating an analogy,
A determination program part for determining whether or not the analogy sentence exists in the database;
In response to determining that the analogy is present in the database by the determination program part, a divided character string paired with the analogy in the database, and the two sentences in the database A read program portion for reading from the database two divided character strings each paired;
Using the three divided character strings read out by the read program part, a second analogical expression having a predetermined relationship with the first analogical expression is generated, and the second analogical expression is solved to generate an analogy divided character A sentence division computer program including a second analogy program part for generating a sequence and outputting it as a return value of the sentence division program part.

The first analogy program part is:
Given the sentence to be processed, a program portion for selecting all the ordered pairs (A, B) of sentences in the database;
For each of the selected ordered pairs, a first analogical expression “A: B :: x: D” is generated with the sentence D to be processed, and the first analogical expression is solved to solve The sentence division computer program according to claim 1, further comprising: an analogy solution program part for trying to generate x = C and outputting all the generated solutions C.

The sentence division program part is:
A recursive call program for recursively calling the sentence division program portion using the analogy sentence as a processing target sentence in response to the determination program portion determining that the analogy sentence does not exist in the database Part,
The program part for adding the said analogy sentence and the division | segmentation character string output by the said sentence division | segmentation program part called by the said recursive calling program part as a pair to the said database is further included. The sentence division computer program according to claim 2.