JP2004318300A

JP2004318300A - Clause boundary detection device, machine translation device, and computer program

Info

Publication number: JP2004318300A
Application number: JP2003108676A
Authority: JP
Inventors: Takehiko Maruyama; 岳彦丸山; Hidenori Kashioka; 秀紀柏岡; Tadashi Kumano; 正熊野; Hideki Tanaka; 英輝田中
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-04-14
Filing date: 2003-04-14
Publication date: 2004-11-11
Anticipated expiration: 2023-04-14
Also published as: JP3924260B2

Abstract

【課題】日本語の発話の中から、局所的な情報のみから節境界を検出する事ができる節境界検出装置を提供する。
【解決手段】節境界検出装置３０は、文章に対して形態素解析を行なう事により得られた形態素列から、元の文章の節境界を検出するための節境界検出装置であって、形態素列の中において、所定の形態素の並びのパタンを検出するための検出処理部１４８と、パタンが検出された事に応答して、形態素列の中で、検出されたパタン中の形態素の並びを、その形態素の並びの末尾に節境界を表す節ラベルを付加したものと置換する処理を行なって形態素列を出力するための置換処理部１５４とを含む。
【選択図】図７To provide a node boundary detection device capable of detecting a node boundary only from local information from Japanese utterances.
A clause boundary detection device is a clause boundary detection device for detecting a clause boundary of an original sentence from a morpheme sequence obtained by performing a morphological analysis on a sentence. In the detection processing unit 148 for detecting a pattern of a predetermined morpheme sequence, and in response to the detection of the pattern, the morpheme sequence in the detected pattern in the morpheme sequence, the And a replacement processing unit 154 for performing a process of replacing the end of the morpheme sequence with a clause label indicating a clause boundary and outputting a morpheme sequence.
[Selection diagram] FIG.

Description

【０００１】
【発明の属する技術分野】
この発明は、自然言語処理を適切に行なえる様にするための前処理を行なう装置に関し、特に、翻訳等の処理が適切に行なえる様に節単位に入力テキストを分離する前処理を行なうための節境界検出装置、そうした節境界検出装置を採用した機械翻訳装置、およびそれらのためのコンピュータプログラムに関する。
【０００２】
【従来の技術】
近年、独話（複数の発話のまとまりであって、講演、ニュース等、発話者が１人のもの）を対象とした自然音声コーパスの構築が進んでいる。講演、ニュースまたは学会発表等、１人の話者が話しつづける独話は、対話（複数の発話のまとまりであって、２人の発話者が発話を交換するもの）よりも１文の長さが長くなったり、文の構造が複雑化したりするという特徴を持つ事が知られている。
【０００３】
図１に、典型的な独話の例であるテレビジョンのニュース（日本語）、及び典型的な対話の例である旅行会話（対訳形式の日本語部分）での１文あたりの形態素数及び文節数を示す。図１から分かる様に、１文あたりの形態素数及び文節数のいずれにおいても、独話の方が対話よりもはるかに多い。
【０００４】
さらに、自発的な発話になるほど、明示的な文末表現の現れにくくなる傾向があり、文の境界を認識する事が困難になる。
【０００５】
独話又は対話等の音声認識を行なったり、翻訳を行なったりする自然言語処理技術では、従来、「文」を基本的な処理単位とする場合が大半である。
【０００６】
しかし、１文が長く、文末が確定しにくいという性格を持つ独話を自然言語処理の対象とする場合、文が長くなる事によって構文解析のあいまい性が爆発するという問題がある。また、文末がはっきりしないために、自然言語処理の対象となるものがはっきりせず、どこまで入力を待てば処理を始めることができるのかが分からないという問題が発生する。
【０００７】
こうした問題は、独話を機械翻訳する場合にも現れる。独話を機械翻訳する場合、発話に追従して翻訳を出力する同時通訳としての運用が望ましい。しかし上記した様に独話では１文が長くなるため、解析が失敗したり、その結果として翻訳が失敗したりするという問題がある。仮に翻訳が成功したとしても、同時通訳としての追従性に欠けるという問題がある。また、文末が確定しにくい場合、どの時点でどの部分を対象に翻訳を開始するかを定めることは難しい。
【０００８】
【非特許文献１】
益岡隆志・田窪行則著、「基礎日本語文法‐改訂版‐」、くろしお出版、１９９２
【発明が解決しようとする課題】
従って、特に独話について、発話の中で各種の処理を漸進的に進めておくために、文とは異なる単位を発話中で随時検出できる様にする事が望ましい。可能であれば、その処理単位は文よりも短いほうが望ましい。
【０００９】
文よりも短い処理単位として、述語を中心としたまとまりである「節」を用いる事が望ましいと考えられる。節は、統語的・意味的にまとまった単位であり、翻訳または文の要約等の処理を節単位で行なうと有効であると考えられる。そこで、節境界を自動的に検出する手法が必要となる。
【００１０】
節境界を検出する手法としてまず考えられるのは、構文解析器を用いて文を解析した結果から、節境界に相当する位置を特定する方法である。しかし、構文解析器は一般に入力として「文」を要求するものである。そのため、文末が入力されて構文解析が済むまでは、節境界の検出を始める事は難しい。この制約は、同時通訳の様に入力を漸進的に処理していく必要がある場合、望ましくない。漸進的な処理を行なうためには、発話の入力中であっても、局所的な情報のみから節境界の位置を検出できる事が望ましい。また、節境界により分離される節がどの様なものであるかを知る事ができれば、自然言語処理技術にとって有用なだけでなく、言語学的な分析にも応用できるため、より好ましい。
【００１１】
従って、本発明の目的は、日本語の発話の中から、局所的な情報のみから節境界を検出する事ができる節境界検出装置を提供する事である。
【００１２】
この発明の他の目的は、日本語の発話の中から、局所的な情報のみから節境界を随時検出する事ができる節境界検出装置を提供する事である。
【００１３】
この発明のさらに他の目的は、日本語の発話の局所的な情報のみから節境界を検出し、当該節境界により分離される節がどの様な種類の節かを判定する事ができる節境界検出装置を提供する事である。
【００１４】
この発明のさらに他の目的は、日本語の発話の中から節を随時検出し、節ごとに自動的に翻訳を行なう事ができる機械翻訳装置を提供する事である。
【００１５】
【課題を解決するための手段】
本発明の第１の局面に係る節境界検出装置は、文章に対して形態素解析を行なう事により得られた形態素列から、元の文章の節境界を検出するための節境界検出装置であって、形態素列の中において、所定の形態素の並びのパタンを検出するための検出手段と、パタンが検出された事に応答して、形態素列の中で、検出されたパタン中の形態素の並びと所定の関係にある位置を節の境界に指定する予め定める処理を行なって形態素列を出力するための境界指定手段とを含む。
【００１６】
好ましくは、境界指定手段は、パタンが検出された事に応答して、ある位置に節の境界を示す境界マーカを挿入して形態素列を出力するための手段を含む。
【００１７】
さらに好ましくは、検出手段は、形態素列の中において、複数個のパタンのうちの任意の一つを検出するための手段を含む。
【００１８】
境界指定手段は、任意の一つを検出するための手段により上記パタンのうちの任意の一つが検出された事に応答して、検出されたパタン中の形態素の並びと所定の関係にある位置に、検出されたパタンに対応して予め定められた節境界ラベルを挿入するためのラベル挿入手段を含んでもよい。
【００１９】
節境界ラベル又は節境界マーカが挿入される位置は、検出されたパタン中の末尾の形態素の直後でもよい。
【００２０】
好ましくは、検出手段は、形態素列を順次読込んでＦＩＦＯ（Ｆｉｒｓｔ−ＩｎＦｉｒｓｔ−Ｏｕｔ）方式で記憶して出力するための一時記憶手段と、一時記憶手段に記憶された形態素の配列の中に、所定の形態素の並びのパタンがある事を検出するための手段とを含み、境界指定手段は、所定の形態素の並びのパタンがある事が検出された事に応答して、一時記憶手段の所定の形態素の並びのパタンまでを出力する様に一時記憶手段を制御するための手段と、一時記憶手段から出力される所定の形態素の並びのパタンの末尾に、節境界を示すマーカを挿入するための手段とを含んでもよい。
【００２１】
さらに好ましくは、検出手段は、形態素列を順次読込んでＦＩＦＯ方式で記憶して出力するための一時記憶手段と、一時記憶手段に記憶された形態素の配列の中に、複数個の所定の形態素の並びのパタンのうちの任意の一つがある事を検出するための手段とを含み、境界指定手段は、任意の一つのパタンが検出された事に応答して、一時記憶手段中の、検出されたパタンまでをＦＩＦＯ方式で出力する様に一時記憶手段を制御するための手段と、任意の一つのパタンが検出された事に応答して、一時記憶手段から出力されるパタンの末尾に、検出されたパタンに対応した節境界ラベルを挿入するための手段とを含む。
【００２２】
本発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを、上記したいずれかの節境界検出装置として動作させるものである。
【００２３】
本発明の第３の局面に係る機械翻訳装置は、入力される日本語の文章に対して形態素解析処理を行ない、得られる形態素列を出力するための形態素解析手段と、上記したいずれかの節境界検出装置であって、形態素解析手段の出力を入力として受ける様に接続された節境界検出装置と、節境界検出装置から出力される形態素列を、当該形態素列中の節境界によって節に分離するための節分離手段と、節分離手段により分離された形態素列を入力とし、節分離手段から節を受けとった事に応答して、受けた節を翻訳するための機械翻訳手段とを含む。
【００２４】
好ましくは、節境界検出装置は出力する形態素列の節境界に節境界マーカを挿入する機能を持ち、節分離手段は、節境界検出装置からの出力をＦＩＦＯ方式で一時記憶するための記憶手段と、節境界検出装置から節境界マーカが出力された事に応答して、記憶手段に記憶された形態素列を機械翻訳手段に与え、機械翻訳を開始させるための手段とを含む。
【００２５】
本発明の第４の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを上記した機械翻訳装置として動作させるものである。
【００２６】
【発明の実施の形態】
［第１の実施の形態］
‐節境界検出の原理‐
構文解析を行なわずに節境界を検出するために、本実施の形態では、入力テキストを形態素解析し、形態素の局所的な連接関係のみを手掛かりとして節境界を検出する。そのために、形態素の局所的な連接関係をパタンに分類し、特定のパタンが検出された場合に節境界を特定するルールを作成し、このルールに従って自動的に節境界を特定する。この節境界検出ルールは、節境界の位置を発見するための形態素列パタンと、節境界の種類をあらわす節境界ラベルの組とからなっている。構文解析は必要としない。
【００２７】
‐構成‐
図２に、本実施の形態の節境界検出装置を採用した翻訳装置の機能ブロック図を示す。この実施の形態のシステムは、テキスト処理が可能な既存の言語処理系（具体的にはＰｅｒｌ処理系）を用い、節境界検出ルールをＰｅｒｌの正規表現置換を用いたスクリプトの形式で実装している。
【００２８】
図２を参照して、この翻訳装置３０は、ユーザからの起動コマンド３２に応答して、日本語の入力テキスト３４を英語に機械翻訳し、結果を翻訳出力３６として出力するためのものである。翻訳装置３０は、入力テキスト３４を形態素解析し、形態素列を出力するための形態素解析部５４を含む。形態素解析部５４としては、既存の形態素解析用プログラムを用いる事ができる。図３に、形態素解析用プログラムとしてよく知られているものの出力形式と出力例とを示す。図３の詳細については後述する。
【００２９】
再び図２を参照して、翻訳装置３０はさらに、上記した節境界検出ルールをＰｅｒｌの正規表現命令列からなるスクリプトの形で実装したプログラム５２と、形態素解析部５４の出力する形態素列に対してプログラム５２を適用する事により、節境界ごとに節ラベルが挿入された処理後のテキストを出力するための言語処理系５６と、言語処理系５６の出力をＦＩＦＯ方式で一時的に蓄積するバッファ５８と、言語処理系５６から節ラベルが出力されるごとに、バッファ５８に蓄積されたテキストを読出して出力する事により、テキストを節に分離するためのテキスト分離部６０と、テキスト分離部６０から与えられるテキストを英語に翻訳して翻訳出力３６として出力するための機械翻訳部６２と、ユーザからの起動コマンド３２に応答して入力テキスト３４およびプログラム５２を読込み、形態素解析部５４および言語処理系５６等を起動するためのオペレーティングシステム（ＯＳ）５０とを含む。ここで節の境界を示す情報を「節ラベル」と呼んでいるのは、節の種類を示す情報を含んでいるためである。この節ラベルは、そこに節境界が存在している事を表すものでもあり、節境界を示すマーカとしての役割も果たしている。
【００３０】
ＯＳ５０、言語処理系５６を備え、プログラム５２を実行する事からも分かる様に、翻訳装置３０は実質的にはコンピュータから構成されている。入力テキスト３４および翻訳出力３６はそれぞれ、標準入力および標準出力を示すが、本実施の形態では入力テキスト３４は所定のファイルから与えられ、翻訳出力３６も所定のファイルとして出力されるものとする。
【００３１】
図３を参照して、形態素解析部５４の出力形式８０について説明する。形態素解析部５４が出力する形態素は、出力形式８０に示される様に、形態素の出現形と、その品詞と、その活用形と、出現したときの活用型とからなる。「活用形」とは、動詞、助動詞、形容詞等の活用のしかたの分類を示すものである。例は「五段活用」「下二段活用」の如きものである。活用型とは、出現した形態素が、各活用形の中でどの様な活用をされているかを示すものである。例は「未然形」「連用形」「連体形」等の如きものである。
【００３２】
図３に、入力例８２として「私は学校へ行きました」という一文を示す。これを形態素解析部５４により形態素解析した結果を解析結果８４に示す。解析結果８４から明らかな様に、形態素解析部５４は入力テキスト３４を形態素解析し、出力形式８０に従った形式の形態素列を出力する機能を持つ。
【００３３】
図４に、節境界検出ルールの例を実装したＰｅｒｌによるコマンドの例を示す。図４を参照して、Ｐｅｒｌによる置換コマンドの一般形１００は、置換を示すコマンド「ｓ」と、置換の際に検索すべき検索文字列と、検索された文字列を置換すべき文字列と、置換の際のオプション機能を指定するオプション文字列とを「／（スラッシュ）」により区切った形式となっている。検索文字列及び置換文字列には、それぞれ「正規形」と呼ばれる形式を使用する事ができる。Ｐｅｒｌに限らず、この様な正規形が使用できる言語処理系は数多く存在する。正規形について必要な場合には以下で説明を加えてあるが、一般的な説明については各言語処理系の解説書等を参照されたい。
【００３４】
図４の基本形１０２により、本実施の形態でＰｅｒｌにより実装された節境界検出ルールの一般形を示す。基本形１０２は、一般形１００において、検索文字列を形態素列パタン１１０とし、置換文字列を「＄１￥／節ラベル￥／」という置換文字列表現１１２とし、オプションとして「ｇ」を指定したものである。
【００３５】
形態素列パタン１１０は丸カッコで囲まれている。これは置換文字列表現１１２の中の「＄１」に対応する。置換文字列表現１１２中の「＄１」は、この部分を検索文字列中で丸カッコで囲まれた文字列の中で１番目のものと置換する事を示す。検索文字列の中で丸カッコで囲まれた文字列は形態素列パタン１１０しかないので、＄１は形態素列パタン１１０により置換される。
【００３６】
置換文字列表現１１２の中の「￥」はエスケープ文字であり、この直後の一文字をコマンドの一部ではなく単なる文字として扱う事を示す。この例では、置換文字列が「／」を含んでいるのに対し、この「／」がコマンド中でも使用されているので、置換文字列中のスラッシュを単なる文字列として扱うために「￥」を使用している。節ラベルについては後述する。
【００３７】
オプション「ｇ」は、グローバルサーチを示す。すなわち、検索文字列で入力文字列を検索した結果、最初のマッチが見出されたところで検索を終了するのではなく、マッチがいくつ見出されても入力文字列の全体について検索および置換を行なう事を示す。
【００３８】
すなわち、基本形１０２によれば、形態素列パタン１１０にマッチする形態素列があれば、その形態素列の最後に「／節ラベル／」で示される文字列が挿入される事になる。
【００３９】
図４には、具体的な節境界検出ルールの第１の例１０４も示されている。この例１０４は、入力される形態素列中に「けれども」という出現形で「助詞‐接続助詞」という品詞を持つものがあれば、その部分を全て「けれども／並列節ケレドモ／」という文字列で置換せよ、というものである。
【００４０】
同じく図４には、具体的な節境界検出ルールの第２の例１０６も示されている。この例１０６は、入力される形態素列中に「連用タ接続」または「連用形」という活用型の形態素があり、その直後に「たら」という出現形で「助動詞」という品詞を持ち、「特殊・タ」という活用形で「仮定形」という活用型を持つパタンがあれば、それらを全て、その末尾に「／条件節タラ／」という文字列を付加したもので置換せよ、というものである。検索文字列中の「｜」は、「または」という意味を表す。
【００４１】
本実施の形態では、この様な節境界検出ルールとして３６１個のルールを用いている。全てのルールは、１個から３個の連接する形態素から構成されるパタンを持つ。入力には読点が含まれていない事を想定し、パタンに読点は含めていない。
【００４２】
図５に、本実施の形態で検出される節の種類の一部を示す。本実施の形態では、非特許文献１に記載されている従属節の形態（補足節、副詞節、連体節、および並列節）を増補及び改編して作成したもので、合計１４４種類の節を用いる。これらの中には、統語的に大きな切れ目になると考えられる主題「は」、談話標識、および感動詞を検出するパタンも含まれている。本明細書では、これらも含めて「節境界」と考える事とする。
【００４３】
本実施の形態で用いられている節ラベルは、実際には図５に示したものをさらに細かく分類したものである。例えば、「タメ節」の下位には「タメニ節」「タメニハ節」という節境界が設定してある。これら下位の節境界を合計すると１４４種類となるという事である。
【００４４】
図６に、プログラム５２の実際の形式を示す。図６を参照して、プログラム５２は、Ｐｅｒｌの書式に従ってＰｅｒｌの処理系へのパスを示す行（１行目）を含む。２行目は、入力されるテキストが存在する限り次の中かっこ「｛」および「｝」に囲まれた部分の処理を繰返し実行する事を示す命令である。この中かっこの中が、上記した節境界検出ルールの本体である。入力があると、ここに記載された全てのグローバル置換命令を実行し、置換後のテキストを末尾の「ｐｒｉｎｔ」命令によって標準出力に出力し、次の入力に対する処理に移る。
【００４５】
図７に、図２に示す言語処理系５６およびプログラム５２によって実現される節境界検出処理の実態をフローチャート形式で示す。言語処理系５６自体は図７に示すものと異なり汎用的な機能を備えたものであるが、ここではプログラム５２を言語処理系５６で実行する場合についてのみ、その動きを示す。後述する様に、プログラム５２および言語処理系５６により実現される処理を専用のプログラムで実装する場合には、その制御構造は例えばこの図７に示されたものの様になる。
【００４６】
図７を参照してこの処理は、関連するファイル（入力ファイルおよび出力ファイル等）をオープンするステップ１４０と、入力テキストファイルの１行目（改行コードまでの文字列）を読込むステップ１４２と、ステップ１４２の結果、入力ファイルの末尾（ＥＯＦ：ＥｎｄＯｆＦｉｌｅ）に達したか否かを判定するステップ１４４とを含む。判定結果がＹＥＳであれば制御はステップ１６２に進み、さもなければ制御はステップ１４６に進む。なお、複数の入力ファイルに対して連続してこの処理を実行する事もできるが、ここでは説明を簡明にするために一つのファイルに対して処理を行なうものとする。
【００４７】
ステップ１４６では、初期処理を行なう。初期処理では、入力されたテキストに対し、節境界の検出処理を行なう上で妨げとなる様な要素を入力テキストから除去する処理を行なう。
【００４８】
続いてステップ１４８では、１番目の置換コマンドのグローバル検索を行なう。ステップ１５０では、プログラム５２内の全ての置換コマンドを実行したか否かを判定する。全て実行が終わっていれば制御はステップ１５８に進む。さもなければ制御はステップ１５２に進む。
【００４９】
ステップ１５２では、検索の結果、置換コマンドの検索文字列の正規表現にマッチした部分があったか否かを判定する。マッチがあれば制御はステップ１５４に進む。さもなければ制御はステップ１５０に戻る。
【００５０】
ステップ１５４では、マッチがあった部分を全て置換文字列で置換する処理が行なわれる。全て置換が終わったらステップ１５６で処理を次の置換コマンドに進めて制御をステップ１５０に戻す。
【００５１】
ステップ１５０で全ての置換コマンドの実行が完了したと判定された場合、制御はステップ１５８に進む。ステップ１５８では、置換処理が完了した１行分のテキストを標準出力に書き出す処理が実行される。続いて入力テキストファイルの次の１行を読込む。制御はこの後ステップ１４４に戻る。
【００５２】
一方、ステップ１４４で入力ファイルのＥＯＦに到達したと判定された場合、ステップ１６２で関連のファイルを全てクローズして処理を終了する。
【００５３】
‐動作‐
この機械翻訳装置は以下の様に動作する。図２を参照して、ユーザが起動コマンド３２を入力したものとする。起動コマンド３２は、入力テキスト３４とプログラム５２とを特定する情報を含む。
【００５４】
ＯＳ５０はこのコマンドに応答して形態素解析部５４を起動し、入力テキスト３４を開いて形態素解析部５４で形態素解析を行なわせる。一方ＯＳ５０は、起動コマンド３２により特定されるプログラム５２を記憶装置から読出す。前述の通り、プログラム５２の１行目にはこのプログラム５２を実行するための言語処理系へのパスが記載されている。ＯＳ５０はこのパスに従って言語処理系５６を起動する。
【００５５】
形態素解析部５４から出力される形態素列は言語処理系５６に与えられる。言語処理系５６は、この形態素列に対してプログラム５２に含まれる節境界検出ルールを適用し、テキスト中の節境界に節ラベルを挿入する処理を行ない、結果をバッファ５８に出力する。
【００５６】
テキスト分離部６０は、言語処理系５６から節ラベルが出力されるごとに、バッファ５８に格納されたテキストを読出し、機械翻訳部６２に与える。
【００５７】
機械翻訳部６２は、与えられる節について機械翻訳を行ない、結果を翻訳出力３６として出力する。
【００５８】
‐処理例‐
図８を参照して、テキスト１９０に対して節境界検出処理を行なった。その結果を処理後のテキスト１９２として示す。処理後のテキスト１９２は、節境界に対応する形態素列パタンが検出された場所に挿入された節ラベルを含んでいる。たとえば「自主避難が呼びかけられている○×町の▽▽地区では」という部分は「自主避難が呼びかけられている」という節と「○×町の▽▽地区では」以下の節とに分離されている。そして、「自主避難が呼びかけられている」という節には「連体節」という節ラベルが付されている。この節ラベルはスラッシュによって本文と区切られて挿入されている。
【００５９】
‐性能評価のための実験‐
本実施の形態に係るプログラム５２および言語処理系５６により実装した節境界検出装置の性能を評価するために、性質の異なる複数のコーパスに対してルールを適用し、その結果を分析した。用意したコーパスの概略の規模を図９に示す。
【００６０】
図９に示される様に、コーパスは全部で５つ用意した。そのうち３つは独話コーパスであり、２つは対話コーパスである。
【００６１】
第１の独話コーパスは放送でのいわゆる解説番組を書き起こしたものである。第２の独話コーパスはテレビジョン放送でのニュースの原稿コーパスである。第３の独話コーパスは経済系の複数の新聞記事データベースである。一方、第１の対話コーパスは、出願人において準備したバイリンガルの旅行会話を題材とする模擬会話コーパスである。第２の対話コーパスは、海外旅行で用いられる典型的な表現を収集したコーパスである。
【００６２】
図９を参照して、１文の長さは第２の独話コーパスが突出して長く、第１および第３の独話コーパスがこれに次ぐ事が分かる。これに比して対話コーパス中の文はいずれも極端に短い事が分かる。
【００６３】
これらコーパスに上記した節境界検出処理を行なった。検出された節の数、１文に含まれる平均節数、各節に含まれる平均形態素数と平均文節数とを図１０に示す。図１０から、節境界検出処理によって検出された一つの節の長さ（形態素数および文節数）は、独話、対話を問わずコーパス間でほとんど差がない事が分かる。
【００６４】
‐評価‐
さらに節境界検出装置の性能を評価するため、各コーパスから５００文を選択し、人手で節境界の検出と判定とを行ない、正解データを作成した。上記した節境界検出装置による節境界検出処理の結果と正解データとを照合し、適合率と再現率とを求めた。その結果を図１１に表形式で示す。
【００６５】
図１１を参照して、全てのコーパスにおいて、適合率と再現率ともに非常に高く、非常によい精度で節境界が検出されている事が分かる。この様によい精度で節境界を検出し、節ごとに翻訳処理を行なう事で、機械翻訳の精度も高くなり、結果として良好な翻訳を得る事が可能になる。しかも上記した処理では、形態素列が所定の節境界パタンにマッチすれば節境界が検出できる。文末の入力が行なわれなくても漸進的に節の検出を行なう事ができる。そのため、同時翻訳等に適している。
【００６６】
‐節境界検出ルールの実際例‐
以下に、実験で実際に使用した節境界検出ルール（Ｐｅｒｌの置換コマンド形式）を示す。ここでは、ルールに相当する置換コマンドのみを示し、スクリプトの制御に属する部分は省略してある。また、実際のスクリプトにおいては１行で記載されるべきところを複数行に分けて記載した部分がある。
【００６７】
＜ルールの開始＞
【００６８】
【表１】

【００６９】
【表２】

【００７０】
【表３】

【００７１】
【表４】

【００７２】
【表５】

【００７３】
【表６】

【００７４】
【表７】

【００７５】
【表８】

【００７６】
【表９】

【００７７】
【表１０】

【００７８】
【表１１】

【００７９】
【表１２】

【００８０】
【表１３】

【００８１】
【表１４】

【００８２】
【表１５】

【００８３】
【表１６】

【００８４】
【表１７】

【００８５】
【表１８】

【００８６】
【表１９】

【００８７】
【表２０】

【００８８】
【表２１】

【００８９】
【表２２】

【００９０】
【表２３】

【００９１】
【表２４】

【００９２】
【表２５】

【００９３】
【表２６】

【００９４】
【表２７】

【００９５】
【表２８】

【００９６】
【表２９】

【００９７】
【表３０】

＜ルールの終了＞
【００９８】
なお、本実施の形態では、言語処理系５６から節境界ラベルが出力されるごとに、テキスト分離部６０がバッファ５８から形態素列を読出して機械翻訳部６２に与え、それによって機械翻訳部６２による機械翻訳がスタートする。しかし本発明はその様な実施の形態に限定されるわけではない。たとえば言語処理系５６の出力を全て一旦バッファ５８に記憶し、その後にバッファ５８の内容を節境界ラベルにより節ごとに分離して機械翻訳部６２に与える様にしてもよい。
【００９９】
また、本実施の形態では、節境界を示す形態素列のパタンが検出されると、その末尾に節境界ラベルを挿入している。しかし本発明はその様な実施の形態には限定されず、そのパタンと所定の関係にある位置に節境界ラベルを挿入する様にしてもよい。例えば、形態素列のパタン中の末尾以外の部分に節境界ラベルを挿入すべき場合もあるかもしれない。パタンの末尾以外の場所、たとえばその一つ前に節境界ラベルを挿入する様にしてもよい。この場合、節に分離するときには節境界ラベルの次の形態素までを一つの節とすればよい。また、一箇所でなく２箇所以上に節境界ラベルを挿入する様にしてもよい。たとえば節境界に対応するパタンの先頭と末尾とに節境界の開始ラベルと終了ラベルとをそれぞれ挿入する様にしてもよい。
【０１００】
さらに、上記した実施の形態では、入力テキスト３４の各行について最初にまとめて読込み、節境界検出処理を行なっている。しかし本発明はその様な実施の形態に限定されるわけではない。例えば、形態素を順次一時記憶装置にＦＩＦＯ方式で記憶し、記憶された形態素列の中に所定のパタンを満足するものがあれば、そこで節境界を検出する様にしてもよい。この場合には、一時記憶装置に記憶された形態素列を当該パタンまで順次出力し、その末尾に当該パタンに対応する節境界ラベルを挿入する様にすればよい。
【０１０１】
［第２の実施の形態］
上記した第１の実施の形態の翻訳装置３０は、節境界を検出するために、予め所定のプログラム言語（Ｐｅｒｌ）によりプログラムされたプログラム５２と、そのプログラム言語の処理系である言語処理系５６とを用いている。しかし本発明はその様な実施の形態に限定される訳ではない。汎用の言語処理系を用いる代わりに、専用のプログラムを用いる事もできる。その場合、節境界ルールについては適宜追加、変更または削除が可能となる様に、ルールのみをデータベース化しておく事が考えられる。
【０１０２】
図１２に、この実施の形態に係る節境界検出装置を採用した、コーパスの統計処理装置の機能的ブロック図を示す。この装置は、処理対象のコーパスに対し、前述した節境界検出処理を行ない、その結果として得られた各節の節ラベルの種類を統計処理し、それによってコーパスの性格を調べる事を可能とするものである。
【０１０３】
図１２を参照して、このコーパスの統計処理装置２００は、コーパス２０２を入力として、コーパス２０２に含まれる各文を節ラベル付の節に分離し、その結果を統計処理する機能を持つ。コーパスの統計処理装置２００は、コーパス２０２を入力とし、その各文を形態素解析して形態素列を出力するための形態素解析部２１０と、形態素解析部２１０の出力する形態素列に対して節境界検出処理を行ない、節境界にその直前までの節の種類を表す節ラベルを挿入してテキストとして出力する処理を行なうための節境界検出部２１２と、節境界検出部２１２から出力される節境界検出後のテキスト２１４内の節ラベルに対して統計的処理を行ない、統計出力２０４を出力するための統計処理部２１６とを含む。
【０１０４】
節境界検出部２１２は、節境界検出ルールをデータベース化したルールデータベース（ルールＤＢ）２３２と、形態素解析部２１０から出力される形態素列に対し、ルールＤＢ２３２に格納されている節境界検出ルールを適用し、実施の形態１の置換命令と同様の処理を行なって、節境界に節ラベルを挿入したテキスト列として出力するための置換処理部２３０とを含む。
【０１０５】
置換処理部２３０としては、実施の形態１のＰｅｒｌ処理系と同様、正規形を処理できる様な性能を持つものが好ましい。その場合、ルールＤＢ２３２に格納されるルールの検索文字列に相当する部分を正規表現で表現する事ができるので、ルールＤＢ２３２の容量を小さくし、かつ処理対象をもれなく適切に処理する事が可能となる。
【０１０６】
置換処理部２３０もコンピュータとソフトウェアとで実現できる。その場合の置換処理部２３０を実現するソフトウェアの構成は、図７に示したフローチャートと同様となる。
【０１０７】
形態素解析部２１０としては、実施の形態１で用いたプログラム５２と同じものを用いる事ができる。また、統計処理部２１６で行なう統計処理は、目的に応じて適切なものを準備すればよい。たとえば、前述した節ごとの平均形態素数、平均文節数、節の種類の分布等を、テキスト２１４に含まれる節ラベルに基づいて計算により求める事ができる。
【０１０８】
今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味および範囲内での全ての変更を含む。
【図面の簡単な説明】
【図１】独話と対話との相違を示すための図である。
【図２】本発明の第１の実施の形態に係る翻訳装置の機能的ブロック図である。
【図３】節境界検出ルールの一般形及び例を説明するための図である。
【図４】節境界検出ルールを実装したＰｅｒｌのコマンド形式を説明するための図である。
【図５】第１の実施の形態の節境界検出ルールで検出可能な節の種類を説明するための図である。
【図６】第１の実施の形態の装置で節境界検出ルールを実装したＰｅｒｌスクリプトの構成を示す図である。
【図７】第１の実施の形態の装置のプログラム５２および言語処理系５６により実現される、節境界検出処理の制御構造を説明するためのフローチャートである。
【図８】節境界検出処理の結果例を示す図である。
【図９】第１の実施の形態による節境界検出処理の性能評価に用いたコーパスの概略規模を表形式で示す図である。
【図１０】第１の実施の形態による節境界検出処理の結果を、コーパス別に表形式で示す図である。
【図１１】第１の実施の形態による節境界検出処理の性能評価の結果を表形式で示す図である。
【図１２】本発明の第２の実施の形態に係るコーパスの統計処理装置の機能的ブロック図である。
【符号の説明】
３０翻訳装置、５０オペレーティングシステム（ＯＳ）、５２プログラム、５４、２１０形態素解析部、５６言語処理系、６０テキスト分離部、６２機械翻訳部、２００コーパスの統計処理装置、２１２節境界検出部、２１６統計処理部、２３０置換処理部、２３２ルールデータベース（ルールＤＢ）[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an apparatus for performing preprocessing for appropriately performing natural language processing, and more particularly, to performing preprocessing for separating input text into sections so that processing such as translation can be appropriately performed. The present invention relates to a node boundary detection device, a machine translation device employing such a node boundary detection device, and a computer program for the same.
[0002]
[Prior art]
2. Description of the Related Art In recent years, construction of a natural speech corpus for a single story (a group of a plurality of utterances, such as lectures, news, and the like, having one utterer) has been advanced. A monologue, in which one speaker continues to speak, such as a lecture, news, or conference presentation, is one sentence longer than a dialogue (a group of utterances where two speakers exchange utterances). It is known that it has a feature that the length of a sentence becomes longer and a sentence structure becomes complicated.
[0003]
FIG. 1 shows the number of morphemes and the number of morphemes per sentence in a television news (Japanese) which is a typical example of a monologue, and a travel conversation (a Japanese part in a bilingual form) which is a typical example of a dialogue. Indicates the number of clauses. As can be seen from FIG. 1, in both the number of morphemes and the number of clauses per sentence, there is much more in single talk than in dialogue.
[0004]
Furthermore, the more spontaneous the utterance, the more the explicit end-of-sentence expression tends to appear, making it difficult to recognize the boundaries of sentences.
[0005]
2. Description of the Related Art In natural language processing technology for performing speech recognition such as a monologue or a conversation and performing translation, a "sentence" is generally used as a basic processing unit in most cases.
[0006]
However, in the case where a single sentence is long and the end of the sentence is hard to be determined, so that natural language processing is performed, there is a problem that the ambiguity of syntax analysis explodes due to the long sentence. Further, since the end of the sentence is not clear, the target of the natural language processing is not clear, and there is a problem that it is difficult to know how long the input can wait before the processing can be started.
[0007]
These problems also arise when translating monologue into a machine. When translating a solitary story, it is desirable to operate as a simultaneous interpreter that outputs a translation following the speech. However, as described above, since one sentence becomes longer in a solitary story, there is a problem that the analysis fails and as a result, the translation fails. Even if the translation is successful, there is a problem that it lacks followability as a simultaneous interpreter. If the end of a sentence is difficult to determine, it is difficult to determine at what point in time what translation should be started.
[0008]
[Non-patent document 1]
Takashi Masuoka and Yukinori Takubo, "Basic Japanese Grammar-Revised Edition", Kuroshio Publishing, 1992
[Problems to be solved by the invention]
Therefore, it is desirable that a unit different from a sentence can be detected in the utterance at any time in order to advance various processes in the utterance, especially for a single utterance. If possible, the processing unit should be shorter than the sentence.
[0009]
It is considered desirable to use a "clause", which is a unit around a predicate, as a processing unit shorter than a sentence. A clause is a unit that is syntactically and semantically united, and it is considered effective to perform processing such as translation or summarizing sentences in units of clauses. Therefore, a method of automatically detecting a node boundary is required.
[0010]
The first method for detecting a clause boundary is to identify a position corresponding to a clause boundary from the result of analyzing a sentence using a parser. However, parsers generally require "sentences" as input. Therefore, it is difficult to start detecting a clause boundary until the end of a sentence is input and parsing is completed. This restriction is undesirable when the input needs to be processed progressively, such as simultaneous interpretation. In order to perform gradual processing, it is desirable that the position of a node boundary can be detected only from local information even during the input of an utterance. It is more preferable to be able to know what clauses are separated by clause boundaries, because they are useful not only for natural language processing technology but also for linguistic analysis.
[0011]
Accordingly, an object of the present invention is to provide a node boundary detecting device capable of detecting a node boundary from only local information from Japanese utterances.
[0012]
Another object of the present invention is to provide a node boundary detecting device capable of detecting a node boundary at any time only from local information from Japanese utterances.
[0013]
Still another object of the present invention is to detect a clause boundary from only local information of Japanese utterances and determine what kind of clause is separated by the clause boundary. It is to provide a detection device.
[0014]
Still another object of the present invention is to provide a machine translation apparatus capable of detecting a clause from Japanese utterances as needed and automatically performing translation for each clause.
[0015]
[Means for Solving the Problems]
A clause boundary detection device according to a first aspect of the present invention is a clause boundary detection device for detecting a clause boundary of an original sentence from a morphological sequence obtained by performing a morphological analysis on a sentence. A detecting unit for detecting a pattern of a predetermined morpheme sequence in the morpheme sequence, and a morpheme sequence in the detected pattern in the morpheme sequence in response to the detection of the pattern. Boundary specifying means for performing a predetermined process of specifying a position having a predetermined relationship as a boundary of a node and outputting a morpheme string.
[0016]
Preferably, the boundary specifying means includes a means for outputting a morphological sequence by inserting a boundary marker indicating a boundary of a node at a certain position in response to the detection of the pattern.
[0017]
More preferably, the detecting means includes means for detecting any one of the plurality of patterns in the morphological sequence.
[0018]
In response to the detection of any one of the patterns by the means for detecting any one of the patterns, the boundary designating means determines a position having a predetermined relationship with the arrangement of the morphemes in the detected pattern. In addition, a label insertion unit for inserting a predetermined node boundary label corresponding to the detected pattern may be included.
[0019]
The position where the node boundary label or the node boundary marker is inserted may be immediately after the last morpheme in the detected pattern.
[0020]
Preferably, the detection unit includes a temporary storage unit for sequentially reading the morpheme sequence, storing and outputting the sequence in a FIFO (First-In First-Out) method, and an arrangement of the morphemes stored in the temporary storage unit. Means for detecting the presence of a pattern having a predetermined morpheme sequence. The boundary designating means responds to the detection of the presence of the pattern having a predetermined morpheme sequence, and Means for controlling the temporary storage means so as to output up to the pattern of the morpheme sequence, and inserting a marker indicating a node boundary at the end of the predetermined morpheme sequence pattern output from the temporary storage means. Means may be included.
[0021]
More preferably, the detection means includes a temporary storage means for sequentially reading the morpheme sequence, storing and outputting the same in a FIFO manner, and a plurality of predetermined morphemes in the morpheme array stored in the temporary storage means. Means for detecting the presence of any one of the patterns in the list, wherein the boundary designation means responds to the detection of any one of the patterns in the temporary storage means. Means for controlling the temporary storage means so as to output up to the pattern in the FIFO method, and detecting the end of the pattern output from the temporary storage means in response to detection of any one of the patterns. Means for inserting a section boundary label corresponding to the specified pattern.
[0022]
A computer program according to a second aspect of the present invention, when executed by a computer, causes the computer to operate as any one of the node boundary detection devices described above.
[0023]
A machine translation device according to a third aspect of the present invention performs a morphological analysis process on an input Japanese sentence, and outputs a morphological sequence obtained, and a morphological analysis unit including: A boundary detection device, wherein a node boundary detection device connected to receive an output of the morphological analysis means as an input, and a morpheme sequence output from the node boundary detection device are separated into nodes by the node boundaries in the morpheme sequence. And a machine translation means for translating the received clause in response to receiving a clause from the clause separating means.
[0024]
Preferably, the node boundary detecting device has a function of inserting a node boundary marker at a node boundary of a morpheme string to be output, and the node separating unit includes a storage unit for temporarily storing an output from the node boundary detecting unit in a FIFO method. Means for giving the morpheme sequence stored in the storage means to the machine translation means in response to the output of the clause boundary marker from the clause boundary detection device to start the machine translation.
[0025]
A computer program according to a fourth aspect of the present invention, when executed by a computer, causes the computer to operate as the above-described machine translation device.
[0026]
BEST MODE FOR CARRYING OUT THE INVENTION
[First Embodiment]
-Principle of node boundary detection-
In order to detect a clause boundary without performing a syntax analysis, in the present embodiment, a morphological analysis is performed on an input text, and a clause boundary is detected using only a local connection relation of morphemes as a clue. For this purpose, local connection relations of morphemes are classified into patterns, a rule for specifying a node boundary when a specific pattern is detected is created, and the node boundary is automatically specified according to this rule. This clause boundary detection rule includes a set of a morpheme sequence pattern for finding the position of the clause boundary and a clause boundary label indicating the type of the clause boundary. No parsing is required.
[0027]
-Constitution-
FIG. 2 shows a functional block diagram of a translation device employing the node boundary detection device of the present embodiment. The system of this embodiment uses an existing language processing system capable of text processing (specifically, a Perl processing system), and implements clause boundary detection rules in the form of a script using Perl regular expression replacement. I have.
[0028]
Referring to FIG. 2, in response to a start command 32 from the user, translation apparatus 30 machine translates Japanese input text 34 into English and outputs the result as translation output 36. . The translation device 30 includes a morphological analysis unit 54 for morphologically analyzing the input text 34 and outputting a morphological sequence. As the morphological analysis unit 54, an existing morphological analysis program can be used. FIG. 3 shows an output format and an output example of a well-known morphological analysis program. Details of FIG. 3 will be described later.
[0029]
Referring again to FIG. 2, translation device 30 further executes a program 52 that implements the above-described clause boundary detection rule in the form of a script composed of a Perl regular expression instruction sequence, and a morpheme sequence output by morphological analysis unit 54. A language processing system 56 for outputting a processed text in which a section label is inserted for each section boundary by applying the program 52, and a buffer for temporarily storing the output of the language processing system 56 in a FIFO system A text separating unit 60 for separating the text into sections by reading and outputting the text stored in the buffer 58 every time a section label is output from the language processing system 56; In response to a start command 32 from the user, and a machine translation unit 62 for translating the text given from the Reading the input text 34 and program 52, including an operating system (OS) 50 for starting the morphological analysis unit 54 and the language processing system 56 or the like. Here, the information indicating the boundary of a section is called a “section label” because it includes information indicating the type of the section. This section label also indicates that a section boundary exists, and also serves as a marker indicating the section boundary.
[0030]
As can be seen from the fact that the translator 30 includes the OS 50 and the language processing system 56 and executes the program 52, the translator 30 is substantially constituted by a computer. The input text 34 and the translation output 36 indicate a standard input and a standard output, respectively. In this embodiment, the input text 34 is given from a predetermined file, and the translation output 36 is also output as a predetermined file.
[0031]
The output format 80 of the morphological analysis unit 54 will be described with reference to FIG. As shown in the output format 80, the morpheme output by the morphological analysis unit 54 includes the appearance form of the morpheme, its part of speech, its utilization form, and the utilization form when it appears. The “conjugation form” indicates a classification of how to utilize verbs, auxiliary verbs, adjectives, and the like. Examples are "five-stage utilization" and "lower two-stage utilization". The inflection type indicates how the appearing morpheme is used in each of the inflection types. Examples are such as "premature form", "continuous form", "continuous form" and the like.
[0032]
FIG. 3 shows a sentence “I went to school” as an input example 82. The result of morphological analysis of this by the morphological analysis unit 54 is shown as an analysis result 84. As is clear from the analysis result 84, the morphological analysis unit 54 has a function of performing a morphological analysis on the input text 34 and outputting a morphological sequence in a format according to the output format 80.
[0033]
FIG. 4 shows an example of a command by Perl that implements an example of a node boundary detection rule. Referring to FIG. 4, the general form 100 of the replacement command by Perl is a command "s" indicating the replacement, a search character string to be searched for at the time of replacement, and a character string to be replaced with the searched character string. , And an option character string specifying an optional function at the time of replacement are separated by "/ (slash)". A format called “normal form” can be used for each of the search character string and the replacement character string. There are many language processing systems that can use such normal forms, not limited to Perl. If necessary, the normal form is explained below. For a general explanation, refer to the manual of each language processing system.
[0034]
The basic form 102 of FIG. 4 shows a general form of the node boundary detection rule implemented by Perl in the present embodiment. The basic form 102 is the same as the general form 100, except that a search character string is a morpheme string pattern 110, a replacement character string is a replacement character string expression 112 of "$ 1 ¥ / section label $ /", and "g" is specified as an option. It is.
[0035]
The morpheme sequence pattern 110 is enclosed in parentheses. This corresponds to "$ 1" in the replacement character string representation 112. “$ 1” in the replacement character string expression 112 indicates that this part is replaced with the first character string in the parenthesized character string in the search character string. Since the character string enclosed in parentheses in the search character string has only the morpheme string pattern 110, $ 1 is replaced by the morpheme string pattern 110.
[0036]
“@” In the replacement character string representation 112 is an escape character, and indicates that the immediately following character is treated as a mere character, not a part of the command. In this example, while the replacement string contains "/", this "/" is also used in the command, so "@" is used to treat the slash in the replacement string as a simple string. I'm using Clause labels will be described later.
[0037]
Option "g" indicates a global search. That is, as a result of searching the input character string with the search character string, the search is not terminated when the first match is found, but the search and replacement are performed for the entire input character string no matter how many matches are found. Show things.
[0038]
That is, according to the basic form 102, if there is a morpheme string that matches the morpheme string pattern 110, a character string indicated by “/ clause label /” is inserted at the end of the morpheme string.
[0039]
FIG. 4 also shows a first example 104 of a specific node boundary detection rule. In this example 104, if the input morpheme string has a part-of-speech "particle-connected particle" in the appearance form of "kademo", all of the parts are represented by the character string "kade / parallel clause keredomo /". Replace it.
[0040]
FIG. 4 also shows a second example 106 of a specific node boundary detection rule. In this example 106, in the input morpheme sequence, there is an inflected morpheme “continuous connection” or “continuous form”, followed immediately by a part-of-speech “auxiliary verb” in the appearance “tarar”, and If there is a pattern with a utilization form of "assumed form" in the utilization form of "ta", replace all of them with a character string of "/ conditional clause cod /" at the end. “|” In the search character string represents “or”.
[0041]
In the present embodiment, 361 rules are used as such a node boundary detection rule. Every rule has a pattern consisting of one to three connected morphemes. Assuming that readings are not included in the input, the readings are not included in the pattern.
[0042]
FIG. 5 shows some of the types of nodes detected in the present embodiment. In the present embodiment, the forms of the dependent clauses (supplementary clauses, adverbial clauses, adnominal clauses, and parallel clauses) described in Non-Patent Document 1 are created by augmenting and reorganizing them. Used. These include the subject "ha", which is considered to be a syntactically large break, a discourse marker, and a pattern for detecting an inflectional verb. In the present specification, these are considered as “knot boundaries”.
[0043]
The node labels used in the present embodiment are actually obtained by further subdividing the labels shown in FIG. For example, node boundaries of “Tameni clause” and “Tameniha clause” are set below “Tame clause”. The sum of these lower node boundaries is 144 types.
[0044]
FIG. 6 shows the actual format of the program 52. Referring to FIG. 6, program 52 includes a line (first line) indicating a path to a Perl processing system according to the Perl format. The second line is a command indicating that the processing of the part enclosed by the following braces "@" and "@" is repeatedly executed as long as the input text exists. The inside of the braces is the body of the above-described clause boundary detection rule. When there is an input, all the global replacement commands described here are executed, the replaced text is output to the standard output by a “print” command at the end, and the process proceeds to the next input.
[0045]
FIG. 7 is a flowchart showing the actual state of the clause boundary detection processing realized by the language processing system 56 and the program 52 shown in FIG. Although the language processing system 56 itself has general-purpose functions different from the one shown in FIG. 7, here, the operation is shown only when the program 52 is executed by the language processing system 56. As will be described later, when the processing realized by the program 52 and the language processing system 56 is implemented by a dedicated program, the control structure is as shown in FIG. 7, for example.
[0046]
Referring to FIG. 7, this processing includes a step 140 for opening a related file (input file, output file, etc.), a step 142 for reading the first line (character string up to line feed code) of the input text file, A step 144 of determining whether or not the end of the input file has reached the end of the input file (EOF: End Of File). If the result of the determination is YES, control proceeds to step 162; otherwise, control proceeds to step 146. Note that this processing can be continuously performed on a plurality of input files, but here, for simplicity of description, processing is performed on one file.
[0047]
In step 146, an initial process is performed. In the initial processing, the input text is subjected to a process of removing elements that hinder the detection of a clause boundary from the input text.
[0048]
Subsequently, at step 148, a global search for the first replacement command is performed. In step 150, it is determined whether all the replacement commands in the program 52 have been executed. If the execution has been completed, the control proceeds to step 158. Otherwise, control proceeds to step 152.
[0049]
In step 152, it is determined whether or not there is a part that matches the regular expression of the search character string of the replacement command as a result of the search. If there is a match, control proceeds to step 154. Otherwise, control returns to step 150.
[0050]
In step 154, a process of replacing all the matched portions with the replacement character string is performed. When all the replacements have been completed, the process proceeds to the next replacement command in step 156, and the control returns to step 150.
[0051]
If it is determined in step 150 that the execution of all the replacement commands has been completed, the control proceeds to step 158. In step 158, a process of writing one line of the completed text to the standard output is executed. Subsequently, the next one line of the input text file is read. Control then returns to step 144.
[0052]
On the other hand, if it is determined in step 144 that the EOF of the input file has been reached, all related files are closed in step 162, and the process ends.
[0053]
-motion-
This machine translator operates as follows. Referring to FIG. 2, it is assumed that the user has input start command 32. The start command 32 includes information for specifying the input text 34 and the program 52.
[0054]
The OS 50 activates the morphological analysis unit 54 in response to this command, opens the input text 34, and causes the morphological analysis unit 54 to perform morphological analysis. On the other hand, the OS 50 reads the program 52 specified by the start command 32 from the storage device. As described above, the first line of the program 52 describes the path to the language processing system for executing the program 52. The OS 50 activates the language processing system 56 according to this path.
[0055]
The morpheme sequence output from the morphological analysis unit 54 is given to the language processing system 56. The language processing system 56 applies a clause boundary detection rule included in the program 52 to the morpheme string, performs a process of inserting a clause label at a clause boundary in the text, and outputs the result to the buffer 58.
[0056]
The text separation unit 60 reads out the text stored in the buffer 58 every time the section label is output from the language processing system 56 and supplies the text to the machine translation unit 62.
[0057]
The machine translation unit 62 performs machine translation for the given clause, and outputs the result as a translation output 36.
[0058]
-Processing example-
Referring to FIG. 8, a section boundary detection process is performed on text 190. The result is shown as processed text 192. The processed text 192 includes the section label inserted at the position where the morpheme sequence pattern corresponding to the section boundary is detected. For example, the section "Involuntary evacuation is called for in the ▽▽ district of ○ × town" is divided into the section "Voluntary evacuation is called for" and the section below "In the ▽▽ district of ○ × town" ing. And the section "voluntary evacuation is called" is labeled with a section label "Adjunct Section". This section label is inserted separated from the text by a slash.
[0059]
-Experiments for performance evaluation-
In order to evaluate the performance of the node boundary detection device implemented by the program 52 and the language processing system 56 according to the present embodiment, rules were applied to a plurality of corpuses having different properties, and the results were analyzed. FIG. 9 shows a schematic scale of the prepared corpus.
[0060]
As shown in FIG. 9, five corpus were prepared in all. Three of them are solitary corpora and two are conversation corpora.
[0061]
The first monolingual corpus is a transcript of a so-called commentary program in broadcasting. The second monopoly corpus is a manuscript corpus of news on television broadcasting. The third monopoly corpus is a database of newspaper articles related to the economy. On the other hand, the first conversation corpus is a simulated conversation corpus prepared on the basis of a bilingual travel conversation prepared by the applicant. The second dialogue corpus is a corpus that collects typical expressions used in overseas travel.
[0062]
Referring to FIG. 9, it can be seen that the length of one sentence is prominently longer in the second monolingual corpus, followed by the first and third monolingual corpora. In contrast, the sentences in the conversation corpus are extremely short.
[0063]
The above-described node boundary detection processing was performed on these corpora. FIG. 10 shows the number of detected clauses, the average number of clauses included in one sentence, the average number of morphemes included in each clause, and the average number of clauses. From FIG. 10, it can be seen that the length of one clause (the number of morphemes and the number of clauses) detected by the clause boundary detection processing has almost no difference between the corpora irrespective of a monologue or a conversation.
[0064]
-Evaluation-
Furthermore, in order to evaluate the performance of the node boundary detecting device, 500 sentences were selected from each corpus, and the node boundaries were detected and determined manually to create correct answer data. The result of the node boundary detection processing by the above-described node boundary detection device was collated with the correct answer data, and the precision and the recall were obtained. The results are shown in table form in FIG.
[0065]
Referring to FIG. 11, it can be seen that in all the corpora, both the precision and the recall are very high, and the node boundaries are detected with very good accuracy. By detecting node boundaries with high accuracy and performing translation processing for each node, the accuracy of machine translation also increases, and as a result, good translation can be obtained. Moreover, in the above-described processing, a node boundary can be detected if the morpheme sequence matches a predetermined node boundary pattern. Even if the end of the sentence is not input, the clause can be detected progressively. Therefore, it is suitable for simultaneous translation and the like.
[0066]
-Practical examples of clause boundary detection rules-
The following is a clause boundary detection rule (Perl replacement command format) actually used in the experiment. Here, only the replacement command corresponding to the rule is shown, and the part belonging to the control of the script is omitted. Further, in an actual script, there is a portion where a portion to be described in one line is described in a plurality of lines.
[0067]
<Start of rule>
[0068]
[Table 1]

[0069]
[Table 2]

[0070]
[Table 3]

[0071]
[Table 4]

[0072]
[Table 5]

[0073]
[Table 6]

[0074]
[Table 7]

[0075]
[Table 8]

[0076]
[Table 9]

[0077]
[Table 10]

[0078]
[Table 11]

[0079]
[Table 12]

[0080]
[Table 13]

[0081]
[Table 14]

[0082]
[Table 15]

[0083]
[Table 16]

[0084]
[Table 17]

[0085]
[Table 18]

[0086]
[Table 19]

[0087]
[Table 20]

[0088]
[Table 21]

[0089]
[Table 22]

[0090]
[Table 23]

[0091]
[Table 24]

[0092]
[Table 25]

[0093]
[Table 26]

[0094]
[Table 27]

[0095]
[Table 28]

[0096]
[Table 29]

[0097]
[Table 30]

<End of rule>
[0098]
In the present embodiment, every time a clause boundary label is output from the language processing system 56, the text separation unit 60 reads the morpheme sequence from the buffer 58 and gives it to the machine translation unit 62. Machine translation starts. However, the present invention is not limited to such an embodiment. For example, the output of the language processing system 56 may be temporarily stored in the buffer 58, and then the contents of the buffer 58 may be separated into clauses by clause boundary labels and provided to the machine translation unit 62.
[0099]
Further, in this embodiment, when a pattern of a morpheme string indicating a node boundary is detected, a node boundary label is inserted at the end. However, the present invention is not limited to such an embodiment, and a node boundary label may be inserted at a position having a predetermined relationship with the pattern. For example, there may be a case where a node boundary label should be inserted at a part other than the end in the pattern of the morpheme sequence. A section boundary label may be inserted at a place other than the end of the pattern, for example, just before the end. In this case, when separating into clauses, the section up to the next morpheme of the clause boundary label may be regarded as one clause. Further, the node boundary labels may be inserted at two or more places instead of one place. For example, the start label and end label of the node boundary may be inserted at the beginning and end of the pattern corresponding to the node boundary, respectively.
[0100]
Further, in the above-described embodiment, each line of the input text 34 is first read collectively, and the node boundary detection processing is performed. However, the present invention is not limited to such an embodiment. For example, morphemes may be sequentially stored in a temporary storage device in a FIFO manner, and if any of the stored morpheme sequences satisfies a predetermined pattern, a node boundary may be detected there. In this case, the morpheme sequence stored in the temporary storage device may be sequentially output up to the pattern, and a section boundary label corresponding to the pattern may be inserted at the end.
[0101]
[Second embodiment]
In order to detect a clause boundary, the translation device 30 according to the first embodiment described above includes a program 52 programmed in advance in a predetermined programming language (Perl) and a language processing system 56 that is a processing system of the programming language. And are used. However, the present invention is not limited to such an embodiment. Instead of using a general-purpose language processing system, a dedicated program can be used. In this case, it is conceivable that only the rules are stored in a database so that the node boundary rules can be added, changed or deleted as appropriate.
[0102]
FIG. 12 shows a functional block diagram of a corpus statistical processing device employing the node boundary detecting device according to this embodiment. This apparatus performs the above-described clause boundary detection processing on the corpus to be processed, statistically processes the types of clause labels of each clause obtained as a result, and thereby makes it possible to examine the character of the corpus. Things.
[0103]
Referring to FIG. 12, the corpus statistical processing apparatus 200 has a function of inputting a corpus 202, separating each sentence included in the corpus 202 into clauses with clause labels, and statistically processing the results. The corpus statistical processing device 200 receives the corpus 202 as an input, performs a morphological analysis of each sentence, and outputs a morphological sequence, and a node boundary detection for the morphological sequence output by the morphological analyzing unit 210. A section boundary detection unit 212 for performing processing, inserting a section label indicating the type of the section immediately before the section boundary, and outputting the text as a text, and a section boundary detection output from the section boundary detection unit 212 A statistical processing unit 216 for performing statistical processing on the section labels in the later text 214 and outputting a statistical output 204;
[0104]
The clause boundary detection unit 212 applies a clause boundary detection rule stored in the rule DB 232 to a rule database (rule DB) 232 in which the clause boundary detection rules are made into a database and a morpheme string output from the morphological analysis unit 210. And a replacement processing unit 230 for performing the same processing as the replacement instruction of the first embodiment and outputting it as a text string with a section label inserted at a section boundary.
[0105]
As the replacement processing unit 230, it is preferable to have a performance capable of processing a normal form, as in the Perl processing system of the first embodiment. In this case, since a portion corresponding to a search character string of a rule stored in the rule DB 232 can be expressed by a regular expression, it is possible to reduce the capacity of the rule DB 232 and appropriately process all processing targets. Become.
[0106]
The replacement processing unit 230 can also be realized by a computer and software. The configuration of the software that implements the replacement processing unit 230 in that case is the same as the flowchart shown in FIG.
[0107]
As the morphological analysis unit 210, the same one as the program 52 used in the first embodiment can be used. In addition, the statistical processing performed by the statistical processing unit 216 may be prepared appropriately according to the purpose. For example, the average number of morphemes, the average number of phrases, the distribution of the types of clauses, and the like for each clause can be obtained by calculation based on the clause labels included in the text 214.
[0108]
The embodiment disclosed this time is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is shown by each claim of the claims, taking into account the description of the detailed description of the invention, and all the changes within the meaning and range equivalent to the language described therein are described. Including.
[Brief description of the drawings]
FIG. 1 is a diagram showing the difference between a monologue and a dialogue.
FIG. 2 is a functional block diagram of the translation device according to the first embodiment of the present invention.
FIG. 3 is a diagram for explaining a general form and an example of a node boundary detection rule.
FIG. 4 is a diagram for explaining a command format of Perl in which a node boundary detection rule is implemented.
FIG. 5 is a diagram for explaining types of nodes detectable by a node boundary detection rule according to the first embodiment.
FIG. 6 is a diagram illustrating a configuration of a Perl script in which a node boundary detection rule is implemented in the device according to the first embodiment.
FIG. 7 is a flowchart for explaining a control structure of a clause boundary detection process realized by the program 52 and the language processing system 56 of the apparatus according to the first embodiment.
FIG. 8 is a diagram illustrating an example of a result of a node boundary detection process.
FIG. 9 is a diagram showing, in a table form, a schematic size of a corpus used for performance evaluation of the node boundary detection processing according to the first embodiment;
FIG. 10 is a diagram illustrating a result of a node boundary detection process according to the first embodiment in a table format for each corpus.
FIG. 11 is a diagram showing, in a table format, results of performance evaluation of the node boundary detection processing according to the first embodiment;
FIG. 12 is a functional block diagram of a corpus statistical processing device according to a second embodiment of the present invention.
[Explanation of symbols]
Reference Signs List 30 translation device, 50 operating system (OS), 52 programs, 54, 210 morphological analysis unit, 56 language processing system, 60 text separation unit, 62 machine translation unit, 200 corpus statistical processing unit, 212 clause boundary detection unit, 216 Statistical processing unit, 230 Replacement processing unit, 232 Rule database (rule DB)

Claims

A clause boundary detection device for detecting a clause boundary of an original sentence from a morpheme sequence obtained by performing a morphological analysis on a sentence,
Detecting means for detecting a pattern of a predetermined morpheme sequence in the morpheme sequence;
In response to the detection of the pattern, in the morphological sequence, performing a predetermined process of specifying a position having a predetermined relationship with the arrangement of the morphemes in the detected pattern as a node boundary, A node boundary detection device, comprising: a boundary designation unit for outputting a morpheme sequence.

2. The method according to claim 1, wherein the boundary designating unit includes a unit for outputting a morphological sequence by inserting a boundary marker indicating a boundary of a node at the certain position in response to the detection of the pattern. Node boundary detection device.

The node boundary detecting device according to claim 1, wherein the detecting unit includes a unit for detecting any one of the plurality of patterns in the morphological sequence.

The boundary designating means responds to the detection of the arbitrary one of the patterns by the means for detecting the arbitrary one, in response to the arrangement of the morphemes in the detected pattern and a predetermined 4. The node boundary detecting device according to claim 3, further comprising: a label inserting unit configured to insert a predetermined node boundary label corresponding to the detected pattern at a relevant position.

The node boundary detection device according to any one of claims 1 to 4, wherein the certain position is immediately after a last morpheme in the detected pattern.

The detecting means,
A temporary storage unit for sequentially reading the morphological sequence, storing and outputting the morphological sequence in a FIFO manner,
Means for detecting that there is a pattern of the predetermined morphemes in the array of morphemes stored in the temporary storage means,
The boundary specifying unit is configured to output the up to the predetermined morpheme arrangement pattern in the temporary storage unit in response to the detection of the presence of the predetermined morpheme arrangement pattern. Means for controlling
2. The node boundary detecting device according to claim 1, further comprising: a unit for inserting a marker indicating a node boundary at the end of the pattern of the predetermined morpheme sequence output from the temporary storage unit.

The detecting means,
A temporary storage unit for sequentially reading the morphological sequence, storing and outputting the morphological sequence in a FIFO manner,
Means for detecting that there is any one of a plurality of patterns of the predetermined morpheme arrangement in the array of morphemes stored in the temporary storage means,
The boundary designation unit controls the temporary storage unit to output up to the detected pattern in the temporary storage unit in a FIFO manner in response to the detection of the arbitrary one pattern. Means for
Means for inserting a node boundary label corresponding to the detected pattern at the end of the pattern output from the temporary storage means, in response to the detection of the arbitrary one pattern, The node boundary detection device according to claim 1.

A computer program that, when executed by a computer, causes the computer to operate as the node boundary detection device according to claim 1.

Morphological analysis means for performing morphological analysis processing on an input Japanese sentence and outputting the obtained morphological sequence,
The node boundary detection device according to any one of claims 1 to 7, wherein the node boundary detection device is connected to receive an output of the morphological analysis unit as an input,
A morpheme sequence output from the clause boundary detection device, a clause separating unit for separating the clause by a clause boundary in the morpheme sequence,
A machine translation device, comprising: a morpheme string separated by the clause separating means as input, and a machine translating means for translating the received clause in response to receiving a clause from the clause separating means.

The node boundary detection device is a node boundary detection device according to claim 6, and has a function of inserting a node boundary marker at a node boundary of a morpheme sequence to be output,
The node separation means,
Storage means for temporarily storing an output from the node boundary detection device in a FIFO manner;
Means for giving the morphological sequence stored in the storage means to the machine translation means in response to the output of the clause boundary marker from the clause boundary detection device, and for starting machine translation. 10. The machine translation device according to 9.

A computer program that, when executed by a computer, causes the computer to operate as the machine translation device according to claim 9.