JPH052605A

JPH052605A - Machine translation system

Info

Publication number: JPH052605A
Application number: JP3298514A
Authority: JP
Inventors: Shigeya Senda; 滋也千田; Junichi Ito; 淳一伊藤; Takashi Katooka; 隆加登岡; Masumi Narita; 真澄成田; Yoshikazu Shiraishi; 美和白石; Yoshihisa Oguro; 慶久大黒; Norikazu Ito; 則和伊藤; Sakiko Honma; 咲子本間
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1990-10-29
Filing date: 1991-10-16
Publication date: 1993-01-08

Abstract

PURPOSE:To division/synthesis-process a sentence which is difficult to translate in terms of a long sentence and a morpheme. CONSTITUTION:A translation part 3 divides input data forming an original sentence into the unit of the sentence based on information on a translation dictionary part 5, and the sentence is divided into a translation unit. The divided parts of the translation unit are separately translated and the parts of the translation unit which are translated at the time of generating the translation sentence are synthesized. Plural sentence acknowledgement means for the text of the original sentence are prepared, and the sentence acknowledgement means adjusted to a purpose and a use is selected and used. Furthermore, a translation word is obtained by noticing sentence end candidate information of a list guide sentence and a list sentence end connection word as a list sentence recognition means among the sentence recognition means.

Description

Detailed Description of the Invention

【０００１】[0001]

【技術分野】本発明は、機械翻訳方式に関し、より詳細
には、翻訳精度や編集作業の効率化を図り、形態素解析
を行う機械翻訳方式に関する。例えば、英日機械翻訳装
置や自然言語処理技術に適用されるものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a machine translation system, and more particularly to a machine translation system for performing morphological analysis by improving translation accuracy and editing work efficiency. For example, it is applied to an English-Japanese machine translation device and a natural language processing technique.

【０００２】[0002]

【従来技術】従来の機械翻訳システムでは、文の長さが
システムの処理速度、翻訳精度に大きな影響を与えてい
る。つまり、文の長さに対し、指数関数的に処理速度が
増加し、翻訳精度も特定の長さ以上になると急激に低下
する。このようなことから、よく文を分割して翻訳する
ことが機械翻訳システムの利用において行なわれてい
る。しかし、これはユーザが翻訳できなかった場合や、
翻訳できないと思った場所に対し、前編集でいちいち翻
訳文単位指定を行なうことで対処している。しかもこれ
らの処理によって得られる結果は、文としての形をして
いないのでわざわざ前編集で文となるように編集した
り、後編集でかなり変更しなければ思ったような訳語が
得られないという問題点がある。2. Description of the Related Art In a conventional machine translation system, the sentence length has a great influence on the processing speed and translation accuracy of the system. That is, the processing speed increases exponentially with respect to the length of the sentence, and the translation accuracy sharply decreases when the translation length exceeds a certain length. For this reason, it is often practiced to use a machine translation system to translate a sentence by dividing it. But this is because the user couldn't translate
We deal with the places where we think that we can't translate by designating the translation sentence unit by preediting. Moreover, the result obtained by these processes is not in the form of a sentence, so if you do not bother to edit it so that it becomes a sentence in the pre-editing or change it considerably in the post-editing, you will not get the translation you want There is a problem.

【０００３】従来の機械翻訳システムでは、文章から文
章への翻訳という観点からのみ考えられており、文書か
ら文書への翻訳という観点に立っていなかった。そのた
めに文書の持つ書式、カラム位置、単語間のスペースの
数といった文書フォーマット情報は顧みられることがな
く、システムに都合がいいように文書および文章の整形
が行なわれ、原文イメージを壊してさえいた。文末認定
は、ピリオドなどの文末記号を手がかりにして行うか、
ユーザに文末を指示してもらったりして行なわれるのが
一般的であり、文書のフォーマットが持つ情報が利用さ
れることはなかった。そのため、通常ピリオドなど文末
記号を伴わずに書かれるタイトル文やリスト文の文認定
の失敗が多く見られ、解析率を著しく下げていた。また
同様に、ピリオドなど文末記号を伴わずに書かれること
の多い見出し語や、記号なども分離されず、従って構文
解析の一要素としてしまうために、構文解析の負担とな
り、かつ解析に失敗することが多かった。更にはそうい
った文認定の失敗を防いだり修正したりするには、ユー
ザに頼らなくてはならず、前編集や後編集に時間がかか
り、翻訳プロセス全体の効率を下げる一因となってい
た。In the conventional machine translation system, it is considered only from the viewpoint of translation from text to text, and not from the viewpoint of translation from text to text. For that reason, the document format information such as the format, column positions, and the number of spaces between words in the document is not taken into consideration, and the document and the sentence are reformatted for the convenience of the system, and even the original image is broken. . End-of-sentence recognition is done by using end-of-sentence symbols such as periods as clues,
This is generally done by instructing the end of sentence by the user, and the information in the document format has not been used. For this reason, there were many failures in the recognition of title sentences and list sentences, which were usually written without a trailing mark such as periods, and the analysis rate was significantly reduced. Similarly, headwords that are often written without end-of-sentence symbols such as periods, and symbols are not separated, and thus become an element of syntactic analysis, resulting in a burden of syntactic analysis and failure of parsing. There were many things. Furthermore, in order to prevent or correct such a sentence recognition failure, the user has to rely on the user, and pre-editing and post-editing are time-consuming, which is one of the factors that reduce the efficiency of the whole translation process.

【０００４】また、従来までの文認定はいわゆる散文体
向けでマニュアルなどタイトル文やリスト文が頻出する
形式には不向きであった。文認定は一義的に決まり、翻
訳した後、ユーザが改めて直していた。文認定に失敗し
たらそれまでであり、ユーザが自分で直す別候補などと
いうものは、翻訳結果とは違い、用意されていなかっ
た。また、文認定を変えることにより、タイトルやリス
トに対応しようという方策はなかった。また、スペルチ
ェック機能は複数有していたが、そのメリットが充分に
生かされておらず、同じ形態素解析部の仕事と言える文
認定機能は単数保持するのみであった。Further, the conventional sentence recognition is for so-called prose style and is not suitable for a format such as a manual in which title sentences and list sentences frequently appear. The sentence recognition was uniquely determined, and after translation, the user had to redo it. If the sentence recognition fails, it is up to that point, and another candidate that the user can fix by himself, unlike the translation result, was not prepared. Also, there was no way to deal with titles and lists by changing the sentence recognition. Also, although it had multiple spell check functions, its merits were not fully utilized, and only one sentence recognition function, which could be said to be the job of the same morphological analysis unit, was retained.

【０００５】機械翻訳を必要とする文書にはいろいろな
ものが考えられる。英日機械翻訳の使用例では、産業文
書の世界の翻訳に関して言えば、マニュアルという代表
的なジャンルが存在する。マニュアルは量が多く、かつ
比較的英語が平易であるので機械翻訳に向いていると言
われている。ただし、マニュアルは雑誌や論文などの英
語とは記述形式が異なる。マニュアルには、文というよ
りも文らしきものや、文とは思えないようなものが多く
存在する。それはタイトル、リスト、見出し、そしてそ
れらのうちのどれともつかないものなどである。これら
は通常の文とは違い、ピリオド等の文末認定記号で終わ
る一つの意味のまとまりであるとは限らない。これらの
どこからどこまでをいわゆる文（１つの翻訳処理単位）
と捕えるかはなかなか難しい問題である。There are various documents that require machine translation. In the example of using English-Japanese machine translation, there is a typical genre of manual when it comes to the translation of industrial documents in the world. It is said that manuals are suitable for machine translation because of the large amount and relatively simple English. However, the description format of the manual is different from English such as magazines and papers. There are many manuals that are more like sentences than sentences, and that do not seem like sentences. It has titles, lists, headings, and none of them. Unlike ordinary sentences, these are not always a group of meanings that end with a sentence end recognition symbol such as a period. So-called sentences (one translation processing unit) from where to where
It is a difficult problem to catch.

【０００６】また、文書の入力方法にもいろいろなもの
が考えられる。キーボード入力、ディスクメディア、通
信、ＯＣＲ（文字認識装置）などである。なかでも紙に
印字された文字を機械的に読み取り、電子情報化するＯ
ＣＲは便利なものとして重宝されている。例えば、ＯＣ
Ｒで読み取られた英数文字情報を英日機械翻訳装置によ
り翻訳する。このとき、ＯＣＲで読み取られ、翻訳にか
けられる直前でスペルチェックや文認定が行なわれてい
ると大変便利である。なぜならばＯＣＲは万能ではない
ので、必ずいくつかの読み取りミスが発生する。従って
ある程度人間が目を通す必要がある。このとき、ＯＣＲ
にかかると即時にスペルチェックと文認定が終了してお
れば大変便利である。このときは、同時性が要求され、
速度が大事であるので文認定するときに語の辞書情報ま
では参照せず、文字列の形態素的特徴だけを見て判断す
る。この方法を使うと時々ピリオドで終了する省略語な
どの処理に失敗することがある。これを解決するために
は辞書情報の検索が必要である。従って辞書情報の検索
を備えた文認定方式も採用しておくべきである。Various document input methods are possible. Examples are keyboard input, disk media, communication, OCR (character recognition device), and the like. Above all, it mechanically reads characters printed on paper and converts them into electronic information O
CR is useful as a convenience. For example, OC
The alphanumeric information read by R is translated by an English-Japanese machine translation device. At this time, it is very convenient if spell checking and sentence recognition are performed just before being read by OCR and translated. Because OCR is not universal, some read errors always occur. Therefore, it is necessary for humans to read it to some extent. At this time, OCR
It will be very convenient if the spell check and sentence recognition are completed immediately after the start. At this time, concurrency is required,
Since speed is important, when recognizing a sentence, the word dictionary information is not referenced, and only the morphological features of the character string are used to make the determination. This method sometimes fails to process abbreviations that end with a period. In order to solve this, it is necessary to search the dictionary information. Therefore, a sentence recognition method with a search of dictionary information should also be adopted.

【０００７】[0007]

【目的】本発明は、上述のごとき実情に鑑みてなされた
もので、システムがこれらのユーザが行なっている文の
分割、及びその合成をある程度自動的に行なってやるこ
とが必要であり、特に長文、形態素的に翻訳処理がうま
く行かない文、（例えば、リスト文など）に対し、自動
的に又はユーザが指定するだけで対処するようにした機
械翻訳方式を提供すること、また、原文のテキストの文
認定手段を複数用意し、原文のテキストに文末記号がな
い場合には、原文のテキストの文字属性及び原文のテキ
ストの配置情報から文認定すること、すなわち、テキス
トの配置情報をテキストの構成要素（単語及び記号等）
の縦方向及び横方向の配置情報とし、縦方向の配置情報
から空文、タイトル文及びリスト文を認定し、横方向の
配置情報から記号及び見出し語を認定できるようにする
こと、また、特殊な文の場合でもある程度の認定処理が
できるようにすること、また、テキストや状況、用途と
いった様々な条件に応じた文認定を行うようにした機械
翻訳方式を提供することを目的としてなされたものであ
る。[Purpose] The present invention has been made in view of the above-mentioned circumstances, and it is necessary for the system to automatically divide the sentences performed by these users and synthesize them to some extent. Providing a machine translation method for dealing with long sentences, sentences that are morphologically unsuccessful in translation processing (for example, list sentences, etc.), automatically or by simply specifying by the user. If multiple text sentence recognition means are prepared and the text of the original text does not have a sentence terminator, the sentence is certified based on the character attributes of the text of the original text and the layout information of the text of the original text. Components (words and symbols, etc.)
The vertical and horizontal layout information of the above, the empty sentence, the title sentence, and the list sentence are certified from the vertical layout information, and the symbols and entry words can be certified from the horizontal layout information. The purpose was to provide a machine translation method that allows a certain degree of qualification processing even in the case of texts, and that certifies texts according to various conditions such as texts, situations, and uses. Is.

【０００８】[0008]

【構成】本発明は、上記目的を達成するために、（１）
入力された原言語の原文を解析処理し、結果として目的
言語の翻訳文を出力する機械翻訳方式において、原文を
形成する入力データを文の単位に分割し、さらに文を翻
訳単位に分割する分割手段と、該分割手段により分割さ
れた翻訳単位の部分に対して別々の翻訳処理を施し、訳
文生成の際に翻訳処理を施された翻訳単位の部分を合成
することにより全体の翻訳結果を得る処理手段とを有す
ること、更には、（２）分割単位同士の関係を分割処理
時に付与し、該分割時情報に基づき解析する解析手段を
有すること、更には、（３）前記分割時情報を用いて適
切な訳語を得るように、訳出順序や活用等を制御する制
御手段を有すること、更には、（４）前記分割手段は、
リスト文を処理するために用いること、更には、（５）
各翻訳単位別に別訳を保持し、ユーザが該翻訳単位を編
集操作で選択、利用できること、更には、（６）翻訳単
位分割を行なうかどうかユーザが指定すること、更に
は、（７）翻訳時に分割処理する単位をユーザが指定す
ること、或いは、（８）第１言語の原文を処理装置に入
力する入力装置と、該入力装置により入力された情報を
表示する表示装置と、情報を処理する処理装置と、第２
言語の出力を行う出力装置とを備えた機械翻訳システム
において、前記原文のテキストの文認定手段を複数用意
すること、更には、（９）前記（８）において、原文の
テキストに文末記号がない場合、前記文認定手段は原文
のテキストの文字属性及び原文のテキストの構成要素
（単語及び記号等）の縦方向及び横方向の配置情報から
文認定をするようにしたこと、更には、（１０）前記
（９）において、前記文認定手段は、空文認定手段とタ
イトル文認定手段とリスト文認定手段と記号部認定手段
及び見出し語認定手段とを具備したこと、更には、（１
１）前記（１０）において、前記空文認定手段は、改行
が連続して現れたとき空文として認定すること、更に
は、（１２）前記（１０）において、前記タイトル文認
定手段は、文認定すべき対象テキストの直前で文認定さ
れており、前記対象テキストが行頭から始まり、次の行
が小文字以外で始まっている時に、先頭の単語が前置
詞、冠詞及び接続詞類以外の場合で各単語の先頭が大文
字、記号、数字である時タイトル文と認定すること、更
には、（１３）前記（１０）において、前記タイトル文
認定手段は、文認定すべき対象テキストの直前で文認定
されており、前記対象テキストが行頭から始まり、次の
行が小文字以外で始まっている時に、行を構成する単語
数が規定値より少なく、次の行のインデントがより深く
なっている場合タイトル文であると認定すること、更に
は、（１４）前記（１０）において、前記リスト文認定
手段は、種類（数字語、アルファベット１文字、アスタ
リスク等の記号文字類のいずれか）及び出現位置が同一
である先頭部分を持つ原文のテキストが連続して現れる
時、リスト文であると認定すること、更には、（１５）
前記（１０）において、前記記号部認定手段は、前記リ
スト文であると認定された時に、その先頭部分を記号部
と認定すること、更には、（１６）前記（１０）におい
て、前記記号部認定手段は、原文のテキストの先頭の数
字語又はアルファベット１文字が括弧類で囲まれたと
き、該括弧類で囲まれた部分を記号部と認定すること、
更には、（１７）前記（１０）において、前記記号部認
定手段は、原文のテキストの先頭の数字語又はアルファ
ベット１文字が同じ括弧、ピリオド、コロン又はセミコ
ロンのいずれかが付加されたとき、該付加された部分を
記号部と認定すること、更には、（１８）前記（１０）
において、前記記号部認定手段は、原文のテキストが記
号文字（アスタリスク等）で始まるとき、該記号文字を
記号部と認定すること、更には、（１９）前記（１０）
において、前記見出し語認定手段は、文認定すべき対象
テキストの直前で文認定されているか、前記対象テキス
トの直前が既に見出し語認定されている時に、前記対象
テキストが行中に規定値以上の複数スペースを持つと
き、行頭からスペースの部分までを見出し語であると認
定すること、更には、（２０）前記（１０）において、
前記見出し語認定手段は、文認定すべき対象テキストの
直前で文認定されているか、前記対象テキストの直前が
既に見出し語認定されている時に、前記対象テキストが
行中に規定値以下の複数スペースを持ち、複数スペース
の直後の文字が小文字以外で始まっている時に、その
行末に文末記号がある場合、その行全体が既に文末認
定されている場合、行頭から最後の複数スペースの真
下までがスペースである場合、複数スペースまでの単
語数が１である場合、のいずれかの時に、行頭からスペ
ースの部分までを見出し語であると認定すること、更に
は、（２１）前記（１０）において、前記記号認定手段
で認定された記号部を構文解析の対象としないようにす
ること、或いは、（２２）第１言語の原文を処理装置に
入力する入力装置と、該入力装置により入力された情報
を表示する表示装置と、情報を処理する処理装置と、第
２言語の出力を行う出力装置とを備えた機械翻訳システ
ムにおいて、前記第１言語のテキストの仮の文（翻訳処
理の１単位）認定結果を表示すること、或いは、（２
３）第１言語の原文を処理装置に入力する入力装置と、
該入力装置により入力された情報を表示する表示装置
と、情報を処理する処理装置と、第２言語の出力を行う
出力装置とを備えた機械翻訳システムにおいて、前記第
１言語のテキストの文（翻訳処理の１単位）認定結果の
一部を強調表示すること、或いは、（２４）第１言語の
原文を処理装置に入力する入力装置と、該入力装置によ
り入力された情報を表示する表示装置と、情報を処理す
る処理装置と、第２言語の出力を行う出力装置とを備え
た機械翻訳システムにおいて、前記第１言語のテキスト
の文（翻訳処理の１単位）認定結果を複数用意し、該複
数の文認定結果を参照して選択すること、或いは、（２
５）第１言語の原文を処理装置に入力する入力装置と、
該入力装置により入力された情報を表示する表示装置
と、情報を処理する処理装置と、第２言語の出力を行う
出力装置と、前記原文のテキストの文を認定するための
複数の文認定手段とから成る機械翻訳システムにおい
て、前記文認定手段のうちリスト文認定手段として、リ
スト導入文における文末候補情報直列語の品詞により、
該文末候補情報に与える構文的役割を変え、前記品詞に
応じて前記文末候補情報の訳語を与えること、更には、
（２６）前記（２５）において、前記リスト文認定手段
により認定されたリスト文の文末に生起しやすいリスト
文末接続語を特定し、記号部でない次の文の先頭に前記
リスト文末接続語を付加すること、或いは、（２７）第
１言語の原文を文字認識装置で読み取られる入力装置
と、該入力装置により入力された情報を表示する表示装
置と、情報を処理する処理装置と、第２言語の出力を行
う出力装置とを備えた機械翻訳システムにおいて、横書
き文書か縦書き文書かを判定し、行間距離を測定する行
範囲の特定手段と、語間距離と字間距離を求めて単語の
位置と長さを決定する算出手段と、前記語間距離の長さ
からインデント距離を求め、該インデント距離の分布か
らインデントの種類を決定するインデント算出手段と、
該インデント算出手段により決定されたインデントの種
類などに基づいてタイトル行か、リスト文かなどの言語
的特徴を認定する文書形式推定手段と、該文書形式推定
手段に基づいて単語あるいは文字を認識する文字認識手
段とから成ること、或いは、（２８）第１言語の原文を
入力するための入力部と、前記第１言語の単語に対応す
る第２言語の訳語及び意味情報等を有する翻訳辞書部
と、前記入力部で入力された原文を形態素解析する形態
素解析部と、前記入力部で入力された原文を構文解析す
る構文解析部と、前記第２言語へ変換・生成する変換・
生成部と、該変換・生成部により変換・生成された訳文
を出力する出力部とかる成る機械翻訳システムにおい
て、解析に失敗した文の直前が、該文のタイトルを兼ね
るタイトル文である場合、解析失敗文と直前のタイトル
文を連結することにより一文とみなして翻訳処理を行な
うこと、更には、（２９）前記（２８）において、前記
解析に失敗した文の直前が、その文のタイトルを兼ねる
タイトル文である場合、解析失敗文の先頭の単語と、そ
の直前のタイトル部分に対し、品詞あるいは主語、述語
等の構文情報を与えて翻訳処理を行なうこと、更には、
（３０）前記（２９）の翻訳処理が解析失敗に終わった
場合、前記（２８）の翻訳処理を行なうこと、更には、
（３１）前記（２８）又は（２９）の翻訳処理で得られ
た訳文からタイトルにあたる部分の訳語を削除して文生
成を行なうこと、更には、（３２）前記（２８）又は
（２９）の翻訳処理で得られた訳文からタイトルにあた
る部分の訳語を代名詞に代えて、文生成を行なうことを
特徴としたものである。以下、本発明の実施例に基づい
て説明する。In order to achieve the above object, the present invention provides (1)
In the machine translation method that analyzes the input source language original sentence and outputs the translated sentence of the target language as a result, divides the input data that forms the original sentence into sentence units, and further divides the sentence into translation units Means and the translation unit portion divided by the dividing means are subjected to different translation processing, and the whole translation result is obtained by synthesizing the translation unit portion subjected to the translation processing at the time of translation generation. Processing means, and further, (2) providing a relationship between division units at the time of division processing, and having an analysis means for analyzing based on the division time information, and (3) the division time information. In order to obtain an appropriate translated word by using it, it has a control means for controlling the translation order, utilization, etc., and (4) the dividing means,
It is used to process list statements, and further (5)
A separate translation is held for each translation unit, and the user can select and use the translation unit by an editing operation. Further, (6) the user specifies whether or not to perform translation unit division, and (7) translation The user sometimes designates a unit for division processing, or (8) an input device for inputting the original text of the first language to the processing device, a display device for displaying the information input by the input device, and the information processing Second processing device
In a machine translation system including an output device for outputting a language, a plurality of sentence recognizing means for the text of the original sentence are prepared, and (9) in (8), the text of the original sentence has no sentence end symbol. In this case, the sentence recognizing means performs sentence recognition based on the character attribute of the original text and the arrangement information of the constituent elements (words, symbols, etc.) of the original text in the vertical direction and the horizontal direction. ) In (9), the sentence certifying means includes an empty sentence certifying means, a title sentence certifying means, a list sentence certifying means, a symbol part certifying means, and a headword certifying means.
1) In the above (10), the blank sentence recognizing means recognizes as a blank sentence when line breaks appear consecutively, and (12) In the above (10), the title sentence recognizing means: Sentences are recognized just before the target text to be certified, and when the target text starts at the beginning of the line and the next line starts in a case other than lowercase, each word except the preposition, article and conjunction When the beginning of is a capital letter, a symbol, or a number, it is recognized as a title sentence, and (13) In (10), the title sentence certifying means is sentence-certified immediately before the target text to be sentence-certified. When the target text starts at the beginning of the line and the next line starts with a letter other than lowercase, the number of words that make up the line is less than the specified value, and the indentation of the next line is deeper. (14) In (10), the list sentence recognizing means has a type (any one of a numeric word, a letter of the alphabet, a symbolic character such as an asterisk) and an appearance position. When texts of the original sentence having the same head portion appear consecutively, it is recognized as a list sentence, and further, (15)
In the above (10), the symbol part recognizing means, when it is recognized as the list sentence, certifies the head part thereof as a symbol part, and (16) in the above (10), the symbol part. The certifying means certifies that, when a numerical word or an alphabetic character at the beginning of the original text is enclosed in parentheses, the portion enclosed in the parentheses is a symbol portion,
Further, (17) In the above (10), the symbol part recognizing means is configured such that when any of a parenthesis, a period, a colon, or a semicolon is added to the first numeric word or one letter of the text of the original text, The added portion is recognized as a symbol portion, and further, (18) the above (10)
In the above, when the text of the original sentence starts with a symbol character (such as an asterisk), the symbol portion recognizing means recognizes the symbol character as a symbol portion, further, (19) above (10)
In the above, the headword recognizing means is sentence-recognized immediately before the target text to be sentence-recognized, or when the target-word has already been recognized just before the target text, the target text is equal to or more than a specified value in a line. When there are multiple spaces, recognizing from the beginning of the line to the space as an entry word, further, (20) In (10) above,
The headword recognizing means has a plurality of spaces, each of which is equal to or less than a specified value, in a line when the target text has been recognized immediately before the target text to be sentence-recognized or when the target text has already been recognized just before the target text. If the character immediately following the multiple spaces begins with a non-lowercase letter, and there is an end-of-sentence at the end of the line, or if the entire line is already end-of-sent, then there is a space from the beginning of the line to just below the last multiple spaces. In the case where the number of words up to a plurality of spaces is 1, in any one of the cases, the part from the beginning of the line to the space is recognized as a headword, and (21) In (10) above, The symbol part certified by the symbol recognizing means is not subject to the syntactic analysis, or (22) an input device for inputting an original sentence of the first language to a processing device; In a machine translation system including a display device that displays information input by a device, a processing device that processes information, and an output device that outputs a second language, a temporary sentence of text in the first language ( Displaying the certification result (1 unit of translation processing) or (2
3) an input device for inputting the original sentence of the first language to the processing device,
In a machine translation system including a display device that displays information input by the input device, a processing device that processes information, and an output device that outputs a second language, a text sentence of the first language ( (1 unit of translation processing) highlighting a part of the recognition result, or (24) an input device for inputting the original sentence of the first language to the processing device, and a display device for displaying the information input by the input device In a machine translation system including a processing device for processing information and an output device for outputting in a second language, a plurality of text sentence (one unit of translation processing) certification results of the first language are prepared. Select by referring to the plurality of sentence recognition results, or (2
5) An input device for inputting the original text of the first language to the processing device,
A display device for displaying the information input by the input device, a processing device for processing the information, an output device for outputting the second language, and a plurality of sentence certifying means for certifying the sentence of the original text. In the machine translation system consisting of, as the list sentence certifying means among the sentence certifying means, by the part of speech of the sentence end candidate information serial word in the list introduction sentence,
Changing the syntactic role given to the sentence end candidate information, and giving a translated word of the sentence end candidate information according to the part of speech;
(26) In (25), the list sentence end connective word that is likely to occur at the end of the list sentence approved by the list sentence identifying means is specified, and the list sentence end connective word is added to the beginning of the next sentence that is not a symbol part. Or (27) an input device for reading an original sentence in the first language with a character recognition device, a display device for displaying information input by the input device, a processing device for processing the information, and a second language In a machine translation system equipped with an output device that outputs a document, it is determined whether the document is a horizontal writing document or a vertical writing document, the line range specifying means for measuring the line spacing, and the word spacing Calculating means for determining the position and the length, indent calculating means for obtaining the indent distance from the length of the interword distance, and determining the type of indent from the distribution of the indent distance,
Document format estimating means for recognizing linguistic features such as title lines or list sentences based on the type of indentation determined by the indent calculating means, and characters for recognizing words or characters based on the document format estimating means Or (28) an input unit for inputting an original sentence in the first language, and a translation dictionary unit having a translated word in the second language corresponding to the word in the first language, semantic information, and the like. A morphological analysis unit that morphologically analyzes the original text input by the input unit; a syntax analysis unit that parses the original text input by the input unit; and a conversion that converts / generates the second language.
In a machine translation system that includes a generation unit and an output unit that outputs a translated sentence that is converted / generated by the conversion unit, in the case where a sentence immediately before a sentence that failed in analysis is a title sentence that also serves as the title of the sentence, The translation processing is performed by concatenating the parsing failure sentence and the immediately preceding title sentence as one sentence. Further, (29) in (28), the sentence immediately before the parsing that fails the parsing is the title of the sentence. In the case of a double-ended title sentence, the translation processing is performed by giving syntactic information such as a part of speech, a subject, or a predicate to the first word of the parsing failure sentence and the title portion immediately before it.
(30) If the translation process of (29) above is unsuccessful in the analysis, perform the translation process of (28) above;
(31) Delete the translated words of the part corresponding to the title from the translated text obtained by the translation process of (28) or (29) to generate a sentence, and (32) add the sentence of (28) or (29) above. It is characterized in that the translated word in the part corresponding to the title is replaced with a pronoun from the translated text obtained by the translation process to generate a sentence. Hereinafter, description will be given based on examples of the present invention.

【０００９】図１は、本発明による機械翻訳方式の一実
施例を説明するための構成図で、図中、１は入力部、２
は原文記憶部、３は翻訳部、４は編集制御部、５は翻訳
辞書部、６は訳文記憶部、７は表示制御部、８は表示
部、９は印刷部、１０は辞書編集部である。FIG. 1 is a block diagram for explaining an embodiment of a machine translation system according to the present invention, in which 1 is an input unit and 2 is
Is an original sentence storage unit, 3 is a translation unit, 4 is an edit control unit, 5 is a translation dictionary unit, 6 is a translated sentence storage unit, 7 is a display control unit, 8 is a display unit, 9 is a printing unit, and 10 is a dictionary editing unit. is there.

【００１０】本発明の実施例においては、英語文を入力
して日本語文の訳文を得る英日機械翻訳装置について説
明する。まず、入力される原文をキーボード等から成る
入力部１で入力し、原文記憶部２に送る。翻訳部３は、
編集制御部４の制御下で翻訳辞書部５の情報をもとに、
原文記憶部２の原文を所定の処理単位ごとに翻訳処理す
る。翻訳処理されて得られた訳文は順次訳文記憶部６に
格納される。編集制御部４は、表示制御部７を駆動し、
原文記憶部２に格納された原文、および訳文記憶部６に
格納された訳文を相互に対応付けて表示部８において表
示する。オペレータはこの表示を見ながら後編集処理を
行なう。後編集処理は入力部１から入力される制御情報
に従い、編集制御部４で実行される。辞書編集部１０は
編集制御部４の制御下で翻訳辞書部５の内容を修正・変
更・削除する。翻訳時にこれらの結果が反映される。後
編集処理の後、完成された訳文と原文を印刷部９で出力
する。In the embodiment of the present invention, an English-Japanese machine translation apparatus for inputting an English sentence and obtaining a translated sentence of a Japanese sentence will be described. First, the input original text is input by the input unit 1 including a keyboard and sent to the original text storage unit 2. The translation unit 3
Based on the information in the translation dictionary unit 5 under the control of the editing control unit 4,
The original sentence in the original sentence storage unit 2 is translated for each predetermined processing unit. The translated texts obtained by the translation processing are sequentially stored in the translated text storage unit 6. The edit control unit 4 drives the display control unit 7,
The original sentence stored in the original sentence storage unit 2 and the translated sentence stored in the translated sentence storage unit 6 are displayed on the display unit 8 in association with each other. The operator performs post-editing processing while looking at this display. The post-editing process is executed by the edit control unit 4 according to the control information input from the input unit 1. The dictionary editing unit 10 corrects / changes / deletes the contents of the translation dictionary unit 5 under the control of the editing control unit 4. These results are reflected during translation. After the post-editing process, the completed translated sentence and the original sentence are output by the printing unit 9.

【００１１】図２は、翻訳部の構成図で、図中、１１は
形態素解析部、１２は構文解析部、１３は構文変換部、
１４は形態素生成部、１５は単語辞書、１６は文法規
則、１７は生成規則である。ここには翻訳部の４つの大
きな処理が示されている。まず、形態素解析部１１では
単語辞書１５を用いて、通常の形態素分割処理などを行
う。構文解析部１２では、個々の語の情報を得て文法規
則１６に従って木構造を作成する。構文変換部１３で
は、入力言語の木構造から出力言語の木構造に変換す
る。形態素生成部１４では、得られた木構造をノードご
とに生成規則１７を用いて訳出する。FIG. 2 is a block diagram of the translation unit. In the figure, 11 is a morphological analysis unit, 12 is a syntax analysis unit, 13 is a syntax conversion unit,
Reference numeral 14 is a morpheme generation unit, 15 is a word dictionary, 16 is a grammar rule, and 17 is a generation rule. Here, four major processes of the translation unit are shown. First, the morpheme analysis unit 11 uses the word dictionary 15 to perform normal morpheme division processing and the like. The syntactic analysis unit 12 obtains information on each word and creates a tree structure according to the grammar rule 16. The syntax conversion unit 13 converts the tree structure of the input language into the tree structure of the output language. The morpheme generation unit 14 translates the obtained tree structure using the generation rule 17 for each node.

【００１２】図３は、図２における形態素解析部の動作
を説明するためのフローチャートである。以下、各ステ
ップに従って順に説明する。step１；まず、入力言語の文字列を読み込む。step２；次に、字句解析及び文認定処理を英単語辞書１
９を用いて行う。step３；リスト部条件判定では文にリスト部が含まれて
いるかどうかをリスト部条件によって判別する。リスト
部条件でリスト部と認定された場合、パターンマッチン
グ部に対して、リスト部の記号部の位置、記号部に後続
する文の位置を与える。step４；その後、パターンマッチング部で、文認定され
た文単位についてパターン認識規則１８を用いて処理単
位に分割できるかどうかを識別する。step５；次に前記step３で認定された部単位を、英単語
辞書１９を用いて解析単位として構文解析以降の処理を
行う。FIG. 3 is a flow chart for explaining the operation of the morphological analysis unit in FIG. Hereinafter, each step will be described in order. step1 ； First, read the character string of the input language. step2 ； Next, the lexical analysis and sentence recognition processing is performed by the English word dictionary 1
9 is used. step3 ; In the list part condition determination, it is determined whether or not the sentence includes a list part by the list part condition. When the list part is recognized as the list part, the position of the symbol part of the list part and the position of the sentence following the symbol part are given to the pattern matching part. step4 ; Then, the pattern matching unit identifies whether the sentence-recognized sentence unit can be divided into processing units by using the pattern recognition rule 18. step5 ; Next, the copy unit recognized in the above step 3 is used as an analysis unit using the English word dictionary 19 to perform the process after the syntactic analysis.

【００１３】なお、リスト文とは記号に導かれる文の総
称である。たとえば、・，＊，▲。▼ （▲。▼は中白丸のことであ
る。）（１），（２），… ａ），ｂ），… などの記号が使われる。リスト文はそれぞれが文の場合
と全体で文をなす場合がある。それぞれが文をなす場合
の対応は比較的たやすい。単に、文の前に特殊な記号が
あることを認識してそれ以外を文として解析すればよ
い。問題は文に埋め込まれて存在するリスト形式（以
下、リスト部）がある文であるが、こういう場合につい
て、本方式のように文分割を行い、それぞれを解析し、
それをのちに合成することによって解決できる。The list sentence is a general term for sentences led by symbols. For example, *, *, ▲. ▼ (▲. ▼ means a white circle). Symbols such as (1), (2), ... a), b), ... are used. Each list sentence may be a sentence or may form a sentence as a whole. Corresponding to each sentence is relatively easy. Simply recognize that there is a special symbol in front of the sentence and analyze the rest as a sentence. The problem is a sentence that has a list format (hereinafter referred to as the list part) that is embedded in the sentence.
It can be solved by synthesizing it later.

【００１４】リスト部の解析は次のような条件で行う。（１）その行が改行されており、インデントされ、先頭
が以下の場合リスト文と仮定する数字語アルファベット１文字アスタリスクなどの記号（２）その行が改行されており、インデントされ、先頭
が以下の場合リスト文と確定する。括弧がついた記号数字ｅｘ）（１），（ａ），… １），ａ），… ピリオド（コロン、セミコロン）がついた数字記号ｅ
ｘ）１．ａ．（３）上記の場合において改行インデントなしで文中に
存在する場合はリスト部として仮定する。The list part is analyzed under the following conditions. (1) The line is broken and indented, and if the beginning is the following, it is assumed to be a list sentence Number word Alphabet 1 character A symbol such as asterisk (2) The line is broken and indented, and the beginning is In case of, it is confirmed as a list sentence. Numbers with parentheses ex) (1), (a), ... 1), a), ... Number symbol e with period (colon, semicolon)
x) 1. a. (3) In the above case, if it exists in a sentence without a line feed indent, it is assumed to be a list part.

【００１５】リスト部における先頭の記号等を記号部と
呼ぶ。仮定と確定の違いは、確定の場合は無条件でリス
ト部とし、仮定の場合は後続する同様なリスト部があ
り、それが同種の記号部を持ち、出現カラム位置が同じ
でなければならない。ただし、（３）の場合、出現カラ
ム位置ではなく記号部の種類だけで判別する。以上のよ
うなリスト部が満たさなければならない条件をリスト部
条件と呼ぶ。The first symbol and the like in the list part is called a symbol part. The difference between hypothesis and definite is that in the case of definite, the list part is unconditional, and in the case of hypothesis, there is a similar list part that follows, it must have the same kind of symbol part, and the appearance column position must be the same. However, in the case of (3), it is determined only by the type of the symbol part, not the position of the appearance column. The conditions that the list section must satisfy are called list section conditions.

【００１６】図４は、パターンマッチング規則を示した
図である。これにより、文中にある分割処理すべき対象
を特定し、分割した処理単位間の関係を決定し、分割単
位にそれらの順序・階層関係、ロール、ゴール等の情報
を付加する。ルールの記述形式は次のとおりであ
る。［］は分割される単位を示し、その内部にマッチン
グ条件が書かれる。［］内の‘〜’は任意の文字列との
マッチングを示す。［］の‘＜数値＞：’はマッチング
する最小文字数である。この文字数以下の場合マッチン
グに成功しても無視される。これは余り小さい範囲を分
割するより従来通り同時に処理するほうがよいと思われ
る場合に指定する。パターン記述の後ろに書かれる＄＜
数字＞を含んだ記述は、マッチングに成功した場合の動
作を示す。＄の後ろの数字は前の［］を現われた順に番
号付けしたものを示す。＄＜数字＞＝カテゴリ、は分割
された部分の解析の際のゴールを示し、＄＜数字＞＜−
＄＜数字＞は分割部分の従属関係を示す。〔ＫＰ〜〕＊
の‘＊’は繰返しを表わし、‘ＫＰ’はリスト文のリス
ト部の記号部を表わす記号である。FIG. 4 is a diagram showing pattern matching rules. As a result, the target to be divided in the sentence is specified, the relationship between the divided processing units is determined, and information such as the order / hierarchical relationship, roles, and goals of the divided processing units is added. The description format of the rule is as follows. [] Indicates a unit to be divided, and a matching condition is written inside the unit. "~" In [] indicates matching with an arbitrary character string. “<Numerical value>:” in [] is the minimum number of matching characters. If the number of characters is less than this, even if the matching is successful, it is ignored. This is specified when it seems that it is better to process simultaneously as usual rather than dividing a too small range. $ <Written after pattern description
The description including the number> indicates the operation when the matching is successful. The number after the $ indicates the previous [] numbered in the order they appear. $ <Number> = category indicates a goal when analyzing the divided parts, and $ <number><−
$ <Number> indicates the subordinate relationship of the divided parts. [KP ~] *
The symbol "*" in (1) represents repetition and the symbol "KP" represents a symbol part of the list part of the list sentence.

【００１７】例えば、規則１は１０文字以上の単位であ
る文字列の後ろにregarding toという文字列を含む単位
が複数個来たときそれぞれを翻訳単位とし、regarding
to以前の塊を文として解析し、regarding toを前置詞句
をゴールとして解析する事を示す。さらに、その前置詞
句は最初の文に含まれるように生成処理されるべきこと
を示している。規則２は引用符で埋めこまれた文を分割
し処理単位として処理する。規則３，４は箇条書きで良
く現われるto不定詞、if節以下が箇条書きされているよ
うなパターンを処理するために書かれたものである。For example, according to rule 1, when a plurality of units including a character string "regarding to" come after a character string that is a unit of 10 characters or more, each is regarded as a translation unit, and
It shows that the lump before to is analyzed as a sentence, and the analyzing to is analyzed with a prepositional phrase as a goal. Further, it indicates that the prepositional phrase should be processed to be included in the first sentence. Rule 2 divides a sentence embedded in quotation marks and processes it as a processing unit. Rules 3 and 4 are written to process patterns such as to infinitives that often appear in bullets, and if clauses below the if clause.

【００１８】規則３は次のことを意味している。最初の〔〜〕は任意の文字列である。〔ｔｏ：〕は〜ｔｏ：〜という文のパターンが来ること
を示す。〔ＫＰ〜〕＊はリスト部条件処理でリスト文と
認定された文に含まれる記号部をあらわす記号ＫＰとマ
ッチし、その後ろに任意の文字列が続くものが複数回来
ることを示している。このルールのアクションとしては
最初の〔〜〕つまり＄１はＳＥ（文）として解析され、
＄２＋＄３つまり“ｔｏ：ＫＰ〜”の形がprepfつまり
前置詞句として解析される。ただし、‘：’‘ＫＰ’は
解析の課程では存在しないかのように扱われる。この＄
２＋＄３をprepfをゴールとして解析する処理は複数あ
る場合はそれぞれについて行われる。さらに、＄１＜−
＄２＋＄３は結果として形態素生成部においては従属す
るノードとして扱われることを示す。また、このルール
によって＄１の解析は、＄１＋＜prepf＞の形、つま
り、最初のマッチングにダミーの形態素単位＜prepf＞
を付加して解析されることを示している。Rule 3 means the following: The first [~] is an arbitrary character string. [To:] indicates that a sentence pattern of ~ to: ~ comes. [KP-] * indicates that the symbol KP that represents the symbol part included in the sentence recognized as a list sentence by the list condition processing matches the symbol KP, followed by an arbitrary character string multiple times. . As the action of this rule, the first [~], that is, $ 1 is analyzed as SE (sentence),
The form of $ 2 + $ 3, that is, "to: KP ~" is analyzed as prepf, that is, a prepositional phrase. However, ':''KP' is treated as if it does not exist in the course of analysis. This $
If there are multiple processes that analyze 2 + $ 3 with prepf as the goal, they are performed for each. Furthermore, $ 1 <-
As a result, $ 2 + $ 3 is treated as a subordinate node in the morpheme generation unit. According to this rule, $ 1 is analyzed in the form of $ 1 + <prepf>, that is, a dummy morpheme unit <prepf> for the first matching.
Is added to indicate that the analysis is performed.

【００１９】例えば次にような文が入力されるとき You must supply the RECFM values if : 1. the data set resides on an unlabeled tape volume or 2. the data set is nor included in a DSCB. 規則４によって次のように分割される。 You must supply the RECFM values ＜if-clause＞. (ゴール：文） 1. if the data set resides on an unlabeled tape volume or＜KF＞. （ゴール：if節） 2. if the data set is nor included in a DSCB. （ゴール：if節）＜＞はダミーの形態素単位で分割された文の処理単位に
対応する。つまり、解析ではこれらが訳語のない記号が
あるかのように解析される。また文の後ろに書かれてい
るようなゴール等の情報が処理単位と共に後ろのフェー
ズに渡される。つまり構文解析では全く別の文として従
来どうり処理すればよい。You must supply the RECFM values if: 1. The data set resides on an unlabeled tape volume or 2. the data set is nor included in a DSCB. Is divided into. You must supply the RECFM values <if-clause>. (Goal: sentence) 1. if the data set resides on an unlabeled tape volume or <KF>. (Goal: if clause) 2. if the data set is nor included in a DSCB. (Goal: if clause) <> corresponds to the processing unit of the sentence divided by the dummy morpheme unit. That is, in the analysis, these are analyzed as if there are symbols without translations. In addition, information such as goals written after the sentence is passed to the subsequent phase together with the processing unit. In other words, syntactic analysis can be processed as a completely different sentence.

【００２０】生成ではこれらの解析単位を文の単位とし
てまとめ上げる。まず従来通りの生成処理を行なう。但
し、ダミーの形態素単位はそのままである。あなたは＜
if-clause＞RECFMの値を与えなければならない。 1. もし、データセットがラベル付けされていないテー
プボリューム上にあるとき又は＜KF＞ 2. もし、データセットが DSCB を含んでいないときこの３つの文が解析単位の生成結果として得られる。こ
れを合成処理して全体の翻訳結果を得る。あなたは 1. もし、データセットがラベル付けされていないテー
プボリューム上にあるとき又は 2. もし、データセットが DSCB を含んでいないときREC
FMの値を与えなければならない。In the generation, these analysis units are put together as a sentence unit. First, the conventional generation process is performed. However, the dummy morpheme unit remains the same. You are
The value of if-clause> RECFM must be given. 1. If the dataset is on an unlabeled tape volume or <KF> 2. If the dataset does not contain a DSCB, then these three statements result in the parsing unit generation. This is combined to obtain the entire translation result. You can 1. REC if the dataset is on an unlabeled tape volume, or 2. REC if the dataset does not contain a DSCB.
You must give the FM value.

【００２１】図５は、パターンマッチング処理を説明す
るためのフローチャートである。以下、各ステップに従
って順に説明する。step１；まず、分割処理された処理単位かどうか判断す
る。step２；分割処理された処理単位であれば通常の生成処
理を行う。step３；１文全部の解がそろったかどうか判断する。step４；解がそろっていれば、解合成処理を行う。step５；解がそろっていなければ、解を保持する。step６；前記step１において分割処理された処理単位で
なければ、通常の生成処理を行い終了する。FIG. 5 is a flow chart for explaining the pattern matching process. Hereinafter, each step will be described in order. step1 ; First, it is determined whether the processing unit is a divided processing unit. step2 : If the processing unit is divided, normal generation processing is performed. step3 ； Judge whether all the solutions of one sentence are complete. step4 ； If the solutions are complete, solution composition processing is performed. step5 ； If the solution is not complete, keep the solution. step6 ; If it is not the processing unit divided in step 1, the normal generation processing is performed and the processing ends.

【００２２】図６は、編集制御部における動作を説明す
るためのフローチャートである。以下、各ステップに従
って順次説明する。step１；まず、翻訳部における翻訳が終了したかどうか
判断する。終了していれば訳文を表示する。step２；次に翻訳が終了していなければ、辞書編集指示
があるかどうか判断する。辞書編集指示があれば辞書編
集処理を行う。step３；次に辞書編集指示がなければ、翻訳指示がある
かどうか判断する。翻訳指示があれば翻訳処理を行う。step４；次に翻訳指示がなければ、原文の入力指示があ
るかどうか判断する。原文の入力指示があればその原文
を原文記憶部に格納し、原文表示変更を行う。step５；次に入力指示がなければ、編集指示があるかど
うか判断する。編集指示があれば編集処理を行う。step６；次に編集指示がなければ、別訳指示があるかど
うか判断する。別訳指示があれば別訳表示を行い、原文
表示変更を行う。step７；次に別訳指示がなければ、その他の処理を実行
してstep１に戻る。FIG. 6 is a flow chart for explaining the operation of the edit control section. The steps will be sequentially described below. step1 ; First, it is judged whether or not the translation in the translation section is completed. If finished, the translated text is displayed. step2 ; Next, if the translation is not completed, it is judged whether or not there is a dictionary editing instruction. If there is a dictionary edit instruction, the dictionary edit process is performed. step3 ; Next, if there is no dictionary editing instruction, it is determined whether there is a translation instruction. If there is a translation instruction, the translation process is performed. step4 ; Next, if there is no translation instruction, it is judged whether or not there is an input instruction of the original sentence. If there is an input instruction of the original sentence, the original sentence is stored in the original sentence storage unit and the original sentence display is changed. step5 ; Next, if there is no input instruction, it is determined whether there is an edit instruction. If there is an edit instruction, edit processing is performed. step6 ; Next, if there is no editing instruction, it is determined whether there is another translation instruction. If there is a translation instruction, the translation is displayed and the original text display is changed. step7 ; Next, if there is no other translation instruction, other processing is executed and the process returns to step1.

【００２３】このように、別訳表示では従来行なわれて
いた１文の複数の解候補の表示・選択処理だけではなく
解析単位別に解析候補を表示する。このことにより、ユ
ーザに単に大量の別訳を表示するよりも解かりやすい部
分ごとの解候補を示すことができる。これを図７に示
す。また、今までの例では形態素処理で解析単位を分割
したがその処理をユーザが指定したり、分割処理を禁止
することをユーザに許すことは簡単に拡張できる。ま
た、今までの例では英日で説明したが日英や他の言語対
について本発明の方式を行なうことも有用であることも
明らかである。As described above, in the separate translation display, not only the display / selection processing of a plurality of solution candidates for one sentence, which is conventionally performed, but also the analysis candidates are displayed for each analysis unit. As a result, the user can be presented with a solution candidate for each part that is easier to understand than simply displaying a large number of different translations. This is shown in FIG. Further, in the above examples, the analysis unit is divided by the morpheme processing, but it is possible to easily extend to allow the user to specify the processing or to prohibit the division processing. Also, although the examples so far have been described in English and Japanese, it is clear that it is also useful to carry out the method of the present invention for Japanese-English and other language pairs.

【００２４】図８（ａ）,（ｂ）は、本発明による機械
翻訳方式の文末認定装置の一実施例を説明するための構
成図で、図中、４１は原文テキスト、４２は文末認定装
置、４２ａは文末記号がない場合の文末認定装置、４２
ｂは文末記号がある場合の文末認定装置、４３は文認定
後テキスト、４４ａは空文認定手段、４４ｂはタイトル
文認定手段、４４ｃはリスト文認定手段、４４ｄは記号
部認定手段、４４ｅは見出し語認定手段である。図８
（ａ）においては、原文のテキストに文末記号がある場
合と、文末記号がない場合の文末認定手段を有してい
る。さらに、文末記号がない場合には、図８（ｂ）に示
すように複数の文末認定手段を有している。8 (a) and 8 (b) are block diagrams for explaining an embodiment of the machine translation type sentence end recognizing device according to the present invention, in which 41 is an original text and 42 is a sentence end recognizing device. , 42a is an end-of-sentence recognizing device when there is no end-of-sentence symbol, 42a
b is an end-of-sentence recognizing device when there is an end-of-sentence symbol, 43 is a post-sentence recognizing text, 44a is a blank sentence certifying means, 44b is a title sentence certifying means, 44c is a list sentence certifying means, 44d is a symbol part certifying means, and 44e is a headline. It is a word recognition means. Figure 8
In (a), there is a sentence end recognizing means for the case where the original text has a sentence end symbol and the case where there is no sentence end symbol. Further, when there is no end-of-sentence symbol, it has a plurality of end-of-sentence recognizing means as shown in FIG.

【００２５】図９は、機械翻訳方式を説明するための全
体構成図で、図中、６１は基本辞書、６２は文法ルー
ル、６３は分野別辞書、６４は翻訳部、６４ａは前処理
部、６４ｂは形態素解析部、６４ｃは構文解析部、６４
ｄは構文生成部、６４ｅは後処理部、６５は原文テキス
ト、６６は翻訳後テキストである。形態素解析部６４ｂ
では原文テキスト６５の辞書引きを行なう。辞書引きに
際しては基本辞書や分野別辞書を利用する。構文解析部
６４ｃでは、個々の語の情報を後で文法ルール６２に従
って解析し、解析結果から木構造を作成する。構文生成
部６４ｄでは得られた木構造をノードごとに訳出する。FIG. 9 is an overall configuration diagram for explaining the machine translation system. In the figure, 61 is a basic dictionary, 62 is a grammar rule, 63 is a field dictionary, 64 is a translation unit, 64a is a preprocessing unit, 64b is a morphological analysis unit, 64c is a syntax analysis unit, 64
d is a syntax generation part, 64e is a post-processing part, 65 is an original text, and 66 is a translated text. Morphological analysis unit 64b
Then, the dictionary of the original text 65 is looked up. When dictionaries are used, basic dictionaries and field-specific dictionaries are used. The syntax analysis unit 64c later analyzes the information of each word according to the grammar rule 62, and creates a tree structure from the analysis result. The syntax generation unit 64d translates the obtained tree structure for each node.

【００２６】次に構成（１１）の方式の実施例について
説明する。例えば下のような文書が入力部より入力され
た場合に、改行コードの連続を捜し出す。もし改行コー
ドが連続してあれば空文であると認識し、連続する改行
コードの直前までを文として認定する。以下の例では、
‘§’を改行コードとする。この場合なら、第一行目の“The Machine Translation
System”が文として認定される。また、例えばユーザー
が文末を指定する際にも、空文がくれば必ず文末認定さ
れるとわかっていれば、あらかじめ文末を指定すること
も修正することも簡単に実現される。Next, an embodiment of the method of configuration (11) will be described. For example, when the following document is input from the input unit, a line feed code sequence is searched for. If the line feed code is continuous, it is recognized as an empty sentence, and the part up to immediately before the continuous line feed code is recognized as a sentence. In the example below,
Let'§ 'be the line feed code. In this case, the first line “The Machine Translation
"System" is certified as a sentence. Also, for example, even when the user specifies the end of a sentence, if it is known that the end of a sentence will always be recognized if there is an empty sentence, it is easy to specify the end of the sentence or modify it. Will be realized.

【００２７】次に構成（１２）及び構成（１３）の方式
の実施例について説明する。文認定が直前で行なわれて
いてかつ行頭から始まり、次の行が小文字以外で始まっ
ている時に、以下の検査を行なう。図１０にそのフロー
チャートを示す。（１）各単語の先頭が大文字、記号、数字であるかどう
かを調べる。ただし、単語先頭文字が小文字でもそれが
前置詞、冠詞、接続詞類の単語であるなら、処理を継続
する。図１０のフローチャートでは、ＩＮＩＴＩＮＡＬ
ＣＡＰのチェックに相当する。（２）行を構成する単語数が規定値より少なく、次の行
のインデントがより深くなっているかどうかを調べる。
いずれかに該当するならば、その行をタイトル文である
として認識して文認定を行なう。例えば、の文書が入力されたなら、その第一行が上の(1)にあて
はまり、タイトル文と認定され、第二行と分離される。
また、例えば上の（２）の条件中の単語数の規定値を６
とした場合、以下の例の第一行がタイトル文と認定さ
れ、第二行と分離される。この場合なら、第一行目が文認定される。Next, an embodiment of the schemes (12) and (13) will be described. When sentence recognition is performed immediately before and starts at the beginning of a line, and the next line begins with a non-lowercase letter, the following checks are performed. FIG. 10 shows the flowchart. (1) Check whether the beginning of each word is capital letters, symbols, or numbers. However, even if the first letter of the word is lowercase, if it is a preposition, article, or conjunction, the processing is continued. In the flowchart of FIG. 10, INITIAL
It corresponds to the check of CAP. (2) Check whether the number of words forming a line is smaller than a specified value and the indentation of the next line is deeper.
If any of the above is true, the line is recognized as a title sentence and sentence recognition is performed. For example, If the document is input, the first line applies to (1) above, is recognized as the title sentence, and is separated from the second line.
Also, for example, the prescribed value of the number of words in the above condition (2) is set to 6
In that case, the first line of the following example is recognized as the title sentence and separated from the second line. In this case, the first line is sentence certified.

【００２８】次に構成(１４)および構成(１５)の方式の
実施例について説明する。ここでは例として、構成（１
４）に構成（１５）の手段を含むアルゴリズムフローに
従って説明していくが、二つの手段を切り離して作成す
ることも可能である。図１１（ａ）,（ｂ）にそのフロ
ーチャートを示す。まず行単位に入力し、入力できなく
なれば終る。listflag ではリスト文認定が行なわれた
際にＯＮになっている。終了時（Ｂで始まる箇所）にこ
のフラグがＯＮになっている時は、最後の行がリスト文
であることを示している。次に記号文認定ルーチンによ
り、記号文であるかどうかの判別がされる。ここでいう
記号文とは、数字語、アルファベット１文字、アスタリ
スク等の記号文字類のいずれかで文が始まるものの総称
としての便宜的な呼び方である。また、記号文認定ルー
チンとは、入力が記号文であるかを調べるものであり、
文末認定は行なわない。記号文でない場合には、入力に
戻り、記号文の場合はリスト文の認定を行なうため、以
下の処理を行なう。Next, an embodiment of the schemes (14) and (15) will be described. Here, as an example, the configuration (1
Although the explanation will be given according to the algorithm flow including the means of the configuration (15) in 4), it is also possible to create the two means separately. The flowcharts are shown in FIGS. 11 (a) and 11 (b). First, input in line units, and if you can no longer input, end. In listflag, it is turned on when the list statement is certified. When this flag is turned on at the end (where it starts with B), it indicates that the last line is a list sentence. Next, it is judged whether the sentence is a symbol sentence by a symbol sentence authorization routine. The symbolic sentence here is a convenient name as a general term for a sentence that begins with a symbolic character such as a numeric word, one letter of the alphabet, or an asterisk. In addition, the symbolic sentence recognition routine is to check whether the input is a symbolic sentence,
End-of-sentence certification is not performed. If it is not a symbol sentence, the process returns to input, and if it is a symbol sentence, the following process is performed to certify the list sentence.

【００２９】（１）入力例は、記号文であると認定され
る。（２）記号文認定ルーチンにより、記号文である場合、
記号部のカラム位置と種類が得られる。それぞれ例えば
colml,kindl といった記憶域にメモリーしておく。（３）次に新たな行を入力する。以下、現在読み込んで
いる行を「現在行」、直前の行を「直前行」と呼ぶこと
とする。（４）現在行が空行の場合、空文による文末認定（構成
１１）が行なわれる（行なわれている）わけであるか
ら、もし直前までリスト文認定処理が続いていたのな
ら、記号部分離（構成１５）だけ行なう。（５）現在行の colml の位置にある文字が、 kindl の
示す種類と同じであるなら、記号文が連続していること
であるためリスト文であると判断される。そのとき、直
前行に対してリスト文として文認定（構成１４）と、記
号部分離を行なう。（６）また、リスト文認定が始まったことを示すため
に、 listflagをＯＮする。この場合、またさらに次の
行を入力し、同じ処理で認定されていく。(1) The input example is recognized as a symbol sentence. (2) If it is a symbol sentence by the symbol sentence authorization routine,
The column position and type of the symbol part are obtained. Each for example
It is stored in memory such as colml and kindl. (3) Next, enter a new line. Hereinafter, the line currently being read will be referred to as the “current line” and the line immediately before it will be referred to as the “previous line”. (4) If the current line is a blank line, sentence end certification (construction 11) is performed (is being performed) by a blank sentence. Only separation (configuration 15) is performed. (5) If the character at the colml position on the current line is the same as the kind indicated by kindl, it is judged as a list sentence because the symbol sentences are continuous. At that time, sentence recognition (construction 14) and symbol part separation are performed on the immediately preceding line as a list sentence. (6) Also, the listflag is turned ON to indicate that the list sentence certification has started. In this case, enter the next line again and certify by the same process.

【００３０】さてここで、colml にある文字がスペース
類の場合は、次の三つの場合が考えられる。（１）リスト文が複数行に渡って書かれた場合。「＊ HP VISTA, HP SINE, and HP Model Data Manager installation …１ and update procedure. …２」２の行を読み込んだ時、行頭から colml までスペース
であり、記号文でないので次の行の入力へいく。（２）リスト文がさらに下のリスト文を抱えている場
合。「＊ Event History …１＊ System Exception reports, including …２ ― System Error Summary …３ ― Subsystem Exception reports for processors,channels, …４ DASD and tape device …５＊ Threshold Summary for some tape device …６」３の行を読み込んだ時、行頭から colml までスペース
であり、記号文であるので、２の行のリスト文認定処理
を行ない、記号位置と文字種を変更してから次の行の入
力へいく。If the characters in the collm are spaces, the following three cases are possible. (1) When the list statement is written over multiple lines. “* HP VISTA, HP SINE, and HP Model Data Manager installation… 1 and update procedure.… 2” When reading the 2nd line, there is a space from the beginning of the line to colml, and it is not a symbol sentence, so go to the input of the next line. . (2) When the list sentence has a list sentence below. "* Event History ... 1 * System Exception reports, including ... 2-System Error Summary ... 3-Subsystem Exception reports for processors, channels, ... 4 DASD and tape device ... 5 * Threshold Summary for some tape device ... 6" Line 3 When is read, there is a space from the beginning of the line to colml and it is a symbol sentence, so the list sentence recognition process of line 2 is performed, the symbol position and character type are changed, and then the next line is entered.

【００３１】（３）たまたまスペースが来た場合。「 ― System Error Summary …１ ― Detail Edits of selected records …２ All of these reports types are discussed in more detail in v …３」３の行を読み込んだ時、行頭から colml までスペース
でなく、１，２とリスト文処理の途中であるため、２の
行のリスト文認定処理を行ない、３の行がリスト文であ
るかどうかのチェックに戻る。(3) When a space happens to come. “-System Error Summary… 1－ Detail Edits of selected records… 2 All of these reports types are discussed in more detail in v… 3” When the line 3 is read, there are no spaces from the beginning of the line to colml, but 1, 2 and so on. Since the list sentence is being processed, the list sentence certifying process of the line 2 is performed, and the process returns to the check whether the line 3 is the list sentence.

【００３２】以上の流れによって、例えば上例の（２）
のような文書が入力された時、その結果は、以下のよう
になる。（１）「＊」（２）「Event History」（３）「＊」（４）「System Exception reports, including」（５）「―」（６）「System Error Summary」（７）「―」（８）「Subsystem Exception reports for processors,channels,」 DASD and tape device」（９）「＊」（１０）「Threshold Summary for some tape device」By the above flow, for example, (2) in the above example
When a document such as is input, the result is as follows. (1) "*" (2) "Event History" (3) "*" (4) "System Exception reports, including" (5) "-" (6) "System Error Summary" (7) "-" ( 8) “Subsystem Exception reports for processors, channels,” DASD and tape device ”(9)“ * ”(10)“ Threshold Summary for some tape device ”

【００３３】次に構成（１６）の方式の実施例について
説明する。原文のテキストは行単位に処理される。その
先頭が、括弧類で囲まれた数字語あるいはアルファベッ
ト一文字である時、それを記号部として認定する。記号
部の認定とは、記号部の直前と直後で文末認定を行なう
ことを意味する。「The customer is responsible for （１） Adequate site and system planning and preparation. （２） Receipt, unpacking, and placement of the 5363 System Unit.」上例では、第二行目の「(１)」の部分が記号部であると
認定され、第一行目の「for」までが文末認定される。
さらには「(１)」の直後で文末認定される。また、第三
行目も同様に行なわれるため、結果として、以下のよう
に文が認定される。（１）「The customer is responsible for」（２）「(１)」（３）「Adequate site and system planning and preparation.」（４）「(２)」（５）「Receipt, unpacking, and placement of the 5363 System Unit.」 Next, an embodiment of the method of configuration (16) will be described. The source text is processed line by line. When the beginning is a number word enclosed in parentheses or a single letter of the alphabet, it is recognized as a symbol part. Accreditation of the sign part means that end-of-sentence recognition is performed immediately before and after the sign part. "The customer is responsible for (1) Adequate site and system planning and preparation. (2) Receipt, unpacking, and placement of the 5363 System Unit." In the above example, the part of "(1)" in the second line is It is recognized as a sign part, and the sentence ending with "for" on the first line is recognized.
Furthermore, the sentence end is recognized immediately after "(1)". In addition, since the third line is performed in the same manner, as a result, the sentence is recognized as follows. (1) "The customer is responsible for" (2) "(1)" (3) "Adequate site and system planning and preparation." (4) "(2)" (5) "Receipt, unpacking, and placement of the 5363 System Unit. ''

【００３４】次に構成（１７）の方式の実施例について
説明する。原文のテキストは行単位に処理される。その
先頭が、数字語あるいはアルファベット一文字であり、
それに閉じ括弧類、ピリオド、コロンまたはセミコロン
のいずれかが付加されている時、それを記号部として認
定する。記号部の認定とは、記号部の直前と直後で文末
認定を行なうことを意味する。「The customer is responsible for １） Adequate site and system planning and preparation. ２） Receipt, unpacking, and placement of the 5363 System Unit.」上例では、第二行目の「１)」の部分が記号部であると
認定され、第一行目の「for」までが文末認定される。
さらには「１)」の直後で文末認定される。また、第三
行目も同様に行なわれるため、結果として、以下のよう
に文が認定される。（１）「The customer is responsible for」（２）「１)」（３）「Adequate site and system planning and preparation.」（４）「２)」（５）「Receipt, unpacking, and placement of the 5363 System Unit.」 Next, an embodiment of the method of configuration (17) will be described. The source text is processed line by line. The beginning is a numeric word or an alphabetic character,
When any of closing brackets, period, colon or semicolon is added to it, it is recognized as a symbol part. Accreditation of the sign part means that end-of-sentence recognition is performed immediately before and after the sign part. "The customer is responsible for 1) Adequate site and system planning and preparation. 2) Receipt, unpacking, and placement of the 5363 System Unit." In the above example, the part "1)" in the second line is the symbol part. It is certified that there is, and the sentence end "for" is certified.
Furthermore, the sentence end is recognized immediately after "1)". In addition, since the third line is performed in the same manner, as a result, the sentence is recognized as follows. (1) “The customer is responsible for” (2) “1” ”(3)“ Adequate site and system planning and preparation. ”(4)“ 2) ”(5)“ Receipt, unpacking, and placement of the 5363 System Unit. "

【００３５】次に構成（１８）の方式の実施例について
説明する。原文のテキストは行単位に処理される。その
先頭が、記号文字（アスタリスク等）で始まる時、それ
を記号部として認定する。記号部の認定とは、記号部の
直前と直後で文末認定を行なうことを意味する。「The customer is responsible for ＊ Adequate site and system planning and preparation. ＊ Receipt, unpacking, and placement of the 5363 System Unit.」上例では、第二行目の「＊」の部分が記号部であると認
定され、第一行目の「for」までが文末認定される。さ
らには「＊」の直後で文末認定される。また、第三行目
も同様に行なわれるため、結果として、以下のように文
が認定される。（１）「The customer is responsible for」（２）「＊」（３）「Adequate site and system planning and preparation.」（４）「＊」（５）「Receipt, unpacking, and placement of the 5363 System Unit.」 Next, an embodiment of the system of the structure (18) will be described. The source text is processed line by line. When the beginning starts with a symbol character (such as an asterisk), it is recognized as a symbol part. Accreditation of the sign part means that end-of-sentence recognition is performed immediately before and after the sign part. "The customer is responsible for * Adequate site and system planning and preparation. * Receipt, unpacking, and placement of the 5363 System Unit." In the above example, the part marked "*" in the second line was identified as the symbol part. The first sentence, “for”, is recognized as the end of the sentence. Furthermore, the sentence end is recognized immediately after "*". In addition, since the third line is performed in the same manner, as a result, the sentence is recognized as follows. (1) "The customer is responsible for" (2) "*" (3) "Adequate site and system planning and preparation." (4) "*" (5) "Receipt, unpacking, and placement of the 5363 System Unit . ''

【００３６】次に構成（１９）の方式の実施例について
説明する。直前で文認定が行なわれている行で、行中に
規定値以上の複数スペースを持つ場合に行頭からその部
分までを見出し語として分離する。例えばその規定値を
５とする。スペースが５つ以上連続して現れたら、そこ
までの語と、複数スペース以降とは構文的につながりが
ないとして、複数スペースの直前までを文認定する。ま
た、直前で文認定が行なわれているとは、文書の入力始
めも含む。「CAUTION When you select [ABORT], all data is lost, the 〜 acquired is set to 0, and you must select [START] 〜 collection. 〜」上例では、第一行目の「CAUTION」と「When」の間に規
定値以上の５つのスペースがあるため、「When」の部分
が見出し語であると認定され、その直後で文末認定が行
なわれる。図１２（ａ）,（ｂ）にそのフローチャート
を示す。ただし、ここでは例として構成（２０）のフロ
ーも含む。Next, an embodiment of the method of configuration (19) will be described. In the line where the sentence recognition is performed immediately before, if there are multiple spaces more than the specified value in the line, the part from the beginning of the line to that part is separated as a headword. For example, the specified value is set to 5. If five or more spaces appear consecutively, it is assumed that there is no syntactic connection between the words up to that point and the spaces after it, and the sentences up to immediately before the spaces are recognized. In addition, the sentence recognition being performed immediately before includes the beginning of document input. "CAUTION When you select [ABORT], all data is lost, the ~ acquired is set to 0, and you must select [START] ~ collection. ~" In the above example, "CAUTION" and "When" in the first line Since there are 5 spaces above the specified value between, the "When" portion is recognized as the entry word, and sentence end recognition is performed immediately after that. The flowchart is shown in FIGS. 12 (a) and 12 (b). However, the flow of the configuration (20) is also included here as an example.

【００３７】次に構成（２０）の方式の実施例について
説明する。直前で文認定が行なわれている行で、行中に
規定値以上の複数スペースを持つ場合も以下の条件のい
ずれかを満たす場合に、複数スペースの直前までを文認
定する。（１）その行末に文末記号がある場合。（２）その行全体が既に文末認定されている場合。（３）行頭から最後の複数スペースの真下までがスペー
スである場合。（４）複数スペースまでの単語数が１である場合。また、直前で文認定が行なわれているとは、文書の入力
始めも含む。「CAUTION When you select [ABORT], all data is lost, the 〜 acquired is set to 0, and you must select [START] 〜 collection. 〜」上例では、第一行目の「CAUTION」と「When」の間に規
定値以下の２つのスペースしかないが、行頭から最後の
複数スペースの真下までがスペースであるので、「Whe
n」の部分が見出し語であると認定され、その直後で文
末認定が行なわれる。図１２にそのフローチャートを示
す。ただし、ここでは例として構成（１９）のフローも
含む。Next, an embodiment of the system of the structure (20) will be described. In the line where the sentence recognition is performed immediately before, even if there are multiple spaces in the line that are equal to or greater than the specified value, the sentence is recognized up to immediately before the multiple spaces if either of the following conditions is satisfied. (1) When there is a sentence end symbol at the end of the line. (2) If the entire line is already end-of-sentence certification. (3) When there is a space from the beginning of the line to just below the last multiple spaces. (4) When the number of words up to a plurality of spaces is 1. In addition, the sentence recognition being performed immediately before includes the beginning of document input. "CAUTION When you select [ABORT], all data is lost, the 〜 acquired is set to 0, and you must select [START] 〜 collection. 〜" In the above example, the first line "CAUTION" and "When" There are only two spaces below the specified value between, but since the space from the beginning of the line to just below the last multiple spaces is "Whe
The "n" part is recognized as a headword, and sentence end recognition is performed immediately after that. FIG. 12 shows the flowchart. However, the flow of the configuration (19) is also included here as an example.

【００３８】次に構成（２１）の方式の実施例について
説明する。記号部が分離された場合、それは構文解析の
対象としない。「＊ Adequate site and system planning and preparation. ＊ Receipt, unpacking, and placement of the 5363 System Unit.」上例では、アスタリスクの部分が記号部として認定され
るが、それだけを構文解析しても、構文となっていない
ため無駄である。そこで、構文解析の負担を軽減するた
めに構文解析は行なわず、和文に直接反映させるだけと
する。Next, an embodiment of the method of the structure (21) will be described. If the sign part is separated, it is not subject to parsing. "* Adequate site and system planning and preparation. * Receipt, unpacking, and placement of the 5363 System Unit." It is useless because it is not. Therefore, in order to reduce the burden of syntactic analysis, syntactic analysis is not performed, but it is only reflected directly in the Japanese sentence.

【００３９】図１３は、本発明による機械翻訳方式の他
の実施例を説明するための構成図で、図中、２１は陰極
線管（ＣＲＴ）、２２はキーボード、２３は光学文字読
取装置（ＯＣＲ）、２４は入力文書、２５はスペルチェ
ック部、２６は前編集部、２７は翻訳本体部、２８は後
編集部、２９は辞書、３０は文法規則、３１は出力文
書、３２はプリンタである。ファイル入力、キーボード
入力、ＯＣＲ入力のいずれかによって得た入力文は入力
文書２４として保管され、次にスペルチェック部２５
は、辞書２９を参照してスペルチェックし、前編集部２
６を用いて前処理を行える。翻訳本体部２７は、辞書２
９と文法規則３０を用いて翻訳し、その結果得られた出
力文は、後編集部２８によって翻訳情報を利用して編集
できる。編集後の出力文は出力文書３１に記憶される。
入力文と出力文はプリンタ３２を用いて印刷できる。FIG. 13 is a block diagram for explaining another embodiment of the machine translation system according to the present invention. In the figure, 21 is a cathode ray tube (CRT), 22 is a keyboard, and 23 is an optical character reader (OCR). ), 24 is an input document, 25 is a spell check unit, 26 is a pre-editing unit, 27 is a translation body unit, 28 is a post-editing unit, 29 is a dictionary, 30 is a grammar rule, 31 is an output document, and 32 is a printer. . The input sentence obtained by any of file input, keyboard input, and OCR input is stored as the input document 24, and then the spell check unit 25.
Checks the spelling with reference to the dictionary 29, and the pre-editing unit 2
6 can be used to perform pretreatment. The translation main unit 27 is the dictionary 2
9 and the grammar rule 30 are used for translation, and the output sentence obtained as a result can be edited by the post-editing unit 28 using the translation information. The edited output sentence is stored in the output document 31.
The input sentence and the output sentence can be printed using the printer 32.

【００４０】図１４は、図８における翻訳本体部の４つ
の処理を示す図で、図中、３３は辞書、３４は入力部、
３５は形態素解析部、３６は構文解析部、３７は変換
部、３８は生成部、３９は出力部、４０は文法規則であ
る。この翻訳本体部の４つの大きな処理とは形態素解
析部３５では入力テキストの辞書引きを行なうこと、
構文解析部３６では、個々の語の情報を得て文法規則４
０に従ってパージングを行い、解析結果から木構造を作
成すること、変換部３７では入力言語の木構造から出
力言語の木構造に変形すること、生成部３８では得ら
れた木構造をノードごとに訳出すること、である。FIG. 14 is a diagram showing four processes of the translation main body unit in FIG. 8, in which 33 is a dictionary, 34 is an input unit,
Reference numeral 35 is a morphological analysis unit, 36 is a syntax analysis unit, 37 is a conversion unit, 38 is a generation unit, 39 is an output unit, and 40 is a grammar rule. The four main processings of the translation main body are that the morphological analysis unit 35 performs dictionary lookup of the input text,
The syntactic analysis unit 36 obtains the information of each word to obtain the grammar rule 4
0, parsing is performed to create a tree structure from the analysis result, the converting unit 37 transforms the input language tree structure into an output language tree structure, and the generating unit 38 translates the obtained tree structure for each node. It is to be.

【００４１】本発明は前処理（前編集）部、あるいは形
態素解析部に属するものである。ここでは入力テキスト
を英文とする。ここで言う文認定の文とは翻訳処理を施
される１単位という意味であり、常識的に言う１つの文
とはやや異なる。従ってタイトルとかリストとか見出し
とかいわゆる文としての体裁を整えていないことがある
が、それ自体が単独で翻訳処理されるべきまとまりであ
れば翻訳処理の１単位であり、ここで言う文といえる。
入力テキストを翻訳処理にかけるとき、一番最初に文認
定処理や辞書引き処理が動く。文認定処理には、テキス
トの形態的特徴だけを利用して行うものと、辞書引きを
行って語の形態的特徴も把握して行うものとがある。い
ずれにしても文認定のアルゴリズムを擁している。この
アルゴリズム次弟で先程から述べているいわゆる文から
はずれるタイトル、リスト、見出しの文（翻訳処理の１
単位）としての認定が行なわれたり、行なわれなかった
りする。The present invention belongs to the preprocessing (preediting) section or the morphological analysis section. Here, the input text is English. The sentence that is recognized here means one unit that is subjected to translation processing, and is slightly different from one sentence that is commonly known. Therefore, a title, a list, a headline, or a so-called sentence may not be prepared, but if it is a unit that should be translated by itself, it is a unit of translation processing and can be called a sentence.
When input text is subjected to translation processing, sentence recognition processing and dictionary lookup processing work first. Sentence recognition processing includes processing that uses only the morphological characteristics of text, and processing that performs dictionary lookup to grasp the morphological characteristics of words. In any case, it has a sentence recognition algorithm. Sentences of titles, lists, and headlines that deviate from the so-called sentences described earlier in this algorithm
Credits may or may not be accredited.

【００４２】例えば次のように簡単な文認定アルゴリズ
ムを基にして考えてみる。以下のパタンを文（翻訳処理
の１単位）として認定する。１．文末記号（．？！；）で終わる。２．後ろに改行が２つある。３．特定の語数（例えば８語）以内で、後ろに改行が続
き、その次の文字が（インデントを無視して）小文字以
外である。４．次の行の先頭にリスト用の序数号か記号がある。５．後ろにスペースが２つ以上、またはタブがある。６．除外規則数字とピリオドだけのパタン（これは序
数号の１つである）は文末としない。Consider, for example, the following simple sentence recognition algorithm. The following patterns are recognized as sentences (one unit of translation processing). 1. It ends with the end-of-sentence symbol (.?!;). 2. There are two line breaks at the end. 3. Within a certain number of words (e.g., 8 words), a newline follows, and the next character is non-lowercase (ignoring indentation). 4. At the beginning of the next line is an ordinal number or symbol for the list. 5. There are two or more spaces or tabs behind. 6. Exclusion rules Do not end with a pattern that consists only of numbers and periods (this is one of the ordinal numbers).

【００４３】これらをまとめると以下のようになる。普通の文用規則１タイトル用規則２，３リスト用規則４，６見出し用規則５これは実施例であり、実際の規則はこれより複雑にな
る。特に１はかなり単純化している。例えば、省略語
（Mr. など）はピリオドで終わるが、普通そこは文末を
示さない。これを処理するには辞書情報を引いておく必
要がある。これらの規則を眺めてみると従来の文認定で
は普通の文用規則（１）か、せいぜいタイトル用規則
（２，３）ぐらいまでを行っていた。つまりリスト用規
則（４，６）と見出し用規則（５）は新しい文認定の考
え方と言える。この文認定規則に段階を設けなければい
くつかの文認定結果が得られる。ある一部の規則を使用
して得られた結果を仮の文認定候補として表示すること
もできるし、ある一部の規則を使用して得られた文認定
結果を強調表示することもできるし、ユーザがある一部
の規則を使用して得られた文認定結果を修正することも
できるし、他の規則による文認定結果をいわゆる別候補
として参照し、そちらに入れかえることもできる。つま
り、入力テキストに対して文認定アルゴリズムを擁した
文認定手段（形態素解析部、もしくは前処理部に属す
る）を用いて文認定を行い、その結果を表示手段に表示
する。この様子を図１７に示す。These are summarized as follows. Ordinary sentence rules 1 Title rules 2,3 List rules 4,6 Heading rules 5 This is an example, and the actual rules are more complicated than this. Especially 1 is considerably simplified. For example, abbreviations (such as Mr.) end with a period, but usually do not indicate the end of a sentence. It is necessary to pull dictionary information to process this. Looking at these rules, in the conventional sentence recognition, the ordinary sentence rule (1) or at most the title rule (2, 3) was used. In other words, the list rules (4, 6) and the heading rule (5) are new concepts of sentence recognition. If there are no steps in this sentence recognition rule, several sentence recognition results will be obtained. Results obtained by using some rules can be displayed as provisional sentence certification candidates, and sentence recognition results obtained by using some rules can be highlighted. The user can correct the sentence recognition result obtained by using a certain rule, or can refer to the sentence recognition result by another rule as a so-called another candidate and replace it there. That is, sentence recognition is performed on the input text using a sentence recognizing means having a sentence recognizing algorithm (belonging to the morphological analysis section or the preprocessing section), and the result is displayed on the display means. This state is shown in FIG.

【００４４】図１５は、文認定を説明するためのフロー
チャートである。以下、各ステップに従って順に説明す
る。まず、テキストを入力し（step1）、文認定処理を
行う（step2）。処理の結果得られた文認定候補を表示
する（step3）。また、文認定の他候補を表示すること
もできる（step4）。また文認定候補を一部強調表示す
ることもできる（step6）。前記step3，4，6の表示につ
いてユーザーが修正・選択する（step5）。ユーザーが
了承すれば（step7）文認定は終了する。図１６は、文
認定部の構成図で、図中、５１は前処理・形態素解析
部、５２は文認定手段、５３は候補選択手段、５４は表
示・編集手段、５５は対象文書である。FIG. 15 is a flow chart for explaining sentence recognition. Hereinafter, each step will be described in order. First, input text (step1) and perform sentence recognition processing (step2). The sentence certification candidates obtained as a result of the processing are displayed (step 3). You can also display other candidates for sentence recognition (step 4). It is also possible to highlight some sentence recognition candidates (step 6). The user corrects / selects the display of steps 3, 4, and 6 (step 5). If the user approves (step 7), the sentence certification ends. FIG. 16 is a block diagram of the sentence recognizing unit, in which 51 is a preprocessing / morphological analysis unit, 52 is a sentence recognizing unit, 53 is a candidate selecting unit, 54 is a displaying / editing unit, and 55 is a target document.

【００４５】図１８は、文認定部の他の構成図で、図
中、５２ａは第１の文認定手段、５２ｂは第２の文認定
手段で、その他、図１６と同じ作用をする部分は同一の
符号が付してある。文認定手段として（１）形態的特徴
だけを用いたアルゴリズムを持つ、（２）それに加えて
辞書情報を参照して利用する、というように複数用意す
ることが可能である。これは、共に速度と精度の面から
利点がある。（１）は速度に優るが、精度にやや劣る。
（２）は精度に優るが、速度にやや劣る。つまり、入力
テキストに対して文認定アルゴリズムを擁した文認定手
段（形態素解析部、もしくは前処理部に属する）を複数
用意して、そのいずれかあるいはいくつかを用いて文認
定を行い、その結果を表示手段に表示する。FIG. 18 is another block diagram of the sentence recognizing section. In the figure, 52a is a first sentence certifying means, 52b is a second sentence certifying means, and other parts which have the same functions as those in FIG. The same reference numerals are attached. It is possible to prepare a plurality of sentence recognition means such as (1) having an algorithm using only morphological features, and (2) additionally using dictionary information for reference. Both have advantages in speed and accuracy. (1) is superior in speed but slightly inferior in accuracy.
(2) is superior in accuracy but slightly inferior in speed. In other words, prepare a plurality of sentence recognition means (belonging to the morphological analysis unit or preprocessing unit) that has a sentence recognition algorithm for the input text, and perform sentence recognition using one or some of them, and the result Is displayed on the display means.

【００４６】次に、リスト文と該リスト文の前に記載さ
れているリスト文を導くための前文（リスト導入文）と
の関係について説明する。リスト導入文の文末候補情
報、例えば、「：」（コロン），「；」（セミコロ
ン），「）」（閉じカッコ）が付与されている場合の翻
訳について説明する。ここではコロンが付与されている
場合で以下の２例について説明する。まず、「When：」
が文末にくる場合で、リスト導入文が独立して翻訳しに
くい場合と、「following：」が文末にくる場合で、リス
ト導入文が独立して翻訳可能な場合とに分けて説明す
る。Next, the relationship between the list sentence and the preceding sentence (list introduction sentence) for guiding the list sentence described before the list sentence will be described. A description will be given of the translation in the case where the sentence introduction candidate information of the list introduction sentence, for example, ":" (colon), ";" (semicolon), ")" (closing bracket) is added. Here, the following two examples will be described in the case where a colon is added. First, "When:"
Will be described separately in the case where the sentence is at the end of the sentence, it is difficult to independently translate the list introduction sentence, and the case where “following:” is at the end of the sentence and the list introduction sentence is independently translatable.

【００４７】（１）構文的に複数のリスト文とリスト導
入文（親の文）を結び付ける。(1) Syntactically connect a plurality of list sentences with a list introduction sentence (parent sentence).

【００４８】[0048]

【表１】 [Table 1]

【００４９】これは、 Direct links are beneficial only when it is not possible 〜. Direct links are beneficial only when two computers transfer 〜. Direct links are beneficial only when two conputers are 〜. ということを、人間にわかりやすく簡潔にしたものであ
る。こういった表１の例では他に、 are: のような be 動詞 for: のような前置詞 find:のような動詞 he:のような代名詞のような場合がある。Direct links are beneficial only when it is not possible ~. Direct links are beneficial only when two computers transfer ~. Direct links are beneficial only when two conputers are ~. It was done. In the example of Table 1, there are other cases such as proverbs such as be: verb for: such as are: and preposition find: such as verb he :.

【００５０】（２）構文的には結び付けないものの、後
続するリスト文は、リスト導入文（親の文）の説明や、
補足、内容のリストなどであることを（意味的に）示し
ている。(2) Although not syntactically linked, the following list sentence is a description of the list introduction sentence (parent sentence),
It indicates (semantic) that it is a supplement or a list of contents.

【００５１】[0051]

【表２】 [Table 2]

【００５２】前記（２）の場合は、“following:”の語
が来たときだけとして、文末候補情報であるコロンに特
別な役割を持たせる。訳しわけのためのアルゴリズムと
してはコロンにプレディクションコード（構文的役割）
を与え、通常に文法を用いて翻訳する。リスト導入文
（親の文）には文末にピリオドを付加するか、または上
記コロンに与えるプレディクションコードにピリオドの
プレディクションコードを加えた複数のプレディクショ
ンコードを与える（後述する図１９〜図２２に示すフロ
ーチャートはこの形になっている）。コロンの直前語の
品詞により、コロンに与えるプレディクションコードを
変える。コロンの前の単語が熟語の時は、構成要素では
なく熟語全体の品詞をとる。In the case of (2), the colon as the sentence end candidate information has a special role only when the word "following:" comes. Prediction code (syntactic role) in colon as an algorithm for translation
And translate it using normal grammar. To the list introduction sentence (parent sentence), a period is added to the end of the sentence, or a plurality of prediction codes obtained by adding a period prediction code to the prediction code given to the colon are given (FIGS. 19 to 22 described later). The flow chart shown in is in this form). The prediction code given to the colon is changed according to the part of speech of the word immediately preceding the colon. When the word before the colon is an idiom, take the part of speech of the entire idiom, not the component.

【００５３】[0053]

【表３】 [Table 3]

【００５４】＊１このプレディクションコードは、名詞
連続とならないものとする。その理由は、＊２の「mak
e:」の例を参照のこと。＊２コロンの直前の語が動詞あるいは助動詞の時で、多
品詞語の時は、プレディクションコードの他にピリオド
の情報も出し、構文解析に任せる。例「will:」の場合 willは助動詞と名詞の多品詞語。この場合、コロンにはｖのプレディクションコードと、
ピリオドとしての情報を持たせる。例「make:」の場合 make は動詞と名詞の多品詞語。この場合、コロンにはＮのプレディクションコードと、
ピリオドとしての情報を持たせる。ただし、make を名
詞ととった場合に、コロンを名詞ととるのは不都合であ
るので、あらかじめＮのプレディクションコードは、
名詞連続とならないものとする。また、“as follo
ws”のときは、コロンをＮとする。＊３名詞の場合、“following”だけは別扱いとする。これは多品詞語であるが、コロンはピリオドとしてのみ
扱う。＊４冠詞の場合は、コロン一つで一つのプレディクショ
ンに対応しないので、複数のプレディクションを与え
る。* 1 This prediction code shall not be continuous nouns. The reason is * 2 "mak
See the e: example. * 2 When the word immediately before the colon is a verb or auxiliary verb, and when it is a multi-part-of-speech word, give the period information in addition to the prediction code and leave it to the parsing. In the case of "will:", will is a multi-part word of an auxiliary verb and a noun. In this case, the colon has v's prediction code,
Give information as a period. Example "make:" make is a multipart word of a verb and a noun. In this case, the colon has an N prediction code,
Give information as a period. However, when make is taken as a noun, it is inconvenient to take a colon as a noun, so the prediction code for N is
It should not be a series of nouns. Also, “as follo
In the case of ws ", the colon is N. * 3 In the case of nouns, only" following "is treated separately. This is a multi-part-of-speech word, but the colon is treated only as a period. * 4 In the case of articles , One colon doesn't correspond to one prediction, so give multiple predictions.

【００５５】また、表４のように、コロンに訳語（また
は訳文）等の翻訳情報を与える。Further, as shown in Table 4, translation information such as a translated word (or translated sentence) is given to the colon.

【００５６】[0056]

【表４】 [Table 4]

【００５７】図１９は及び図２０、コロンを伴うリスト
導入文で導かれるリスト文が認定された後の処理を示す
フローチャートである。以下、各ステップに従って順に
説明する。step１：コロン直列語の表層をチェックする。step２：“following”かどうか調べる。step３：前記 step２において、“following”であれ
ば、コロンにピリオドのプレディクションを与える。step４：前記 step２において、“following”でなけれ
ば、次に、“ as follows”かどうか調べる。step５：前記 step４において、“ as follows”であれ
ば、コロンに「名詞＋ピリオド」のプレディクションを
与える。step６：前記step４において、“ as follows”でなけ
れば、次に、“to”であるかどうかを調べる。step7 ：前記 step６において、“to”であれば、コロン
に「名詞＋ピリオド」のプレディクションと「動詞＋ピ
リオド」の両方のプレディクションを与える。step８：前記 step６において、“to”でなければ、コ
ロン直列語の品詞をチェックする。step９：接続詞であるかどうかを調べる。step１０：前記 step９において、接続詞であれば、コ
ロンに「文」のプレディクションを与える。step１１：前記 step９において、接続詞でなければ、
次に、前置詞であるかどうかを調べる。step１２：前記 step１１において、前置詞であれば、
コロンに「名詞＋ピリオド」のプレディクションを与え
る。FIG. 19 and FIG. 20 are flow charts showing the processing after the list sentence guided by the list introduction sentence accompanied by a colon is recognized. Hereinafter, each step will be described in order. step1 : Check the surface layer of colon serial words. step2 : Check whether it is "following". step3 : If "following" is selected in step2, a period prediction is given to the colon. step4 : If it is not "following" in step 2, then it is checked whether it is "as follows". step5 : In the above step4, if "as follows", give the colon a "noun + period" prediction. step6 : If it is not "as follows" in step 4, then it is checked whether it is "to". step7: In the step6, if "to", give a prediction and both the prediction of "verb + period" of "noun + period" in the colon. step8: In the step6, "to" Otherwise, check the part of speech of the colon series language. step9 : Check whether it is a conjunction. step10 : In step 9, if it is a conjunction, the colon is given a "sentence" prediction. step11 : In the above step9, if it is not a conjunction,
Next, check whether it is a preposition. step12 : In the above step11, if it is a preposition,
Give the colon a "noun + period" prediction.

【００５８】step１３：前記 step１１において、前置
詞でなければ、次に動詞であるかどうかを調べる。step１４：前記 step１３において、動詞であれば、コ
ロンに「名詞＋ピリオド」のプレディクションを与え
る。但し、この名詞のプレディクションは、名詞連続し
ない名詞とする。step１５：前記 step１３において、動詞でなければ、
次に、助動詞であるかどうかを調べる。step１６：前記 step１５において、助動詞であれば、
コロンに「動詞＋ピリオド」のプレディクションを与え
る。step１７：前記 step１５において、助動詞でなけれ
ば、次に、名詞であるかどうかを調べる。step１８：前記 step１７において、名詞であれば、コ
ロンに「ピリオド」のプレディクションを与える。step１９：前記 step１７において、名詞でなければ、
次に、冠詞であるかであるかどうか調べる。step２０：前記 step１９において、冠詞であれば、コ
ロンに「名詞＋ピリオド」のプレディクションを与え
る。冠詞でなければ終了する。Step 13 : If it is not a preposition in step 11, it is next examined whether or not it is a verb. step14 : In step 13, if the verb is a verb, the colon is given a noun + period prediction. However, the prediction of this noun is a noun that is not continuous. step15 : If it is not a verb in step 13,
Next, check whether it is an auxiliary verb. step16 : In the above step15, if it is an auxiliary verb,
Give the colon the "verb + period" prediction. step17 : If it is not an auxiliary verb in step15, then it is checked whether or not it is a noun. step18: In step 17, if it is a noun, give the colon a "period" prediction. step19 : If it is not a noun in step17,
Next, check whether it is an article. step20: in the step19, if the article, give a prediction of "noun + period" in the colon. If it is not an article, it ends.

【００５９】図２１及び図２２は、リスト導入文の翻訳
処理を示すフローチャートである。コロンに与えたプレ
ディクション・コードをリスト文の構文的役割としても
与える。これによりリスト文の翻訳精度を上げる。フロ
ーチャート終了直前にコロンに与えたプレディクション
コードを記憶する処理を行う。次に文が前の文の子のリ
ストであるときは、この子のリスト文の構文的役割を前
記記憶したプレディクションコードに指定して、構文解
析を行う。図２１及び図２２に示すフローチャートは、
図１９及び図２０に示すフローチャートの step１〜 st
ep２０までは同じであるが次の点が異なる。すなわち、
コロンに与えたピリオドを除くプレディクションコード
を記憶する記憶部Ａを step２３を設ける。フローチャ
ートの開始直後に、リスト文かどうかを調べ（step２
１）、リスト文であれば、リスト導入文を記憶部Ａをゴ
ールとして翻訳する（step２２）。21 and 22 are flowcharts showing the translation processing of the list introduction sentence. The prediction code given to the colon is also given as the syntactic role of the list statement. This improves the translation accuracy of list sentences. The process of storing the prediction code given to the colon is performed immediately before the end of the flowchart. Next, when the sentence is a list of children of the previous sentence, the syntactical role of the child list sentence is specified in the stored prediction code, and parsing is performed. The flowcharts shown in FIGS. 21 and 22 are
Steps 1 to st of the flowchart shown in FIGS. 19 and 20
Up to ep20, it is the same, but the following points are different. That is,
Step 23 is provided as a storage unit A for storing the prediction code excluding the period given to the colon. Immediately after starting the flowchart, check whether it is a list statement (step2
1) If it is a list sentence, the list introduction sentence is translated with the storage unit A as a goal (step 22).

【００６０】次に、リスト文の文末に“and”“then”
“or”のような接続語(接続詞や文副詞等)がくる場合の
翻訳処理について説明する。すなわち、リスト文の文末
に生起しやすい単語に着目して、後段の翻訳処理がうま
くいくようにリスト文を変形する方法を以下の実施例に
より説明する。図２３はフローチャートを示す。ここで
は、表５に示すようなリスト文が入力文として与えられ
ているとする。Next, at the end of the list sentence, "and""then"
The translation process when a connective word (connective or sentence adverb, etc.) such as “or” comes will be described. That is, a method of transforming a list sentence so that the subsequent translation process will be successful, focusing on a word that tends to occur at the end of the list sentence, will be described by the following embodiment. FIG. 23 shows a flowchart. Here, it is assumed that a list sentence as shown in Table 5 is given as an input sentence.

【００６１】[0061]

【表５】 [Table 5]

【００６２】この入力文は形態素解析部の文末認定手段
により表６に示すような文認定がなされる。The input sentence is recognized by the sentence end recognizing means of the morphological analysis unit as shown in Table 6.

【００６３】[0063]

【表６】 [Table 6]

【００６４】図中、左端の数字は、文認定手段により付
加された文番号である。文番号（２）（４）（６）は、
記号部として分離され、それぞれ１文として処理され
る。各文番号に対応して、当該文が記号部であるかどう
かの情報フラグが立てられ、表７に示すように記憶され
る。In the figure, the leftmost number is the sentence number added by the sentence recognizing means. Sentence numbers (2) (4) (6)
The symbol part is separated and processed as one sentence. Corresponding to each sentence number, an information flag indicating whether or not the sentence is a symbol part is set and stored as shown in Table 7.

【００６５】[0065]

【表７】 [Table 7]

【００６６】まず、原文記憶部のスタック上に読み込む
文が存在するかどうかを調べる（step１）。存在する場
合にはスタックの一番上の文を読み込み（step２）、表
７のフラグ情報テーブルによりその文が記号部の文であ
るかどうか調べる（step３）。もし、読み込む文が存在
しなければ、処理は終了する。読み込んだ文が記号部の
文であれば、この文をスタックから取り出し（step
４）、Ａのステップに進む。記号部の文でなければ、文
末がピリオドかどうか調べる（step５）。ピリオドであ
れば、当該文をスタックから取り出し（step４）、Ａの
ステップに進む。文末がピリオドでなければ、その直前
に特定の単語が存在するかどうかを調べる（step６）。
特定の単語とは、“and”や“then”あるいは“or”と
いったリスト文の文末に生起しやすい接続詞や文副詞で
あり、これらの単語群は予めシステム側で用意しておく
ことも可能であるし、ユーザが翻訳処理の前に設定する
ことも可能である。First, it is checked whether or not there is a sentence to be read on the stack of the original sentence storage unit (step 1). If it exists, the top sentence of the stack is read (step 2), and it is checked by the flag information table in Table 7 whether or not the sentence is a sentence in the symbol part (step 3). If there is no sentence to read, the process ends. If the read sentence is a sentence in the symbol part, this sentence is taken out from the stack (step
4) Go to step A. If the sentence is not in the symbol part, it is checked whether the sentence end is a period (step 5). If it is a period, the sentence is taken out from the stack (step 4) and the process proceeds to step A. If the sentence end is not a period, it is checked whether or not a specific word exists immediately before that (step 6).
Specific words are connectives and sentence adverbs such as “and”, “then”, and “or” that tend to occur at the end of sentences in list sentences, and these words can be prepared in advance by the system. It is also possible for the user to set it before the translation process.

【００６７】もし、文末の単語が先に指定された単語群
に属するものでなければ、当該文をスタックから取り出
し（step４）、ステップＡに進む。指定された単語群に
属するものであれば、その単語を一時記憶部に格納して
原文からその単語を削除した後（step７）、この文をス
タックから取り出し、次の文を読み込む（step８）。表
６の場合には、文番号（１）にも文番号（２）にも指定
の単語が文末に存在しない。文番号（３）において、単
語“then”が存在するので、（３）の文から“then”は
削除される。ここで、次の文を読み込み、その文が記号
部の文であればスタックから取り出し、さらに次の文を
読み込む。もし、記号部の文でなければ、先に格納した
単語“then”の先頭１文字目のみを大文字に変えて当該
文の先頭に付加し、当該文の先頭文字は小文字に変換さ
れステップＢに進む（step１０）。これにより、文番号
（３）および文番号（５）の文は表８に示すようにな
る。この文番号（５）の文は、ステップＢに送られる。If the word at the end of the sentence does not belong to the previously designated word group, the sentence is taken out from the stack (step 4) and the process proceeds to step A. If the word belongs to the designated word group, the word is stored in the temporary storage unit, the word is deleted from the original sentence (step 7), this sentence is taken out from the stack, and the next sentence is read (step 8). In the case of Table 6, neither the sentence number (1) nor the sentence number (2) has the specified word at the end of the sentence. Since the word "then" exists in the sentence number (3), "then" is deleted from the sentence of (3). Here, the next sentence is read, and if the sentence is a sentence in the symbol part, it is taken out from the stack and the next sentence is read. If it is not a sentence in the symbol part, only the first character at the beginning of the previously stored word "then" is changed to upper case and added to the beginning of the sentence, the first letter of the sentence is converted to lower case, and step B is performed. Proceed (step 10). As a result, the sentences of sentence number (3) and sentence number (5) are as shown in Table 8. The sentence with the sentence number (5) is sent to step B.

【００６８】[0068]

【表８】 [Table 8]

【００６９】こうして、同様の処理が繰り返され、当初
の入力リスト文は表９のように変形され後段の翻訳処理
に供される。In this way, the same processing is repeated, and the original input list sentence is transformed as shown in Table 9 and provided for the subsequent translation processing.

【００７０】[0070]

【表９】 [Table 9]

【００７１】なお、表１０及び表１１に、従来の翻訳結
果と本発明の翻訳結果とを示してある。Tables 10 and 11 show the conventional translation result and the translation result of the present invention.

【００７２】[0072]

【表１０】 [Table 10]

【００７３】[0073]

【表１１】 [Table 11]

【００７４】次に、スキャナーで読み取る場合などにお
ける画像データ形成となっている文書では、タイトル文
やリスト文の言語処理について説明する。すなわち、画
像情報を利用したインデント判定について以下に説明す
る。図２４はフローチャートである。まず、行範囲の特
定について説明する。表１２のような書式の文書の画像
を読み取る（step１）。Next, in the case of a document in which image data is formed when reading with a scanner, the language processing of the title sentence and list sentence will be described. That is, the indent determination using the image information will be described below. FIG. 24 is a flowchart. First, the specification of the row range will be described. The image of the document having the format shown in Table 12 is read (step 1).

【００７５】[0075]

【表１２】 [Table 12]

【００７６】画像の中で、画素のない部分が列状になっ
ており、それが等間隔で並列していれば、そこは文字の
ない部分（つまり行間部）であり、その空白の列が横に
並べば横書き文書、縦に並べば縦書き文書、と判定でき
る（この例では横書き）。行間部が特定されれば、文字
の書かれている部分もわかるので、一行ごとの位置を画
素単位で測定する。また、行の範囲は横書きならば縦方
向に画素がない部分、縦書きならば横方向に画素がない
部分を探すことによって、容易に特定することができる
（step２）。行の位置を特定した後、次に行間距離を求
める（step３）。まず、文書全体にわたって全ての行間
距離を測定する。行と行とは等間隔に配置されているは
ずであるから、測定結果において、分布が最も集中した
長さを行間距離と決定する。行間距離に比べて極端に離
れている行間には、空白行が挿入されているとする。空
白行の数も行幅と比較することによって容易に推定する
ことができる（step４）。この様子を表１３に示す。In the image, if there are no pixels in columns and they are arranged in parallel at equal intervals, there are no characters (that is, line spacing), and the blank columns are It can be determined that it is a horizontal writing document if it is arranged horizontally, and a vertical writing document if it is arranged vertically (horizontal writing in this example). If the space between lines is specified, the part in which the characters are written is also known, so the position of each line is measured in pixel units. Further, the range of rows can be easily specified by searching for a portion having no pixels in the vertical direction for horizontal writing and a portion having no pixels in the horizontal direction for vertical writing (step 2). After the position of the line is specified, the distance between lines is calculated (step 3). First, all line spacing is measured throughout the document. Since the lines should be arranged at equal intervals, the length in which the distribution is most concentrated in the measurement result is determined as the line-to-line distance. It is assumed that blank lines are inserted between lines that are extremely far apart from each other. The number of blank lines can also be easily estimated by comparing with the line width (step 4). This state is shown in Table 13.

【００７７】[0077]

【表１３】 [Table 13]

【００７８】次に語間距離と字間距離の算出（step５）
について説明する。１行毎に分割された後、１行の範囲
内で縦方向に画素のない範囲を探す。英語の場合、字間
よりも語間の距離が長いので、行内の空白範囲の長さは
２つに大別されるはずである。したがって、空白長の長い方を語間距離空白長の短い方を字間距離と決定する。画像読み取り誤差、画像のノイズやかすれ
などによって、空白長の分布が明確に２つにならないか
もしれないが、最も分布の集中する長さを中心に許容範
囲をもうけて、許容範囲内の空白長は語間あるいは字間
との判定を行なえばよい。この許容範囲は任意に設定可
能にする。この集計処理を全ての行において実施する。
例えば、表１４のような画像位置が得られたとする。Next, calculation of the distance between words and the distance between characters (step 5)
Will be described. After being divided for each row, a range without pixels in the vertical direction is searched for within the range of one row. In the case of English, the distance between words is longer than the distance between characters, so the length of the blank range within a line should be roughly divided into two. Therefore, the one with the longer blank length is determined as the inter-word distance and the one with the shorter blank length is determined as the inter-character distance. It may not be clear that there are two blank length distributions due to image reading errors, image noise, or blurring. May be determined to be a space between words or a space between characters. This allowable range can be set arbitrarily. This tabulation process is performed on all the rows.
For example, assume that the image positions shown in Table 14 are obtained.

【００７９】[0079]

【表１４】 [Table 14]

【００８０】表１５に示す集計結果より、字間距離：２［画素］語間距離：５［画素］と決定する。字間距離や語間距離が決定すると、単語の
位置と長さがわかる。From the tabulation result shown in Table 15, it is determined that the character distance is 2 [pixels] and the word distance is 5 [pixels]. When the inter-character distance and inter-word distance are determined, the position and length of the word can be known.

【００８１】[0081]

【表１５】 [Table 15]

【００８２】次にインデントの算出（step６）について
説明する。また、語間距離より極端に長い位置から始ま
っている行は、インデントされていると判定できる。字
間距離や語間距離と同様に、インデント距離を文書全体
にわたって探し、長さの分布を求めることによって、異
なるインデントの深さ（インデントの種類）を区別する
ことができる。Next, the indent calculation (step 6) will be described. Further, a line starting from a position extremely longer than the interword distance can be determined to be indented. Similar to the inter-character distance and the inter-word distance, different indent depths (indent types) can be distinguished by searching for the indent distance over the entire document and obtaining the distribution of lengths.

【００８３】次に文書形式の推定（step７）について説
明する。インデント、空白行、行長が求まれば、
文書の形式が推定できる。例えば、・当該行の直前が文書開始点あるいは空白行・当該行長が文書の最大幅の２／３以下・インデントがないなら、タイトル行である可能性が高い。タイトル行の言
語的性質としては、・先頭が数字あるいはアルファベット１文字である可能
性が高い・各単語が capitalize されている可能性が高い・名詞句の可能性が高いなどがある。また、リストの項目である行（便宜上、リ
スト文と呼ぶ）は・インデント数が同じ行が直前あるいは直後に隣接する・当該行長が文書の最大幅の２／３以下などの形式的特徴がある。リスト文の言語的性質として
は、・先頭が数字あるいは記号である可能性が高い・名詞句である可能性が高いなどがある。Next, the document format estimation (step 7) will be described. If you want indentation, blank line, line length,
The format of the document can be estimated. For example: -The document start point or blank line immediately before the line-The line length is 2/3 or less of the maximum width of the document-If there is no indent, there is a high possibility that it is the title line. The linguistic properties of the title line include: -the beginning is likely to be a number or one letter of the alphabet-is likely to be capitalized for each word-is likely to be a noun phrase. Lines that are the items of a list (for convenience, are called list sentences) are adjacent to lines that have the same indentation immediately before or immediately after that. is there. The linguistic properties of the list sentence include: ・ The beginning is likely to be a number or symbol ・ It is likely to be a noun phrase.

【００８４】次に単語の認識（step８）について説明す
る。各行の形式的特徴を推定した後、文字認識装置（Ｏ
ＣＲ）などで行の範囲の画像と、認識用標準パターンと
を照合することによって単語あるいは文字を認識する。
その際、当該行の言語的特徴を照合スコアに加えること
によって、認識精度を向上させることができる。例え
ば、先の例から表１６の様に認識範囲が求められたとす
る。Next, the word recognition (step 8) will be described. After estimating the formal characteristics of each line, the character recognition device (O
A word or character is recognized by collating the image in the range of the line with the standard pattern for recognition by (CR) or the like.
At that time, the recognition accuracy can be improved by adding the linguistic feature of the line to the matching score. For example, it is assumed that the recognition range is obtained as shown in Table 16 from the above example.

【００８５】[0085]

【表１６】 [Table 16]

【００８６】この行は、タイトル行の形式的特徴である
以下の条件を満足する。・当該行の直前が文書開始点あるいは空白行・当該行長が文書の最大幅の２／３以下・インデントがないしたがって、この行はタイトル行の可能性が高いので、
以下の言語的特徴を持つ可能性も高い。・先頭が数字あるいはアルファベット１文字である可能
性が高い・各単語が capitalize されている可能性が高い・名詞句の可能性が高いこの特徴を利用すれば、 This−this Is−IS−is−iS など、大文字と小文字とが似ている場合に、どれかに決
定することができる。次に、機械翻訳への応用について
説明する。機械翻訳においても、タイトル文、リスト文
などは一般的な文とは異なった言語処理を行なわない
と、正しく翻訳することが出来ない。したがって、OCR
結果として文字だけでなく、各行の形式的特徴も添えれ
ば翻訳率の向上が期待できる。This line satisfies the following conditions which are the formal characteristics of the title line. -The document start point or a blank line immediately before the line-The line length is 2/3 or less of the maximum width of the document-There is no indent, so this line is likely to be the title line.
It is highly possible that it has the following linguistic features.・ There is a high probability that it will start with a number or a single letter ・ It is likely that each word has been capitalized ・ There is a high possibility of a noun phrase. If the case is similar, such as iS, you can decide which one to use. Next, the application to machine translation will be described. Even in machine translation, title sentences, list sentences, and the like cannot be correctly translated unless language processing different from general sentences is performed. Therefore, OCR
As a result, the translation rate can be expected to improve if not only the characters but also the formal characteristics of each line are added.

【００８７】以上の説明は、画像情報を利用したインデ
ント判定についての説明であった。次に、インデント深
さを利用した文書形式推定について説明する。まずイン
デント深さの分布について説明する。図２５に示すよう
な、深さが異なるインデントが使用されている文書の、
インデントを検出する。The above description is for the indent determination using the image information. Next, document format estimation using the indent depth will be described. First, the distribution of the indent depth will be described. As shown in FIG. 25, for documents using indents with different depths,
Detect indentation.

【００８８】まず、文書全体にわたって、各行の開始位
置の、ファイルの行頭（左端）からの距離を集計する。
距離の単位としては空白キャラクタの数（カラム数）で
もよいし、実際の長さ（［mm］,［inch］,等)でもよ
い。集計結果において分布が集中している距離がインデ
ント深さであり、集中点の数だけインデントの種類があ
るといえる。例えば、集計した結果、距離が表１７の分
布になったとする。この分布において集中しているの
は、０，１０，１６，２２，２８である。したがって、
この文書は、インデントなし：０［カラム］〃深さ１：１０［カラム］〃深さ２：１６［カラム］〃深さ３：２２［カラム］〃深さ４：２８［カラム］の４種が使われていると推定できる。文番号も集計され
ているので、各行がどれだけインデントされているか
は、表１７より容易にわかる。First, the distance of the start position of each line from the beginning (left end) of the file is calculated over the entire document.
The unit of the distance may be the number of blank characters (the number of columns) or the actual length ([mm], [inch], etc.). It can be said that the distance at which the distribution is concentrated in the tabulation result is the indent depth, and there are as many types of indents as there are concentrated points. For example, it is assumed that the distances have the distribution shown in Table 17 as a result of tabulation. In this distribution, 0, 10, 16, 22, and 28 are concentrated. Therefore,
This document has no indentation: 0 [column] 〃 depth 1:10 [column] 〃 depth 2:16 [column] 〃 depth 3:22 [column] 〃 depth 4:28 [column] Can be presumed to be used. Since the sentence numbers are also aggregated, it is easy to see from Table 17 how much each line is indented.

【００８９】[0089]

【表１７】 [Table 17]

【００９０】この結果を図２６に示す。図中、ｂ：インデント深さ１：１０［カラム］ａ：〃深さ２：１６［カラム］ｃ：〃深さ３：２２［カラム］ｄ：〃深さ４：２８［カラム］を示す。次にインデント深さに基づいて文書形式分析に
ついて説明する。インデント深さは文書の入れ子構造を
表現する。たとえば、図２７（ａ）のインデントは、図
２７（ｂ）の文書構造を表現する。つまり、・当該行のインデント深さが直前行と同じであれば、当
該行は直前行と同じレベル・当該行のインデント深さが直前行より深ければ、当該
行は直前行の下位関係にある・当該行のインデント深さが直前行よりも浅ければ、当
該行は直前行と関係ないという原則が成り立つ。The results are shown in FIG. In the figure, b: indent depth 1:10 [column] a: 〃 depth 2:16 [column] c: 〃 depth 3:22 [column] d: 〃 depth 4:28 [column] Next, the document format analysis based on the indent depth will be described. The indentation depth represents the document nesting structure. For example, the indent of FIG. 27 (a) represents the document structure of FIG. 27 (b). In other words, if the indent depth of the line is the same as the previous line, the line is at the same level as the previous line.・ If the indent depth of the line is shallower than that of the previous line, the principle that the line is not related to the line immediately before is established.

【００９１】次に各範囲の内容の特定について説明す
る。インデントに基づいて文書構造の範囲を推定した
後、さらに各範囲の内容を推定する。内容としては・タイトル部・本文部・リスト部・見出し部などが挙げられる。タイトル部の形態的特徴としては・構成している単語数が少ない可能性が高い・先頭が、数字あるいはアルファベットである可能性が
高い・直前・直後の行とインデント深さが異なる・インデント深さが浅いなどがある。リスト部の形態的特徴としては・構成している単語数が少ない可能性が高い・先頭が数字あるいはアルファベット、記号である可能
性が高い・直前あるいは直後、またはその両方の行と同じインデ
ント深さ・インデント深さが深いなどがある。同様に、本文部、見出し部などにも、各々
形態的な特徴がある。このようにインデント深さと、行
の形態的な特徴を利用すれば、文書形式の内容を推定す
ることができる。Next, the specification of the contents of each range will be described. After estimating the range of the document structure based on the indent, the content of each range is further estimated. The contents include a title part, a body part, a list part, a headline part, and the like. The morphological characteristics of the title part are as follows: ・ The number of words that are composed is likely to be small ・ The beginning is likely to be a number or alphabet ・ The indentation depth is different from the line immediately before or after it ・ Indentation depth But there are shallow. The morphological features of the list part are: -It is likely that the number of words that are composed is small-It is likely that the beginning is a number, alphabet, or symbol-The indent depth is the same as the line immediately before or after, or both of them・ There is a deep indent. Similarly, the text portion, the headline portion, and the like also have morphological characteristics. In this way, the content of the document format can be estimated by using the indentation depth and the morphological characteristics of the line.

【００９２】次に機械翻訳における文書形式情報の利用
について説明する。機械翻訳においても、タイトル文、
リスト文などは一般的な文とは異なった言語処理を行な
わないと、正しく翻訳することが出来ない。したがっ
て、文書形式解析時に、翻訳単位何に文書形式情報を付
与しておけば、適した翻訳処理を実施することができ、
翻訳率の向上が期待できる。なお、本発明では、文書全
体が既に電子化されている場合について述べた。しかし
文書形式を特定する際は、当該行の形態的内容と、
当該行とその隣接する行との関係によって、当該行の文
書形式を判定しているので、文書が一行毎に入力される
場合に対して、本発明で述べた方法を拡張することは容
易である。それには、当該行の入力が終了しても行の判
定を行なわず、次の行が入力された時点で、直前の行の
判定を行なえばよい。Next, the use of document format information in machine translation will be described. Even in machine translation, the title sentence,
List sentences cannot be translated correctly unless they are processed differently from ordinary sentences. Therefore, if the document format information is added to the translation unit at the time of analyzing the document format, the appropriate translation processing can be performed.
We can expect an improvement in translation rate. The present invention has described the case where the entire document has already been digitized. However, when specifying the document format, the morphological content of the line and
Since the document format of the line is determined based on the relationship between the line and its adjacent line, it is easy to extend the method described in the present invention to the case where the document is input line by line. is there. For that purpose, the line is not determined even after the input of the line is completed, and the immediately preceding line may be determined when the next line is input.

【００９３】以上の説明は、インデント深さを利用した
文書形式推定についての説明であった。次に、目次部に
注目した機械翻訳について説明する。まず、目次部につ
いて説明する。文書を機械翻訳する場合、目次部を最初
に翻訳するとよい。ここでいう目次部とは、・章番号・章題・頁数等から構成されている行の集合を指す。したがって、電
子化文書に目次が無い場合でも、本文から章題部分を抽
出して新規に目次部を作成すれば、本発明で述べる方法
が適用できる。The above description is about the document format estimation using the indent depth. Next, the machine translation focusing on the table of contents will be described. First, the table of contents will be described. When translating a document by machine, translate the table of contents first. The table of contents section here refers to a set of lines composed of: chapter number, chapter title, number of pages, etc. Therefore, even if the electronic document does not have a table of contents, the method described in the present invention can be applied by extracting a chapter part from the text and creating a new table of contents.

【００９４】[0094]

【表１８】 [Table 18]

【００９５】次に目次部情報の利用について説明する。訳語選択章題は、章の内容を簡潔に表現するものであり、本文
中に頻出する語（重要語）や専門語などが多く含まれて
いることが予想される。したがって、章題の集合である
目次部を本文部よりも先に翻訳し、翻訳結果を確定する
ことによって、本文部の翻訳負荷を低減することができ
る。例えば、表１８の目次を先に機械翻訳し、訳文を確
定すれば “Parallel Distributed Processing”＝「並列分散処
理」 “Distributed Representations” ＝「分散表現」 “PDP” ＝「PDP（並列分
散処理）」 “Cognitive Science” ＝「認知科学」などの訳語対を記憶しておくことができる。これらの単
語は重要語であり、キーワードとなる単語である。これ
らの単語が本文中に出現した場合には、記憶している翻
語対を用いて正しい訳語を与えることができる。訳文に
おいて、専門語や重要語が正しく訳すことは、理解容易
性向上に非常に効果がある。また、先に翻訳するのは目
次部に限らず、索引部などでも同様な効果が得られる。Next, the use of the table of contents information will be described.Select word The chapter title is a simple expression of the contents of the chapter.
Contains many frequently used words (important words) and jargon
Is expected to exist. Therefore, it is a set of chapters
Translate the table of contents before the text and confirm the translation result
By doing so, it is possible to reduce the translation load of the text section.
It For example, the table of contents in Table 18 is machine-translated first and the translated text is confirmed.
If you decide “Parallel Distributed Processing” ＝ “Parallel Distributed Processing”
Reason " “Distributed Representations” = “Distributed Representations” “PDP” = “PDP (parallel
(Disposal) “Cognitive Science” = “Cognitive science” It is possible to memorize translated word pairs such as. These simple
Words are important words and are keywords. this
If these words occur in the text, the
Word pairs can be used to give correct translations. In translation
It is easy to understand that technical words and important words are translated correctly.
It is very effective in improving sex. Also, it is the eyes to translate first
The same effect can be obtained not only in the next part but also in the index part.

【００９６】章部分の分離目次部にある翻訳単位（文字列）は、本文部では章題
として出現するはずである。章題部は通常の英文とは異
なり、文末を示す記号（“．”，“？”，“！”等）が
無い場合が多い。したがって、機械翻訳において、章題
部を本文部から自動的に分離することは困難であり、章
題部・本文部が連結した文字列を一翻訳単位と認定する
ために、翻訳処理を失敗する場合が多かった。そこで、
章題部分と本文部とを分離するために以下の手順を実施
する。翻訳対象文書において、目次部を指定する。
目次部と指定された文字列から・章番号・章題・頁数に対応する文字列を抽出して記憶する。抽出作業は使用
者が手動で行なってもよいし、目次部は一定のパターン
を持つことが多いことが予想されるので、［行頭］［数
字の集合］［英数字の集合］［数字の集合］［行末］な
どの抽出パターンに従って自動抽出してもよい。目
次部を翻訳して翻訳結果を確定する。確定結果は原翻訳
単位と対応させて記憶する。本文部を翻訳する。そ
の際、で記憶した章番号・章題（頁数がわかれば頁数
も）と一致する翻訳部分があれば、そこは章題部と判定
し、本文部から分離する。で分離した部分は翻訳
処理しないで、で記憶した訳をその部分の訳とする。
なお、本文部から章部分を抽出して目次部を作成するた
めに、文字列のパターン、例えば、［行頭］［数字の集
合］［英数字の集合］［行末］等を指定し、本文部を自
動検索する機能を設けておいてもよい。[0096]Separation of chapter parts The translation unit (character string) in the table of contents section is the chapter title in the text section.
Should appear as. The title section is different from normal English
And the symbols that indicate the end of the sentence (such as ".", "?", "!")
There are often no cases. Therefore, in machine translation,
It is difficult to automatically separate the part from the text part,
Authorize the character string in which the subject part and the body part are connected as one translation unit
Therefore, the translation process often failed. Therefore,
Perform the following steps to separate the chapter and body parts
To do. Specify the table of contents in the document to be translated.
From the table of contents and the specified character string ・ Chapter number ・ Chapter ・ Number of pages The character string corresponding to is extracted and stored. Extraction work used
May be done manually by the person in charge, or the table of contents may have a uniform pattern.
Since it is expected that there will often be
Set of characters] [Set of alphanumeric characters] [Set of numbers] [End of line]
Automatic extraction may be performed according to any extraction pattern. Eye
Translate the next part and confirm the translation result. Final result is original translation
It is stored in association with the unit. Translate the text section. So
At the time of, the chapter number and title (in the number of pages if you know the number of pages
If there is a translation part that matches () also, it is judged as the chapter part
And separate it from the text section. Translated parts separated by
Without processing, the translation stored in is used as the translation for that part.
In addition, the chapter part is extracted from the text part to create the table of contents part.
For example, a string pattern, for example, [start of line] [collection of numbers
Specify], [Set of alphanumeric characters], [End of line], etc.
A dynamic search function may be provided.

【００９７】図２８は、目次部に注目した機械翻訳を説
明するためのフローチャートである。以下、各ステップ
に従って順に説明する。step１：まず、目次部を翻訳するかどうか判断する。step２：翻訳する場合は、目次部の指定を行う。step３：目次情報の抽出、記憶を行う。step４：目次部の翻訳を行う。step５：目次部分の翻訳対を記憶する。step６：本文部を翻訳単位毎に読み込む。step７：章題部を含むかどうかを調べる。step８：章題部を含めば、章題部を分離する。step９：分離した部分に目次部の訳を与える。step１０：残った部分を翻訳処理する。step１１：本文部終了かどうかを調べる。終了でなけれ
ば、前記step６に戻り、終了でなければ終る。FIG. 28 is a flow chart for explaining the machine translation focusing on the table of contents. Hereinafter, each step will be described in order. step1 : First, determine whether to translate the table of contents. step2 : When translating, specify the table of contents. step3 : Extract and store table of contents information. step4 : Translate the table of contents. step5 : Memorize the translation pair in the table of contents. step6 : Read the text part for each translation unit. step7 : Check whether the chapter part is included. step8 : If the chapter part is included, the chapter part is separated. step9 : Give the translation of the table of contents to the separated part. step10 : Translate the remaining part. step11 : It is checked whether or not the main body part ends. If it is not completed, the process returns to step 6, and if it is not completed, the process ends.

【００９８】これまでの説明によると、入力された原文
は、原文のテキストの文字属性及び構成要素等の情報を
用いて、形態素解析時に文認定される。タイトル文の文
認定手段としては、文認定すべき対象テキスト（タイト
ル）が、その直前で文認定されているとき、対象テキス
ト（タイトル）が行頭から始まり、先頭の単語が前置
詞、冠詞及び接続詞以外の大文字で始まる単語、記号、
数字であり、次の行が小文字以外で始まっている場合に
タイトル文として認定する、となっている。文認定の結
果、タイトル文の直後に続く文が、仮にそのタイトル文
とつながって一文を形成しているような場合であって
も、上記の条件にあてはまれば、タイトル文と次に続く
文はそれぞれ別文として、解析、変換、生成等の翻訳処
理を行うことになる。According to the above description, the input original sentence is recognized as a sentence at the time of morphological analysis, using information such as the character attributes and constituent elements of the text of the original sentence. As a sentence recognition method for a title sentence, when the target text (title) to be sentence-recognized is immediately preceding, the target text (title) starts from the beginning of the line and the first word is other than a preposition, article, or conjunction. Words, symbols, starting with capital letters
It is a number, and if the next line starts with a letter other than lower case, it is recognized as a title sentence. As a result of sentence recognition, even if the sentence immediately following the title sentence is connected to the title sentence to form one sentence, if the above conditions are satisfied, the title sentence and the next sentence continue. Translation processing such as analysis, conversion, generation, etc. is performed for each sentence as a separate sentence.

【００９９】入力された原文テキストの文章を、一文ご
とに切り出しながら翻訳処理を進める方法では、文と文
の切れ目を認定する手段が必要となる。特に最近では、
ＯＣＲにより、イメージとして取り込んだ文書を文字認
識し、その結果得られたテキストデータをそのまま翻訳
処理にかける、といった方法も取られており、そういっ
た場合、文と文の切れ目を認識する手段を持つことは重
要となる。しかしながら従来の文末認定の方法では、タ
イトルとその直後に続く文とで一文を形成している（マ
ニュアル等でよく見受けられる）ような場合、形態素解
析時の文認定の結果、タイトルの部分と直後の文は別文
として解析され、翻訳処理が実行されるので、たとえタ
イトルの部分がその直後の文の文頭の語を兼ねている場
合であっても、タイトルに続く次の文が非文（主語なし
文）となってしまい、解析に失敗する。このような問題
点を解決するために、以下に説明するような実施例に基
づくとよい。In the method of advancing the translation process while cutting out the input original text, sentence by sentence, a means for recognizing a sentence and a break between sentences is required. Especially recently
There is also a method of recognizing a document captured as an image by OCR and subjecting the text data obtained as a result to a translation process as it is. In such a case, it is necessary to have a means for recognizing a sentence and a break between sentences. Is important. However, in the conventional sentence end recognition method, if a title and the sentence immediately following it form a sentence (which is often seen in manuals, etc.), the result of sentence recognition at the time of morphological analysis indicates that the title part and the sentence immediately follow. Sentence is analyzed as a separate sentence and the translation process is executed, so even if the title part also serves as the first word of the sentence immediately after it, the next sentence following the title is a non-sentence ( It becomes a sentence with no subject) and analysis fails. In order to solve such a problem, it is preferable to use the embodiments described below.

【０１００】図２９は、マニュアル等のタイトル部分
（例えば、italic部分）が次の文（Representsで始まっ
ている)の文頭を兼ねている例を示す図であり、図３０
はフローチャートである。翻訳処理を実行すると、“it
alic”の部分と“Represent"で始まる文は、形態素解析
（step１)において別々の文として認定される。次の処
理である構文解析(step２）では、“italic”の部分は
タイトル文として解析されるが、“Represent〜”で始
まる文の方は主語のない文となってしまい解析失敗とな
る。解析に失敗すると、その直前の文がタイトル文であ
るかどうかを判断し（step４）、タイトル文であった場
合には、そのタイトル文と解析失敗文を連結し（step
５）、再度翻訳処理を実行する。文の連結の結果、“it
alic represents characters youtype from the keyboa
rd”という文になり、再度翻訳処理を実行した結果「イ
タリック体はキーボードから入力する文字を表しま
す。」という訳文が得られる。FIG. 29 is a diagram showing an example in which a title part (eg, italic part) of a manual or the like also serves as the beginning of the next sentence (starting with Represents).
Is a flowchart. When you execute the translation process, "it
The "alic" part and the sentence starting with "Represent" are recognized as separate sentences in the morphological analysis (step 1). In the next process, the syntactic analysis (step 2), the "italic" part is analyzed as a title sentence. However, the sentence starting with "Represent ~" becomes a sentence without a subject and fails in parsing.When the parsing fails, it is judged whether the sentence immediately before it is a title sentence (step 4), and the title If it is a sentence, concatenate the title sentence and the analysis failure sentence (step
5) Then, the translation process is executed again. The result of concatenating the statements is "it
alic represents characters youtype from the keyboa
It becomes the sentence "rd", and as a result of executing the translation process again, the translated sentence "Italics represent characters input from the keyboard." is obtained.

【０１０１】図３１は、他の実施例を説明するためのフ
ローチャートである。図３９で示した場合の例と同じ
く、“italic”の部分と“Represent〜”で始まる文は
形態素解析（step２）において別々の文として認定され
る。同じく、次の処理の構文解析（step３）において、
“Represent〜”で始まる文は解析失敗となる。解析に
失敗した文を、タイトル部分との連結（step７）によっ
て、“italic represents characters you type from t
he keyboard”という文を作り、更に、その文の文頭の
語（タイトル部分）と、次に続く語（解析失敗文の先頭
の語）に対して、例えばタイトル部分には主語名詞、解
析失敗文の先頭語には述語動詞といったあらかじめ決め
ておいたパターンの構文情報を与え（step８）、その
後、翻訳処理を行なう。構文情報を与えて解析した結
果、解析失敗となる場合は、与えた構文情報を削除し
（step１１）、再度翻訳処理を行なう。FIG. 31 is a flow chart for explaining another embodiment. As in the example shown in FIG. 39, the "italic" part and the sentence starting with "Represent ~" are recognized as different sentences in the morphological analysis (step 2). Similarly, in the parsing of the next process (step 3),
The sentence starting with "Represent ~" is a parsing failure. By combining the sentence that failed to parse with the title part (step 7), “italic represents characters you type from t
The sentence "he keyboard" is created, and for the word at the beginning of the sentence (the title part) and the next word (the first word of the parse failure sentence), for example, the subject noun in the title part, the parse failure sentence The syntactic information of a pre-determined pattern such as a predicate verb is given to the first word of the step (step 8), and then the translation process is performed. Is deleted (step 11) and the translation process is performed again.

【０１０２】図２９に挙げた例文を前述の方式で翻訳す
ると、“Represents〜”で始まる文は、タイトルの“it
alic”と連結されて解析が行なわれるので、図３２
（ａ）に示すような翻訳結果が得られる。タイトル部
（・italic）の訳語イタリック体と、その直後の文の
文頭の訳語図３２（ａ）−ｎも同じイタリック体とな
り、同じ言葉が連続するので、図３２（ｂ）に示すよう
に文頭の訳語とその訳語に伴う助詞を省略する。タイト
ル文の訳語とその直後の文の文頭の訳語が同じ言葉で連
続する場合、タイトル文の直後の文の文頭の訳語を指示
代名詞（図３２（ｃ）−ｐ）に置き換える。When the example sentence shown in FIG. 29 is translated by the above-mentioned method, a sentence starting with "Represents ~" is converted into a title "it".
Since the analysis is performed by linking with "alic", FIG.
The translation result as shown in (a) is obtained. The italic translation of the title part (・ italic) and the translation of the sentence at the beginning of the sentence immediately following it are also the same italics, and the same words are consecutive, so as shown in Fig. 32 (b) Omit the translation of and the particle that accompanies that translation. When the translated word of the title sentence and the translated word at the beginning of the sentence immediately after the title sentence are consecutive with the same word, the translated word at the beginning of the sentence immediately after the title sentence is replaced with a demonstrative pronoun (FIG. 32 (c) -p).

【０１０３】[0103]

【効果】以上の説明から明らかなように、本発明による
と、以下のような効果がある。（１）請求項１において、原文を複数の部分に分割し、
それぞれに翻訳を施すことにより、より高速かつ高い精
度の翻訳を実現し、見通しの良いシステムを構築でき
る。（２）請求項２において、翻訳処理時にこれらの情報を
利用できるので、これにより精度の高い翻訳が出来る。（３）請求項３において、生成時にこれらの情報を利用
することで分割処理に影響されずに正しい訳語が得られ
る。（４）請求項４において、分割手段をリスト文の処理に
適用したのでリスト文の翻訳処理が容易になる。（５）構成５において、各翻訳単位において別訳を保持
することで単一の文の別訳よりも多くの詳しい別訳情報
が得られ、別訳を含んで考えたときの翻訳精度は向上す
る。（６）構成６において、ユーザが分割処理を望まない場
合それを解除できる。（７）構成７において、ユーザが分割処理単位を操作で
き、ユーザが望んだ分割処理が可能である。（８）請求項５，６および７において複数の文末認定手
段を有しているので、柔軟な文末認定が行なわれ、目
的、用途に合った文認認定手段を選択、使用することが
できる。（９）構成１１において、空文を文末として認識してい
るので、文末記号がない場合の文末認定失敗を解消でき
る。また人手によって、文末認定の失敗を簡単に修正で
きる。（１０）構成１２及び構成１３において、タイトルを認
識可能としているので、文末記号を伴わずに書かれてい
ることの顕著なタイトル文を認識できる。（１１）構成１４において、リスト文を認識可能として
いるので、文末記号を伴わずに書かれることの顕著なリ
スト文を認識できる。（１２）構成１５，１６，１７および１８において、記
号認定手段を有しているので、文末記号を伴わずに書か
れることの顕著な見出し語を認識できる。（１３）構成１９，２０及び構成２１において、見出し
語を分離しているので、文末記号を伴わずに書かれるこ
との顕著な見出し語を認識できる。（１４）構成２１において、構文解析の要素から記号部
を取り除いているので、構文解析の負担を少なくし、誤
解析を防ぐことができる。（１５）構成２２に対応する効果；第１言語の文認定結
果を一旦確認することができる。（１６）構成２３に対応する効果；第１言語の文認定結
果を一旦確認することができる上に、特定の箇所が強調
されることにより、より確認しやすくなる。（１７）構成２４に対応する効果；第１言語の文認定結
果を一旦確認することができる上に、ユーザが選択、修
正することができる。（１８）請求項８において、リスト導入文の品詞により
文末候補情報を与える構文的役割を変え、リスト導入文
の品詞に応じて文末候補情報の訳語を与えることができ
る。（１９）請求項９において、リスト文の文末に生起しや
すいリスト文末接続語を特定し、記号部でない次の文の
先頭に前記リスト文末接続語を付加したのでリスト文の
訳語が自然な訳文となる。（２０）請求項１０において、画像情報を利用してイン
デント判定を行ったので、タイトル文やリスト文などの
言語処理が容易となる。（２１）請求項１１において、解析に失敗した文の直前
の文がタイトルであるか否かを判断しタイトルである場
合にはその直後の文と連結し、一文として解析を行なう
ことにより正しい文としての翻訳結果を得ることができ
る。（２２）請求項１２において、タイトルとその直後の文
の文頭の語に、品詞或いは主語、述語といった決まった
パターンの構文情報を与えてやることによって、より正
確な解析、翻訳処理を行ない、翻訳精度を高めることが
できる。（２３）構成３０において、与えられた構文情報を用い
て解析を行なった結果、解析失敗となった場合、与えら
れた構文情報があてはまらない可能性があるので、その
情報を無視して再度請求項１の処理を行うことにより、
正しい文としての翻訳結果を得ることができる。（２４）構成３１において得られた翻訳結果の、タイト
ルを兼ねている部分の訳語を訳文から削除することによ
って、タイトル部分の訳語と同一である文頭の訳語との
連続を避け、訳文を読みやすくすることができる。（２５）構成３２において、前記（３１）で挙げた削除
する訳語の部分を、代名詞に置き換えることにより、訳
文の理解容易性を高めることができる。As is apparent from the above description, the present invention has the following effects. (1) In claim 1, the original sentence is divided into a plurality of parts,
By translating each of them, it is possible to realize faster and more accurate translation, and to construct a system with good visibility. (2) In claim 2, since these pieces of information can be used during the translation processing, highly accurate translation can be performed. (3) In claim 3, by using these pieces of information at the time of generation, a correct translation can be obtained without being affected by the division processing. (4) In claim 4, since the dividing means is applied to the processing of the list sentence, the translation processing of the list sentence is facilitated. (5) In the configuration 5, by holding the translation in each translation unit, more detailed translation information can be obtained than in the translation of a single sentence, and the translation accuracy when the translation is included is improved. To do. (6) In configuration 6, if the user does not want the division processing, it can be canceled. (7) In the configuration 7, the user can operate the division processing unit, and the division processing desired by the user can be performed. (8) Since a plurality of end-of-sentence recognizing means are provided in claims 5, 6 and 7, flexible end-of-sentence recognizing is performed, and it is possible to select and use the end-of-sentence recognizing means suitable for the purpose and application. (9) In the configuration 11, since the empty sentence is recognized as the sentence end, the sentence end recognition failure when there is no sentence end symbol can be eliminated. In addition, the failure of end-of-sentence certification can be easily corrected manually. (10) Since the titles are recognizable in the configurations 12 and 13, it is possible to recognize a title sentence that is prominently written without an end-of-sentence symbol. (11) Since the list sentence is recognizable in the configuration 14, it is possible to recognize the list sentence that is remarkable without being accompanied by the sentence terminator. (12) In the configurations 15, 16, 17 and 18, since the symbol recognizing means is provided, it is possible to recognize a prominent headword that is written without an end-of-sentence symbol. (13) Since the headwords are separated in the configurations 19, 20 and 21, the prominent headwords that can be written without the end-of-sentence symbol can be recognized. (14) In the configuration 21, since the symbol part is removed from the elements of the syntactic analysis, the syntactic analysis load can be reduced and erroneous analysis can be prevented. (15) Effect corresponding to the configuration 22; The sentence recognition result of the first language can be confirmed once. (16) Effect corresponding to the configuration 23: The sentence recognition result of the first language can be confirmed once, and moreover, it becomes easier to confirm by emphasizing a specific part. (17) Effects corresponding to the configuration 24: The sentence recognition result of the first language can be confirmed once, and the user can select and correct it. (18) In claim 8, the syntactic role of giving the sentence end candidate information is changed depending on the part of speech of the list introduction sentence, and the translated word of the sentence end candidate information can be given according to the part of speech of the list introduction sentence. (19) In claim 9, the list sentence end connective word that is likely to occur at the end of the list sentence is specified, and the list sentence end connective word is added to the beginning of the next sentence that is not a symbol part. Becomes (20) According to the tenth aspect, since the indent determination is performed using the image information, the language processing of the title sentence, the list sentence and the like becomes easy. (21) In claim 11, it is determined whether or not the sentence immediately before the sentence that has failed to be analyzed is a title, and if it is a title, it is concatenated with the sentence immediately after that, and the sentence is analyzed as one sentence to obtain a correct sentence. You can get the translation result as. (22) In claim 12, by giving syntax information of a fixed pattern such as a part of speech, a subject, or a predicate to the title and the first word of the sentence immediately after the title, more accurate analysis and translation processing are performed, and the translation is performed. The accuracy can be increased. (23) In the configuration 30, if the parsing fails as a result of the parsing using the given syntactic information, the given syntactic information may not be applicable, so ignore the information and request again. By performing the processing of item 1,
You can get the translation result as a correct sentence. (24) By deleting the translated word of the part that also serves as the title of the translation result obtained in the structure 31 from the translated sentence, it is possible to avoid the continuation of the translated word at the beginning of the sentence that is the same as the translated word of the title portion, and to make the translated text easy to read. can do. (25) In the configuration 32, by replacing the part of the translated word described in (31) to be deleted with a pronoun, it is possible to improve the understandability of the translated text.

[Brief description of drawings]

【図１】本発明による機械翻訳方式の一実施例を説明
するための構成図である。FIG. 1 is a configuration diagram for explaining an embodiment of a machine translation system according to the present invention.

【図２】翻訳部の構成図である。FIG. 2 is a configuration diagram of a translation unit.

【図３】形態素解析部の動作を説明するためのフロー
チャートである。FIG. 3 is a flowchart for explaining the operation of the morphological analysis unit.

【図４】パターンマッチング規則を示す図である。FIG. 4 is a diagram showing a pattern matching rule.

【図５】パターンマッチング処理を説明するためのフ
ローチャートである。FIG. 5 is a flowchart for explaining a pattern matching process.

【図６】編集制御部における動作を説明するためのフ
ローチャートである。FIG. 6 is a flowchart for explaining the operation of the edit control unit.

【図７】別訳表示画面を示す図である。FIG. 7 is a diagram showing a separate translation display screen.

【図８】本発明による機械翻訳方式の文末認定の一実
施例を説明するための構成図である。FIG. 8 is a configuration diagram for explaining an example of sentence end recognition of a machine translation system according to the present invention.

【図９】機械翻訳方式を説明するための全体構成図で
ある。FIG. 9 is an overall configuration diagram for explaining a machine translation system.

【図１０】構成（５)（６）を説明するためのフロー
チャートである。FIG. 10 is a flowchart for explaining configurations (5) and (6).

【図１１】構成（７)（８）を説明するためのフロー
チャートである。FIG. 11 is a flowchart for explaining configurations (7) and (8).

【図１２】見出し語認定手段を説明するためのフロー
チャートである。FIG. 12 is a flowchart for explaining headword identifying means.

【図１３】本発明による機械翻訳方式の他の実施例を
説明するための構成図である。FIG. 13 is a configuration diagram for explaining another embodiment of the machine translation system according to the present invention.

【図１４】図６における翻訳本体部の処理を説明する
ための構成図である。14 is a configuration diagram for explaining a process of the translation main body unit in FIG.

【図１５】文認定を説明するためのフローチャートで
ある。FIG. 15 is a flowchart for explaining sentence recognition.

【図１６】文認定部の構成図である。FIG. 16 is a configuration diagram of a sentence authorization unit.

【図１７】テキスト例に対する‘規則１，２，３’と
‘規則１〜６’の適用結果を示す図である。FIG. 17 is a diagram showing application results of “Rules 1, 2, 3” and “Rules 1 to 6” for text examples.

【図１８】文認定部の他の構成図である。FIG. 18 is another configuration diagram of the sentence authorization unit.

【図１９】コロンを伴うリスト導入部で書かれるリス
ト文が認定された後の処理（１）を示すフローチャート
である。FIG. 19 is a flowchart showing a process (1) after a list sentence written in a list introduction unit accompanied by a colon is recognized.

【図２０】コロンを伴うリスト導入部で書かれるリス
ト文が認定された後の処理（１）を示すフローチャート
である。FIG. 20 is a flowchart showing a process (1) after a list sentence written in a list introduction unit accompanied by a colon is recognized.

【図２１】リスト導入文の翻訳処理を示すフローチャ
ートである。FIG. 21 is a flowchart showing a translation process of a list introduction sentence.

【図２２】リスト導入文の翻訳処理を示すフローチャ
ートである。FIG. 22 is a flowchart showing a translation process of a list introduction sentence.

【図２３】リスト文末接続語に着目して翻訳処理する
ためのフローチャートである。FIG. 23 is a flowchart for performing a translation process focusing on a list sentence end connecting word.

【図２４】画像情報を利用してインデント判定を行う
ためのフローチャートである。FIG. 24 is a flowchart for performing indent determination using image information.

【図２５】複数種のインデントを使った文書の例を示
す図である。FIG. 25 is a diagram showing an example of a document using a plurality of types of indents.

【図２６】インデントの検出結果を示す図である。FIG. 26 is a diagram showing a result of indent detection.

【図２７】インデント例とその表現する文書構造を示
す図である。FIG. 27 is a diagram showing an indent example and a document structure expressed by the indent example.

【図２８】目次部に注目した機械翻訳を説明するため
のフローチャートである。FIG. 28 is a flowchart for explaining machine translation focusing on the table of contents.

【図２９】マニュアル等のタイトル部分が次の文の文
頭を兼ねている例を示す図である。FIG. 29 is a diagram showing an example in which the title portion of a manual or the like also serves as the beginning of the next sentence.

【図３０】タイトル文とその直後の文とを連結して一
文を形成するためのフローチャートを示す図である。FIG. 30 is a diagram showing a flowchart for connecting a title sentence and a sentence immediately after it to form one sentence.

【図３１】タイトル文とその直後の文とを連結して一
文を形成するための他のフローチャートを示す図であ
る。FIG. 31 is a diagram showing another flowchart for connecting a title sentence and a sentence immediately after it to form one sentence.

【図３２】翻訳の結果を示す図である。FIG. 32 is a diagram showing a result of translation.

[Explanation of symbols]

１…入力部、２…原文記憶部、３…翻訳部、４…編集制
御部、５…翻訳辞書部、６…訳文記憶部、７…表示制御
部、８…表示部、９…印刷部、１０…辞書編集部。1 ... Input unit, 2 ... Original sentence storage unit, 3 ... Translation unit, 4 ... Edit control unit, 5 ... Translation dictionary unit, 6 ... Translation sentence storage unit, 7 ... Display control unit, 8 ... Display unit, 9 ... Printing unit, 10 ... Dictionary editor.

───────────────────────────────────────────────────── フロントページの続き (72)発明者成田真澄東京都大田区中馬込１丁目３番６号株式会社リコー内 (72)発明者白石美和東京都大田区中馬込１丁目３番６号株式会社リコー内 (72)発明者大黒慶久東京都大田区中馬込１丁目３番６号株式会社リコー内 (72)発明者伊藤則和東京都大田区中馬込１丁目３番６号株式会社リコー内 (72)発明者本間咲子東京都大田区中馬込１丁目３番６号株式会社リコー内 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Masumi Narita 1-3-3 Nakamagome, Ota-ku, Tokyo Stocks Company Ricoh (72) Inventor Miwa Shiraishi 1-3-3 Nakamagome, Ota-ku, Tokyo Stocks Company Ricoh (72) Inventor Yoshihisa Oguro 1-3-3 Nakamagome, Ota-ku, Tokyo Stocks Company Ricoh (72) Inventor Norikazu Ito 1-3-3 Nakamagome, Ota-ku, Tokyo Stocks Company Ricoh (72) Inventor Sakiko Honma 1-3-3 Nakamagome, Ota-ku, Tokyo Stocks Company Ricoh

Claims

[Claims]

1. An input source language original text is analyzed and processed.
As a result, in a machine translation system that outputs a translated sentence of a target language, input data that forms an original sentence is divided into sentence units, and a dividing unit that divides the sentence into translation units, and a translation unit divided by the dividing unit. Machine translation which performs a separate translation process for each part and synthesizes the translation unit parts that have been subjected to the translation process at the time of translation generation to obtain a whole translation result. method.

2. The machine translation system according to claim 1, further comprising an analysis unit which gives a relationship between the division units at the time of the division processing and analyzes based on the division time information.

3. The machine translation system according to claim 1, further comprising control means for controlling a translation order and utilization so as to obtain an appropriate translated word by using the division time information.

4. The machine translation system according to claim 1, wherein the dividing unit is used to process a list sentence.

5. An input device for inputting an original sentence in a first language to a processing device, a display device for displaying information input by the input device, a processing device for processing the information, and an output in a second language. A machine translation system comprising an output device, wherein a plurality of sentence recognizing means for the text of the original sentence are prepared.

6. If the original text does not have a sentence terminator,
6. The machine translation method according to claim 5, wherein the sentence recognizing means is configured to perform sentence recognition based on character attributes of the original text and vertical and horizontal arrangement information of constituent elements of the original text.

7. The machine according to claim 6, wherein the sentence recognizing means comprises an empty sentence recognizing means, a title sentence recognizing means, a list sentence recognizing means, a symbol part recognizing means, and a headword recognizing means. Translation method.

8. An input device for inputting an original sentence in a first language to a processing device, a display device for displaying information input by the input device, a processing device for processing the information, and an output in a second language. In a machine translation system comprising an output device and a plurality of sentence certifying means for certifying a sentence of the original text, a serial word of sentence end candidate information in a list introduction sentence as a list sentence certifying means of the sentence certifying means. A machine translation method, wherein the syntactic role given to the sentence end candidate information is changed according to the part of speech, and a translated word of the sentence end candidate information is given according to the part of speech.

9. A list sentence end connecting word which is likely to occur at the end of the list sentence certified by the list sentence recognizing means is specified, and the list sentence ending connective word is added to the beginning of the next sentence which is not a symbol part. The machine translation system according to claim 8.

10. An input device for inputting an original sentence in a first language, a display device for displaying information input by the input device, a processing device for processing the information, and an output device for outputting in a second language. In a machine translation system equipped with, determine a horizontal document or a vertical document and determine the position and length of the word by determining the line range specifying means for measuring the line-to-line distance and the word-to-word distance and the character-to-letter distance. Calculating means, and an indent calculating means for obtaining an indent distance from the length of the distance between words, and determining an indent type from a distribution of the indent distance,
Document format estimating means for recognizing linguistic features such as title lines or list sentences based on the type of indent determined by the indent calculating means, and characters for recognizing words or characters based on the document format estimating means A machine translation method comprising a recognition means.

11. An input unit for inputting an original sentence in a first language, a translation dictionary unit having a translated word in a second language corresponding to a word in the first language and semantic information, and the input unit. Morphological analysis unit for morphologically analyzing the original text, a syntax analysis unit for parsing the original text input by the input unit, a conversion / generation unit for converting / generating into the second language, and conversion by the conversion / generating unit -In a machine translation system including an output unit that outputs a generated translated sentence, if a sentence that has failed to be analyzed is a title sentence that also serves as the title of the sentence, the analysis failure sentence and the immediately preceding title sentence are concatenated. The machine translation method is characterized by performing translation processing as if it were one sentence.

12. When the sentence immediately before the sentence that failed in the analysis is a title sentence that also serves as the title of the sentence, the part of speech, the subject, the predicate, etc. for the first word of the sentence for which the analysis failed and the title portion immediately before the sentence 12. The machine translation system according to claim 11, wherein the translation processing is performed by giving the syntax information of.