JPH02103662A

JPH02103662A - Sentence dividing system

Info

Publication number: JPH02103662A
Application number: JP63256635A
Authority: JP
Inventors: Norikazu Ito; 則和伊藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1988-10-12
Filing date: 1988-10-12
Publication date: 1990-04-16

Abstract

PURPOSE:To improve a sentence structure analyzing speed, efficiency, and accuracy of the title system without spoiling the information held by an original text by mechanically dividing the original text to be processed. CONSTITUTION:A translating device having a dictionary consulting system is provided with a CRT 1, keyboard 2, OCR 3, input document 4, spell checking section 5, preediting section 6, translating main body section 7, post-editing section 8, dictionary 9, grammatical rule 10, output document 11, and printer 12. When three marks and words respectively held in a punctuation mark table, coordinate conjunction table, and personal pronoun table continuously appear in an input text, it is estimated that punctuation exists between the punctuation mark and personal pronoun and the first letter of the coordinate conjunction is capitalized by considering that the pronoun is the leading word of the punctuated sentence after rewriting the punctuation mark to a sentence ending mark. Since a given text is divided in accordance with the morphological feature of the sentence in such way, accurate, quick, and efficient analyses can be performed.

Description

【発明の詳細な説明】皮権分黙本発明は、文分割方式、より詳細には、言語理解装置、
機械翻訳装置における形態素解析部及び構文解析部に関
するものである。[Detailed Description of the Invention] The present invention provides a sentence segmentation method, more specifically, a language understanding device,
The present invention relates to a morphological analysis unit and a syntactic analysis unit in a machine translation device.

従】」１区近年、機械翻訳システム等において利用される自然言語
解析の発達は目覚ましい。とはいえ、まだ精度、速度等
の点において、十分とはいえない状況である。その１つ
の理由として解析対象文が長すぎることが挙げられる。1. In recent years, natural language analysis, which is used in machine translation systems, has made remarkable progress. However, the situation is still not satisfactory in terms of accuracy, speed, etc. One reason for this is that the sentence to be analyzed is too long.

一般に構文解析の解析単位は１文である。解析を行う時
にその解析範囲は短いのが望ましい。つまり、より短い
範囲を解析するのならば、解析規則の適用回数及び組み
合わせ数が少なくて済むので、解析がより容易であり、
解析過程で生まれるあいまい性をより少なく押さえるこ
とができる。Generally, the unit of syntax analysis is one sentence. When performing analysis, it is desirable that the analysis range be short. In other words, if a shorter range is to be analyzed, the number of applications and combinations of analysis rules can be reduced, making the analysis easier.
Ambiguity generated during the analysis process can be minimized.

しかし、実際のテキストは短い文ばかりから成り立って
いるわけではなく、その逆に、多くは長い文からなって
いる。そこで１文を適当に分割して複数の文とすれば、
各文は短い文となり、解析範囲が限定されて、解析規則
の適用回数及び組み合わせ数が著しく減少するので、む
だな規則適用や組み合わせも減少し、解析効率が向上し
て速度が上がり、あいまい性も解消されて解析精度も向
上する。However, actual texts do not only consist of short sentences; on the contrary, many of them consist of long sentences. So, if you divide one sentence into multiple sentences,
Each sentence becomes a short sentence, the parsing range is limited, and the number of application and combinations of parsing rules is significantly reduced, which reduces unnecessary rule applications and combinations, improves parsing efficiency and speed, and reduces ambiguity. This also improves analysis accuracy.

このような文分割方式として、特開昭６２−１６３１７
６号公報には、原文中からｒａｎｄＪｒｏｒＪ　　ｒｈ
ｏｗｅｖｅｒＪ等、予め定められた文字列パターンを抽
出し、用法の提示をオペレータが与えることによって原
文を区分する文編集装置が提案されている。また、特開
昭６３−１０２６７号公報には、原文をオペレータの指
示によって指定した個所で分割したり、また、複数の文
章を単一の文章に連結する機械翻訳装置が開示されてい
る。。As such a sentence division method, Japanese Patent Application Laid-Open No. 62-16317
In Publication No. 6, randJrorJrh
Sentence editing devices, such as overheaderJ, have been proposed that extract predetermined character string patterns and classify original sentences by having an operator provide usage suggestions. Further, Japanese Patent Application Laid-Open No. 10267/1983 discloses a machine translation device that divides an original text at locations specified by an operator's instructions, and that concatenates multiple sentences into a single sentence. .

しかし、これらの従来の装置は、いずれもオペレータの
指示によるものであり、自動的な処理がなされるもので
ないため、指示の適格さや速度の点で難点がある。長文
の解析にあたって、これを分割することについては上述
したとおりオペレータの指示を必要とするもので、人手
を介さない機械だけによる解析はまだまだ精度が落ちる
のが現状である。However, these conventional devices all rely on instructions from an operator and do not perform automatic processing, so they have drawbacks in terms of appropriateness and speed of instructions. As mentioned above, when analyzing a long text, instructions from the operator are required to divide the text into segments, and the current situation is that the accuracy of analysis by machines alone without human intervention is still low.

月−一二眞本発明は、上述のごとき実情に鑑みてなされたもので、
与えられたテキストを形態的特徴から分割することによ
り、自動的な分割を可能とし、より正確で速く効率の良
い解析が行えるようにすることを目的としてなされたも
のである。The present invention was made in view of the above-mentioned circumstances.
The purpose of this method is to enable automatic segmentation by dividing a given text based on its morphological features, and to enable more accurate, faster, and more efficient analysis.

構　　　成。composition.

本発明は、上記目的を達成するために、機械翻訳等の自
然言語解析システムにおける形態素解析部において、入
力された言語テキストを分かつためにコンマ、ダッシュ
などの区切り記号を持つ区切り記号テーブルと等位接続
詞を持つ等位接続詞テーブルと人称代名詞の主格を持つ
人称代名詞テーブルとを備え、入力されたテキストにそ
れぞれのテーブルが持つ語が１つずつ連続で現われたと
きに、区切り記号と等位接続詞の語の間に内容の切れ目
があると推定し、区切り記号を文末記号に書き替え１等
位接続詞を区切られた文の先頭であるとして１文字目を
大文字化することを特徴としたものである。以下、本発
明の実施例に基づいて説明する。In order to achieve the above object, the present invention provides a delimiter table and a delimiter table having delimiters such as commas and dashes to separate input linguistic text in a morphological analysis unit in a natural language analysis system such as machine translation. It has a coordinating conjunction table with conjunctions and a personal pronoun table with nominative personal pronouns, and when the words of each table appear one after another in the input text, the delimiters and coordinating conjunctions are displayed. It is characterized by estimating that there is a break in the content between words, replacing the delimiter with a sentence-final symbol, and capitalizing the first letter, assuming that the first-coordinate conjunction is the beginning of a separated sentence. . Hereinafter, the present invention will be explained based on examples.

第１図は、本発明による辞書引き方式を備えた翻訳装置
の一実施例を示す構成図で、図中、１はＣＲＴ、２はキ
ーボード、３は０ＣＲ１４は入力文書、５はスペルチェ
ック部、６は前編集部、７は翻訳本体部、８は後編集部
、９は辞書、１０は文法規則、１１は出力文書、１２は
プリンタで、ファイル入力、キーボード入力、ＯＣＲ入
力のいずれかによって得た入力文はスペルチェック、前
編集を用いて前処理され、翻訳部によって得られた出力
文は後編集によって翻訳情報を利用して編集され、入力
文と出力文はプリンタを用いて印刷される。FIG. 1 is a block diagram showing an embodiment of a translation device equipped with a dictionary lookup method according to the present invention. In the figure, 1 is a CRT, 2 is a keyboard, 3 is an input document 0CR14, 5 is a spell check unit, 6 is a pre-editing section, 7 is a translation main section, 8 is a post-editing section, 9 is a dictionary, 10 is a grammar rule, 11 is an output document, 12 is a printer, and the information obtained by file input, keyboard input, or OCR input is The input sentences are preprocessed using spell checking and pre-editing, the output sentences obtained by the translation section are edited using the translation information in post-editing, and the input and output sentences are printed using a printer. .

第２図は、翻訳本体部の流れを示す図で、この翻訳本体
部（翻訳部）７は大きく分けて形態素解析、構文解析、
変換、生成の４つの処理からなり、まず、形態素解析部
では入力テキストの辞書引きを行なう６個々の語の情報
を得て構文解析部では文法規則に従ってパージングを行
う。解析結果から木構造を作成する。変換部では入力言
語の木構造から出カフ１語の本構造に変形する。生成部
では得られた木構造をノードごとに訳出する。FIG. 2 is a diagram showing the flow of the translation main body section. This translation main body section (translation section) 7 is roughly divided into morphological analysis, syntactic analysis,
It consists of four processes: conversion and generation. First, the morphological analysis section performs a dictionary lookup of the input text to obtain information on six individual words, and the syntactic analysis section performs parsing according to grammatical rules. Create a tree structure from the analysis results. The conversion unit transforms the tree structure of the input language into the book structure of one output word. The generation section translates the obtained tree structure node by node.

本発明は、上記形態素解析部に属するもので。The present invention belongs to the above-mentioned morphological analysis section.

ここでは入力テキストは英文とする。入力されたテキス
トを対象として、形態素解析部では第３図に示す分割処
理を行う。Here, the input text is English. The morphological analysis section performs the division process shown in FIG. 3 on the input text.

第４図乃至第６図は、第３図の分割処理に用いられるテ
ーブルの一例であり、第４図は区切り記号テーブル、第
５図は、等位接続詞テーブル、第６図は、主格人称代名
詞テーブルである。Figures 4 to 6 are examples of tables used in the division process in Figure 3. Figure 4 is a delimiter table, Figure 5 is a coordinate conjunction table, and Figure 6 is a nominative personal pronoun. It's a table.

第３図は、分割処理の流れを示す図で、入力されたテキ
ストを初めから最後までｌ　ｉｉずつ調べていく０区切
り記号でない場合、等位接続詞でない場合、主格人称代
名詞でない場合は、いずれもポインタを１語進めて調べ
ていく。区切り記号であれば、ポインタを１語進め、次
の語が等位接続詞であれば、更に１語進め、次が主格人
称代名詞である場合、すなわち、区切り記号と等位接続
詞と主格人称代名詞の３つが連続して呪われた時にテキ
ストの切れ目が存在すると判定する。このとき区切り記
号と等位接続詞の間でテキストを分割する。同時に区切
り記号を文末記号とする。テキストが英文であればピリ
オドにする。等位接続詞は次の文の先頭となるので最初
の文字を大文字とする。Figure 3 is a diagram showing the flow of the division process, in which the input text is examined from beginning to end in increments of l ii.If it is not a 0 delimiter, if it is not a coordinating conjunction, or if it is not a nominative personal pronoun, Move the pointer forward one word to find out. If it is a delimiter, advance the pointer one word, if the next word is a coordinating conjunction, advance it one more word, and if the next word is a nominative personal pronoun, that is, a delimiter, a coordinating conjunction, and a nominative personal pronoun. It is determined that there is a break in the text when three curses occur in succession. At this time, the text is divided between the delimiter and the coordinating conjunction. At the same time, the delimiter is the end of sentence symbol. If the text is in English, use a period. Coordinating conjunctions begin the next sentence, so capitalize the first letter.

以下、実例を示して説明する。This will be explained below using an example.

今、次のテキスト（Ａ）が与えられているとする。Assume that the following text (A) is now given.

（Ａ）、　Ｔｈｅ　ａｒｒｉｖａｌ　ｏｆ　ｅｘｐｏｒ
ｔｓ　ｆｒｏｍ　ｔｈｅ　ＭｉｄｄｌｅＥａｓｔ　ｗｉ
ｌｌ　ｂｅ　ａ　５ｈｏｃｋ、　ａｎｄ　Ｉ　ｄｏｎ’
ｔ　ｓｅｅ　ａｎｙｏｒｇａｎｉｚｅｄ　ｗａｙ　ｔｏ
　ａｂｓｏｒｂ　ｉｔ。(A), The arrival of export
ts from the Middle East
I'll be a 5hock, and I don't
t see any organized way to
absorb it.

調べる手順は次の（１）〜（８）によって行なわれる。The checking procedure is performed according to the following (1) to (8).

（１）、テキストの先頭から１語ずつそれが区切り記号
であるかどうか調べる。(1) Check each word from the beginning of the text to see if it is a delimiter.

ｒＴｈｅＪ　から始まり、　ｒｓｈｏｃｋＪまでは区切
り記号はないから１語づつ、この処理が繰り返される。Starting from rTheJ and ending with rshockJ, there is no delimiter, so this process is repeated word by word.

区切り記号であれば次の語を見る。If it is a delimiter, look at the next word.

ｒｓｈｏｃｋＪの次の「、」は区切り記号であるから、
（３）に移る。The “,” next to rshockJ is a delimiter, so
Move on to (3).

（２）。(2).

次の語が等位接続詞であるかどうかを調べる。Find out whether the following words are coordinating conjunctions.

ｒａｎｄＪ　を調べる。Check randJ.

等位接続詞であれば次の語を見る６ｒａｎｄＪは等位接続詞であるから、（５）に移る。If it is a coordinating conjunction, look at the next word6 Since randJ is a coordinating conjunction, Move on to (5).

次の語が主格人称代名詞であるかどうかを調べる。Check whether the following word is a nominative personal pronoun.

次の語「Ｉ」を調べる。Look up the next word "I".

（６）、主格人称代名詞であれば分割処理を行う。(6) If it is a nominative personal pronoun, a division process is performed.

「工」は主格人称代名詞であるから、（７）に移る。Because “tech” is a nominative personal pronoun, Move on to (7).

区切り記号を文末記号に変える。この場合はピリオドと
する。Change the delimiter to the end of sentence symbol. In this case, use a period.

「、」を「、」に変える。つまり、「、」と次の語ｒａ
ｎｄＪとの間で文を分割する。Change "," to ",". In other words, "," and the next word ra
Divide the sentence between ndJ and ndJ.

等位接続詞の先頭文字を大文字とする。Capitalize the first letter of a coordinating conjunction.

ｒａｎｄＪは上記分割の結果、文頭となるので、先頭文
字「ａ」を大文字に変えｒＡｎｄＪ　とする。As a result of the above division, randJ becomes the beginning of a sentence, so the first character "a" is changed to a capital letter and becomes rAndJ.

（３）。(3).

（４）。(4).

（５）。(5).

（７）。(7).

（８）。(8).

以上の分割処理の結果５文（Ａ）は以下のように文（Ｂ
）と文（Ｃ）に分かれる。As a result of the above division processing, 5 sentences (A) are converted into sentences (B) as follows.
) and sentence (C).

（Ｂ）、　Ｔｈｅ　ａｒｒｉｖａｌ　ｏｆ　ｅｘｐｏｒ
ｔｓ　ｆｒｏｍ　ｔｈｅ　ＭＩｄｄｌｅＥａｓｔ　ｗｉ
ｌｌ　ｂｅ　ａ　５ｈｏｃｋ。(B), The arrival of export
ts from the MIddleEast wi
ll be a 5hock.

（Ｃ）、　Ａｎｄ　Ｉ　ｄｏｎ’ｔ　ｓｅｅ　ａｎｙ　
ｏｒｇａｎｉｚｅｄ　ｗａｙ　ｔ。(C), And I don't see any
organized way.

ａｂｓｏｒｂ　ｉｔ。absorb it.

なお、本方式は形態的特徴のみを利用しているために、
主格人称代名詞テーブルは主格のみに用いられる言葉だ
けを持つ。したがって、英文の場合、例えば２人称を示
すｙｏｕは見た目（形態）では主格か目的格かわからな
いためこのテーブルには入れていない。Note that since this method uses only morphological features,
The nominative personal pronoun table has only words that are used only in the nominative. Therefore, in the case of English sentences, for example, you, which indicates the second person, is not included in this table because it is difficult to tell whether it is the nominative case or the objective case based on its appearance (form).

劾−一一果以上の説明から明らかなように、本発明によれば、処理
対象テキストを機械的に分割することによって原テキス
トの持つ情報を損なうことなく、構文解析を行うときの
速度、効率及び精度を向上させることができる。As is clear from the above explanation, according to the present invention, by mechanically dividing the text to be processed, the speed and efficiency of parsing can be improved without losing the information of the original text. and accuracy can be improved.

[Brief explanation of drawings]

第１図は、本発明による辞書引き方式を備えた翻訳装置
の一実施例を示す構成図、第２図は、翻訳本体部の流れ
を示す図、第３図は１分割処理の流れを示す図、第４図
乃至第６図は、テーブルの例を示す図である。１・・・ＣＲＴ、２・・・キーボード、３・・・ＯＣＲ
，４・・・入力文書、５・・・スペルチェック部、６・
・・前編集部。７・・・翻訳本体部、８・・・後編集部、９・・・辞書
、１０・・・文法規則、１１・・・出力文書、１２・・
・プリンタ。第図第図第図FIG. 1 is a block diagram showing an embodiment of a translation device equipped with a dictionary lookup method according to the present invention, FIG. 2 is a diagram showing the flow of the translation main body, and FIG. 3 is a flow diagram of the 1-division processing. 4 to 6 are diagrams showing examples of tables. 1...CRT, 2...Keyboard, 3...OCR
, 4... Input document, 5... Spell check section, 6.
...Previous editorial department. 7... Translation body part, 8... Post-editing part, 9... Dictionary, 10... Grammar rules, 11... Output document, 12...
・Printer. Figure Figure Figure Figure

Claims

[Claims]

1. In the morphological analysis section of a natural language analysis system such as machine translation, in order to separate input language text, a delimiter table with delimiters such as commas and dashes, a coordinate conjunction table with coordinating conjunctions, and personal pronouns are used. personal pronoun tables with the nominative case of A sentence division method that is characterized by estimating the delimiter, rewriting the delimiter with a sentence-final symbol, and capitalizing the first character of the coordinating conjunction, assuming that it is the beginning of the delimited sentence.