JPH0619962A

JPH0619962A - Text dividing device

Info

Publication number: JPH0619962A
Application number: JP4177950A
Authority: JP
Inventors: Hidezo Kugimiya; 秀造釘宮
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1992-07-06
Filing date: 1992-07-06
Publication date: 1994-01-28

Abstract

PURPOSE:To accurately divide a text and to efficiently execute language processing by detecting the divided position of the text in terms of a dividing segmenting character and a format, segmenting the text on the position and outputting the segmented text. CONSTITUTION:An original sentence to be translated is inputted S1 and one sentence segmenting processing is executed S2. The one sentence segmenting processing extracs format information such as the layout/character sorts of a text in each sentence by using segmenting characters included in the text and their format information and stores extracted layout information also correspondingly to respective sentences included in the text. The text is successively translated S3 by setting up each segmented sentence as an input unit and the formating of translated sentences is executed S4 by applying stored format information and outputted S5. Since the text is divided by using format information other than normal segmenting characters, the segmentation of a sentence which can not be expressed only by segmenting characters can be correctly segmented and the text can be segmented in each sentence.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、機械翻訳、文章要
約、キーワード抽出などの言語処理において用いられる
テキスト分割装置に関し、特に、原文を一括入力した
後、所定の分割単位、たとえば１文ごとに切出して後続
する処理に出力するためのテキスト分割装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a text segmentation device used in language processing such as machine translation, text summarization, and keyword extraction. The present invention relates to a text dividing device for cutting out and outputting to subsequent processing.

【０００２】[0002]

【従来の技術】機械翻訳等の言語処理においては、原文
テキストをＯＣＲ（光学的文字読取装置）などにより一
括入力した後に所定の処理を行なうことが一般的であ
る。この場合、機械翻訳、文章要約、キーワード抽出な
どの処理はテキストの１文を単位として行なわれる。そ
のため、一括して入力された原文テキストを１文ずつに
分割する処理が必要となる。2. Description of the Related Art In language processing such as machine translation, it is general to perform a predetermined processing after inputting original texts collectively by an OCR (optical character reader). In this case, processing such as machine translation, text summarization, and keyword extraction is performed in units of one text. Therefore, it is necessary to divide the original texts that have been input in batches into individual sentences.

【０００３】従来、この１文切出の処理は、ピリオ
ド（．）、コロン（：）、セミコロン（；）などの文の
切れ目を表わす区切り文字を認識することにより、この
区切り文字の部分でテキストを分割して行なっていた。Conventionally, the processing for cutting out one sentence is performed by recognizing a delimiter character such as a period (.), A colon (:), or a semicolon (;) which represents a break of a sentence, and a text is recognized at the delimiter portion. Was done by dividing.

【０００４】[0004]

【発明が解決しようとする課題】このような従来のテキ
スト分割装置では、区切り文字が存在しないとテキスト
をその部分で分割することができない。そのため、テキ
ストのタイトル部分と本文部分とが分割されずひとまと
めとして出力されたり、リストとして数行にわたって挙
げられた多数の項目が、全体で１つの文になってしまっ
て出力されたりするという、誤った処理が行なわれるこ
とがあった。このような誤った１文切出し処理をする
と、後の処理を正しく行なうために、誤った部分を修正
する作業が必要となる。そのため従来のテキスト分割装
置を用いると言語処理全体の効率が悪くなるという問題
点がある。In such a conventional text dividing device, if the delimiter does not exist, the text cannot be divided at that portion. Therefore, the title part and the body part of the text are not divided and output as a batch, or many items listed over several lines in the list are output as one sentence as a whole. Processing was sometimes performed. If such an erroneous one-sentence cut-out process is performed, an operation of correcting the erroneous part is required in order to correctly perform the subsequent process. Therefore, if the conventional text segmentation device is used, there is a problem in that the efficiency of the entire language processing becomes poor.

【０００５】それゆえにこの発明の目的は、従来のテキ
スト分割装置よりもより精度よくテキストの分割を行な
うことができ、その結果後の言語処理を効率よくするこ
とができるテキスト分割装置を提供することである。Therefore, an object of the present invention is to provide a text segmentation device that can perform text segmentation with higher accuracy than the conventional text segmentation device, and as a result can efficiently perform subsequent language processing. Is.

【０００６】[0006]

【課題を解決するための手段】本発明に係るテキスト分
割装置は、テキストに含まれる区切り文字と、テキスト
のフォーマットとからテキストの分割位置を検出するた
めの分割位置検出手段と、分割位置検出手段により検出
された分割位置で、テキストを区切って出力するための
出力手段とを含む。A text division device according to the present invention is a division position detection means for detecting a division position of a text based on a delimiter included in the text and a text format, and a division position detection means. Output means for dividing and outputting the text at the division position detected by.

【０００７】[0007]

【作用】本発明に係るテキスト分割装置では、分割位置
として区切り文字のみでなく、テキストのフォーマット
をも用いて検出処理が行なわれ、このようにして検出さ
れた分割位置でテキストが分割される。In the text dividing device according to the present invention, not only the delimiter character as the division position but also the text format is used for the detection processing, and the text is divided at the division position thus detected.

【０００８】[0008]

【発明の実施例】以下、この発明の一実施例を図面を参
照して詳細に説明する。なお、本明細書においては、テ
キストの「フォーマット」とは、文の配列を示すレイア
ウトや、文を構成する各文字が用いられている文字種な
ど、文字の配置を表わすすべての情報を示すものとす
る。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described in detail below with reference to the drawings. In the present specification, the “format” of text indicates all information indicating the arrangement of characters, such as the layout indicating the arrangement of sentences and the character type in which each character forming the sentence is used. To do.

【０００９】図１は、本発明の一実施例に係るテキスト
分割装置を用いた機械翻訳装置で行なわれる処理のフロ
ーチャートおよびハードウェアの一部を示す模式図であ
る。まずステップＳ１で、翻訳対象の原文を図示されな
いＯＣＲなどにより入力する。FIG. 1 is a schematic diagram showing a flowchart of a process performed by a machine translation device using a text segmentation device according to an embodiment of the present invention and a part of hardware. First, in step S1, an original sentence to be translated is input by an OCR (not shown) or the like.

【００１０】続いてステップＳ２で、本発明に係るテキ
スト分割装置を用いて１文切出し処理が行なわれる。こ
のときの１文切出し処理は、テキストのレイアウト／文
字種などのフォーマット情報を取出し、テキストに含ま
れる区切り文字のみならずこれらフォーマット情報をも
用いて各文ごとに行なう。このとき、抽出されたレイア
ウト情報も図１に示されるように文章に含まれる各文と
対比させて格納する。Then, in step S2, a one-sentence cutting process is performed using the text segmentation device according to the present invention. At this time, the one-sentence cut-out process is performed for each sentence by taking out format information such as text layout / character type and using not only the delimiter characters included in the text but also the format information. At this time, the extracted layout information is also stored in comparison with each sentence included in the sentence as shown in FIG.

【００１１】ステップＳ３では、ステップＳ２で切出さ
れた１文を入力単位として順次翻訳処理を行なう。In step S3, translation processing is sequentially performed with the one sentence cut out in step S2 as an input unit.

【００１２】続いてステップＳ４で、ステップＳ３で得
られた翻訳結果の文に対して、ステップＳ２の処理によ
って一旦格納していたレイアウト／文字種などのフォー
マット情報を適用して訳文のフォーマッティングを行な
う。Subsequently, in step S4, the translated sentence obtained in step S3 is formatted by applying the format information such as the layout / character type temporarily stored in the process of step S2.

【００１３】そしてステップＳ５で、ステップＳ４でフ
ォーマットされた結果の文を出力して終了する。Then, in step S5, the resulting sentence formatted in step S4 is output and the process ends.

【００１４】図２は、図１のステップＳ２で行なわれる
１文切出し処理のより詳細な手順を示すフローチャート
である。図３は入力テキストの一例を示す模式図であ
り、図４は図３に対して１文切出し処理を行なった場合
に得られたレイアウト／文字種情報と、切出された文と
の対応関係を示すバッファの模式図である。FIG. 2 is a flow chart showing a more detailed procedure of the one-sentence cutout process performed in step S2 of FIG. FIG. 3 is a schematic diagram showing an example of the input text, and FIG. 4 shows the correspondence between the layout / character type information obtained when the one-sentence cutout process is performed on FIG. 3 and the cutout sentence. It is a schematic diagram of the buffer shown.

【００１５】図２を参照して、１文切出しは次のように
して行なわれる。まずステップＳ１１で、テキストのう
ちの処理対象となっている行を指す行ポインタを、テキ
ストの先頭行にセットする処理が行なわれる。Referring to FIG. 2, one sentence cutout is performed as follows. First, in step S11, a process of setting a line pointer that points to a line to be processed in the text to the first line of the text is performed.

【００１６】ステップＳ１２で、ポインタの指す行が空
行かどうかを判定する処理が行なわれる。空行とは、何
も文字が含まれていない行を指す。処理対象の行が空行
である場合には処理はステップＳ１５に進み、空行でな
い場合には処理はステップＳ１３に進む。In step S12, a process for determining whether the line pointed to by the pointer is a blank line is performed. A blank line is a line that does not contain any characters. If the line to be processed is a blank line, the process proceeds to step S15, and if it is not a blank line, the process proceeds to step S13.

【００１７】ステップＳ１３では、処理対象の行の先頭
が数字と記号の組合せであるかどうかについての判断が
行なわれる。行頭が数字と記号の組合せである場合には
この行はタイトルである可能性が高い。そのため処理は
ステップＳ１４に進む。行頭が数字と記号の組合せでな
い場合には処理はステップＳ１８に進む。In step S13, it is determined whether the beginning of the line to be processed is a combination of numbers and symbols. If the beginning of a line is a combination of numbers and symbols, this line is likely to be a title. Therefore, the process proceeds to step S14. If the beginning of the line is not a combination of numbers and symbols, the process proceeds to step S18.

【００１８】ステップＳ１４では、現在処理中の行を１
文（１つの単位）としてバッファに格納する処理が行な
われる。ステップＳ１４の後処理はステップＳ１５に進
む。In step S14, the line currently being processed is set to 1
A process of storing in the buffer as a sentence (one unit) is performed. The post-processing of step S14 proceeds to step S15.

【００１９】ステップＳ１３からステップＳ１８に処理
が進んだ場合、ステップＳ１８では、処理対象の行の行
頭がスペースであるかどうかについての判定が行なわれ
る。行頭がスペースであれば処理はステップＳ１９に進
み、それ以外の場合には処理はステップＳ２１に進む。When the process proceeds from step S13 to step S18, it is determined in step S18 whether or not the line head of the line to be processed is a space. If the line head is a space, the process proceeds to step S19, and if not, the process proceeds to step S21.

【００２０】ステップＳ１９では、スペースの後の最初
の文字が記号であるかどうかについての判断が行なわれ
る。記号である場合には処理はステップＳ２０に、それ
以外の場合には処理はステップＳ２１に進む。In step S19, a determination is made as to whether the first character after the space is a symbol. If it is a symbol, the process proceeds to step S20; otherwise, the process proceeds to step S21.

【００２１】ステップＳ２０においては、現在の行の次
の行または現在の行の前の行と現在の行とが同じ形かど
うかについての判断が行なわれる。同じ形かどうかと
は、行頭がスペースであってかつ最初の文字が記号であ
るか、あるいはそうした条件が成立しないかということ
である。次行または前行が現在の行と同じ形の場合には
処理はステップＳ１４に進み、それ以外の場合には処理
はステップＳ２１に進む。ステップＳ１４に処理が進ん
だ場合、行頭が数字と記号の組合せであった場合と同様
に現在の行を１文としてバッファに格納する処理が行な
われ、さらにステップＳ１５以下に処理が進む。In step S20, a determination is made as to whether the line next to the current line or the line preceding the current line has the same shape as the current line. Whether or not they have the same shape means that the beginning of a line is a space and the first character is a symbol, or that such a condition is not satisfied. If the next line or the previous line has the same shape as the current line, the process proceeds to step S14, and if not, the process proceeds to step S21. When the process proceeds to step S14, the process of storing the current line as one sentence in the buffer is performed as in the case where the beginning of the line is a combination of a number and a symbol, and the process further proceeds to step S15 and thereafter.

【００２２】一方ステップＳ１８、ステップＳ１９、ス
テップＳ２０の３つの判断のいずれかでＮＯという判断
が行なわれた場合処理はステップＳ２１に進む。ステッ
プＳ２１では、現在の行から、次の空行の前までに対し
て、通常の１文切出し処理を施す。すなわち、テキスト
に含まれるピリオドやコロンなどの区切り文字でテキス
トを分割し、それぞれを１文として処理を行なう。処理
はステップＳ２２に進む。On the other hand, if the determination is NO in any of the three determinations of step S18, step S19, and step S20, the process proceeds to step S21. In step S21, the normal one-sentence cutting process is performed from the current line to the position before the next blank line. That is, the text is divided by delimiters such as periods and colons included in the text, and each is processed as one sentence. The process proceeds to step S22.

【００２３】ステップＳ２２では、行ポインタを、まだ
１文切出し処理を行なっていない部分まで進める処理を
行なう。ステップＳ２２の後処理はステップＳ１７に進
む。In step S22, processing for advancing the row pointer to a portion where one sentence is not yet cut out is performed. The post-processing of step S22 proceeds to step S17.

【００２４】一方、ステップＳ１５では、レイアウト／
文字種などのテキストのフォーマット情報をバッファに
格納する処理が行なわれる。ここでフォーマット情報と
しては、文頭にスペースがある場合のそのスペースの個
数、使用されている活字の種類（たとえばボールド体、
イタリック体など）、文末に改行があるどうかなどの情
報を含む。この詳細については図３、４を参照して後に
説明する。On the other hand, in step S15, the layout /
A process of storing text format information such as a character type in a buffer is performed. Here, the format information includes the number of spaces at the beginning of a sentence, the type of type used (for example, bold type,
Italics), including information such as whether there is a line break at the end of the sentence. Details of this will be described later with reference to FIGS.

【００２５】ステップＳ１５の後処理はステップＳ１６
に進み、行ポインタを１つ進める処理が行なわれる。こ
れにより処理対象の行は１つ先に進むことになる。ステ
ップＳ１６の後処理はステップＳ１７に進む。The post-processing of step S15 is step S16.
Then, processing for advancing the row pointer by one is performed. As a result, the line to be processed is moved forward by one. The post-processing of step S16 proceeds to step S17.

【００２６】ステップＳ１７では、ステップＳ１６、ス
テップＳ２２で新たに設定された行ポインタで示される
位置に、処理対象となる行が存在するかどうかについて
の判断が行なわれる。存在する場合には処理は再びステ
ップＳ１２に戻りステップＳ１２以下の処理が繰返して
実行される。行が存在しない場合には処理は終了する。In step S17, it is determined whether or not there is a line to be processed at the position indicated by the line pointer newly set in steps S16 and S22. If it exists, the process returns to step S12 and the processes of step S12 and thereafter are repeatedly executed. If the line does not exist, the process ends.

【００２７】図２に示されるような１文切出し処理を行
なうことにより、次のような結果を得ることができる。
図３は、入力テキストの一例である。図３に示されるテ
キストの場合には、タイトルと、本文とが空白行で分離
されている。また本文はさらに、地の文を表わす部分
と、この地の文によって導入される多数の例示部分とが
含まれ、これら２つの部分は空行で分離されている。By performing the one-sentence cutout process as shown in FIG. 2, the following results can be obtained.
FIG. 3 is an example of input text. In the case of the text shown in FIG. 3, the title and the body are separated by a blank line. Also, the text further includes a portion that represents the text of the ground and a number of example portions introduced by the text of the ground, the two parts being separated by blank lines.

【００２８】図３に示されるテキストの場合には、通常
の区切り文字以外の部分でテキストを分割しなければ、
たとえばタイトルと地の文の部分が相互に接続されてし
まったり、例示の文が相互に複数個接続されてしまった
りし、正しい１文切出し処理が行なわれない。In the case of the text shown in FIG. 3, if the text is not divided at a part other than the normal delimiter,
For example, the title and the ground sentence may be connected to each other, or a plurality of exemplified sentences may be connected to each other, so that the correct one-sent-line cutting process is not performed.

【００２９】これに対し、本願発明のテキスト分割装置
を用いてこの文を分割すると、その結果は図４に示され
るようになる。図４を参照して、文ナンバー１のタイト
ルと文ナンバー３の地の文とは、文ナンバー２の空行に
よって分離されている。また文ナンバー３と文ナンバー
４とは通常の区切り文字（ピリオド）により分離され、
文ナンバー４と文ナンバー６との間は通常の区切り文字
（コロン）および文ナンバー５の空行によって分離され
ている。また文ナンバー６、７の例示の部分は、文末に
改行が存在することからこのレイアウト情報によって２
つの文に分割される。他の例示の文も同様に分離され
る。また文ナンバー１と文ナンバー３との間では、使用
されている文字種が異なっていることを用いても分割が
可能である。On the other hand, when this sentence is divided using the text dividing device of the present invention, the result is as shown in FIG. Referring to FIG. 4, the title of sentence number 1 and the ground sentence of sentence number 3 are separated by the blank line of sentence number 2. In addition, sentence number 3 and sentence number 4 are separated by a normal delimiter (period),
The sentence number 4 and the sentence number 6 are separated by a normal delimiter (colon) and a blank line of the sentence number 5. In addition, since the line breaks are present at the end of the sentences in the example portions of sentence numbers 6 and 7, the layout information
It is divided into two sentences. Other example sentences are separated as well. Further, the sentence number 1 and the sentence number 3 can be divided by using the different character types used.

【００３０】以上のように本発明に係るテキスト分割装
置では、通常の区切り文字以外のフォーマット情報を用
いてテキストの分割が行なわれる。そのための、区切り
文字のみでは表わせないような文の区切りを正しく検出
してテキストを１文ずつに切出す処理が可能である。区
切り文字のみでは分割不能な文も正しく分割することが
できるため、後続する言語処理に先立って１文切出し処
理の結果を修正する必要性は少なく、処理の効率を向上
させることができる。As described above, in the text dividing device according to the present invention, the text is divided using the format information other than the normal delimiter. Therefore, it is possible to correctly detect sentence delimiters that cannot be expressed only by delimiters and to cut out the text one sentence at a time. Since a sentence that cannot be divided only by the delimiter can be correctly divided, it is not necessary to correct the result of the one-sentence cutout process prior to the subsequent language processing, and the processing efficiency can be improved.

【００３１】[0031]

【発明の効果】以上のように本発明に係るテキスト分割
装置には、通常の区切り文字のみでは表現できないテキ
ストの分割位置を、テキストのフォーマット情報を用い
て検出し、このように検出された分割位置でテキストを
分割することができる。そのため、区切り文字のみを用
いてテキスト分割を行なった場合に比べてテキスト分割
の精度がより向上し、後続する処理に先立ってテキスト
分割の処理結果を訂正する必要性は少なくなる。As described above, in the text segmentation device according to the present invention, the segmentation position of the text that cannot be represented only by the normal delimiter is detected using the text format information, and the segmentation thus detected is detected. You can split text by position. Therefore, the accuracy of the text division is further improved as compared with the case where the text division is performed using only the delimiter, and the necessity of correcting the text division processing result prior to the subsequent processing is reduced.

【００３２】その結果、テキスト分割の精度をより向上
させることができ、かつ後続する言語処理の効率も高め
ることができるテキスト分割装置を提供できる。As a result, it is possible to provide a text division device which can further improve the accuracy of text division and can improve the efficiency of the subsequent language processing.

[Brief description of drawings]

【図１】本発明の一実施例に係るテキスト分割装置を用
いた機械翻訳装置で行なう処理のフローチャートおよび
装置の一部を示す模式図である。FIG. 1 is a schematic diagram showing a flowchart of a process performed by a machine translation device using a text segmentation device according to an embodiment of the present invention and a part of the device.

【図２】１文切出処理のフローチャートである。FIG. 2 is a flowchart of a one-sentence cutout process.

【図３】入力テキストの一例を示す模式図である。FIG. 3 is a schematic diagram showing an example of input text.

【図４】図３に示されるテキストを本発明に係るテキス
ト分割装置で分割した場合の処理結果を示すバッファの
模式図である。FIG. 4 is a schematic diagram of a buffer showing a processing result when the text shown in FIG. 3 is divided by the text division device according to the present invention.

Claims

[Claims]

1. A division position detection unit for detecting a division position of the text based on a delimiter included in the text and a format of the text, and the text at the division position detected by the division position detection unit. A text segmentation device including an output means for dividing and outputting.