JPH07104865B2

JPH07104865B2 - Sentence cutting device

Info

Publication number: JPH07104865B2
Application number: JP5096694A
Authority: JP
Inventors: 章浩古川
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1993-03-31
Filing date: 1993-03-31
Publication date: 1995-11-13
Anticipated expiration: 2010-11-13
Also published as: JPH06290209A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は自然言語で記述された文
書の文切りを行なう文切り装置に関し、特に、表やグラ
フ内に記述された文書の文切りを行なう文切り装置に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a sentence cutting device for cutting sentences written in a natural language, and more particularly to a sentence cutting device for cutting sentences written in a table or a graph.

【０００２】[0002]

【従来の技術】自然言語で記述された文書を機械翻訳す
る場合等、文書を文単位に分割すること、即ち文切りす
ることが必要になる。2. Description of the Related Art In the case of machine translation of a document described in natural language, it is necessary to divide the document into sentences, that is, to cut the sentence.

【０００３】従来はこのような場合、日本語文に於いて
は句点「。」に基づいて文切りを行ない、英語文に於い
てはピリオド「．」に基づいて文切りを行なっていた
（例えば、特開昭６４−６１８６３号公報）。Conventionally, in such a case, in Japanese sentences, sentence cutting is performed based on the punctuation mark ".", And in English sentences, sentence cutting is performed based on the period "." (For example, JP-A-64-61863).

【０００４】文書中に句点やピリオドがあれば、上記し
た従来の技術に基づいて容易に文切りを行なうことがで
きる。If there are punctuation marks or periods in the document, sentence cutting can be easily performed based on the above-mentioned conventional technique.

【０００５】しかし、表中の文書やグラフ中の文書は一
般に句点やピリオドを用いないため、表，グラフ中に複
数行の文が存在している場合、上述した従来技術ではそ
れが連続した１文が折り返されたものなのか、複数の文
が並記されたものなのかを判別することができない。However, since a document in a table or a document in a graph generally does not use a punctuation mark or a period, when a sentence of a plurality of lines exists in a table or a graph, in the above-mentioned conventional technique, it is consecutive 1 It is not possible to determine whether the sentence is folded or a plurality of sentences are written in parallel.

【０００６】そこで、隣接する形態素（単語）が接続可
であるか否かを隣接する形態素の品詞に基づいて判定す
るという技術（例えば、特開昭６１−１６３６７号公
報）を用いて表中の文書やグラフ中の文書を文切りする
ということが考えられる。即ち、第ｉ行目の最後の形態
素の品詞と第（ｉ＋１）行目の最初の形態素の品詞とに
基づいて両者が接続可であるか否かを判定し、接続可で
あれば、第ｉ行と第（ｉ＋１）行は連続した文と判定
し、接続不可であれば、第ｉ行と第（ｉ＋１）行は異な
る文と判定するものである。Therefore, a technique of determining whether or not adjacent morphemes (words) are connectable based on the part of speech of the adjacent morphemes (for example, Japanese Patent Laid-Open No. 61-16367) is used. It can be considered to cut a document or a document in a graph. That is, based on the part of speech of the last morpheme on the i-th line and the part of speech of the first morpheme on the (i + 1) th line, it is determined whether or not they are connectable. It is determined that the line and the (i + 1) th line are continuous sentences, and if the connection is impossible, the i-th line and the (i + 1) th line are different sentences.

【０００７】[0007]

【発明が解決しようとする課題】しかし、上記した技術
は隣接する形態素の品詞に基づいて連続した文か否かを
判定しているだけであるので、判定誤りが生じやすいと
いう問題があった。例えば、名詞＋動詞の並びは一般的
には接続しにくいが、「私行く」のように接続する場合
もあるので、隣接する形態素の品詞に基づいて連続した
文か否かを判定するだけでは判定誤りが生じやすい。However, since the above-mentioned technique only determines whether or not the sentence is a continuous sentence based on the parts of speech of the adjacent morphemes, there is a problem that a determination error is likely to occur. For example, it is generally difficult to connect a sequence of nouns and verbs, but there are cases where they are connected like "I go," so it is only necessary to determine whether or not they are continuous sentences based on the parts of speech of adjacent morphemes. Judgment error is likely to occur.

【０００８】本発明の目的は表，グラフ中の文書を文切
りする際、判定誤りの生じにくい文切り装置を提供する
ことにある。An object of the present invention is to provide a sentence cutting device which is less likely to cause a judgment error when cutting a document in a table or a graph.

【０００９】[0009]

【課題を解決するための手段】本発明は上記目的を達成
するため、原文テキスト中の表部分及びグラフ部分を抽
出するレイアウト解析手段と、該レイアウト解析手段が
抽出した表部分及びグラフ部分に存在するテキストを、
１文とみなせる単位毎に抽出するテキスト抽出手段と、
該テキスト抽出手段が抽出した１文とみなせる単位毎に
形態素解析及び構文解析を行ない、前記１文とみなせる
単位の各行間の接続コスト及び修飾の有無を求める解析
手段と、該解析手段が求めた各行間の接続コスト及び修
飾の有無に基づいて前記１文とみなせる単位の各行が連
続するか否かを判定する判定手段と、該判定手段の判定
結果に基づいて前記１文とみなせる単位を文切りする文
分割／結合手段とを設けたものである。In order to achieve the above object, the present invention resides in a layout analysis means for extracting a table portion and a graph portion in an original text, and a table analysis portion and a graph portion extracted by the layout analysis means. Text
Text extraction means for extracting each unit that can be regarded as one sentence,
Morphological analysis and syntactic analysis are performed for each unit that can be regarded as one sentence extracted by the text extraction unit, and an analysis unit that determines the connection cost between each line of the unit that can be regarded as one sentence and the presence or absence of modification, and the analysis unit A determination unit that determines whether or not each line of the unit that can be regarded as one sentence is continuous based on the connection cost between lines and the presence or absence of modification, and a unit that can be regarded as one sentence based on the determination result of the determination unit A sentence dividing / combining means for cutting is provided.

【００１０】[0010]

【作用】原文テキスト中の表部分及びグラフ部分がレイ
アウト解析手段によって抽出され、表部分及びグラフ部
分に存在するテキストが１文とみなせる単位毎にテキス
ト抽出手段によって抽出される。The layout analysis means extracts the table portion and the graph portion in the original text, and the text existing in the table portion and the graph portion is extracted by the text extracting means for each unit that can be regarded as one sentence.

【００１１】解析手段はテキスト抽出手段が抽出した１
文とみなせる単位毎に形態素解析及び構文解析を行な
い、１文とみなせる単位の各行間の接続コスト及び修飾
の有無を求める。The analyzing means is 1 extracted by the text extracting means.
Morphological analysis and syntactic analysis are performed for each unit that can be regarded as a sentence, and the connection cost between each line of the unit that can be regarded as one sentence and the presence or absence of modification are obtained.

【００１２】判定手段は解析手段が求めた各行間の接続
コスト及び修飾の有無に基づいてテキスト抽出手段が抽
出した単位の各行が連続するか否かを判定する。The determination means determines whether or not each line of the unit extracted by the text extraction means is continuous, based on the connection cost between each line and the presence or absence of modification obtained by the analysis means.

【００１３】文分割／結合手段は判定手段の判定結果に
従って抽出手段が抽出した１文とみなせる単位を文切り
する。The sentence dividing / combining unit cuts a unit which can be regarded as one sentence extracted by the extracting unit according to the determination result of the determining unit.

【００１４】[0014]

【実施例】次に本発明の実施例について図面を参照して
詳細に説明する。Embodiments of the present invention will now be described in detail with reference to the drawings.

【００１５】図１は本発明の実施例のブロック図であ
り、入力手段１０と、記憶手段２０と、レイアウト解析
手段３０と、テキスト抽出手段４０と、解析手段５０
と、判定手段６０と、文分割／結合手段７０と、出力手
段８０とから構成されている。FIG. 1 is a block diagram of an embodiment of the present invention. Input means 10, storage means 20, layout analysis means 30, text extraction means 40, and analysis means 50.
And a judgment unit 60, a sentence division / coupling unit 70, and an output unit 80.

【００１６】入力手段１０は自然言語で記述された原文
テキストを入力するものであり、フロッピーディスク装
置，磁気テープ装置，光学読み取り装置（ＯＣＲ），キ
ーボード等によって構成される。The input means 10 is for inputting an original text described in natural language, and comprises a floppy disk device, a magnetic tape device, an optical reading device (OCR), a keyboard and the like.

【００１７】記憶手段２０は入力手段１０が入力した原
文テキストを記憶するものであり、メモリ装置や磁気デ
ィスク装置等のコンピュータ内の記憶装置によって構成
される。The storage means 20 stores the original text input by the input means 10, and is constituted by a storage device in a computer such as a memory device or a magnetic disk device.

【００１８】レイアウト解析手段３０は記憶手段２０に
記憶された原文テキストを、表部分と、グラフ部分と、
テキスト部分（文字列部分）とに分割し、表部分及びグ
ラフ部分をテキスト抽出手段４０に渡す。記憶手段２０
に記憶されている原文テキストが２次元のレイアウト情
報と属性を表現する種々の文書交換標準形式（例えば、
ＳＧＭＬ，Ｐｏｓｔｓｃｒｉｐｔ等）により記述された
ものである場合は、これら文書交換標準形式に対する解
析プログラムをレイアウト解析手段３０とすることがで
きる。The layout analysis means 30 converts the original text stored in the storage means 20 into a table portion, a graph portion,
It is divided into a text portion (character string portion) and the table portion and the graph portion are passed to the text extracting means 40. Storage means 20
The textual text stored in the various text interchange standard formats (eg,
If it is described in SGML, Postscript, etc.), an analysis program for these document exchange standard formats can be used as the layout analysis means 30.

【００１９】テキスト抽出手段４０はレイアウト解析手
段３０から渡された表部分，グラフ部分に存在するテキ
ストを、１文とみなせる単位毎に抽出し、解析手段５０
及び文分割／結合手段７０に渡す機能を有する。The text extracting means 40 extracts the text existing in the table portion and the graph portion passed from the layout analyzing means 30 for each unit that can be regarded as one sentence, and the analyzing means 50
And a function of passing the sentence to the sentence dividing / combining means 70.

【００２０】解析手段５０はテキスト抽出手段４０から
渡されたテキストに対して形態素解析及び構文解析を行
なう機能と、解析結果に基づいて行間の接続コストを求
める機能と、解析結果に基づいて第２行目以降の先頭の
形態素を修飾する形態素がそれよりも前に存在するか否
かを調べて存在の有無を示す修飾有無情報を作成する機
能と、接続コスト及び修飾有無情報を判定手段６０に出
力する機能とを有する。尚、接続コストは接続しやすさ
を示す情報であり、本実施例では「低」，「中」，
「高」の３段階で接続のしやすさを表すものとする。ま
た、接続コストが低い程、接続しやすいものとする。The analysis means 50 performs a morpheme analysis and a syntax analysis on the text passed from the text extraction means 40, a function of obtaining a connection cost between lines based on the analysis result, and a second function based on the analysis result. A function for creating a modification presence / absence information indicating presence / absence of existence of a morpheme that modifies the leading morpheme on and after the line, and a connection cost and modification presence / absence information to the determination unit 60. It has a function of outputting. The connection cost is information indicating the ease of connection, and in the present embodiment, "low", "medium",
Ease of connection shall be represented by three levels of "high". Also, the lower the connection cost, the easier the connection.

【００２１】判定手段６０は解析手段５０からの接続コ
スト及び修飾有無情報に基づいて或る行と次の行とが連
続するか否かを判定する機能を有する。The judging means 60 has a function of judging whether or not a certain row and the next row are continuous based on the connection cost and the modification presence / absence information from the analyzing means 50.

【００２２】文分割／結合手段７０は判定手段６０の判
定結果に従ってテキスト抽出手段４０から渡されたテキ
ストを分割した形或いは結合した形で出力する機能を有
する。The sentence dividing / combining means 70 has a function of outputting the text delivered from the text extracting means 40 in a divided form or a combined form according to the judgment result of the judging means 60.

【００２３】出力手段８０は外部との入出力を図る手段
であり、フロッピーディスク装置，磁気テープ装置，プ
リンタ，ディスプレイ装置等により構成される。The output means 80 is means for achieving input / output with the outside, and is constituted by a floppy disk device, a magnetic tape device, a printer, a display device and the like.

【００２４】図２はテキスト抽出手段４０の処理例を示
す流れ図、図３は解析手段５０の処理例を示す流れ図、
図４は判定手段６０の判定基準の一例を示した図であ
り、以下各図を参照して本実施例の動作を説明する。FIG. 2 is a flow chart showing a processing example of the text extracting means 40, and FIG. 3 is a flow chart showing a processing example of the analyzing means 50.
FIG. 4 is a diagram showing an example of the judgment criteria of the judging means 60, and the operation of this embodiment will be described below with reference to the drawings.

【００２５】入力手段１０は自然言語で記述された原文
テキストを入力し、記憶手段２０に格納する。The input means 10 inputs the original text described in natural language and stores it in the storage means 20.

【００２６】記憶手段２０に原文テキストが格納される
と、レイアウト解析手段３０は原文テキストを表部分
と、グラフ部分と、テキスト部分とに分割し、表部分と
グラフ部分とをテキスト抽出手段４０に渡す。今、例え
ば、記憶手段２０に格納された原文テキストが図５に示
すものであるとすると、原文テキストをテキスト部分５
１と、グラフ部分５２と、表部分５３とに分割し、グラ
フ部分５２及び表部分５３をテキスト抽出手段４０に渡
すことになる。When the original text is stored in the storage means 20, the layout analysis means 30 divides the original text into a table portion, a graph portion and a text portion, and the table portion and the graph portion are sent to the text extraction means 40. hand over. Now, for example, if the original text stored in the storage means 20 is as shown in FIG.
1, the graph portion 52, and the table portion 53 are divided, and the graph portion 52 and the table portion 53 are passed to the text extracting means 40.

【００２７】テキスト抽出手段４０はレイアウト解析手
段３０から表部分或いはグラフ部分が渡されると、図２
の流れ図に示す処理を開始する。When the table portion or the graph portion is transferred from the layout analyzing means 30 to the text extracting means 40, the text extracting means 40 shown in FIG.
The process shown in the flowchart of FIG.

【００２８】表部分が渡された場合（ステップＳ１がＹ
ＥＳ）は、テキスト抽出手段４０は、先ず、表の先頭の
カラムを処理対象とし (ステップＳ２）、先頭のカラム
内のテキストを連続する１文とみなして解析手段５０に
渡す (ステップＳ３）。When the front part is passed (step S1 is Y
ES), the text extraction means 40 first treats the top column of the table as a processing target (step S2), regards the text in the top column as one continuous sentence, and passes it to the analysis means 50 (step S3).

【００２９】その後、テキスト抽出手段４０は表部分に
未処理のカラムがあるか否かを判断し (ステップＳ
４）、未処理のカラムがあると判断した場合は処理対象
を次のカラムにした後 (ステップＳ５）、ステップＳ３
の処理に戻る。また、ステップＳ４で未処理のカラムが
ないと判断した場合は、テキスト抽出手段４０はその処
理を終了する。After that, the text extracting means 40 judges whether or not there is an unprocessed column in the table portion (step S
4) If it is determined that there is an unprocessed column, the process target is set to the next column (step S5), and then step S3.
Return to processing. If it is determined in step S4 that there is no unprocessed column, the text extracting means 40 ends the process.

【００３０】レイアウト解析手段３０からグラフ部分が
渡された場合 (ステップＳ１がＮＯ）は、テキスト抽出
手段４０は渡されたグラフ部分を上から下に向かってサ
ーチする (ステップＳ６）。When the graph portion is passed from the layout analysis means 30 (NO in step S1), the text extracting means 40 searches the passed graph portion from top to bottom (step S6).

【００３１】そして、文字を検出すると (ステップＳ７
がＹＥＳ）、テキスト抽出手段４０は検出した文字が未
処理の文字か否かを判断する (ステップＳ８）。When a character is detected (step S7)
Is YES), the text extracting means 40 determines whether the detected character is an unprocessed character (step S8).

【００３２】未処理の文字であると判断した場合 (ステ
ップＳ８がＹＥＳ）は、テキスト抽出手段４０はステッ
プＳ７で検出した文字を先頭にして空白列が現れるまで
行方向に文字を読み込む (ステップＳ９）。When it is determined that the character is an unprocessed character (YES in step S8), the text extracting means 40 reads the character in the row direction starting from the character detected in step S7 until a blank column appears (step S9). ).

【００３３】その後、テキスト抽出手段４０は次の行を
見にいき、ステップＳ９で文字列を読み込んだ列と同じ
列に文字が存在するか否かを判断する (ステップＳ１
０）。Thereafter, the text extracting means 40 goes to the next line and judges whether or not a character exists in the same column as the column from which the character string was read in step S9 (step S1).
0).

【００３４】そして、存在すると判断した場合はステッ
プＳ９に戻り、上記した次の行に於いて空白列が現れる
まで文字を行方向に読み込む。When it is determined that the character exists, the process returns to step S9 and the characters are read in the line direction until a blank column appears in the next line.

【００３５】ステップＳ１０の判断結果がＮＯとなるま
で、ステップＳ９，Ｓ１０の処理が繰り返し行なわれ、
ステップＳ１０の判断結果がＹＥＳとなると、テキスト
抽出手段４０はステップＳ９で読み込んだテキストの
内、未出力のテキストを連続する１文とみなして解析手
段５０及び文分割／結合手段７０に渡した後 (ステップ
Ｓ１１）、ステップＳ６の処理に戻る。即ち、テキスト
抽出手段４０はグラフ部分のテキストについては空白で
囲まれているテキストを連続する１文とみなして解析手
段５０及び文分割／結合手段７０に出力することにな
る。The processes of steps S9 and S10 are repeated until the result of the determination in step S10 is NO,
If the decision result in the step S10 is YES, the text extracting means 40 regards the unoutputted text among the texts read in the step S9 as one continuous sentence and passes it to the analyzing means 50 and the sentence dividing / combining means 70. (Step S11), the process returns to step S6. That is, the text extraction means 40 regards the text surrounded by the blanks as the text of the graph portion and outputs it to the analysis means 50 and the sentence division / connection means 70 as one continuous sentence.

【００３６】また、テキスト抽出手段４０はグラフ部分
の最後までサーチを行なった場合 (ステップＳ１２がＹ
ＥＳ）は、その処理を終了する。When the text extracting means 40 has performed a search up to the end of the graph portion (step S12 is Y
ES) ends the processing.

【００３７】解析手段５０はテキスト抽出手段４０から
表内或いはグラフ内のテキストが渡されると、図３の流
れ図に示す処理を開始する。When the text in the table or the graph is delivered from the text extracting means 40, the analyzing means 50 starts the processing shown in the flow chart of FIG.

【００３８】先ず、解析手段５０はテキスト抽出手段４
０から渡されたテキストが複数行か否かを判断する (ス
テップＳ２１）。First, the analysis means 50 is the text extraction means 4
It is judged whether or not the text passed from 0 has a plurality of lines (step S21).

【００３９】複数行であると判断した場合は、解析手段
５０はテキストに対して形態素解析を行ない、形態素の
品詞に基づいて各形態素間の接続コストを求める (ステ
ップＳ２２）。更に、解析手段５０は構文解析を行な
い、各形態素の修飾関係等を調べる (ステップＳ２
３）。When it is determined that the text has a plurality of lines, the analysis unit 50 performs a morpheme analysis on the text and obtains a connection cost between each morpheme based on the part of speech of the morpheme (step S22). Further, the analysis means 50 performs a syntax analysis to check the modification relation of each morpheme (step S2).
3).

【００４０】その後、解析手段５０はステップＳ２２で
求めた各形態素間の接続コストに基づいて各行間の接続
コストを求める (ステップＳ２４）。即ち、第ｉ行の最
後の形態素と第（ｉ＋１）行の先頭の形態素との接続コ
ストを第ｉ行と第（ｉ＋１）行との間の接続コストとす
る。但し、１つの形態素が第ｉ行と第（ｉ＋１）行の２
行にわたる場合は、第ｉ行と第（ｉ＋１）行との間の接
続コストを最も接続しやすいことを示すもの、即ち
「低」にする。After that, the analyzing means 50 obtains the connection cost between the respective rows based on the connection cost between the respective morphemes obtained at the step S22 (step S24). That is, the connection cost between the last morpheme of the i-th row and the head morpheme of the (i + 1) th row is the connection cost between the i-th row and the (i + 1) th row. However, one morpheme is 2 in the i-th row and the (i + 1) -th row.
If it spans rows, the connection cost between the i-th row and the (i + 1) -th row is set to the value that indicates the easiest connection, that is, "low".

【００４１】各行間の接続コストを求めると、解析手段
５０は構文解析結果に基づいて、ステップＳ２５で第２
行目以降の各行の先頭に存在する形態素それぞれについ
て、それを修飾する形態素がそれよりも前にあるか否か
を示す修飾有無情報を生成する。When the connection cost between each line is obtained, the analysis means 50 makes a second determination in step S25 based on the result of the syntax analysis.
For each morpheme existing at the beginning of each line after the line, modification presence / absence information indicating whether or not a morpheme that modifies it is generated.

【００４２】即ち、第ｉ行の先頭の形態素を修飾する形
態素が第（ｉ−１）行を含めてそれよりも前にある場合
は第（ｉ−１）行と第ｉ行との間の修飾有無情報を修飾
有りを示す「有」にし、ない場合は第（ｉ−１）行と第
ｉ行との間の修飾有無情報を修飾無しを示す「無」にす
る。但し、１つの形態素が第（ｉ−１）行と第ｉ行の２
行にわたる場合は、第（ｉ−１）行と第ｉ行との間の修
飾有無情報を、修飾の有無にかかわらず修飾有りを示す
「有」にする。That is, when the morpheme that modifies the leading morpheme of the i-th row is before and including the (i-1) -th row, it is between the (i-1) -th row and the i-th row. The modification presence / absence information is set to “present” indicating that there is modification, and if not, the modification presence / absence information between the (i−1) th row and the i-th row is set to “absent” indicating no modification. However, one morpheme is 2 in the (i-1) th row and the i-th row.
When it extends over a line, the modification presence / absence information between the (i−1) th line and the i-th line is set to “present” indicating that there is modification regardless of the presence or absence of modification.

【００４３】ステップＳ２４，Ｓ２５で各行間の接続コ
スト及び修飾有無情報を求めると、解析手段５０は各行
間の接続コスト及び修飾有無情報を判定手段６０に渡す
(ステップＳ２６）。When the connection cost between each row and the presence / absence of modification are obtained in steps S24 and S25, the analysis means 50 passes the connection cost between each row and the presence / absence of modification to the determination means 60.
(Step S26).

【００４４】判定手段６０は解析手段５０から各行間の
接続コスト及び修飾有無情報が送られてくると、図４に
示す判定基準に従って各行間が連続するか否かを判定
し、判定結果を文分割／結合手段７０に渡す。When the connection cost and modification presence / absence information between each line are sent from the analysis unit 50, the judging means 60 judges whether or not each line is continuous according to the judgment criteria shown in FIG. It is passed to the dividing / combining means 70.

【００４５】即ち、解析手段５０から渡された第ｉ行と
第（ｉ＋１）行との間の接続コストが「低」であり、接
続コストが第ｉ行と第（ｉ＋１）行とが接続しやすいこ
とを示している場合は、判定手段６０は修飾有無情報の
「有」，「無」にかかわらず、第ｉ行と第（ｉ＋１）行
とが連続すると判定する。また、解析手段５０から渡さ
れた第ｉ行と第（ｉ＋１）行との間の接続コストが
「中」，「高」である場合は、判定手段６０は修飾有無
情報が「無」の場合は連続しないと判定し、「有」の場
合は連続すると判定する。That is, the connection cost between the i-th row and the (i + 1) th row passed from the analyzing means 50 is "low", and the connection cost is connected between the i-th row and the (i + 1) th row. If it is easy, the determination unit 60 determines that the i-th row and the (i + 1) -th row are continuous regardless of the presence / absence of the modification presence / absence information. When the connection cost between the i-th row and the (i + 1) -th row passed from the analysis means 50 is “medium” or “high”, the determination means 60 determines that the modification presence / absence information is “none”. Is determined not to be continuous, and if “present” is determined to be continuous.

【００４６】文分割／結合手段７０は判定手段６０から
判定結果が渡されると、その判定結果に従ってテキスト
抽出手段４０から渡されているテキストを分割または結
合し、出力手段８０に出力する。また、文分割／結合手
段７０で分割または結合したテキストを他の文書処理プ
ログラムの入力とすることもできる。When the judgment result is passed from the judgment means 60, the sentence dividing / combining means 70 divides or combines the texts passed from the text extracting means 40 according to the judgment result, and outputs it to the output means 80. Further, the text divided or combined by the sentence dividing / combining means 70 can be input to another document processing program.

【００４７】[0047]

【発明の効果】以上説明したように、本発明は各行間の
接続コスト及び修飾の有無に基づいて表部分及びグラフ
部分のテキストの文切りを行なっているので、信頼性の
高い文切りを行なうことが可能になる効果がある。As described above, according to the present invention, the text segmentation of the table portion and the graph segment is performed based on the connection cost between each line and the presence / absence of modification. Therefore, the sentence segmentation with high reliability is performed. There is an effect that can be.

[Brief description of drawings]

【図１】本発明の実施例のブロック図である。FIG. 1 is a block diagram of an embodiment of the present invention.

【図２】テキスト抽出手段４０の処理例を示す流れ図で
ある。FIG. 2 is a flowchart showing a processing example of a text extracting means 40.

【図３】解析手段５０の処理例を示す流れ図である。FIG. 3 is a flow chart showing a processing example of the analyzing means 50.

【図４】判定手段６０の判定基準の一例を示す図であ
る。FIG. 4 is a diagram showing an example of a determination criterion of a determination means 60.

【図５】原文テキストの一例を示す図である。FIG. 5 is a diagram showing an example of original text.

[Explanation of symbols]

１０…入力手段２０…記憶手段３０…レイアウト解析手段４０…テキスト抽出手段５０…解析手段６０…判定手段７０…文分割／結合手段８０…出力手段 10 ... Input means 20 ... Storage means 30 ... Layout analysis means 40 ... Text extraction means 50 ... Analysis means 60 ... Judgment means 70 ... Sentence division / joining means 80 ... Output means

Claims

[Claims]

1. A layout analysis means for extracting a table portion and a graph portion in the original text, and a text for extracting the text existing in the table portion and the graph portion extracted by the layout analysis means for each unit that can be regarded as one sentence. An extraction unit, an analysis unit that performs morphological analysis and syntactic analysis for each unit that can be regarded as one sentence extracted by the text extraction unit, and obtains the connection cost between each line of the unit that can be regarded as one sentence and the presence or absence of modification; Determination means for determining whether or not each line of the unit that can be regarded as one sentence is continuous based on the connection cost between each line and the presence or absence of modification obtained by the means, and the one sentence based on the determination result of the determination means. A sentence segmentation device comprising sentence segmentation / combination means for segmenting sentences that can be regarded.

2. The text extracting unit regards the text of the table portion as a unit in which the text existing in each column can be regarded as one sentence, and the text of the graph portion is one sentence each. The sentence cutting device according to claim 1, which is a unit that can be regarded as.