JPH07104865B2 - Sentence cutting device - Google Patents

Sentence cutting device

Info

Publication number
JPH07104865B2
JPH07104865B2 JP5096694A JP9669493A JPH07104865B2 JP H07104865 B2 JPH07104865 B2 JP H07104865B2 JP 5096694 A JP5096694 A JP 5096694A JP 9669493 A JP9669493 A JP 9669493A JP H07104865 B2 JPH07104865 B2 JP H07104865B2
Authority
JP
Japan
Prior art keywords
text
sentence
unit
analysis
regarded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP5096694A
Other languages
Japanese (ja)
Other versions
JPH06290209A (en
Inventor
章浩 古川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to JP5096694A priority Critical patent/JPH07104865B2/en
Publication of JPH06290209A publication Critical patent/JPH06290209A/en
Publication of JPH07104865B2 publication Critical patent/JPH07104865B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Landscapes

  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】本発明は自然言語で記述された文
書の文切りを行なう文切り装置に関し、特に、表やグラ
フ内に記述された文書の文切りを行なう文切り装置に関
する。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a sentence cutting device for cutting sentences written in a natural language, and more particularly to a sentence cutting device for cutting sentences written in a table or a graph.

【0002】[0002]

【従来の技術】自然言語で記述された文書を機械翻訳す
る場合等、文書を文単位に分割すること、即ち文切りす
ることが必要になる。
2. Description of the Related Art In the case of machine translation of a document described in natural language, it is necessary to divide the document into sentences, that is, to cut the sentence.

【0003】従来はこのような場合、日本語文に於いて
は句点「。」に基づいて文切りを行ない、英語文に於い
てはピリオド「.」に基づいて文切りを行なっていた
(例えば、特開昭64−61863号公報)。
Conventionally, in such a case, in Japanese sentences, sentence cutting is performed based on the punctuation mark ".", And in English sentences, sentence cutting is performed based on the period "." (For example, JP-A-64-61863).

【0004】文書中に句点やピリオドがあれば、上記し
た従来の技術に基づいて容易に文切りを行なうことがで
きる。
If there are punctuation marks or periods in the document, sentence cutting can be easily performed based on the above-mentioned conventional technique.

【0005】しかし、表中の文書やグラフ中の文書は一
般に句点やピリオドを用いないため、表,グラフ中に複
数行の文が存在している場合、上述した従来技術ではそ
れが連続した1文が折り返されたものなのか、複数の文
が並記されたものなのかを判別することができない。
However, since a document in a table or a document in a graph generally does not use a punctuation mark or a period, when a sentence of a plurality of lines exists in a table or a graph, in the above-mentioned conventional technique, it is consecutive 1 It is not possible to determine whether the sentence is folded or a plurality of sentences are written in parallel.

【0006】そこで、隣接する形態素(単語)が接続可
であるか否かを隣接する形態素の品詞に基づいて判定す
るという技術(例えば、特開昭61−16367号公
報)を用いて表中の文書やグラフ中の文書を文切りする
ということが考えられる。即ち、第i行目の最後の形態
素の品詞と第(i+1)行目の最初の形態素の品詞とに
基づいて両者が接続可であるか否かを判定し、接続可で
あれば、第i行と第(i+1)行は連続した文と判定
し、接続不可であれば、第i行と第(i+1)行は異な
る文と判定するものである。
Therefore, a technique of determining whether or not adjacent morphemes (words) are connectable based on the part of speech of the adjacent morphemes (for example, Japanese Patent Laid-Open No. 61-16367) is used. It can be considered to cut a document or a document in a graph. That is, based on the part of speech of the last morpheme on the i-th line and the part of speech of the first morpheme on the (i + 1) th line, it is determined whether or not they are connectable. It is determined that the line and the (i + 1) th line are continuous sentences, and if the connection is impossible, the i-th line and the (i + 1) th line are different sentences.

【0007】[0007]

【発明が解決しようとする課題】しかし、上記した技術
は隣接する形態素の品詞に基づいて連続した文か否かを
判定しているだけであるので、判定誤りが生じやすいと
いう問題があった。例えば、名詞+動詞の並びは一般的
には接続しにくいが、「私行く」のように接続する場合
もあるので、隣接する形態素の品詞に基づいて連続した
文か否かを判定するだけでは判定誤りが生じやすい。
However, since the above-mentioned technique only determines whether or not the sentence is a continuous sentence based on the parts of speech of the adjacent morphemes, there is a problem that a determination error is likely to occur. For example, it is generally difficult to connect a sequence of nouns and verbs, but there are cases where they are connected like "I go," so it is only necessary to determine whether or not they are continuous sentences based on the parts of speech of adjacent morphemes. Judgment error is likely to occur.

【0008】本発明の目的は表,グラフ中の文書を文切
りする際、判定誤りの生じにくい文切り装置を提供する
ことにある。
An object of the present invention is to provide a sentence cutting device which is less likely to cause a judgment error when cutting a document in a table or a graph.

【0009】[0009]

【課題を解決するための手段】本発明は上記目的を達成
するため、原文テキスト中の表部分及びグラフ部分を抽
出するレイアウト解析手段と、該レイアウト解析手段が
抽出した表部分及びグラフ部分に存在するテキストを、
1文とみなせる単位毎に抽出するテキスト抽出手段と、
該テキスト抽出手段が抽出した1文とみなせる単位毎に
形態素解析及び構文解析を行ない、前記1文とみなせる
単位の各行間の接続コスト及び修飾の有無を求める解析
手段と、該解析手段が求めた各行間の接続コスト及び修
飾の有無に基づいて前記1文とみなせる単位の各行が連
続するか否かを判定する判定手段と、該判定手段の判定
結果に基づいて前記1文とみなせる単位を文切りする文
分割/結合手段とを設けたものである。
In order to achieve the above object, the present invention resides in a layout analysis means for extracting a table portion and a graph portion in an original text, and a table analysis portion and a graph portion extracted by the layout analysis means. Text
Text extraction means for extracting each unit that can be regarded as one sentence,
Morphological analysis and syntactic analysis are performed for each unit that can be regarded as one sentence extracted by the text extraction unit, and an analysis unit that determines the connection cost between each line of the unit that can be regarded as one sentence and the presence or absence of modification, and the analysis unit A determination unit that determines whether or not each line of the unit that can be regarded as one sentence is continuous based on the connection cost between lines and the presence or absence of modification, and a unit that can be regarded as one sentence based on the determination result of the determination unit A sentence dividing / combining means for cutting is provided.

【0010】[0010]

【作用】原文テキスト中の表部分及びグラフ部分がレイ
アウト解析手段によって抽出され、表部分及びグラフ部
分に存在するテキストが1文とみなせる単位毎にテキス
ト抽出手段によって抽出される。
The layout analysis means extracts the table portion and the graph portion in the original text, and the text existing in the table portion and the graph portion is extracted by the text extracting means for each unit that can be regarded as one sentence.

【0011】解析手段はテキスト抽出手段が抽出した1
文とみなせる単位毎に形態素解析及び構文解析を行な
い、1文とみなせる単位の各行間の接続コスト及び修飾
の有無を求める。
The analyzing means is 1 extracted by the text extracting means.
Morphological analysis and syntactic analysis are performed for each unit that can be regarded as a sentence, and the connection cost between each line of the unit that can be regarded as one sentence and the presence or absence of modification are obtained.

【0012】判定手段は解析手段が求めた各行間の接続
コスト及び修飾の有無に基づいてテキスト抽出手段が抽
出した単位の各行が連続するか否かを判定する。
The determination means determines whether or not each line of the unit extracted by the text extraction means is continuous, based on the connection cost between each line and the presence or absence of modification obtained by the analysis means.

【0013】文分割/結合手段は判定手段の判定結果に
従って抽出手段が抽出した1文とみなせる単位を文切り
する。
The sentence dividing / combining unit cuts a unit which can be regarded as one sentence extracted by the extracting unit according to the determination result of the determining unit.

【0014】[0014]

【実施例】次に本発明の実施例について図面を参照して
詳細に説明する。
Embodiments of the present invention will now be described in detail with reference to the drawings.

【0015】図1は本発明の実施例のブロック図であ
り、入力手段10と、記憶手段20と、レイアウト解析
手段30と、テキスト抽出手段40と、解析手段50
と、判定手段60と、文分割/結合手段70と、出力手
段80とから構成されている。
FIG. 1 is a block diagram of an embodiment of the present invention. Input means 10, storage means 20, layout analysis means 30, text extraction means 40, and analysis means 50.
And a judgment unit 60, a sentence division / coupling unit 70, and an output unit 80.

【0016】入力手段10は自然言語で記述された原文
テキストを入力するものであり、フロッピーディスク装
置,磁気テープ装置,光学読み取り装置(OCR),キ
ーボード等によって構成される。
The input means 10 is for inputting an original text described in natural language, and comprises a floppy disk device, a magnetic tape device, an optical reading device (OCR), a keyboard and the like.

【0017】記憶手段20は入力手段10が入力した原
文テキストを記憶するものであり、メモリ装置や磁気デ
ィスク装置等のコンピュータ内の記憶装置によって構成
される。
The storage means 20 stores the original text input by the input means 10, and is constituted by a storage device in a computer such as a memory device or a magnetic disk device.

【0018】レイアウト解析手段30は記憶手段20に
記憶された原文テキストを、表部分と、グラフ部分と、
テキスト部分(文字列部分)とに分割し、表部分及びグ
ラフ部分をテキスト抽出手段40に渡す。記憶手段20
に記憶されている原文テキストが2次元のレイアウト情
報と属性を表現する種々の文書交換標準形式(例えば、
SGML,Postscript等)により記述された
ものである場合は、これら文書交換標準形式に対する解
析プログラムをレイアウト解析手段30とすることがで
きる。
The layout analysis means 30 converts the original text stored in the storage means 20 into a table portion, a graph portion,
It is divided into a text portion (character string portion) and the table portion and the graph portion are passed to the text extracting means 40. Storage means 20
The textual text stored in the various text interchange standard formats (eg,
If it is described in SGML, Postscript, etc.), an analysis program for these document exchange standard formats can be used as the layout analysis means 30.

【0019】テキスト抽出手段40はレイアウト解析手
段30から渡された表部分,グラフ部分に存在するテキ
ストを、1文とみなせる単位毎に抽出し、解析手段50
及び文分割/結合手段70に渡す機能を有する。
The text extracting means 40 extracts the text existing in the table portion and the graph portion passed from the layout analyzing means 30 for each unit that can be regarded as one sentence, and the analyzing means 50
And a function of passing the sentence to the sentence dividing / combining means 70.

【0020】解析手段50はテキスト抽出手段40から
渡されたテキストに対して形態素解析及び構文解析を行
なう機能と、解析結果に基づいて行間の接続コストを求
める機能と、解析結果に基づいて第2行目以降の先頭の
形態素を修飾する形態素がそれよりも前に存在するか否
かを調べて存在の有無を示す修飾有無情報を作成する機
能と、接続コスト及び修飾有無情報を判定手段60に出
力する機能とを有する。尚、接続コストは接続しやすさ
を示す情報であり、本実施例では「低」,「中」,
「高」の3段階で接続のしやすさを表すものとする。ま
た、接続コストが低い程、接続しやすいものとする。
The analysis means 50 performs a morpheme analysis and a syntax analysis on the text passed from the text extraction means 40, a function of obtaining a connection cost between lines based on the analysis result, and a second function based on the analysis result. A function for creating a modification presence / absence information indicating presence / absence of existence of a morpheme that modifies the leading morpheme on and after the line, and a connection cost and modification presence / absence information to the determination unit 60. It has a function of outputting. The connection cost is information indicating the ease of connection, and in the present embodiment, "low", "medium",
Ease of connection shall be represented by three levels of "high". Also, the lower the connection cost, the easier the connection.

【0021】判定手段60は解析手段50からの接続コ
スト及び修飾有無情報に基づいて或る行と次の行とが連
続するか否かを判定する機能を有する。
The judging means 60 has a function of judging whether or not a certain row and the next row are continuous based on the connection cost and the modification presence / absence information from the analyzing means 50.

【0022】文分割/結合手段70は判定手段60の判
定結果に従ってテキスト抽出手段40から渡されたテキ
ストを分割した形或いは結合した形で出力する機能を有
する。
The sentence dividing / combining means 70 has a function of outputting the text delivered from the text extracting means 40 in a divided form or a combined form according to the judgment result of the judging means 60.

【0023】出力手段80は外部との入出力を図る手段
であり、フロッピーディスク装置,磁気テープ装置,プ
リンタ,ディスプレイ装置等により構成される。
The output means 80 is means for achieving input / output with the outside, and is constituted by a floppy disk device, a magnetic tape device, a printer, a display device and the like.

【0024】図2はテキスト抽出手段40の処理例を示
す流れ図、図3は解析手段50の処理例を示す流れ図、
図4は判定手段60の判定基準の一例を示した図であ
り、以下各図を参照して本実施例の動作を説明する。
FIG. 2 is a flow chart showing a processing example of the text extracting means 40, and FIG. 3 is a flow chart showing a processing example of the analyzing means 50.
FIG. 4 is a diagram showing an example of the judgment criteria of the judging means 60, and the operation of this embodiment will be described below with reference to the drawings.

【0025】入力手段10は自然言語で記述された原文
テキストを入力し、記憶手段20に格納する。
The input means 10 inputs the original text described in natural language and stores it in the storage means 20.

【0026】記憶手段20に原文テキストが格納される
と、レイアウト解析手段30は原文テキストを表部分
と、グラフ部分と、テキスト部分とに分割し、表部分と
グラフ部分とをテキスト抽出手段40に渡す。今、例え
ば、記憶手段20に格納された原文テキストが図5に示
すものであるとすると、原文テキストをテキスト部分5
1と、グラフ部分52と、表部分53とに分割し、グラ
フ部分52及び表部分53をテキスト抽出手段40に渡
すことになる。
When the original text is stored in the storage means 20, the layout analysis means 30 divides the original text into a table portion, a graph portion and a text portion, and the table portion and the graph portion are sent to the text extraction means 40. hand over. Now, for example, if the original text stored in the storage means 20 is as shown in FIG.
1, the graph portion 52, and the table portion 53 are divided, and the graph portion 52 and the table portion 53 are passed to the text extracting means 40.

【0027】テキスト抽出手段40はレイアウト解析手
段30から表部分或いはグラフ部分が渡されると、図2
の流れ図に示す処理を開始する。
When the table portion or the graph portion is transferred from the layout analyzing means 30 to the text extracting means 40, the text extracting means 40 shown in FIG.
The process shown in the flowchart of FIG.

【0028】表部分が渡された場合(ステップS1がY
ES)は、テキスト抽出手段40は、先ず、表の先頭の
カラムを処理対象とし (ステップS2)、先頭のカラム
内のテキストを連続する1文とみなして解析手段50に
渡す (ステップS3)。
When the front part is passed (step S1 is Y
ES), the text extraction means 40 first treats the top column of the table as a processing target (step S2), regards the text in the top column as one continuous sentence, and passes it to the analysis means 50 (step S3).

【0029】その後、テキスト抽出手段40は表部分に
未処理のカラムがあるか否かを判断し (ステップS
4)、未処理のカラムがあると判断した場合は処理対象
を次のカラムにした後 (ステップS5)、ステップS3
の処理に戻る。また、ステップS4で未処理のカラムが
ないと判断した場合は、テキスト抽出手段40はその処
理を終了する。
After that, the text extracting means 40 judges whether or not there is an unprocessed column in the table portion (step S
4) If it is determined that there is an unprocessed column, the process target is set to the next column (step S5), and then step S3.
Return to processing. If it is determined in step S4 that there is no unprocessed column, the text extracting means 40 ends the process.

【0030】レイアウト解析手段30からグラフ部分が
渡された場合 (ステップS1がNO)は、テキスト抽出
手段40は渡されたグラフ部分を上から下に向かってサ
ーチする (ステップS6)。
When the graph portion is passed from the layout analysis means 30 (NO in step S1), the text extracting means 40 searches the passed graph portion from top to bottom (step S6).

【0031】そして、文字を検出すると (ステップS7
がYES)、テキスト抽出手段40は検出した文字が未
処理の文字か否かを判断する (ステップS8)。
When a character is detected (step S7)
Is YES), the text extracting means 40 determines whether the detected character is an unprocessed character (step S8).

【0032】未処理の文字であると判断した場合 (ステ
ップS8がYES)は、テキスト抽出手段40はステッ
プS7で検出した文字を先頭にして空白列が現れるまで
行方向に文字を読み込む (ステップS9)。
When it is determined that the character is an unprocessed character (YES in step S8), the text extracting means 40 reads the character in the row direction starting from the character detected in step S7 until a blank column appears (step S9). ).

【0033】その後、テキスト抽出手段40は次の行を
見にいき、ステップS9で文字列を読み込んだ列と同じ
列に文字が存在するか否かを判断する (ステップS1
0)。
Thereafter, the text extracting means 40 goes to the next line and judges whether or not a character exists in the same column as the column from which the character string was read in step S9 (step S1).
0).

【0034】そして、存在すると判断した場合はステッ
プS9に戻り、上記した次の行に於いて空白列が現れる
まで文字を行方向に読み込む。
When it is determined that the character exists, the process returns to step S9 and the characters are read in the line direction until a blank column appears in the next line.

【0035】ステップS10の判断結果がNOとなるま
で、ステップS9,S10の処理が繰り返し行なわれ、
ステップS10の判断結果がYESとなると、テキスト
抽出手段40はステップS9で読み込んだテキストの
内、未出力のテキストを連続する1文とみなして解析手
段50及び文分割/結合手段70に渡した後 (ステップ
S11)、ステップS6の処理に戻る。即ち、テキスト
抽出手段40はグラフ部分のテキストについては空白で
囲まれているテキストを連続する1文とみなして解析手
段50及び文分割/結合手段70に出力することにな
る。
The processes of steps S9 and S10 are repeated until the result of the determination in step S10 is NO,
If the decision result in the step S10 is YES, the text extracting means 40 regards the unoutputted text among the texts read in the step S9 as one continuous sentence and passes it to the analyzing means 50 and the sentence dividing / combining means 70. (Step S11), the process returns to step S6. That is, the text extraction means 40 regards the text surrounded by the blanks as the text of the graph portion and outputs it to the analysis means 50 and the sentence division / connection means 70 as one continuous sentence.

【0036】また、テキスト抽出手段40はグラフ部分
の最後までサーチを行なった場合 (ステップS12がY
ES)は、その処理を終了する。
When the text extracting means 40 has performed a search up to the end of the graph portion (step S12 is Y
ES) ends the processing.

【0037】解析手段50はテキスト抽出手段40から
表内或いはグラフ内のテキストが渡されると、図3の流
れ図に示す処理を開始する。
When the text in the table or the graph is delivered from the text extracting means 40, the analyzing means 50 starts the processing shown in the flow chart of FIG.

【0038】先ず、解析手段50はテキスト抽出手段4
0から渡されたテキストが複数行か否かを判断する (ス
テップS21)。
First, the analysis means 50 is the text extraction means 4
It is judged whether or not the text passed from 0 has a plurality of lines (step S21).

【0039】複数行であると判断した場合は、解析手段
50はテキストに対して形態素解析を行ない、形態素の
品詞に基づいて各形態素間の接続コストを求める (ステ
ップS22)。更に、解析手段50は構文解析を行な
い、各形態素の修飾関係等を調べる (ステップS2
3)。
When it is determined that the text has a plurality of lines, the analysis unit 50 performs a morpheme analysis on the text and obtains a connection cost between each morpheme based on the part of speech of the morpheme (step S22). Further, the analysis means 50 performs a syntax analysis to check the modification relation of each morpheme (step S2).
3).

【0040】その後、解析手段50はステップS22で
求めた各形態素間の接続コストに基づいて各行間の接続
コストを求める (ステップS24)。即ち、第i行の最
後の形態素と第(i+1)行の先頭の形態素との接続コ
ストを第i行と第(i+1)行との間の接続コストとす
る。但し、1つの形態素が第i行と第(i+1)行の2
行にわたる場合は、第i行と第(i+1)行との間の接
続コストを最も接続しやすいことを示すもの、即ち
「低」にする。
After that, the analyzing means 50 obtains the connection cost between the respective rows based on the connection cost between the respective morphemes obtained at the step S22 (step S24). That is, the connection cost between the last morpheme of the i-th row and the head morpheme of the (i + 1) th row is the connection cost between the i-th row and the (i + 1) th row. However, one morpheme is 2 in the i-th row and the (i + 1) -th row.
If it spans rows, the connection cost between the i-th row and the (i + 1) -th row is set to the value that indicates the easiest connection, that is, "low".

【0041】各行間の接続コストを求めると、解析手段
50は構文解析結果に基づいて、ステップS25で第2
行目以降の各行の先頭に存在する形態素それぞれについ
て、それを修飾する形態素がそれよりも前にあるか否か
を示す修飾有無情報を生成する。
When the connection cost between each line is obtained, the analysis means 50 makes a second determination in step S25 based on the result of the syntax analysis.
For each morpheme existing at the beginning of each line after the line, modification presence / absence information indicating whether or not a morpheme that modifies it is generated.

【0042】即ち、第i行の先頭の形態素を修飾する形
態素が第(i−1)行を含めてそれよりも前にある場合
は第(i−1)行と第i行との間の修飾有無情報を修飾
有りを示す「有」にし、ない場合は第(i−1)行と第
i行との間の修飾有無情報を修飾無しを示す「無」にす
る。但し、1つの形態素が第(i−1)行と第i行の2
行にわたる場合は、第(i−1)行と第i行との間の修
飾有無情報を、修飾の有無にかかわらず修飾有りを示す
「有」にする。
That is, when the morpheme that modifies the leading morpheme of the i-th row is before and including the (i-1) -th row, it is between the (i-1) -th row and the i-th row. The modification presence / absence information is set to “present” indicating that there is modification, and if not, the modification presence / absence information between the (i−1) th row and the i-th row is set to “absent” indicating no modification. However, one morpheme is 2 in the (i-1) th row and the i-th row.
When it extends over a line, the modification presence / absence information between the (i−1) th line and the i-th line is set to “present” indicating that there is modification regardless of the presence or absence of modification.

【0043】ステップS24,S25で各行間の接続コ
スト及び修飾有無情報を求めると、解析手段50は各行
間の接続コスト及び修飾有無情報を判定手段60に渡す
(ステップS26)。
When the connection cost between each row and the presence / absence of modification are obtained in steps S24 and S25, the analysis means 50 passes the connection cost between each row and the presence / absence of modification to the determination means 60.
(Step S26).

【0044】判定手段60は解析手段50から各行間の
接続コスト及び修飾有無情報が送られてくると、図4に
示す判定基準に従って各行間が連続するか否かを判定
し、判定結果を文分割/結合手段70に渡す。
When the connection cost and modification presence / absence information between each line are sent from the analysis unit 50, the judging means 60 judges whether or not each line is continuous according to the judgment criteria shown in FIG. It is passed to the dividing / combining means 70.

【0045】即ち、解析手段50から渡された第i行と
第(i+1)行との間の接続コストが「低」であり、接
続コストが第i行と第(i+1)行とが接続しやすいこ
とを示している場合は、判定手段60は修飾有無情報の
「有」,「無」にかかわらず、第i行と第(i+1)行
とが連続すると判定する。また、解析手段50から渡さ
れた第i行と第(i+1)行との間の接続コストが
「中」,「高」である場合は、判定手段60は修飾有無
情報が「無」の場合は連続しないと判定し、「有」の場
合は連続すると判定する。
That is, the connection cost between the i-th row and the (i + 1) th row passed from the analyzing means 50 is "low", and the connection cost is connected between the i-th row and the (i + 1) th row. If it is easy, the determination unit 60 determines that the i-th row and the (i + 1) -th row are continuous regardless of the presence / absence of the modification presence / absence information. When the connection cost between the i-th row and the (i + 1) -th row passed from the analysis means 50 is “medium” or “high”, the determination means 60 determines that the modification presence / absence information is “none”. Is determined not to be continuous, and if “present” is determined to be continuous.

【0046】文分割/結合手段70は判定手段60から
判定結果が渡されると、その判定結果に従ってテキスト
抽出手段40から渡されているテキストを分割または結
合し、出力手段80に出力する。また、文分割/結合手
段70で分割または結合したテキストを他の文書処理プ
ログラムの入力とすることもできる。
When the judgment result is passed from the judgment means 60, the sentence dividing / combining means 70 divides or combines the texts passed from the text extracting means 40 according to the judgment result, and outputs it to the output means 80. Further, the text divided or combined by the sentence dividing / combining means 70 can be input to another document processing program.

【0047】[0047]

【発明の効果】以上説明したように、本発明は各行間の
接続コスト及び修飾の有無に基づいて表部分及びグラフ
部分のテキストの文切りを行なっているので、信頼性の
高い文切りを行なうことが可能になる効果がある。
As described above, according to the present invention, the text segmentation of the table portion and the graph segment is performed based on the connection cost between each line and the presence / absence of modification. Therefore, the sentence segmentation with high reliability is performed. There is an effect that can be.

【図面の簡単な説明】[Brief description of drawings]

【図1】本発明の実施例のブロック図である。FIG. 1 is a block diagram of an embodiment of the present invention.

【図2】テキスト抽出手段40の処理例を示す流れ図で
ある。
FIG. 2 is a flowchart showing a processing example of a text extracting means 40.

【図3】解析手段50の処理例を示す流れ図である。FIG. 3 is a flow chart showing a processing example of the analyzing means 50.

【図4】判定手段60の判定基準の一例を示す図であ
る。
FIG. 4 is a diagram showing an example of a determination criterion of a determination means 60.

【図5】原文テキストの一例を示す図である。FIG. 5 is a diagram showing an example of original text.

【符号の説明】[Explanation of symbols]

10…入力手段 20…記憶手段 30…レイアウト解析手段 40…テキスト抽出手段 50…解析手段 60…判定手段 70…文分割/結合手段 80…出力手段 10 ... Input means 20 ... Storage means 30 ... Layout analysis means 40 ... Text extraction means 50 ... Analysis means 60 ... Judgment means 70 ... Sentence division / joining means 80 ... Output means

Claims (2)

【特許請求の範囲】[Claims] 【請求項1】 原文テキスト中の表部分及びグラフ部分
を抽出するレイアウト解析手段と、 該レイアウト解析手段が抽出した表部分及びグラフ部分
に存在するテキストを、1文とみなせる単位毎に抽出す
るテキスト抽出手段と、 該テキスト抽出手段が抽出した1文とみなせる単位毎に
形態素解析及び構文解析を行ない、前記1文とみなせる
単位の各行間の接続コスト及び修飾の有無を求める解析
手段と、 該解析手段が求めた各行間の接続コスト及び修飾の有無
に基づいて前記1文とみなせる単位の各行が連続するか
否かを判定する判定手段と、 該判定手段の判定結果に基づいて前記1文とみなせる単
位を文切りする文分割/結合手段とを備えたことを特徴
とする文切り装置。
1. A layout analysis means for extracting a table portion and a graph portion in the original text, and a text for extracting the text existing in the table portion and the graph portion extracted by the layout analysis means for each unit that can be regarded as one sentence. An extraction unit, an analysis unit that performs morphological analysis and syntactic analysis for each unit that can be regarded as one sentence extracted by the text extraction unit, and obtains the connection cost between each line of the unit that can be regarded as one sentence and the presence or absence of modification; Determination means for determining whether or not each line of the unit that can be regarded as one sentence is continuous based on the connection cost between each line and the presence or absence of modification obtained by the means, and the one sentence based on the determination result of the determination means. A sentence segmentation device comprising sentence segmentation / combination means for segmenting sentences that can be regarded.
【請求項2】 前記テキスト抽出手段は、表部分のテキ
ストについては各カラム内に存在するテキストをそれぞ
れ1文とみなせる単位とし、グラフ部分のテキストにつ
いては空白で囲まれているテキストをそれぞれ1文とみ
なせる単位とすることを特徴とする請求項1記載の文切
り装置。
2. The text extracting unit regards the text of the table portion as a unit in which the text existing in each column can be regarded as one sentence, and the text of the graph portion is one sentence each. The sentence cutting device according to claim 1, which is a unit that can be regarded as.
JP5096694A 1993-03-31 1993-03-31 Sentence cutting device Expired - Fee Related JPH07104865B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP5096694A JPH07104865B2 (en) 1993-03-31 1993-03-31 Sentence cutting device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP5096694A JPH07104865B2 (en) 1993-03-31 1993-03-31 Sentence cutting device

Publications (2)

Publication Number Publication Date
JPH06290209A JPH06290209A (en) 1994-10-18
JPH07104865B2 true JPH07104865B2 (en) 1995-11-13

Family

ID=14171889

Family Applications (1)

Application Number Title Priority Date Filing Date
JP5096694A Expired - Fee Related JPH07104865B2 (en) 1993-03-31 1993-03-31 Sentence cutting device

Country Status (1)

Country Link
JP (1) JPH07104865B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009078183A1 (en) * 2007-12-19 2009-06-25 Nec Corporation Document segmentation system

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08212293A (en) * 1995-01-31 1996-08-20 Toshiba Corp Sgml tag giving processing system
JP2010067112A (en) * 2008-09-12 2010-03-25 Toshiba Corp Mechanical translation system and mechanical translation program
JP5647779B2 (en) * 2009-10-05 2015-01-07 新日鉄住金ソリューションズ株式会社 Information processing apparatus, information processing method, and program

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009078183A1 (en) * 2007-12-19 2009-06-25 Nec Corporation Document segmentation system

Also Published As

Publication number Publication date
JPH06290209A (en) 1994-10-18

Similar Documents

Publication Publication Date Title
EP0952533B1 (en) Text summarization using part-of-speech
KR100912502B1 (en) Machine translation method for PDF file
US7756871B2 (en) Article extraction
US4962452A (en) Language translator which automatically recognizes, analyzes, translates and reinserts comments in a sentence
JPH07282063A (en) Machine translation device
JPH07325828A (en) Grammar checking system
JP2000194699A (en) Translation support device and method and computer readable recording medium
JP2008108209A (en) Technique for enhancing precision of machine translation
JP2765665B2 (en) Translation device for documents with typographical information
JPH077410B2 (en) Document layout method
US20010029443A1 (en) Machine translation system, machine translation method, and storage medium storing program for executing machine translation method
JPH07104865B2 (en) Sentence cutting device
JPH0412505B2 (en)
JP3876014B2 (en) Machine translation device
US5640581A (en) CD-ROM information editing apparatus
JPH0883280A (en) Document processor
JP3377942B2 (en) Electronic dictionary search device and computer-readable storage medium storing electronic dictionary search device control program
JPH052605A (en) Machine translation system
JP3131432B2 (en) Machine translation method and machine translation device
JP3244286B2 (en) Translation processing device
JPH10293811A (en) Document recognition device and method, and program storage medium
JP2608384B2 (en) Machine translation apparatus and method
JPS61272873A (en) System for correction and expression of text
KR19990001034A (en) Sentence Extraction Method Using Context Information and Local Document Type
JP2924955B2 (en) Translation method and translation device

Legal Events

Date Code Title Description
FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20071113

Year of fee payment: 12

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20081113

Year of fee payment: 13

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20081113

Year of fee payment: 13

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20091113

Year of fee payment: 14

LAPS Cancellation because of no payment of annual fees