JPH07219951A

JPH07219951A - Specific range extracting device and sentence extracting device

Info

Publication number: JPH07219951A
Application number: JP6008260A
Authority: JP
Inventors: Tadashi Nagano; 正永野; Hideko Kurita; 秀子栗田; Takao Fukushige; 貴雄福重; Masanori Takahashi; 雅則高橋
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1994-01-28
Filing date: 1994-01-28
Publication date: 1995-08-18
Anticipated expiration: 2020-02-02
Also published as: JP3616126B2

Abstract

PURPOSE:To highly accurately determi.ne a specific range consisting of quotation parentheses etc., by referring to a specific range stored in a specific range storing part and recurringly setting up a specific range. CONSTITUTION:A specific range setting part 2 repeatedly sets up a range satisfying a specific condition in an input text inputted from an input part 1 based upon definition stored in a specific range definition storing part 7 until an area satisfying the specific range disappears. A set range storing part 5 registers a set range in each setting, and at the time of judging whether the condition is satisfied or not in succeeding processing, the set area is referred to and judged. When the area satisfying the condition disappears, the setting part 2 outputs areas set up in the past to an output part 3. A sentence delimination setting part 6 sets up sentence deliminations in the text based upon the specific range extracted from a specific range extracting part and information stored in a sentence end information storing part 8.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は文書処理および自然言語
処理における、引用、括弧、台詞等からなる特殊範囲を
設定、抽出する特殊範囲抽出装置および、複数の文を含
む電子化電子化テキストデータから文の区切りを特定す
る文抽出装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a special range extraction device for setting and extracting a special range consisting of citations, parentheses, dialogues, etc. in document processing and natural language processing, and electronic digitized text data containing a plurality of sentences. The present invention relates to a sentence extraction device that specifies a sentence break.

【０００２】[0002]

【従来の技術】従来、自然言語のテキストデータの引
用、括弧、台詞などの特殊領域の判定はテキストの先頭
から見ていって引用符の出現の度に引用内、引用外を切
り替える方法などがあった。2. Description of the Related Art Conventionally, citation of natural language text data, determination of special areas such as parentheses, dialogue, etc. has been performed by observing from the beginning of the text and switching between quotation and non-quotation each time a quotation mark appears. there were.

【０００３】また、テキストを１文毎に分割して自然言
語処理を行うような場合、ピリオドやクエスションマー
クなどの終端記号が出現する場所を無条件で区切りとす
るなどの方法がとられていた。また関連する技術とし
て、言語コンパイラなどでは、ＢＮＦ記法などの形式で
記述された規則をテキストデータに適用することによっ
て、括弧や引用符の対応関係を決定する方法があった。Further, in the case where the text is divided into sentences and the natural language processing is performed, a method of unconditionally separating places where terminal symbols such as periods and question marks appear is used. It was As a related technique, a language compiler or the like has a method of determining a correspondence between parentheses and quotation marks by applying a rule described in a format such as BNF notation to text data.

【０００４】[0004]

【発明が解決しようとする課題】引用符の出現によって
引用範囲の内と外を切り替える方法を用いた場合、引用
符自体が別の引用符や括弧で囲われている可能性を無視
するため、誤った範囲を抽出することがあった。それゆ
え自然言語に出現する多種類の引用や括弧を高精度に抽
出することはできなかった。また、クォーテーションマ
ークで囲われた引用部分の抽出などにおいては、一度あ
るマークが開始記号であるか終了記号であるかを取り違
えると、その誤りによって他の引用部分の判定まで間違
える場合が多かった。When the method of switching the inside and outside of the quotation range by the appearance of the quotation mark is used, the possibility that the quotation mark itself is enclosed in another quotation mark or parentheses is ignored. The wrong range was sometimes extracted. Therefore, it was not possible to accurately extract many types of quotations and parentheses that appear in natural language. In addition, when extracting a quoted part surrounded by quotation marks, if a mark is erroneously recognized as a start symbol or an end symbol, it is often erroneous to judge other quoted parts due to the error.

【０００５】一方、計算機言語のコンパイラなどで使用
される、文法規則によって括弧などの対応関係を決定す
る方法は、入力の形式を限定するために、任意の自然言
語のテキストに適用することはできなかった。On the other hand, the method of determining correspondences such as parentheses according to grammatical rules, which is used in a computer language compiler or the like, can be applied to arbitrary natural language texts in order to limit the input format. There wasn't.

【０００６】また、文抽出処理において、引用を考慮し
ないと、引用符によって囲まれた文をその一部として含
む文が出現した場合に複数の文に分割されてしまうとい
う問題があった。特に引用領域内に複数の文が存在する
場合には、それらの文同士の区切りの位置は周辺の状況
が一般の文の区切りとかわらないため、区切りとして認
識されてしまい、引用を含む文全体が１つの文として認
識できなかった。Further, in the sentence extraction process, if quoting is not taken into consideration, there is a problem in that when a sentence including a sentence surrounded by quotation marks appears, it is divided into a plurality of sentences. Especially when there are multiple sentences in the quotation area, the position of the delimiter between these sentences is recognized as a delimiter because the surrounding situation does not change from the ordinary sentence delimiter, and the entire sentence including the citation is recognized. Was not recognized as one sentence.

【０００７】これをさける為に従来の方法による引用や
括弧の範囲を認識する方法をとろうとしても、上記の引
用範囲抽出における問題によって引用の範囲を間違う場
合が多いため、その影響により高精度な文の切り出しが
できなかった。In order to avoid this, even if a method of recognizing the range of quotation or parenthesis by the conventional method is adopted, the range of the quotation is often wrong due to the problem in the above-mentioned quotation range extraction. I couldn't cut out such a sentence.

【０００８】なお、まとまった文章に対する従来の方法
での文切り出しの例を図９に示す。ここでは、「 "."
か "?" か "!" のどれかがが出現し、かつ空白を挟んだ
次の文字が大文字か引用符であるときに限り文の区切り
を設定する」という規則を用いている。その結果切り出
された文を図１０に示す。FIG. 9 shows an example of sentence segmentation by a conventional method for a gathered sentence. here,""."
Or "?" Or "!" Occurs, and sets the sentence delimiter only if the next character with a space is an uppercase letter or quotation mark. " The sentence cut out as a result is shown in FIG.

【０００９】このうち、文１、文２、文３、文５、文
７、文８は、正しく切り出されているが、文４、文６、
文１１は引用符の片方だけがついた、文として不自然な
形で切り出されている。また、文９は、引用の途中で区
切られてしまっている。文４、文６、文９、文１１のよ
うに不自然な形の「文」は、構文解析して情報抽出など
の計算機による処理を行おうとするとき、重大な阻害要
因となる。Of these, sentence 1, sentence 2, sentence 3, sentence 5, sentence 7, and sentence 8 are correctly cut out, but sentence 4, sentence 6,
Sentence 11 is cut out in an unnatural form as a sentence with only one of the quotation marks. Moreover, the sentence 9 is divided in the middle of the quotation. An unnatural "sentence" such as sentence 4, sentence 6, sentence 9, and sentence 11 becomes a serious impediment factor when attempting to perform syntax analysis and computer processing such as information extraction.

【００１０】[0010]

【課題を解決するための手段】上記課題を解決するため
の本願発明による特殊範囲抽出装置は、引用や括弧の範
囲決定に際し、種類を問わず、間にそれを囲っている引
用符や括弧記号が存在しないものだけを引用／括弧範囲
として設定し、以後は、既に設定された引用／括弧範囲
と矛盾しないようにより外側の引用／括弧を決定するこ
とによって、高精度に引用／括弧範囲を決定することが
できる。このことは範囲を設定する度にその範囲を記憶
し、以降の処理にでそれをアクセスできるように構成す
ることによって実現できる。A special range extraction device according to the present invention for solving the above-mentioned problems, regardless of the type when quoting or bracketing, determines the range of quotes or brackets. Only those that do not exist are set as the quotation / parenthesis range, and thereafter the quotation / parenthesis range is determined with high accuracy by determining the outer quotation / parenthesis so as not to conflict with the already set quotation / parenthesis range. can do. This can be realized by storing the range each time the range is set and configuring it so that it can be accessed in the subsequent processing.

【００１１】また、文抽出装置は、上記の手段により抽
出された引用や括弧の範囲を用いて、この範囲の途中を
文の区切りとすることを禁止することにより、台詞など
の引用符などに挾まれた別の文を含む文をも途中で切断
されることなく高精度に抽出する。Further, the sentence extraction device uses the range of quotation marks and parentheses extracted by the above means, and prohibits the middle of this range from being used as a sentence delimiter, so that the quotation marks such as dialogues can be obtained. Extracts a sentence that contains another sentence that is sandwiched with high accuracy without being cut off.

【００１２】[0012]

【作用】特殊範囲設定部は、入力テキスト中の特定の条
件を満たす範囲を条件を満たす領域が存在しなくなるま
で繰り返し設定する。そのときテキスト中のある位置に
すでにそのような範囲が設定されているかどうかを条件
中に書くことができるように構成する。特殊範囲設定部
は範囲を設定する度に設定範囲記憶部に設定範囲を登録
し、以後の処理で条件を満たすかどうかを判定する際に
設定された領域を参照して判定する。特殊範囲設定部
は、条件を満たす領域が存在しなくなった時点で、それ
までに設定したすべての領域を出力する。The special range setting section repeatedly sets a range satisfying a specific condition in the input text until no region satisfying the condition exists. Then, it is possible to write in the condition whether such a range has already been set at a certain position in the text. The special range setting section registers the setting range in the setting range storage section every time the range is set, and refers to the set area when determining whether or not the condition is satisfied in the subsequent processing and makes a determination. The special range setting unit outputs all the areas set up to that point when the area satisfying the conditions no longer exists.

【００１３】また、文抽出装置では、特殊範囲設定部の
出力情報が文区切設定部へ送られる。文区切設定部で
は、括弧区切り範囲設定部で設定された範囲の途中の位
置を除外して、残りの範囲の終端文字列だけを文区切り
の候補とし、他の条件を調べて文区切りを設定する。Further, in the sentence extracting device, the output information of the special range setting section is sent to the sentence division setting section. In the sentence delimiter setting part, the position in the middle of the range set by the bracket delimiter range setting part is excluded, only the terminating character string of the remaining range is made a candidate for sentence demarcation, and other conditions are checked and the sentence delimiter is set. To do.

【００１４】[0014]

【実施例】図１は本発明の一実施例である。本実施例で
は対象言語を英語とする。図１において、１はテキスト
を入力する入力部、７は特定の記号を用いた条件により
特殊範囲を再帰的に定義した特殊範囲定義記憶部、２は
入力されたテキスト中の引用や括弧等、７の定義に基づ
き特殊範囲を推定する特殊範囲設定部、３は抽出した引
用や括弧等の範囲及び文を出力する出力部、４は「it'
s」のように引用を囲う以外の目的でシングルクオート
が使用されるパターンを記憶しておくＳＱ除外リスト、
５は設定した引用や括弧の範囲を記憶する設定範囲記憶
部である。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows an embodiment of the present invention. In this embodiment, the target language is English. In FIG. 1, 1 is an input unit for inputting text, 7 is a special range definition storage unit in which a special range is recursively defined by a condition using a specific symbol, 2 is a quotation or parentheses in the input text, The special range setting unit that estimates the special range based on the definition of 7, the output unit 3 that outputs the range and the sentence such as the extracted quotation or parenthesis, and 4 is "it '
An SQ exclusion list that remembers patterns in which single quotes are used for purposes other than enclosing quotes, such as "s".
Reference numeral 5 denotes a set range storage unit that stores the set quotation and bracket range.

【００１５】８は、文末になりうる文字または文字列情
報を格納した文末情報格納部、６は、設定した引用や括
弧の範囲と文末情報格納部８に格納された文末情報とか
ら文の区切りを設定する文区切り設定部であるが、これ
を用いた動作は後述し、ここでの実施例では特殊範囲推
定部の出力をそのまま出力して特殊範囲抽出装置として
用いる。Reference numeral 8 denotes a sentence end information storage unit that stores character or character string information that may be the end of a sentence, and reference numeral 6 denotes a sentence delimiter based on the set quotation or bracket range and the sentence end information stored in the sentence end information storage unit 8. The operation using this will be described later, and in this embodiment, the output of the special range estimation unit is directly output and used as a special range extraction device.

【００１６】入力は一般的な英文のテキストである。特
殊範囲設定部２は、入力された英語テキストに対して後
述するアルゴリズムを用いて括弧や引用の範囲を推定
し、設定する。以後、このように設定された範囲を特殊
範囲と呼ぶことにする。設定した範囲の最終結果は、フ
ァイルの先頭からの位置を表す数値のペアの集合によっ
て表され、入力されたテキストと共に出力部３へ送られ
る。出力部３では、得られた情報に基づいて切り出され
た引用部分を表示する。The input is general English text. The special range setting unit 2 estimates and sets the range of parentheses and citations for the input English text using an algorithm described later. Hereinafter, the range set in this way will be referred to as a special range. The final result of the set range is represented by a set of pairs of numerical values representing the position from the beginning of the file, and is sent to the output unit 3 together with the input text. The output unit 3 displays the quoted portion cut out based on the obtained information.

【００１７】以後の説明の便宜の為に、文字の集合をい
くつか定義する。これを（表１）に示す。また、特殊範
囲定義記憶部７に記憶されている特殊範囲の判定条件を
（表２）に示す。For convenience of the following description, some character sets are defined. This is shown in (Table 1). Further, the special range determination conditions stored in the special range definition storage unit 7 are shown in (Table 2).

【００１８】[0018]

【表１】 [Table 1]

【００１９】[0019]

【表２】 [Table 2]

【００２０】（表２）中では、スペース文字、タブ文
字、改行文字のいづれかであることを「空白である」、
そうでないことを「空白でない」と呼ぶ。図２に特殊範
囲設定部の動作を表すフローチャートを示す。これは
（表２）の条件を満たすような特殊範囲設定だけを設定
するためのアルゴリズムとなっている。In (Table 2), it is "blank" that any of space character, tab character, and line feed character is
If not, it is called "not blank". FIG. 2 shows a flowchart showing the operation of the special range setting section. This is an algorithm for setting only the special range setting that satisfies the condition of (Table 2).

【００２１】図２に従って括弧／引用範囲推定部の動作
を説明する。動作は、基本的には入力された電子化テキ
ストファイルに対して特殊範囲定義記憶部７に定義１〜
定義６として格納された６つの判定条件にマッチする範
囲があるかどうかを捜し、あればそこを特殊範囲として
登録する、ということを条件を満たすような範囲がなく
なるまで繰り返す。The operation of the parenthesis / quotation range estimation unit will be described with reference to FIG. Basically, the operation is defined in the special range definition storage unit 7 with respect to the inputted electronic text file.
The process of searching for a range that matches the six determination conditions stored as definition 6 and registering that range as a special range is repeated until there is no range that satisfies the conditions.

【００２２】配列変数 stat[] はファイル内の各位置が
特殊範囲内であるかどうかを示す変数である。位置iが
特殊範囲内であれば stat[i] = 1 となり、そうでなけ
れば stat[i] = 0 となる。なお、stat[] は、図１にお
ける設定範囲記憶部５に相当する記憶領域である。最初
に stat[]の要素に全て０を代入しておく。２次元配列
変数ans[][] は特殊範囲を登録するためのものである。
Ｘ番目設定された各特殊範囲について、その開始位置が
ans[X][1]に、終了位置に ans[X][2] に登録されるこ
とになる。登録されていないときはー１にすると定義し
ておく。最初にすべてのansの要素にー１を代入する。
また、ans[][]用のポインタans_maxに０を代入してお
く。The array variable stat [] is a variable indicating whether each position in the file is within the special range. If position i is within the special range then stat [i] = 1, else stat [i] = 0. Note that stat [] is a storage area corresponding to the setting range storage unit 5 in FIG. First, all 0s are assigned to the elements of stat []. The two-dimensional array variable ans [] [] is for registering a special range.
The start position of each X-th special range
It will be registered in ans [X] [1] and ans [X] [2] at the end position. If not registered, it is defined as -1. First, assign -1 to all elements of ans.
In addition, 0 is assigned to the pointer ans_max for ans [] [].

【００２３】次の「初期設定」というサブルーチンは、
各種の特殊範囲の設定サブルーチン内で使用する変数の
初期化などを行うものであり、「""」に関する処理を例
として後述する。The following "initial setting" subroutine is
The variables used in the subroutine for setting various special ranges are initialized, and the processing related to """will be described later as an example.

【００２４】変数 change は図２におけるループの過程
で設定が行われたかどうかを表すフラグである。処理の
始めにこれを０（設定が行われていない）に初期化して
おく。順番は問わないが、ここでは「""」にはさまれた
部分の処理から始まって、各種の引用、括弧記号につい
て順番に設定を行う。The variable change is a flag indicating whether or not the setting is made in the process of the loop in FIG. It is initialized to 0 (no setting has been made) at the beginning of processing. The order does not matter, but here, starting from the processing of the part sandwiched by """, various quotations and parentheses are set in order.

【００２５】これらのサブルーチン内の具体的な動作に
ついては後述するが、これらはその時点で表２の各条件
を満たすような範囲を抽出するサブルーチンである。結
果として設定される場合もされない場合もありうる。各
サブルーチンでは、設定が行われた場合にのみ変数chan
geに１を設定する。Specific operations in these subroutines will be described later, but these are subroutines for extracting a range that satisfies each condition of Table 2 at that time. As a result, it may or may not be set. In each subroutine, the variable chan is set only when the setting is made.
Set 1 to ge.

【００２６】全ての種類の特殊範囲の設定を試みたあ
と、changeが０であれば、それ以上は特殊範囲が設定で
きないことになるので、処理を終了する。change が１
であれば、少なくとも１回は設定が行われたことになる
ので、change を０に設定し直してもう一度処理を始め
る。以上はループが終わってもchangeが０のままである
ような状態になるまで繰り返す。After all types of special ranges have been set, if change is 0, the special range cannot be set any more, and the process ends. change is 1
If so, it means that the setting has been performed at least once, so the change is reset to 0 and the processing is started again. The above is repeated until the state where change remains 0 even after the loop ends.

【００２７】なお、判定条件１〜６は再帰的に記述され
ているので、設定が行われるにつれて、今まで条件を満
たさないような領域が条件を満たすようになることがあ
りうる。Since the judgment conditions 1 to 6 are described recursively, there is a possibility that a region that has not satisfied the condition will satisfy the condition as the setting is performed.

【００２８】例として、「"Did you say, "What a pit
y"?", the man said. 」という部分を含むテキストの入
力に対して条件１（定義１）を考える。テキストが入力
された時点では、特殊範囲は１つも設定されていないの
で、「"Did you say, "Whata pity"?"」という部分は定
義１の条件１[4]を満たさないので、条件１全体も満た
さない。しかし、「"What a pity"」という部分は条件
１を満たすので、その範囲が特殊範囲として設定され
る。この結果、これを設定したあとでは、「"Didyou sa
y, "What a pity"?"」という部分中の両端の「"」の間
にある「”」は全て「"What a pity"」という「他の特
殊範囲」に含まれることとなり、条件１の[4]は今度は
満たされ、「"Did you say, "What a pity"?"」が特殊
範囲として設定される。As an example, "" Did you say, "What a pit
Consider condition 1 (definition 1) for input of text that includes the part "y"? ", the man said. Since no special range has been set when the text is entered, the part "" Did you say, "Whata pity"? "" Does not meet condition 1 [4] of definition 1, so condition 1 Not even satisfy the whole. However, since the part "" What a pity "" satisfies the condition 1, the range is set as the special range. As a result, after setting this, "" Did you sa
y, "What a pity"? "", """between both ends of""are all included in" Other special range "of" What a pity ", condition 1 [4] of is satisfied this time, and "" Did you say, "What a pity"? "" Is set as a special range.

【００２９】次に、図２中の各サブルーチンの動作につ
いて説明する。図３が「""にはさまれた特殊範囲の設
定」の処理のための「初期設定」のフローチャートであ
り、図４が「""にはさまれた特殊範囲の設定」のサブル
ーチン本体のフローチャートである。これらを例として
動作を説明する。他のサブルーチン及び初期化もほぼ同
様にして実現できる。Next, the operation of each subroutine in FIG. 2 will be described. FIG. 3 is a flow chart of "initial setting" for the processing of "setting of special range sandwiched between""", and FIG. 4 is a flow chart of a subroutine body of "setting of special range sandwiched between"". It is a flowchart. The operation will be described using these as examples. Other subroutines and initialization can be realized in almost the same manner.

【００３０】図３では、初期設定として、テキスト中の
全ての「"」の出現する位置をテキストの先頭位置から
の文字数によってあらわしたものを配列 dq[]に先頭か
ら順番に代入していく。また出現した「"」の個数を変
数 dq_max に代入する。In FIG. 3, as an initial setting, all the positions where ““ ”appear in the text are represented by the number of characters from the beginning position of the text, and are sequentially substituted into the array dq [] from the beginning. Substitute the number of """that appears for the variable dq_max.

【００３１】次に図４に即して「""にはさまれた特殊範
囲の設定」のサブルーチンの動作を説明する。 4-1,4-
2,4-3,4-4の部分では、間に「特殊範囲内ではない位置
にある「"」記号が存在しないような２つの「"」のペア
を捜す。そのようなペアが存在すれば処理4-5が始まる
時点で、それらの位置がdq[i]とdq[j]に代入されてい
る。Next, the operation of the "set special range sandwiched by""subroutine will be described with reference to FIG. 4-1,4-
In the 2,4-3,4-4 part, we search for two pairs of """such that there is no""" symbol in a position that is not within the special range. If such pairs exist, their positions are assigned to dq [i] and dq [j] at the time when processing 4-5 starts.

【００３２】処理4-5ではその「"」のペアが表１の定義
１を（<=定義１では？）満たしているかどうかを調べ
る。これは入力テキスト中のその前後の文字を調べるこ
とによって容易に実現できる。In processing 4-5, it is checked whether or not the pair of """satisfies definition 1 in Table 1 (<= definition 1?). This can be easily accomplished by examining the characters before and after it in the input text.

【００３３】なお、「"」の場合は発生しないが「'」や
「`」に囲まれた特殊範囲を設定する場合にはここで、
テキストの周囲の状況が、図１のＳＱ除外リスト４に登
録されているパターンに一致するかどうかを判定する処
理が必要である。In the case of "", it does not occur, but when setting a special range surrounded by "'" and "` ", here,
It is necessary to perform a process of determining whether or not the situation around the text matches the pattern registered in the SQ exclusion list 4 of FIG.

【００３４】ＳＱ除外リストは、正規表現などでパター
ンを記述しておき、それと周囲の文字列のマッチングを
とるには、文字列パターンマッチングに関する従来の技
術を用いればよい。In the SQ exclusion list, a pattern is described by a regular expression or the like, and in order to match it with a surrounding character string, a conventional technique relating to character string pattern matching may be used.

【００３５】処理4-5の結果、条件を満たせばそのdq[i]
とdq[j]を特殊範囲としてans に登録し、stat[]のdq
[i]番目の要素からdq[j]番目の要素までを全て１（「特
殊範囲以内である」）にする。その後、他の「"」のペ
アを捜して処理を続行し、すべての間に「特殊範囲内で
はない位置にある「"」記号」が存在しないような「"」
のペアについて処理してサブルーチンを終了する。以上
のようにして各サブルーチンが実現される。As a result of processing 4-5, if the condition is satisfied, the dq [i]
And dq [j] as special ranges in ans, and dq of stat []
Set all the elements from the [i] th element to the dq [j] th element to 1 ("is within the special range"). After that, it searches for other pairs of """and continues the process, so that there is no""" symbol "that is not in the special range between them.
And the subroutine is finished. Each subroutine is realized as described above.

【００３６】図２の処理を全体としてみれば、最初は間
に同種の引用／括弧記号がないような範囲のみが特殊範
囲として登録され、引用や括弧が多重になっている時
は、外側の引用／括弧は内側のものが特殊範囲として設
定された後に特殊範囲として設定されていくことにな
る。このようにして多種類の引用や括弧が同時に出現
し、かつそれらが多重になっていた場合も、最も外側の
範囲まで必ず設定することができる。Looking at the processing of FIG. 2 as a whole, at first, only the range without the same kind of quotation / bracket symbols is registered as a special range, and when the quotation and the parentheses are duplicated, the outside The quotation / bracket will be set as a special range after the inner one is set as a special range. In this way, even when many kinds of citations and parentheses appear at the same time and they are multiplexed, the outermost range can be set without fail.

【００３７】上記のようにして決定された設定範囲のデ
ータはファイル中における位置の形で出力部に送られ
る。出力部では、この数値から設定された各引用部分を
表示する。The data of the setting range determined as described above is sent to the output unit in the form of the position in the file. The output section displays each quoted part set from this numerical value.

【００３８】次に、前述の「"Did you say, "What a pi
ty"?", the man said. 」という例を用いて全体の動作
を説明する。Next, the above-mentioned "" Did you say, "What a pi
The whole operation is explained using an example of "ty"? ", the man said.".

【００３９】特殊範囲が１つも設定されていない状態で
設定できる特殊範囲は「"What apity"」という範囲だ
けである。「"Did you say, "」は（表２）の定義１[6]
に、「"?"」は（表２）の定義１[5]に、「"Did you sa
y, "What a pity"」や「"What a pity"?"」や「"Did y
ou say, "What a pity"?"」は（表２）の定義１[4]に、
それぞれ抵触するからである。（なお、最後の３つは定
義に抵触するだけでなく、図４のアルゴリズムを用いた
場合はそもそもdq[i],dq[j]のペアとして選択されな
い）ところが「"What a pity"」という範囲が特殊範囲
として設定された後では、「"Did you say, "What a pi
ty"?"」が（表２）の定義１[4]に抵触しなくなる。The only special range that can be set when no special range is set is the "" What apity "" range. "" Did you say, "" is the definition 1 in (Table 2) [6]
In addition, ""? "Is defined in [Table 2] definition 1 [5] and""Did you sa
y, "What a pity" or "What a pity"? "or" Did y
ou say, "What a pity"? "" is defined in [Table 2] definition 1 [4],
Because they conflict with each other. (Note that the last three do not conflict with the definition, but are not selected as a pair of dq [i] and dq [j] in the first place when using the algorithm of Fig. 4) However, it is called "What a pity" After the range is set as a special range, "Did you say," What a pi
ty "?""no longer conflicts with Definition 1 [4] in (Table 2).

【００４０】従って、「"Did you say, "What a pity"
?"」は特殊範囲として設定される。なお、「"Did you s
ay, "What a pity"」や「"What a pity"?"」は今度は
（表２）の定義１[2]に抵触するので切り出されないこ
とになる。（「pity"」の部分は既に他の特殊範囲の終
了位置に、「"What」の部分は既に他の特殊範囲の開始
位置になっている。なお、この２つは定義に抵触するだ
けでなく、図４のアルゴリズムを用いた場合はそもそも
dq[i],dq[j]のペアとして選択されない）。Therefore, "" Did you say, "What a pity"
? "" Is set as a special range. In addition, "" Did you s
“Ay,“ What a pity ”” and ““ What a pity ”?” ”now violate the definition 1 [2] in (Table 2), so they are not cut out. (The "pity" part is already at the end position of the other special range, and the "" What "part is already at the start position of the other special range. However, if the algorithm of FIG. 4 is used,
Not selected as a pair of dq [i], dq [j]).

【００４１】以上のようにして「"Did you say, "」、
「"?"」、「"What a pity"?"」といった間違った引用範
囲はは切り出されずに「"What a pity"」という範囲と
「"Did you say, "What a pity"?"」という２つの範囲
が書き手の意図どおりに切り出されるような処理が実現
される。As described above, "" Did you say, "",
Wrong citations such as ""? "And" What a pity "?" Are not cut out, but "" What a pity "" and "Did you say," What a pity "?"" Processing is realized so that the two ranges are cut out as intended by the writer.

【００４２】また、「"Did you say, 'What a pity'?",
the 'Visitor' said. 」のように２種類以上の引用や
括弧が組み合わされた表現でも、内部のものから決定さ
れていくので、全く問題なく設定できる。In addition, "" Did you say, 'What a pity'? ",
Expressions that combine two or more types of quotations and parentheses, such as "the 'Visitor'said.", are determined from the internal ones, so it can be set without any problems.

【００４３】次に文抽出装置としての動作について述べ
る。図１において、６は文の区切りを設定する文区切り
設定部である。７には（表１）に示す＜複合文末文字列
＞、＜文末文字列＞が、文末表現として格納されている
ものとする。本実施例では、特殊範囲設定部によって推
定された特殊範囲の情報は文区切り設定部６に送られ
る。Next, the operation of the sentence extracting device will be described. In FIG. 1, 6 is a sentence break setting unit that sets sentence breaks. 7, it is assumed that <Compound sentence end character string> and <Sentence end character string> shown in (Table 1) are stored as sentence end expressions. In this embodiment, the information on the special range estimated by the special range setting unit is sent to the sentence break setting unit 6.

【００４４】図５に文区切り設定部の動作を説明するフ
ローチャートを示す。図５において、「位置」は入力フ
ァイル中の先頭から何文字めかを表す数値であり、初期
値は１（ファイルの先頭）である。まず、その位置が表
１の「文末文字列」の終端になっているかを調べる(図
中5-2)。なっていなければ、区切りとしないて次の位置
へ進む。なっていれば、さらにその位置が特殊範囲内か
どうかを判定し、特殊範囲でなければ処理5-5へ進む。FIG. 5 shows a flowchart for explaining the operation of the sentence break setting unit. In FIG. 5, "position" is a numerical value indicating the number of characters from the beginning of the input file, and the initial value is 1 (the beginning of the file). First, check whether that position is the end of the "end-of-sentence character string" in Table 1 (5-2 in the figure). If not, proceed to the next position without a break. If so, it is further determined whether or not the position is within the special range. If it is not within the special range, the process proceeds to processing 5-5.

【００４５】特殊範囲であれば、区切りとしないで次の
位置に進むが、英語においては引用や括弧が文の終端に
あるとき「…."」のようにピリオドが引用範囲の内側に
書かれるので、その場合は例外的に処理5-5へ進む（図
中5-4）。5-5では、その位置以降で最初に現われるアル
ファベット文字が大文字か小文字かを調べる。小文字で
あれば文中に出現する「U.S.」などの省略表現などであ
ると考えられるので区切りとしないで次の位置の処理へ
進む。大文字であれば、そこ（文末文字列の終締）を文
同士の区切りとして設定して、次の位置の処理に進む。If it is a special range, it advances to the next position without a break, but in English, when a quotation or parenthesis is at the end of a sentence, a period is written inside the quotation range like "....". Therefore, in that case, the process proceeds exceptionally to process 5-5 (5-4 in the figure). In 5-5, check whether the first alphabetic character after that position is uppercase or lowercase. If it is a lowercase letter, it is considered to be an abbreviated expression such as "US" that appears in the sentence, so proceed to the processing at the next position without separating it. If it is an uppercase letter, that point (end of sentence end character string) is set as a delimiter between sentences, and the process proceeds to the next position.

【００４６】以上のようにしてたとえ複数の文からなる
台詞などを含む文であっても、台詞の途中で分断される
ことなく全体を抽出することができる。As described above, even a sentence including a dialogue composed of a plurality of sentences can be extracted as a whole without being divided in the middle of the dialogue.

【００４７】次に文切り出しの動作を簡単な例文を用い
て説明する。前述の「"Did yousay, "What a pity"
?", the man said. 」という例を含む、「It was stran
ge."Did you say, "What a pity"?", the man said. Sh
e couldn't answer.」という部分がテキスト中にあった
とする。一般の文切り出し処理では「pity"?"」のとこ
ろの「?」が文の終端を表す記号であるため、ここで文
が区切られ、「"Didyou say, "What a pity"?」や「",
the man said. 」といった「文]が抽出されてしまう。
これに対し、本実施例を用いれば、「"Did you say, "W
hat a pity"?"」という部分が特殊範囲として設定され
るので、「?」が文の終端を表す記号であっても区切り
とされない。（図５の5-4の部分で「唯一」でないため
除外される。）従って、「It was strange.」「"Did yo
u say, "What a pity"?", the man said. 」「 Sh
e couldn't answer.」という３つの文が正しく抽出でき
る。Next, the operation of sentence extraction will be described using a simple example sentence. The above "" Did you say, "What a pity"
"It was stran," including the example "?", the man said.
ge. "Did you say," What a pity "?", the man said. Sh
Suppose there is a part "e couldn't answer." in the text. In general sentence segmentation processing, "?" In "pity"? "Is a symbol that represents the end of the sentence, so sentences are separated here, and" Did you say, "What a pity"? "",
"Sentences" such as "the man said." are extracted.
On the other hand, if this embodiment is used, "" Did you say, "W
Since the part "hat a pity"? "" is set as a special range, even if "?" is a symbol indicating the end of a sentence, it is not separated. (It is excluded because it is not "only" in the part 5-4 in Fig. 5.) Therefore, "It was strange.""" Did yo
u say, "What a pity"? ", the man said.""Sh
The three sentences "e couldn't answer." can be correctly extracted.

【００４８】また、「He said, "It is very easy. Com
e on!"」のように引用や台詞が複数の文から構成され
ている場合も、引用中の部分が特殊範囲となるので途中
で分断されずに引用を含む文全体を抽出することができ
る。Moreover, "He said," It is very easy. Com
Even when a quotation or dialogue is composed of multiple sentences, such as "e on!"", the entire sentence including the quotation can be extracted without being divided because the quoted portion is a special range. .

【００４９】更に、上記のような方法で引用や括弧を設
定すれば、書き手が誤るなどの原因によって開始記号と
終了記号のうち片方しかなかったような場合には範囲が
設定されないため、誤った記号や例外的な記号によって
文の抽出範囲が大きく乱されるのを防ぐことができる、
という効果もある。例えば間違って入力された「"」記
号があってもそれと相呼応する「"」記号が出現しなか
ったならば、それ以降は普通に文が切り出される。「 T
his is a "pen. ...(多数の文からなる長いテキス
ト).... He said, "I want to go there."」といった
形が出現し、penの前の「"」記号が間違って入力された
ものであったとする。従来は引用内で文が区切られるの
を防ごうとすると、「"pen. ...(多数の文からなる長い
テキスト).... He said, "」のような部分が全部つなぎ
合わされてしまうという問題があった。Further, if quotation marks and parentheses are set by the above method, the range is not set when there is only one of the start symbol and the end symbol due to the writer's error, etc. You can prevent the extraction range of sentences from being greatly disturbed by symbols and exceptional symbols,
There is also the effect. For example, if there is a wrongly entered """symbol but the corresponding""" symbol does not appear, the sentence is cut out normally thereafter. "T
his is a "pen. ... (long text consisting of many sentences) .... He said," I want to go there. "" appears and the """symbol in front of the pen is incorrect It was input. In the past, when trying to prevent sentences from being separated in quotes, all parts like "" pen .... (long text consisting of many sentences) .... He said, "" were spliced together. There was a problem of being lost.

【００５０】これに対し本発明を用いると penの前
の「"」に対応する「"」記号が出現しないため、ここか
ら始まる特殊範囲は設定されない。「"pen. ...(多数
の文からなる長いテキスト).... He said, "」は（表
２）の定義１[6]に抵触するのである。特殊範囲 "I wan
t to go there."だけが設定されることになる。On the other hand, when the present invention is used, the """symbol corresponding to the""in front of the pen does not appear, so the special range starting from here is not set. "" Pen .... (long text consisting of many sentences) .... He said, "" violates definition 1 [6] of (Table 2). Special range "I wan
Only "t to go there." will be set.

【００５１】このため、「（多数の文からなる長いテキ
スト)」の部分は普通に文切り出し処理が行われ、誤抽
出を防げる。Therefore, the "(long text consisting of a large number of sentences)" portion is normally subjected to sentence cutting processing to prevent erroneous extraction.

【００５２】最後にまとまった文章に対する動作を示
す。前述のように従来の方法で一連の文章の区切り位置
を設定した例を図９に示し、そこを区切り位置として切
り出した文を図１０に示す。図１０では、文４、文
６、文９、文１１のように引用を含む文の一部分だけが
切り出されてしまっている。同じ例文に対して特殊範
囲設定を行った結果を図６に示す。図６では引用の前後
の空白などを考慮することにより、引用符の範囲が正し
く切り出されている。その結果を用いて文区切りを設定
した様子を図７に示す。図７は、図９と比べると、特殊
範囲の内部の区切りが設定されていない点が異なってい
る。図８は図６の区切り位置を区切りとして、切り出さ
れた文である。図８においては、全ての文が引用の途中
で切り離されることなく抽出されている。The operation for the last collected sentence will be described. FIG. 9 shows an example of setting a delimiter position of a series of sentences by the conventional method as described above, and FIG. 10 shows a sentence cut out by using the delimiter position. In FIG. 10, only a part of the sentence including the quotation is cut out like sentence 4, sentence 6, sentence 9, and sentence 11. FIG. 6 shows the result of setting the special range for the same example sentence. In FIG. 6, the range of quotation marks is correctly cut out by taking into consideration blanks before and after the quotation. FIG. 7 shows how sentence breaks are set using the results. FIG. 7 is different from FIG. 9 in that a delimiter inside the special range is not set. FIG. 8 shows a sentence cut out by using the delimiter position of FIG. 6 as a delimiter. In FIG. 8, all the sentences are extracted without being cut off during the citation.

【００５３】[0053]

【発明の効果】本発明は以上のように、多重の引用、台
詞、括弧が混在していてもそれらを高精度に抽出する手
段を提供する。また、これらの表現を含むドキュメント
を文毎に正確に区切っていく手段をも提供する。しか
も、これらの処理を単語の意味を知るために辞書を引く
といった重い処理を用いずに高速に実現できる。引用部
分や文の切り出しは、文書の編集に、また計算機による
実用的な自然言語処理の前処理として活用することがで
きる。As described above, the present invention provides means for accurately extracting multiple citations, dialogues, and parentheses even if they are mixed. It also provides a way to accurately separate documents containing these expressions sentence by sentence. Moreover, these processes can be realized at high speed without using a heavy process such as looking up a dictionary to know the meaning of a word. Extraction of quoted parts and sentences can be utilized for document editing and as preprocessing for practical natural language processing by a computer.

[Brief description of drawings]

【図１】本発明の一実施例における文抽出装置の構成を
示す概念図FIG. 1 is a conceptual diagram showing the configuration of a sentence extraction device according to an embodiment of the present invention.

【図２】本発明の特殊範囲設定部の動作を示すフローチ
ャートFIG. 2 is a flowchart showing the operation of a special range setting unit of the present invention.

【図３】本発明の一実施例における定義１による特殊範
囲設定の初期化ルーチンを示すフローチャートFIG. 3 is a flowchart showing an initialization routine of special range setting according to definition 1 in an embodiment of the present invention.

【図４】本発明の一実施例における定義１による特殊範
囲設定処理の動作を示すフローチャートFIG. 4 is a flowchart showing the operation of special range setting processing according to definition 1 in an embodiment of the present invention.

【図５】本発明の一実施例における文区切設定部の動作
を表すフローチャートFIG. 5 is a flowchart showing the operation of a sentence break setting unit according to an embodiment of the present invention.

【図６】本発明の一実施例による特殊範囲設定結果図FIG. 6 is a special range setting result diagram according to an embodiment of the present invention.

【図７】本発明の一実施例による文抽出処理結果図FIG. 7 is a sentence extraction processing result diagram according to an embodiment of the present invention.

【図８】本発明の一実施例による文切り出し結果図FIG. 8 is a diagram showing a result of sentence extraction according to an embodiment of the present invention.

【図９】入力電子化テキストデータ図[Figure 9] Input digitized text data diagram

【図１０】従来技術による図９の処理結果図FIG. 10 is a processing result diagram of FIG. 9 according to the related art.

[Explanation of symbols]

１入力部２特殊範囲設定部３出力部４ＳＱ除外リスト５設定範囲記憶部６文区切設定部７特殊範囲定義記憶部８文末情報格納部 1 Input Section 2 Special Range Setting Section 3 Output Section 4 SQ Exclusion List 5 Setting Range Storage Section 6 Sentence Separation Setting Section 7 Special Range Definition Storage Section 8 Sentence Information Storage Section

───────────────────────────────────────────────────── フロントページの続き (72)発明者高橋雅則大阪府門真市大字門真1006番地松下電器産業株式会社内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Masanori Takahashi 1006 Kadoma, Kadoma City, Osaka Prefecture Matsushita Electric Industrial Co., Ltd.

Claims

[Claims]

1. A special range definition storage unit that recursively defines a special range, and a special range setting unit that sets a special range in computerized text data based on the definition stored in the special range definition storage unit. A special range storage unit that stores the special range set by the special range setting unit, wherein the special range setting unit is stored in the special range storage unit based on the definition stored in the special range definition storage unit. A special range extraction device characterized in that the special range is recursively set with reference to the special range.

2. The special range extraction device according to claim 1, a sentence end information storage unit that stores a character or a character string that can represent the end of a sentence, a special range extracted by the special range extraction device, and the sentence end information. A sentence extraction device comprising a sentence division setting unit for setting sentence divisions in electronic text data based on information stored in a storage unit.