JPH04188364A

JPH04188364A - Device for extracting intrinsic wording of japanese sentence

Info

Publication number: JPH04188364A
Application number: JP2319150A
Authority: JP
Inventors: Masanobu Higashida; 正信東田; Masahiro Oku; 雅博奥; Fumiko Fukunaga; 福永　文子
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1990-11-22
Filing date: 1990-11-22
Publication date: 1992-07-06
Anticipated expiration: 2011-07-31
Also published as: JP2520195B2

Abstract

PURPOSE:To extract an intrinsic wording mixing HIRAGANA(cursive form of Japanese syllabary) by extracting the peculiar words having 'the continuous form of a verb + inflecting termination' as a constitutive element as intrinsic wording candidates and narrowing down these candidates in two stages. CONSTITUTION:The HIRAGANA to be the continuous form termination of the verb is defined as the other character type to be distinguished from the other HIRAGANA, the character string of an objective document is developed to the column of character type codes and from the column, the intrinsic wordings having 'the continuous form of the verb + the inflecting termination' as the constitutive element are extracted as the intrinsic wording candidates by a HIRAGANA mixed intrinsic wording candidate extraction part 4. Next, those candidates are narrowed down in the first stage according to a narrow-down rule concerning the length of the character string or the combination of the character types at a HIRAGANA mixed intrinsic wording candidate narrow-down part A 6 and in the second stage to delete words already registered in a general word dictionary from the candidates at a HIRAGANA mixed intrinsic wording candidate narrow-down part B 8 by referring to the general word dictionary 9 concerning the intrinsic wording candidates narrowed down in the first stage. Thus, the intrinsic wording mixing HIRAGANA can be automatically extracted.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は２日本語文書中から、該文書で使用されている
製品名、会社名１人名などの固有名詞を自動的に抽出し
、更に一般単語の組み合わせであっても新語や該文書で
のみ使用されていると考えられる語（固有名詞を含めて
これらを固有用語と呼ぶ）を自動的に抽出する日本文固
有用語抽出装置に関するものである。[Detailed Description of the Invention] [Field of Industrial Application] The present invention automatically extracts proper nouns such as product names, company names, and one person's names used in two Japanese documents from two Japanese documents; This relates to a device for extracting unique Japanese words that automatically extracts new words and words that are considered to be used only in the document (including proper nouns, these are called proper words), even if they are combinations of common words. be.

[Conventional technology]

固有用語を抽出する技術としては、同−出願人等による
特願昭６２−２３８３８５号、特願昭６２−１９７８４
３号がある。この方式においては３日本語で使用される
文字を１漢数字」「漢字」　「ひらがな」　「カタカナ
」など約１０種類の文字種別に分類して。Techniques for extracting unique terms include Japanese Patent Application No. 1983-238385 and Japanese Patent Application No. 1978-1978 filed by the same applicant.
There is No. 3. In this method, the characters used in 3 Japanese languages are classified into approximately 10 types of characters, including 1 kanji numerals, ``kanji,''``hiragana,'' and ``katakana.''

“区切り記号「”と“区切り記号」”とで挟まれた語や
「ひらがな」と「ひらがな」とで挟まれた語を固有用語
候補として抽出し１文字種別の組み合わせに関する候補
絞り込みルール適用による絞り込みを行い、一般語辞書
に登録されているかどうかによる絞り込みを実行して、
残った語を固有用語として抽出するようにしていた。Words sandwiched between “delimiter ” and “delimiter” or words sandwiched between “hiragana” and “hiragana” are extracted as unique term candidates, and narrowed down by applying candidate narrowing down rules regarding combinations of single character types. and narrow down the search based on whether it is registered in the general word dictionary.
The remaining words were extracted as unique terms.

[Problem to be solved by the invention]

従来の技術においては、漢字、カタカ九英字などで構成
される固有用語の抽出は可能であったが、「手書き文字
」、「組み込み関数」などの如く平仮名が混じっている
場合の固有用詔については抽出できないという欠点があ
った。With conventional technology, it was possible to extract unique terms consisting of kanji, kataka, 9 alphabets, etc., but it is difficult to extract unique terms that are mixed with hiragana, such as "handwritten characters" and "built-in functions." The disadvantage was that it could not be extracted.

本発明は、平仮名混じりの固有用語の構成上の特徴を利
用して、上記問題点を克服し平仮名混じりの固有用語を
自動的に抽出することを目的としている。The present invention aims to overcome the above-mentioned problems by utilizing the structural features of proper terms containing hiragana and to automatically extract proper terms containing hiragana.

〔課題を解決するための手段）本発明は１日本語の文書に含まれる固有用語のうち。[Means to solve the problem] The present invention is based on one of the unique terms included in Japanese documents.

対象日本語文書の中から３特願昭６２−１９７８４３号
。3 Japanese Patent Application No. 197843 from the target Japanese documents.

特願昭６２−２３８３８５号の手法により抽出された漢
字。Kanji extracted using the method disclosed in Japanese Patent Application No. 1983-238385.

カタカナ、英字またはそれらの組み合わせから構成され
ていて、固有用語として認定された語の前後で対象文書
を分割した文字列を対象とし。Targets character strings that are composed of katakana, alphabetic characters, or a combination thereof, and are divided into text before and after words that are recognized as unique terms.

平仮名混じりの固有用詔の多くが動詞の連用形語尾に他
の名詞が付加されて構成されていることを利用して１動
詞の連用形語尾になりうる平仮名を他の平仮名と区別し
て別の字種とし、対象文書の文字列を字種コードの列に
展開して、その中から°“動詞の連用形＋活用語尾”が
構成要素となっている固有用語を固有用語候補として抽
出し、それらの候補を１文字列の長さや字種の組み合わ
せに関する絞り込みルールによる第１段階の絞り込みと
、第１段階で絞り込まれた固有用語候補に対して一般語
辞書を参照する事により一般語辞書にすでに登録されて
いる語を候補から削除する第２段階の絞り込のとを行う
。Taking advantage of the fact that many unique edicts containing hiragana are composed of other nouns added to the conjunctive ending of a verb, hiragana that can be used as the conjunctive ending of a single verb are distinguished from other hiragana and given in a different character type. Then, expand the character string of the target document into a string of character type codes, extract unique terms whose constituent elements are ° “conjugated form of verb + conjugated ending” as unique term candidates, and select these candidates. The first stage of narrowing down is based on the narrowing down rules regarding the length of one character string and the combination of character types, and the unique term candidates narrowed down in the first stage are referred to the common language dictionary to find words that have already been registered in the common language dictionary. The second stage of narrowing down is performed to delete the words that are listed from the candidates.

[For production]

“動詞の連用形＋活用語尾”が構成要素となっている固
有用語を固有用語候補として抽出し、これらの候補につ
いて上記２段階の絞り込みを行って平仮名混じりの固有
用語をも抽出するようにする。Unique terms whose constituent elements are "conjugated form of verb + conjugated ending" are extracted as unique term candidates, and these candidates are narrowed down in the above two steps to extract unique terms containing hiragana as well.

〔Example〕

第１１］は本発明の一実施例を示す基本構成圓である。 11] is a basic configuration circle showing an embodiment of the present invention.

１は日本語文書、２は固有用語リス）・’１．Ｊ、１．
１は固有用語リスト「２」を表している。また１２は日
本文固有用語抽出装置本体であって９文字列分離部３．
平板名混じり固有用語候補抽出部４５字種コードテーブ
ル５．平仮名混じり固を用語候補絞り込み部Ａ６．絞り
込みルールテーブル７、平仮名混じり固有用語候補絞り
込み部Ｂ８．一般語辞書９．固有用語出力部１０により
構成される。1 is a Japanese document, 2 is a list of unique terms)・'1. J, 1.
1 represents the unique term list "2". 12 is the main body of the Japanese sentence specific term extraction device; 9 character string separation unit 3.
Character type code table 5. Extracting unit for unique term candidates with flat names mixed in. Hiragana mixed solid term candidate narrowing down section A6. Narrowing down rule table 7, hiragana-mixed unique term candidate narrowing down section B8. General language dictionary 9. It is composed of a unique term output section 10.

日本語文書１は日本文固有用語抽出装置本体１２への入
力文書であり。The Japanese document 1 is an input document to the Japanese language specific term extraction device main body 12.

固有用語りストｒｌＪ　２は、従来の日本語文書固有用
語抽出装置によって日本語文書１からあらかじめ抽出さ
れた平仮名を含まない固有用語のリストである。The unique term list rlJ2 is a list of unique terms that do not include hiragana and are extracted in advance from the Japanese document 1 by the conventional Japanese document unique term extraction device.

文字列分離部３は１日本語文書１を入力として固有用語
リスト「１」２を検索し２人力文中に固有用語があれば
これを削除して、固有用語の前後文字列を分離し、各々
を独立の分離文字列としてこのリストを作成する。The character string separation unit 3 inputs the Japanese document 1, searches the unique term list "1" 2, deletes the unique term if it exists in the human text, separates the character strings before and after the unique term, and separates each Create this list as an independent separated string.

平仮名混じり固有用語候補抽出部４は９文字列６一分離部３で作成された分離文字列リストを入力としてこ
れを字種コードテーブル５を使用してコード列に変換し
た後、このコード列を入力として平仮名混じり固有用語
候補抽出を行い、平仮名混じり固有用語候補リストを作
成する。The hiragana-mixed unique term candidate extraction unit 4 inputs the separated character string list created by the 9-character string 6-separation unit 3, converts this into a code string using the character type code table 5, and then converts this code string into a code string. As input, candidates for unique terms mixed with hiragana are extracted, and a list of candidates for unique words mixed with hiragana is created.

字種コードテーブル５は２日本語文書で使用される文字
を複数の字種に分離し、その各々に対応する字種コード
を示す表である。The character type code table 5 is a table that separates characters used in a Japanese document into a plurality of character types and shows character type codes corresponding to each of the character types.

平仮名混じり固有用語候補絞り込み部Ａ６は。The hiragana-mixed unique term candidate narrowing down unit A6 is.

平仮名混じり固有用語候補抽出部４で作成された平仮名
混じり固有用語候補リストを入力として絞り込みルール
テーブル７のルールを適用し１選択条件に適合したもの
を元の文字列に変換する。The list of hiragana-mixed proper term candidates created by the hiragana-mixed proper term candidate extraction unit 4 is input, and the rules of the narrowing-down rule table 7 are applied to convert those that meet one selection condition into the original character string.

絞り込みルールテーブル７は１文字列長３字種コードの
並び等に関する情報をもとに作成した候補絞り込みのた
めのルールを記述したテーブルである。The narrowing down rule table 7 is a table in which rules for narrowing down candidates are written based on information regarding the arrangement of one character string length, three character type codes, and the like.

平仮名混じり固有用語候補絞り込み部Ｂ８は。The hiragana-mixed unique term candidate narrowing down unit B8 is.

平仮名混じり固有用語候補絞り込み部Ａ６で絞り込まれ
１元の文字列に変換された平仮名混じり固有用語候補に
ついて、一般語辞書９による絞り込みを行う。The common word dictionary 9 narrows down the hiragana-mixed unique term candidates narrowed down by the hiragana-mixed unique term candidate narrowing down unit A6 and converted into one-dimensional character strings.

一般語辞書９は、−船釣な日本語単語の字面。General language dictionary 9 is the font of the Japanese word ``boat fishing''.

読みや品詞などを記述した一般語辞書である。This is a dictionary of common words that describes readings, parts of speech, etc.

固有用語出力部１０は、平仮名混じり固有用語候補絞り
込み部Ｂ８で絞り込まれた結果を蓄積して出力装置に固
有用語りスト「２」を出力する。The unique term output unit 10 accumulates the results narrowed down by the hiragana-mixed unique term candidate narrowing down unit B8, and outputs the unique term list "2" to the output device.

固有用語リス）ｒ２Ｊ１１は、出力された固有用語のリ
ストである。Unique term list) r2J11 is a list of outputted unique terms.

第２図は日本文固有用語抽出装置の動作例の概略フロー
図を示す。以下、これらの図にしたがって第１図の動作
を説明する。FIG. 2 shows a schematic flow diagram of an example of the operation of the Japanese language specific term extraction device. The operation of FIG. 1 will be explained below with reference to these figures.

ステップ３１−１：任意の日本語文書を入力とする。Step 31-1: Input any Japanese document.

ステップ５Ｌ−１人力である日本語文書（以下、入力文書）に含まれる第
１文を処理するためにｎ＝１とする。Step 5L-1 In order to process the first sentence included in a human-generated Japanese document (hereinafter referred to as input document), n=1 is set.

ステップＳｔ−，３：入力文書すべてを処理するまでステップ５ｌ−７までの
処理を繰り返す。Step St-, 3: Repeat the process up to step 5l-7 until all input documents are processed.

ステップ５ｌ−１文字列分離部３では、第ｎ文中に含まれる固有用語検索
を目的として、固有用語リスＩ−ｒ　Ｉ　Ｊを順に検索
する。Step 5l-1 The character string separation unit 3 sequentially searches the list of unique terms I-r I J for the purpose of searching for a unique term included in the n-th sentence.

ステンブ３１−５：さらに文字列分離部３では、ステップ５１−４で検索し
た固有用語と一致する文字列が第ｎ文中にあればそれを
文中より削除する。さらにその前後の文字列を分離し、
各々を独立した分離文字列とする。Stencil 31-5: Furthermore, the character string separation unit 3 deletes a character string that matches the unique term searched in step 51-4 from the nth sentence if it exists in the nth sentence. Furthermore, separate the strings before and after that,
Each is an independent separated character string.

ステップ５ｌ−６：次の文を処理するためにｎ＝ｎ＋１とする。Step 5l-6: Let n=n+1 to process the next sentence.

ステップ５ｌ−７：入力文書中すべての文に対する処理を終了したか否かで
処理を分ける。すべての処理を終了していない場合には
ステップ５ｌ−３に戻り、１了している場合にはステッ
プ５１−８に進む。Step 5l-7: Processing is divided depending on whether processing has been completed for all sentences in the input document. If all the processing has not been completed, the process returns to step 51-3, and if it has been completed, the process proceeds to step 51-8.

ステップ５ｌ−８：以上の処理で分離、蓄積された分離文字列のリストを作
成する。Step 5l-8: Create a list of separated character strings separated and accumulated through the above processing.

ステンブ５ｌ−１ニステップ５１−８で作成された分離文字列のリストを入
力とする。The list of separated character strings created in Step 5l-1 and Step 51-8 is input.

ステップ５２−２　：入力である分離文字列リスト（以下１人力分離文字列）
に含まれる第１文字列を処理するためにｉ＝１とする。Step 52-2: Input separated character string list (hereinafter referred to as 1-manual separated character string)
In order to process the first character string included in , i=1.

ステップ３２−３　：入力分離文字列すべてを処理するまでステップ５２−７
までの処理を繰り返す。Step 32-3: Step 52-7 until all input separation strings are processed.
Repeat the process up to.

ステップ５２−４　：平仮名混じり固有用語候補抽出部４では、第ｉ文字列の
１文字１文字を５字種コードテーブル５の記述に従って
字種に基づく複数種類のコード（例えば第７図に示す７
種類のコード）に変換し。Step 52-4: The hiragana-mixed unique term candidate extraction unit 4 converts each character of the i-th character string into multiple types of codes based on the character types (for example, the 7 characters shown in FIG.
type code).

第ｉ文字列に対するコード列を生成する。Generate a code string for the i-th character string.

ステップ３１−５　：さらに平仮名混じり固有用語候補抽出部４では。Step 31-5: Further, in the hiragana-mixed proper term candidate extraction unit 4.

第ｉ文字列について１字種Ｊ、または字種Ｆで分離され
る。または囲まれるＡ、Ｂ、Ｃ，Ｄ、Ｅの文字列の組み
合わせすべてのうち文字コードＢを含むものをすべて平
仮名混しり固有用語候補として抽出する。The i-th character string is separated by character type J or character type F. Alternatively, all combinations of the enclosed character strings A, B, C, D, and E that include the character code B are extracted as hiragana-mixed unique term candidates.

ステップ５２−６　：次の文字列を処理するためにｉ＝ｉ＋１とする。Step 52-6: Let i=i+1 to process the next string.

ステンブ５２−７：入力文字列中すべての文字列に対する処理を終了したか
否かで処理を分ける。Step 52-7: Processing is divided depending on whether processing has been completed for all character strings in the input character string.

すべての処理を終了していない場合にはステップ５２−
３に戻り、終了している場合にはステップ３２−８に進
む。If all processing has not been completed, step 52-
Return to step 3, and if completed, proceed to step 32-8.

ステップ８２−８゜以上の処理で抽出された平仮名混じり固有用語候補のリ
ストを作成する。Step 82 - A list of unique term candidates containing hiragana characters extracted through the above process is created.

ステップ５３−１　ニステップ５２−８で作成された平仮名混じり固有用語候
補のリストを入力とする。Step 53-1: The list of hiragana-mixed unique term candidates created in step 52-8 is input.

ステップ５３−２　：入力である平仮名混じり固有用語候補リストに含まれる
第１固有用語候補文字列を処理するためにｊ＝１とする
。Step 53-2: Set j=1 to process the first unique term candidate character string included in the input hiragana-mixed unique term candidate list.

ステップ５３−３　：入力固有用語候補文字列すべてを処理するまでステップ
５３−７までの処理を繰り返す。Step 53-3: Repeat the processing up to step 53-7 until all input unique term candidate character strings are processed.

ステップ５３−４　：平仮名混じり固有用語候補絞り込み部Ａでは。Step 53-4: In the hiragana-mixed unique term candidate narrowing down section A.

第１固を用語候補文字列について絞り込みルールテーブ
ル１のルールを適用することにより選択処理を行い、固
有用語候補か否かを決定する。A selection process is performed by applying the rules of narrowing rule table 1 to the term candidate character string for the first term, and it is determined whether or not it is a unique term candidate.

ステップ３３−５：さらに平仮名混じり固有用語候補絞り込み部Ａでは、絞
り込めルール適用により固有用語候補として選択された
か否かで処理を分ける。Step 33-5: Furthermore, in the hiragana-mixed unique term candidate narrowing down section A, processing is divided depending on whether or not the unique term candidate is selected as a unique term candidate by applying the narrowing down rule.

選択された場合にはステップ５３−６に進み。If selected, proceed to step 53-6.

選択されなかった場合にはステップ５３−８に進む。If not selected, the process advances to step 53-8.

ステップ５３−６　：平仮名混じり固有用語候補絞り込み部Ｂでは。Step 53-6: In the hiragana-mixed unique term candidate narrowing down section B.

平仮名混しり固有用語候補絞り込み部Ａで選択された固
有用語候補文字列の字種コート′を元の文字に置き換え
る。The character type code ′ of the unique term candidate character string selected by the hiragana-mixed unique term candidate narrowing down unit A is replaced with the original character.

ステップ５３−７　：さらに平仮名混じり固有用語候補絞り込み部Ｂでは、ス
テップ５３−６で元の文字に置き換えられた固有用語候
補が一般語であるか否かを一般語辞書９を参照して絞り
込む。Step 53-7: Furthermore, the hiragana-mixed unique term candidate narrowing down unit B narrows down the list of candidates, with reference to the common word dictionary 9, to determine whether or not the unique term candidates replaced with the original characters in step 53-6 are common words.

ステップ５３−８　：次の固有用語候補を処理するためにｊ＝ｊ＋１とする。Step 53-8: In order to process the next unique term candidate, let j=j+1.

ステップ５３−９　：入力固有用語候補リスト中ずべての固有用語候補に対す
る処理を終了したか否かで処理を分ける。Step 53-9: Processing is divided depending on whether processing has been completed for all unique term candidates in the input unique term candidate list.

すべての処理を終了していない場合にはステップ５３−
３に戻り、Ｖ！：了している場合にはステップ５ｉｌｏ
に進む。If all processing has not been completed, step 53-
Return to 3, V! :If completed, step 5ilo
Proceed to.

ステップＳ３−１０ニステップ５３−８で選択された固有用語候補を最終結果
として固有用語出力部１０に出力する。The unique term candidates selected in step S3-10 and step 53-8 are output to the unique term output unit 10 as a final result.

ステップＳ３−１１・最終結果の固有用語から固有用語リスト「２」を作成す
る。Step S3-11: Create a unique term list "2" from the final resultant unique terms.

次に具体的な入力文書の例を用いて動作の概略を説明す
る。第３図に示す入力文例を日本語文書１として説明す
る。Next, an outline of the operation will be explained using a specific example of an input document. The input sentence example shown in FIG. 3 will be described as Japanese document 1.

まず３文字列分離部３において、第４図に示す如く既に
抽出されている平仮名を含まない固有用語のリストであ
る固有用語リスト「１」２を調べ。First, the 3-character string separator 3 examines the unique term lists "1" and 2, which are lists of unique terms that do not include hiragana that have already been extracted as shown in FIG.

入力文例中にそのリスト中の固有用語があれば。If there is a unique term in the list in the input sentence example.

それを第５図に示すように削除する。さらに第６図に示
す如くその削除された固有用語の前後の文字の並びを独
立した分離文字列とし２分離文字列リストを作成する。Delete it as shown in FIG. Furthermore, as shown in FIG. 6, a two-separate character string list is created by making the sequence of characters before and after the deleted unique term into independent separate character strings.

次に１平板名混じり固有用語候補抽出部４では。Next, in the unique term candidate extraction unit 4 that includes one flat name.

文字列分離部３で作成された分離文字列リストの各文字
列を順に処理する。まず第１分離文字列「一方、各」に
ついて、第７図に示す所の日本語文字を７種類に分類し
た字種コードテーブル５を使用し、第８図に示す所のコ
ード化分離文字列リストにおける第１コード化文字列ｒ
ＡＡＦＡＪに変換する。第２分離文字列「の動きは、適
当なｍつの」は同様な操作によりｒＪＡＢＪ　ＦＡＡＪ
ＡＪＪ、と変換される。以下同様にして、第８図に示す
コード化分離文字列リストを得る。Each character string in the separated character string list created by the character string separator 3 is processed in order. First, for the first separated character string "On the other hand, each", use the character type code table 5 shown in Fig. 7, which classifies Japanese characters into 7 types, and use the coded separated character string shown in Fig. 8. the first coded string r in the list
Convert to AAFAJ. The second separated character string ``The movement of is an appropriate m number'' can be determined by similar operations such as rJABJ FAAJ
It is converted to AJJ. In the same manner, the coded separated character string list shown in FIG. 8 is obtained.

次に、このコード化文字列について２字種コードＦおよ
び字種コードＪで抽出される。または囲まれる字種コー
ドＡ、Ｂ、Ｃ，Ｄ、Ｅの組み合わせであってＢを含むも
のすべてを平仮名混じり固有用語候補とする。Next, this coded character string is extracted using two character type codes F and J. Alternatively, all combinations of the enclosed character type codes A, B, C, D, and E that include B are treated as hiragana-mixed unique term candidates.

まず第１コード化文字列ｒＡＡＦＡＪについて。First, regarding the first coded character string rAAFAJ.

字種コードＦを境にコード化文字列ｒＡＡＪ　。Coded character string rAAJ with character type code F as the boundary.

「Ａ」が分離されるが、これらは字種コードＢを含まな
いので、候補としない。"A" is separated, but since these do not include the character type code B, they are not candidates.

次に５第２コ一ド化文字列ｒＪＡＢＪＦＡＡＪＡＪ　Ｊ
Ｊについて、同様にしてｒＡＢＪ、ｒＡＡＪ、ｒＡＪが
抽出され、このうち字種コードＢを含むｒＡＢ、を平仮
名混じり固有用語候補とする。Next, 5 second code string rJABJFAAJAJ J
For J, rABJ, rAAJ, and rAJ are extracted in the same way, and among these, rAB that includes the character type code B is set as a unique term candidate containing hiragana.

以下同様にして処理を繰り返すことにより、第９図に示
すような平仮名混じり固有用語候補コード化文字列リス
トを得る。By repeating the same process, a list of coded character strings of unique term candidates mixed with hiragana as shown in FIG. 9 is obtained.

一１５＝次に、平仮名混じり固有用語候補絞り込み部Ａ６では、
平仮名混じり固有用語候補抽出部４で作成された平仮名
混じり固有用語候補コード化文字列リストの各要素につ
いて絞り込みルールを適用し、絞り込み処理を行う。115=Next, in the hiragana-mixed proper term candidate narrowing down section A6,
A narrowing down rule is applied to each element of the hiragana mixed unique term candidate encoded character string list created by the hiragana mixed unique term candidate extraction unit 4, and a narrowing process is performed.

まず、第１コード化文字列ｒＡＢＪについて第１０図に
示す絞り込みルールテーブルにおける第１ルールが適用
され、候補から落とされる。次に第２コード化文字列ｒ
ＡＡＢＪに対し第４ルールが適用され、候補から落とさ
れる。First, the first rule in the narrowing-down rule table shown in FIG. 10 is applied to the first coded character string rABJ, and it is eliminated from the candidates. Then the second coded string r
The fourth rule is applied to AABJ and it is removed from the list of candidates.

以下同様にして、第１１図に示すようなルール適用結果
が得られる。この結果、「動きフレーム」と「動き量」
とが平仮名混じりの固有用語候補として抽出される。In the same manner, a rule application result as shown in FIG. 11 is obtained. As a result, "motion frame" and "movement amount"
are extracted as unique term candidates mixed with hiragana.

次に、平仮名混じり固有用語候補絞り込み部Ｂ８では、
平仮名混じり固有用語候補絞り込み部Ａで作成された固
有用語候補について一般語辞書９を調べ、辞書にあれば
固有用語候補から落とし。Next, in the hiragana-mixed unique term candidate narrowing down section B8,
The common word dictionary 9 is checked for the unique term candidates created by the hiragana-mixed unique term candidate narrowing down section A, and if the unique term candidates are in the dictionary, they are dropped from the unique term candidates.

残った固有用語候補を平仮名混じりの固有用語として抽
出する。この例では固有用語候補「動きフレーム」、「
動き量」の両者とも一般語辞書にはなく、平仮名混じり
の固有用語として抽出される。The remaining unique term candidates are extracted as unique terms mixed with hiragana. In this example, the unique term candidates ``motion frame'', ``
Both of the words ``amount of movement'' are not found in the general language dictionary, and are extracted as unique words mixed with hiragana.

以上のようにして、固有用語リスト「１」を利用した２
段階の抽出方法、すなわち連用形活用語尾型平仮名に着
目した独自の字種コード列への展開方法とコード列の並
びとに関する絞り込みルールの適用による候補絞り込み
、および一般語辞書参照による候補の絞り込みにより、
平仮名混じり固有用語である「動きフレーム」、「動き
量」を抽出することができる。As described above, 2 using the unique term list ``1''
By using a step-by-step extraction method, that is, by focusing on the conjunctive conjugated ending hiragana and developing it into a unique character type code string, by applying narrowing down rules regarding the order of the code string, and by referring to a common word dictionary to narrow down the candidates.
It is possible to extract ``motion frame'' and ``amount of motion,'' which are unique terms mixed with hiragana.

〔Effect of the invention〕

本発明によれば、平仮名混じり固有用語を抽出すること
ができ１例えば翻訳対象文書の中から文書固有の用語を
抽出して対訳集を作成する処理や。According to the present invention, unique terms mixed with hiragana can be extracted.1 For example, a process of extracting document-specific terms from a document to be translated and creating a bilingual collection.

論文、著作物等の日本語文書の用語集、索引作成などの
作成処理において大幅な効率化を図ることが期待できる
。It can be expected to significantly improve the efficiency of creating glossaries and indexes for Japanese documents such as papers and copyrighted works.

例えば日英翻訳用日本語英語対訳集の作成に関して、従
来では、大量の翻訳対象文書を翻訳する前に、該文書の
中で用いられている専門用語や特殊な用語を文書の中か
ら人手で抽出してそれに訳語を与えて対訳集を作成して
いたが２本発明を利用する事により、翻訳の前処理にお
いて翻訳の対象となる文書の中から当該文書固有の用語
を、形態素解析、係り受は解析などの解析処理を行うこ
となく、高速に抽出する事が可能になる。また。For example, when creating a Japanese-English bilingual collection for Japanese-English translation, in the past, before translating a large number of documents to be translated, the technical terms and special terms used in the documents were manually extracted from the documents. Previously, bilingual collections were created by extracting words and giving translations to them.2 By using the present invention, terms unique to the document are extracted from the document to be translated in the preprocessing of translation, and are analyzed by morphological analysis and related translations. It becomes possible to extract data at high speed without performing analysis processing such as analysis. Also.

未知語など形態素解析などを行うと、解析失敗の原因に
なったり抽出に失敗したりする固有用語を抽出する事が
可能となる。本発明では抽出された固有用語に対して訳
語を与えた用語集を利用者辞書に登録し、利用する事で
翻訳の品質は著しく向上する事が期待できる。By performing morphological analysis of unknown words, it becomes possible to extract unique words that may cause analysis failure or extraction failure. In the present invention, the quality of translation can be expected to be significantly improved by registering and using a glossary that provides translations for extracted unique terms in a user dictionary.

また索引の自動作成に関して、従来では、索引を作成す
る場合、執筆者が自分の書いた原稿の中から重要と思わ
れる用語を拾い出し、抽出された用語をソートして索引
を人手で作成するという手順で行っていた。本発明を利
用する事により、執筆者は電子化された日本語文書を本
装置にかける事で自動的に索引候補用語を抽出できる。Regarding automatic index creation, conventionally, when creating an index, the author picks out terms that are considered important from the manuscript he has written, sorts the extracted terms, and creates the index manually. The procedure was as follows. By using the present invention, an author can automatically extract index candidate terms by running an electronic Japanese document through this device.

執筆者は本装置により提示された候補の内、不要な候補
を若干削除する事で索引に使用する用語を抽出する事が
可能である。電子化された候補用語はあいうえお順に容
易にソートする事ができ、高速に索引を作成することが
できる。The author can extract terms to be used in the index by deleting some unnecessary candidates from among the candidates presented by this device. The electronic candidate terms can be easily sorted in alphabetical order, and an index can be created quickly.

[Brief explanation of drawings]

第１図は本発明の一実施例を示す基本構成図。第２図は本発明の動作の概略フロー図、第３図は本発明
の入力となる日本語文書の一例、第４図はあらかじめ抽
出された平仮名を含まない固有用語のリスト、第５図ば
あらかしめ抽出された固有用語を入力文書から削除した
例、第６図は削除された固有用語の前後の文字の並びを
分離して作成した分離文字列のリスト第７図は日本語文
字を７種類に分けた字種コートテーブルの内容例、第８
図は字種コートテーブルを使用してコードに変換された
分離文字列のリスト、第９図は字種コードＦ、Ｊで分離
された。又は囲まれたＡ、Ｂ、Ｃ。Ｄ、Ｅの文字列で、Ｂを含んでいるものだけを候＝１９
− 補としたリスト、第１０図は候補として残ったコート化
文字列に対して適用される絞り込みルールテーブルの内
容例、第１１図は最終候補リスト、および絞り込みルー
ル適用結果を示す。１・・・日本語文書、２・・固有用語リス）ｒｌＪ。３・・・文字列分離部、４・・・平仮名混じり固有用語
候補抽出部、５・・・字種コードテーブル、６・・・平
仮名混じり固有用語候補絞り込み部Ａ、７・・・絞り込
みルールテーブル、８・・・平仮名混しり固有用語候補
絞り込み部Ｂ、９・・・一般語辞書、１０・・・固有用
語出力部、１１・・・固有用語りストｒ２．＋、１２・
・・日本文固有用語抽出装置。特許出願人　　日本電信電話株式会社FIG. 1 is a basic configuration diagram showing an embodiment of the present invention. Figure 2 is a schematic flowchart of the operation of the present invention, Figure 3 is an example of a Japanese document that is input to the present invention, Figure 4 is a list of unique terms that do not include hiragana that have been extracted in advance, and Figure 5 is a list of unique terms that do not include hiragana. An example of deleting a unique term extracted from an input document, Figure 6 shows a list of separated character strings created by separating the sequence of characters before and after the deleted unique term Figure 7 shows a list of 7 Japanese characters Example of content of character type court table divided into types, No. 8
The figure shows a list of separated character strings converted into codes using the character type code table, and FIG. 9 shows the separated character strings separated by character type codes F and J. Or enclosed A, B, C. Among character strings D and E, select only those containing B = 19
- Complementary list, FIG. 10 shows an example of the contents of a narrowing rule table applied to the coated character strings remaining as candidates, and FIG. 11 shows the final candidate list and the results of applying the narrowing rules. 1...Japanese document, 2...specific term list) rlJ. 3... Character string separation unit, 4... Hiragana mixed unique term candidate extraction unit, 5... Character type code table, 6... Hiragana mixed unique term candidate narrowing down unit A, 7... Narrowing down rule table , 8... Hiragana-mixed specific term candidate narrowing section B, 9... General word dictionary, 10... Specific term output unit, 11... Specific term list r2. +, 12・
・Japanese sentence-specific term extraction device. Patent applicant Nippon Telegraph and Telephone Corporation

Claims

[Scope of Claims] The system is configured to input a Japanese document written in Japanese and a unique term list "1" that does not include hiragana included in the document, the input Japanese document and the unique term list "1" a character string separation unit that separates and extracts the character strings sandwiched between the above-mentioned unique terms and unique terms; a character type code table that classifies characters used in Japanese documents into multiple specific character types; , a hiragana-mixed unique term candidate extraction unit that extracts hiragana-mixed unique term candidates for each character string separated by the character string separation unit using the character type code table; Hiragana-mixed proper term candidate narrowing down unit A that extracts only candidates that satisfy the selection conditions for narrowing down from the mixed proper term candidates
and a hiragana-mixed unique term candidate narrowing-down unit that compares the candidate terms narrowed down by the hiragana-mixed unique term candidate narrowing unit A with a general language dictionary and deletes words registered in the general-language dictionary to further narrow down the candidates. B, a hiragana-mixed unique term output unit that accumulates narrowed-down results and outputs a unique term list "2" to an output device; Extraction device.