JPH0528871B2

JPH0528871B2 -

Info

Publication number: JPH0528871B2
Application number: JP62238385A
Authority: JP
Inventors: Masahiro Oku; Masanobu Higashida
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1987-09-21
Filing date: 1987-09-21
Publication date: 1993-04-27
Also published as: JPS6479863A

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は日本語文書中から、その文書でのみ
使用されている製品名、会社名、人名などの固有
名詞や単語の組み合わせであつても新語やその文
書においてのみ使用されていると考えられる語
（対象文固有用語）を自動的に抽出する日本語対
象文固有用語抽出装置に関するものである。[Detailed Description of the Invention] [Field of Industrial Application] This invention is applicable to Japanese documents, even if they are proper nouns or combinations of words such as product names, company names, or people's names that are only used in the document. This invention relates to a Japanese target sentence-specific term extraction device that automatically extracts new words and words that are considered to be used only in the document (target sentence-specific terms).

[Conventional technology]

日本語文書中から、その文書でのみ使用されて
いる製品名、会社名、人名などの固有名詞や単語
の組み合わせであつても新語やその文書において
のみ使用されていると考えられる語（対象文固有
用語）を抽出する従来の方式では、入力日本文を
あらかじめ用意した解析辞書を用いて分かち書き
を行い、その結果、うまく分かち書きできなかつ
た部分を対象文固有用語として抽出する。また字
種の変化点に着目する方式では、ある字種から別
の字種へ変化する点までを対象文固有用語の候補
として抽出し、その候補の中から、あらかじめ用
意した解析用の日本語辞書に登録されていない候
補を対象文固有用語として抽出する。 Even if a Japanese document is a combination of proper nouns or words such as product names, company names, or people's names that are used only in that document, new words or words that are considered to be used only in that document (target sentence) In the conventional method for extracting unique terms), the input Japanese sentence is separated using a pre-prepared analysis dictionary, and as a result, the portions that cannot be properly separated are extracted as terms specific to the target sentence. In addition, in the method that focuses on the points of change in character type, the points at which one character type changes to another are extracted as candidates for target sentence-specific terms, and from among these candidates, Japanese words prepared in advance for analysis are selected. Candidates not registered in the dictionary are extracted as target sentence-specific terms.

[Problem that the invention seeks to solve]

しかし前者の方式では次の問題があつた。 However, the former method had the following problem.

入力日本文を解析辞書を用いて分かち書きを
行うため、抽出に時間がかかる。 Extraction takes time because input Japanese sentences are separated using an analysis dictionary.

日本語の分かち書き処理は、解析辞書中に単
語が登録されていることを前提としているた
め、対象文固有用語についても解析を試み、部
分的に解析に成功するとその部分を対象文固有
用語の対象外となる（例：「○×会社」は全体
で社名を表わす対象文固有用語であるにもかか
わらず、「会社」は一般的な語であつて解析に
成功するため、「○×」のみが対象文固有用語
であると認定されてしまう）など、正確に対象
文固有用語を抽出することができない。 Japanese separation processing assumes that the words are registered in the analysis dictionary, so it also attempts to analyze the target sentence-specific terms, and if it is partially successful in parsing, that part is used as the target sentence-specific term. (Example: Although "○× company" is a term specific to the target sentence that represents the company name as a whole, "company" is a general word and the analysis is successful, so only "○×" is used. is recognized as a target sentence-specific term), making it impossible to accurately extract target sentence-specific terms.

対象文固有用語が一般的な語によつて構成さ
れている場合、その用語は対象文固有用語とし
て認定されない。例を以下に示す。「日本電力
電話株式会社」は固有名詞であり、対象文固有
用語であるが、一般的な語のみによつて構成さ
れているために「日本／電力／電話／株式／会
社」のように分かち書きに成功する。このため
対象文固有用語として抽出されるべき「日本電
力電話株式会社」が抽出されない。 If a target sentence-specific term is composed of general words, the term is not recognized as a target sentence-specific term. An example is shown below. "Nippon Electric Power and Telephone Corporation" is a proper noun and a term specific to the target sentence, but since it is composed of only general words, it is separated into "Japan/Electricity/Telephone/Stocks/Company". succeed in Therefore, "Nippon Electric Power Telephone Co., Ltd.", which should be extracted as a target sentence-specific term, is not extracted.

また後者の方式では、辞書引きの回数が少ない
ために高速ではあるが、複数の字種にわたる対象文固有用語が抽出し
ずらい、「〜向け」などの送りがな付きの接辞を含む
対象文固有用語が抽出できない、といつた問題点が存在した。 In addition, although the latter method is faster because it requires fewer dictionary lookups, it is difficult to extract target sentence-specific terms that span multiple character types, and it is difficult to extract target sentence-specific terms that include affixes with okurikana such as ``for ~''. There were problems such as the inability to extract.

この発明の目的は、前記の問題点を解決した日
本語文書中から対象文固有用語を高速かつ正確に
自動抽出する日本語対象文固有用語抽出方式を提
供することにある。 An object of the present invention is to provide a Japanese target sentence-specific term extraction method for automatically extracting target sentence-specific terms from a Japanese document at high speed and accurately, which solves the above-mentioned problems.

[Means for solving problems]

この発明は日本語文書中の文字列を９種類の字
種に分類したコード列に展開する手段と、前記コード列における字種の変化点のみから、
対象文固有用語の候補を抽出する抽出手段と、その抽出手段で抽出された候補の中から、日本
語の性質から日本語対象文固有用語とならない文
字列の条件及び日本語対象文固有用語となりやす
い文字列の条件を記憶した言語情報テーブルを用
いて、日本語対象文固有用語となりやすい文字列
の条件を満たす候補のみを残すことによつて、よ
り精度の高い日本語対象文固有用語の候補を抽出
する処理手段と、その処理手段で抽出された候補の中から、日本
語解析辞書に収録されていない語のみを日本語対
象文固有用語として出力する選択手段とを有する
ことを最も主要な特徴とする。 This invention provides a means for developing a character string in a Japanese document into a code string classified into nine types of characters, and a means for developing a character string in a Japanese document into a code string classified into nine types of characters.
An extraction means for extracting candidates for target sentence-specific terms, and from among the candidates extracted by the extraction means, conditions for character strings that do not become Japanese target sentence-specific terms due to the nature of Japanese, and conditions for character strings that become Japanese target sentence-specific terms. By using a language information table that stores conditions for character strings that are easy to find, and retaining only candidates that meet the conditions for character strings that are likely to become Japanese target sentence-specific terms, more accurate candidates for Japanese target sentence-specific terms can be created. The most important feature is to have a processing means for extracting words, and a selection means for outputting only words that are not included in the Japanese analysis dictionary from among the candidates extracted by the processing means as terms specific to the Japanese target sentence. Features.

従来の技術とは抽出手段において、日本語対象文固有用語の
候補を抽出する際に、分かち書きなどの解析辞
書による解析を行わず、字種の変化点に着目し
て候補を抽出するために高速である、抽出手段において、字種の変化点に着目して
候補を抽出するので一般的な語のみから成る語
や一部に一般的な語を含む語も日本語対象文固
有用語の候補として抽出できる、処理手段において、「漢字５文字以上の列は
日本語対象文固有用語となりやすい」などのヒ
ユーリステイツクルールや「その候補中に接辞
が含まれている場合には、接辞に対する処理を
施こす」などの処理情報から成る言語情報を用
いて、日本語対象文固有用語の候補を絞るの
で、正確な候補抽出が行える、選択手段においては、解析辞書中の単語と完
全一致しない語はすべて日本語対象文固有用語
として出力するため、一般的な語のみからなる
日本語対象文固有用語や、一部に一般的な語を
含む日本語対象文固有用語も、日本語対象文固
有用語として抽出できる、の各点が異なる。 What is the conventional technology? When extracting candidates for terms specific to the Japanese target sentence, the extraction means uses a high-speed method to extract candidates by focusing on changing points of character types, without performing analysis using an analytical dictionary such as separation. The extraction method extracts candidates by focusing on the change points of character types, so words that consist only of common words or words that partially contain common words can be used as candidates for terms specific to the Japanese target sentence. The processing means that can extract hyuristic rules such as ``A string of five or more kanji characters is likely to be a term specific to a Japanese target sentence'' or ``If an affix is included in the candidates, perform processing on the affix.'' Linguistic information consisting of processing information such as "Shikosu" is used to narrow down candidates for terms specific to the Japanese target sentence, so accurate candidate extraction is possible. All Japanese target sentence-specific terms are output as Japanese target sentence-specific terms, so Japanese target sentence-specific terms that consist only of common words and Japanese target sentence-specific terms that include some common words are also output as Japanese target sentence-specific terms. Each point of , which can be extracted as , is different.

〔Example〕

第１図はこの発明方式をハードウエアによつて
構成した際の基本構成図であつて、コード列展開
部１は入力された日本語文書を９種類の字種（漢
字コード、漢数字コード、ひらがなコード、カタ
カナコード、アラビア数字コード、アルフアベツ
トコード、句読点コード、区切りコード、その他
のコード）に分類したコード列に展開する。日本
語対象文固有用語候補抽出部２はコード列展開部
１において得られたコード列の中から、字種の変
化点に着目して日本語対象文固有用語の候補を抽
出する。日本語対象文固有用語言語処理部３は日
本語対象文固有用語候補抽出部２において得られ
た候補の中の各候補に対して言語情報テーブル８
を検索し、その情報に従つてその候補を処理した
後、より精度の高い候補のみから成る候補群を抽
出する。日本語対象文固有用語選択部４は日本語
対象文固有用語言語処理部３において得られた候
補群の中の各候補の字面をキーとして日本語解析
辞書９を検索し、日本語解析辞書９に登録されて
いない語のみを日本語対象文固有用語として選択
する。日本語対象文固有用語登録部５は日本語対
象文固有用語選択部４で選択された日本語対象文
固有用語を日本語対象文固有用語フアイル６に登
録する。日本語対象文固有用語フアイル６は最終
的に抽出された日本語対象文固有用語を登録して
おくフアイルである。分類テーブル７は日本語対
象文固有用語候補抽出部２において抽出する字種
列をどのように分類するかを規定したテーブルで
あり、言語情報テーブル８は日本語対象文固有用
語言語処理部３において、より精度の高い候補を
抽出する際に用いる言語情報や処理方法を記述し
たテーブルであり、日本語解析辞書９は一般的な
日本語単語の字面や品詞などが登録されている。
コード列展開部１、日本語対象文固有用語候補抽
出部２、日本語対象文固有用語言語処理部３、日
本語対象文固有用語選択部４及び日本語対象文固
有用語登録部５は演算装置およびメモリから成る
日本語対象文固有用語抽出装置１０を構成してい
る。 FIG. 1 is a basic configuration diagram when this invention system is configured by hardware, and the code string expansion unit 1 converts the input Japanese document into nine types of characters (Kanji code, Kanji code, Expand into code strings classified into Hiragana code, Katakana code, Arabic numeral code, Alphabet code, punctuation mark code, delimiter code, and other codes). The Japanese target sentence-specific term candidate extraction unit 2 extracts candidates for Japanese target sentence-specific terms from the code string obtained by the code string expansion unit 1, focusing on points of change in character types. The Japanese target sentence specific term language processing section 3 generates a linguistic information table 8 for each candidate among the candidates obtained by the Japanese target sentence specific term candidate extraction section 2.
After searching for and processing the candidates according to that information, a candidate group consisting only of more accurate candidates is extracted. The Japanese target sentence specific term selection unit 4 searches the Japanese language analysis dictionary 9 using the character face of each candidate in the candidate group obtained in the Japanese language processing unit 3 as a key, and searches the Japanese language analysis dictionary 9. Only words that are not registered in are selected as terms unique to the Japanese target sentence. The Japanese target sentence specific term registration unit 5 registers the Japanese target sentence specific term selected by the Japanese target sentence specific term selection unit 4 into the Japanese target sentence specific term file 6. The Japanese target sentence specific term file 6 is a file in which the finally extracted Japanese target sentence specific terms are registered. The classification table 7 is a table that specifies how to classify the character type string extracted by the Japanese target sentence specific term candidate extraction unit 2, and the linguistic information table 8 is a table that defines how to classify character type strings extracted by the Japanese target sentence specific term candidate extraction unit 2. , is a table that describes linguistic information and processing methods used when extracting more accurate candidates, and the Japanese language analysis dictionary 9 registers the fonts, parts of speech, etc. of common Japanese words.
The code string expansion section 1, the Japanese target sentence-specific term candidate extraction section 2, the Japanese target sentence-specific term language processing section 3, the Japanese target sentence-specific term selection section 4, and the Japanese target sentence-specific term registration section 5 are computing devices. A Japanese target sentence specific term extraction device 10 is comprised of a memory and a memory.

第２図は日本語対象文固有用語抽出装置１０の
動作の概略流れ図である。次に第２図の概略流れ
図に従つて動作の説明を行う。先ずｎを１とし
（ステツプS₁）日本語対象文固有用語抽出装置１
０の入力である日本語文書の第ｎ文について処理
し（ステツプS₂）コード列展開部１ではその第ｎ
文の日本語文書の１文字１文字を９種類の字種
（漢字コード、漢数字コード、ひらがなコード、
カタカナコード、アラビア数字コード、アルフア
ベツトコード、句読点コード、区切りコード、そ
の他のコード）のコードに変換し、その日本語文
書に対するコード列を生成する（ステツプS₃）。
このとき各コードには、そのコードがその日本語
文書のどの文字から生成されたのかを示す情報が
付与される。コード列展開部１によつて生成され
たコード列は日本語対象文固有用語候補抽出部２
に送られる。 FIG. 2 is a schematic flowchart of the operation of the Japanese target sentence specific term extraction device 10. Next, the operation will be explained according to the schematic flowchart shown in FIG. First, set n to 1 (step S ₁ ) Japanese target sentence specific term extraction device 1
Processes the nth sentence of the Japanese document, which is the input of 0 (step S ₂ ), and the code string expansion unit 1 processes the nth sentence
Each character in a Japanese document is divided into 9 types of characters (Kanji code, Kanji code, Hiragana code,
katakana code, Arabic numeral code, alphanumeric code, punctuation mark code, delimiter code, and other codes), and generates a code string for the Japanese document (step _S3 ).
At this time, each code is given information indicating which character of the Japanese document the code was generated from. The code string generated by the code string expansion section 1 is the Japanese target sentence-specific term candidate extraction section 2.
sent to.

日本語対象文固有用語候補抽出部２では、区切
りコード、ひらがなコード、句読点コードのいず
れかから他の６種類のいずれかのコードへ変わる
変化点で始まり、次に区切りコード、ひらがなコ
ード、句読点コードのいずれかへと変わる変化点
で終わるコード列に対応する元の文字列を日本語
対象文固有用語の候補としてすべて抽出し、分類
テーブル７に記述されている条件によつてその候
補を分類する（ステツプS₄）。第３図に分類テー
ブル７の内容の例を示す。ここで「漢字」には
「漢数字」も含まれる。 The Japanese target sentence-specific term candidate extraction unit 2 starts at a change point where a delimiter code, hiragana code, or punctuation mark code changes to one of the other six types of codes, and then extracts a delimiter code, hiragana code, or punctuation mark code. Extract all original character strings corresponding to code strings that end at a change point that changes to one of the following as candidates for Japanese target sentence-specific terms, and classify the candidates according to the conditions described in classification table 7. (Step S ₄ ). FIG. 3 shows an example of the contents of the classification table 7. Here, "kanji" also includes "kanji numerals".

分類された日本語対象文固有用語の候補は、日
本語対象文固有用語言語処理部３に送られる。日
本語対象文固有用語言語処理部３ではｉ＝１とし
（ステツプS₅）、第ｉ番目の候補について処理する
（ステツプS₆）、つまりまず言語情報テーブル８を
検索して情報を得る（ステツプS₇）。次に得られ
た情報に従つて日本語対象文固有用語の各候補に
ついて処理する（ステツプS₈）ことによつてより
精度の高い日本語対象文固有用語の候補を抽出し
（ステツプS₉）、日本語対象文固有用語選択部４に
該候補を送る。 The classified Japanese target sentence specific term candidates are sent to the Japanese target sentence specific term language processing unit 3. The Japanese target sentence specific term language processing unit 3 sets i=1 (step S ₅ ) and processes the i-th candidate (step S ₆ ), that is, first searches the linguistic information table 8 to obtain information (step S 5 ). _S7 ). Next, each candidate for Japanese target sentence-specific terms is processed according to the obtained information (Step S ₈ ), thereby extracting more accurate Japanese target sentence-specific term candidates (Step S ₉ ). , sends the candidates to the Japanese target sentence specific term selection unit 4.

日本語対象文固有用語選択部４では、日本語対
象文固有用語言語処理部３より送られてきた日本
語対象文固有用語の候補の字面をキーとして、日
本語解析辞書９を検索する。検索の結果、その候
補が日本語解析辞書９に登録されているときは、
その候補は日本語対象文固有用語ではないとして
候補から落とす（ステツプS₁₀）。逆にその候補が
日本語解析辞書９に登録されていないときは、そ
の候補は日本語対象文固有用語であるとしてその
候補を日本語対象文固有用語登録部５に送る（ス
テツプS₁₀）。 The Japanese target sentence specific term selection section 4 searches the Japanese language analysis dictionary 9 using the font of the candidate Japanese target sentence specific term sent from the Japanese target sentence specific term language processing section 3 as a key. As a result of the search, if the candidate is registered in the Japanese analysis dictionary 9,
The candidate is rejected from the list of candidates as it is not a term unique to the Japanese target sentence (Step S ₁₀ ). On the other hand, if the candidate is not registered in the Japanese language analysis dictionary 9, the candidate is determined to be a term unique to the Japanese target sentence and is sent to the Japanese target sentence unique term registration section 5 (step _S10 ).

日本語対象文固有用語登録部５では、日本語対
象文固有用語選択部４より送られてきた日本語対
象文固有用語を日本語対象文固有用語フアイル６
に書き込み登録する（ステツプS₁₁）。ステツプS₉
でその候補を日本語対象文固有用語の候補として
残さない場合、またはステツプS₁₁の次にｉ＋１
をｉとし（ステツプS₁₂）、その後、候補がまだあ
るかを調べ、候補がまだある場合はステツプS₆に
戻り、候補がない場合はｎを＋１して（ステツプ
S₁₄）、文がまだあるかを調べ（ステツプS₁₅）、文
がまだある場合はステツプS₂に戻り、文がない場
合は終りとする。 The Japanese target sentence-specific term registration unit 5 stores the Japanese target sentence-specific terms sent from the Japanese target sentence-specific term selection unit 4 into the Japanese target sentence-specific term file 6.
Register by writing (Step _S11 ). Step S ₉
If you do not leave that candidate as a candidate for the Japanese target sentence-specific term, or after step S ₁₁ , select i+1.
is set to i (step S ₁₂ ), and then it is checked whether there are any more candidates. If there are still candidates, the process returns to step S _6. If there are no candidates, n is +1 (step S 12).
_S14 ) and checks whether there are any more sentences (step _S15 ). If there are still sentences, the process returns to step _S2 ; if there are no sentences, the process ends.

次に例を用いて動作の概略を説明する。第４図
に示す例文を対象文固有用語抽出装置１０の入力
となる日本語文書として説明する。まずコード列
展開部１では、第４図に示す例文の１文字１文字
を対応するコード（９種類の何れかのコード）に
変換し、コード列を生成する（第５図）。第５図
のコード列では漢字コード（漢数字を除く全漢
字）を、漢数字コード（〇、一、二、三、四、
五、六、七、八、九）を、ひらがなコード（ひ
らがなすべて）を、カタカナコード（カタカナ
すべて）を、アラビア数字コード（０，１，
２，３，４，５，６，７，８，９）を、アルフ
アベツトコード（Ａ〜Ｚ，ａ〜ｚの大文字、小文
字）を、句読点コード（読点「、」、句点「。」、
カンマ「，」、ピリオド「．」、クエスチヨンマーク
「？」、イクスクラメーシヨンマーク「！」）を、
区切りコード（カギカツコ「」、丸カツコ（）、大
カツコ〔〕、中カツコ｛｝などのカツコ類、コー
テーシヨンマーク「’」、ダブルコーテーシヨン
マーク「”」）を、その他のコード（前記のどの
コードにも入らない文字や記号）をと略記して
いる。 Next, an outline of the operation will be explained using an example. The example sentence shown in FIG. 4 will be explained as a Japanese document that is input to the target sentence specific term extraction device 10. First, the code string expansion unit 1 converts each character of the example sentence shown in FIG. 4 into a corresponding code (any of nine types of codes) to generate a code string (FIG. 5). The code string in Figure 5 shows the kanji codes (all kanji except kanji numerals), the kanji numeric codes (〇, 1, 2, 3, 4,
5, 6, 7, 8, 9), hiragana code (all hiragana), katakana code (all katakana), Arabic numeral code (0, 1,
2, 3, 4, 5, 6, 7, 8, 9), alpha alphabet codes (uppercase and lowercase letters of A to Z, a to z), punctuation mark codes (comma marks ",", period marks ".",
comma “,”, period “.”, question mark “?”, and exclamation mark “!”).
Separator codes (key cutlets such as ``'', round cutlets (), large cutlets [], medium cutlets {}, quotation mark ``''', double quotation mark ``''), and other codes (as mentioned above) Characters and symbols that do not fit into any code are abbreviated as .

このとき各コードにはそのコードがどの文字か
ら生成されたかを示す情報（例えば第５図の一番
最初のは第４図の文字“基”から、次のは文
字“盤”から生成されたことを示す情報）を付与
する。この付与の方法には、元の文字列に番号を
付け、その番号をコード列に付与する方法や、元
の文字とコードをペアで持つ方法などがあるが、
ここではこの方法については問わない。コード列
展開部１によつて生成されたコード列（第５図）
は日本語対象文固有用誤候補抽出部２に送られ
る。 At this time, each code has information indicating which character it was generated from (for example, the first one in Figure 5 is generated from the character ``base'' in Figure 4, the next one is generated from the character ``board'') (information indicating that). There are several ways to assign this, such as adding a number to the original character string and assigning that number to the code string, or having the original character and code as a pair.
This method is not discussed here. Code string generated by code string expansion unit 1 (Figure 5)
is sent to the Japanese target sentence-specific usage error candidate extraction unit 2.

次に日本語対象文固有用語候補抽出部２では、
第５図に示したコード列から区切りコード、ひら
がなコード、句読点コードのいずれかから、他の
６種類のいずれかのコードで変わる変化点で始ま
り、次に区切りコード、ひらがなコード、句読点
コードのいずれかへと変わる変化点で終わるコー
ド列を取り出し（第６図）、そのコード列に対応
する元の文字列を日本語対象文固有用語の候補と
してすべて抽出する。第４図に示す例文からは、
第７図に示すように21個の候補が抽出される。な
お文頭には変化点があるものとして次に区切りコ
ード、ひらがなコード、句読点コードのいずれか
へと変わる変化点までを候補として抽出する。さ
らに日本語対象文固有用語候補抽出部２では、字
種の変化点の情報により抽出した候補を分類テー
ブル７に従つて分類する。この結果を第８図に示
す。その分類結果は日本語対象文固有用語言語処
理部３に送られる。 Next, the Japanese target sentence-specific term candidate extraction unit 2
The code sequence shown in Figure 5 starts with a change point that changes from one of the delimiter code, hiragana code, or punctuation mark code to any of the other six types of codes, and then changes to either the delimiter code, hiragana code, or punctuation mark code. A code string that ends at a point where the code changes to `` is extracted'' (Fig. 6), and all original character strings corresponding to that code string are extracted as candidates for terms specific to the Japanese target sentence. From the example sentences shown in Figure 4,
As shown in FIG. 7, 21 candidates are extracted. Assuming that there is a change point at the beginning of the sentence, the change points up to the next change to either a delimiter code, a hiragana code, or a punctuation mark code are extracted as candidates. Furthermore, the Japanese target sentence-specific term candidate extracting unit 2 classifies the extracted candidates according to the information on the change points of character types according to the classification table 7. The results are shown in FIG. The classification results are sent to the Japanese target sentence specific term language processing section 3.

日本語対象文固有用語言語処理部３では言語情
報テーブル８を検索し、得られた情報に従つて候
補を絞り込む。言語情報テーブル８の内容の例を
第９図に示す。日本語対象文固有用語言語処理部
３では第８図に示す候補の分類をキーとして、言
語情報テーブル８を検索する。第８図の分類２に
属する語のうち「昨年」は第９図分類２の項の
「副詞的に使われる名詞は候補から落とす」に一
致するため、候補から落とされ、同様に「三台」
は「漢数字とそれに続く助数詞を持つものは候補
から落とす」に一致するために、「発売」は「サ
変名詞は候補から落とす」に一致するためにそれ
ぞれ候補から落とされる。ゆえに分類２に属する
語のうち「西電」、「南芝」、「会社」の３語が候補
として残る。分類３，４，６，７，19に属する語
についてはすべて候補として残る。分類21に属す
る「七〇％」、「三・九九％」、「三％」、「各一・六
九％出資」の４語は共に「漢数字とそれに続く助
数詞を持つものは候補から落とす」（第９図分類
21の項）に一致するため候補から落とされる。以
上の結果、「西電」、「南芝」、「会社」「新機種」、
「日際電気」、「凸版電話」、「日際岩井」、「住際銀
行」、「国際電力電話」、「松際通信工業」、「日本毎
朝新聞社」、「基盤技術開発推進センター」、「マズ
キル」、「キヤダン」の計14個の文字列が日本語対
象文固有用語の候補として日本語対象文固有用語
選択部４に送られる。 The Japanese target sentence specific term language processing section 3 searches the language information table 8 and narrows down the candidates according to the obtained information. An example of the contents of the language information table 8 is shown in FIG. The Japanese target sentence specific term language processing unit 3 searches the language information table 8 using the candidate classification shown in FIG. 8 as a key. Among the words belonging to Category 2 in Figure 8, ``last year'' matches the term ``Nouns used as adverbs are excluded from the candidates'' in the section of Category 2 in Figure 9, so it is dropped from the candidates. ”
is dropped from the list because it matches ``those with a Chinese numeral and a particle following it are dropped from the list of candidates'', and ``sale'' matches ``those with a kanji numeral followed by a particle are dropped from the list of candidates'', and ``sample nouns are dropped from the list of candidates''. Therefore, among the words belonging to category 2, three words remain as candidates: "Nishiden,""Nanshiba," and "Company." All words belonging to categories 3, 4, 6, 7, and 19 remain as candidates. The four words "70%,""3.99%,""3%," and "1.69% invested each" that belong to classification 21 are all "those with a Chinese numeral followed by a fractional word are not candidates. "Drop" (Figure 9 Classification
21), it is dropped from the list of candidates. As a result of the above, "Nishiden", "Nanshiba", "company", "new model",
``Nichikai Denki'', ``Toppan Denwa'', ``Nichikai Iwai'', ``Sumisai Bank'', ``Kokusai Electric Telephone'', ``Matsugiwa Tsushin Kogyo'', ``Nippon Asahi Shimbun'', ``Fundamental Technology Development Promotion Center'' , "Mazukiru", and "Kiyadan", a total of 14 character strings, are sent to the Japanese target sentence-specific term selection unit 4 as candidates for Japanese target sentence-specific terms.

日本語対象文固有用語選択部４では、日本語解
析辞書９を日本語対象文固有用語の候補の字面で
検索し、日本語解析辞書９に登録されていない語
のみを日本語対象文固有用語として選択する。例
文に対しては、上記14個の候補の各々についてそ
の字面をキーとして日本語解析辞書９を検索す
る。検索の結果、日本語解析辞書９には「会社」
のみが一般的な語として登録されているため「会
社」が候補から落とされる。よつて上記14個の文
字列から「会社」を除いた計13個の文字列が日本
語対象文固有用語として日本語対象文固有用語登
録部５に送られる。 The Japanese target sentence-specific term selection unit 4 searches the Japanese language analysis dictionary 9 for candidates for Japanese target sentence-specific terms, and selects only words that are not registered in the Japanese language analysis dictionary 9 as Japanese language target sentence-specific terms. Select as. For example sentences, the Japanese analysis dictionary 9 is searched for each of the above 14 candidates using the character face as a key. As a result of the search, "company" is listed in Japanese parsing dictionary 9.
Since only ``company'' is registered as a general word, ``company'' is dropped from the list of candidates. Therefore, a total of 13 character strings excluding "company" from the 14 character strings mentioned above are sent to the Japanese target sentence-specific term registration unit 5 as Japanese target sentence-specific terms.

日本語対象文固有用語登録部５では、送られて
きた13個の日本語対象文固有用語を日本語対象文
固有用語フアイル６に書き込み登録する。日本語
対象文固有用語フアイル６に書き込まれた日本語
対象文固有用語を第１０図に示す。 The Japanese target sentence specific term registration unit 5 writes and registers the sent 13 Japanese target sentence specific terms in the Japanese target sentence specific term file 6. The Japanese target sentence specific terms written in the Japanese target sentence specific term file 6 are shown in FIG.

このような構造および作用となつていることか
ら、従来の方法に比べて、日本語対象文固有用語の候補を抽出する際
に、分かち書きなどの解析辞書による解析を行
わず、字種の変化点に着目して候補を抽出する
ために高速である、字種の変化点に着目して候補を抽出するの
で、一般的な語のみから成る語や一部に一般的
な語を含む語も日本語対象文固有用語の候補と
して抽出できる、字種の変化点に着目して抽出した候補に対し
て、言語情報テーブル中の情報（「漢字５文字
以上の列は日本語対象文固有用語となりやす
い」といつたヒユーリステイツクルールや、
「接辞を含む場合には接辞に対する処理を施こ
す」などの処理情報）を用いて、候補を絞り込
むので正確な日本語対象文固有用語の候補抽出
が行える、日本語解析辞書中の単語と完全一致しない語
はすべて日本語対象文固有用語として抽出する
ため、一般的な語のみから成る日本語対象文固
有用語や、一部に一般的な語を含む日本語対象
文固有用語も、日本語対象文固有用語として抽
出できる、の各点で改善があつた。 Because of this structure and action, compared to conventional methods, when extracting candidates for terms specific to the Japanese target sentence, we do not perform analysis using an analytical dictionary such as separation, and instead use the method that analyzes changes in character types. It is fast because it extracts candidates by focusing on For candidates extracted by focusing on the change points of character types that can be extracted as candidates for target sentence-specific terms, information in the linguistic information table (``Sequences of 5 or more kanji characters are likely to become Japanese target sentence-specific terms'') is used. ” and Hyurystateskrule,
Processing information such as ``If it contains an affix, process the affix'') is used to narrow down the candidates, allowing for accurate candidate extraction of Japanese target sentence-specific terms. All words that do not match are extracted as Japanese target sentence-specific terms, so Japanese target sentence-specific terms that consist only of common words and Japanese target sentence-specific terms that include some common words are also extracted as Japanese target sentence-specific terms. Improvements were made in each of the points that can be extracted as target sentence-specific terms.

〔Effect of the invention〕

以上説明したようにこの発明装置によれば、当
該文書でのみ使用されている製品名、会社名、人
名などの固有名詞や単語の組み合わせであつても
新語やその文書においてのみ使用されていると考
えられる語（以上の語をまとめて日本語対象文固
有用語と呼ぶ）の候補をその文書中から、字種の
変化点の情報によつて抽出し、その候補の持つ言
語情報を用いて候補を絞つた後に、解析辞書を検
索することによつて一般的な語を取り除いたもの
を日本語対象文固有用語として出力するのである
から、日本語文書中に存在する日本語対象文固有
用語を高速かつ正確に抽出できるという利点があ
る。 As explained above, according to the device of the present invention, even if a product name, company name, person's name, or other proper noun or word combination is used only in the document, it can be recognized as a new word or a combination of words that is used only in the document. Possible word candidates (the above words are collectively referred to as Japanese target sentence-specific terms) are extracted from the document using the information on the change points of character types, and the linguistic information of the candidates is used to create candidates. After narrowing down the words, we search the analysis dictionary to remove common words and output them as Japanese target sentence-specific terms. It has the advantage of being able to extract quickly and accurately.

[Brief explanation of drawings]

第１図はこの発明装置をハードウエアによつて
構成した基本構成図、第２図はこの発明装置の動
作の概略流れ図、第３図は分類テーブルの内容の
例を示す図、第４図は動作の説明に用いた例文を
示す図、第５図は例文（第４図）に対するコード
列を示す図、第６図は第５図のコード列から抽出
される日本語対象文固有用語の候補のコード列を
示す図、第７図は第６図のコード列に対する元の
文字列を示す図、第８図は分類テーブルによる日
本語対象文固有用語の候補の分類を示す図、第９
図は言語情報テーブル８の内容の例を示す図、第
１０図は抽出された日本語対象文固有用語を示す
図である。 Fig. 1 is a basic configuration diagram of the hardware configuration of this inventive device, Fig. 2 is a schematic flowchart of the operation of this inventive device, Fig. 3 is a diagram showing an example of the contents of a classification table, and Fig. 4 is a diagram showing an example of the contents of a classification table. Figure 5 shows the example sentence used to explain the operation, Figure 5 shows the code string for the example sentence (Figure 4), Figure 6 shows candidates for Japanese target sentence-specific terms extracted from the code string in Figure 5. FIG. 7 is a diagram showing the original character string for the code string in FIG.
The figure shows an example of the contents of the language information table 8, and FIG. 10 shows the extracted terms specific to the Japanese target sentence.

Claims

[Scope of Claims] 1. A linguistic information table that stores conditions for character strings that do not become terms specific to Japanese target sentences due to the nature of Japanese language and conditions for character strings that tend to become terms unique to Japanese target sentences, and the font of Japanese words. A Japanese analysis dictionary that stores characters such as characters and parts of speech, and converts character strings in input Japanese documents into kanji codes, kanji numeric codes, hiragana codes, katakana codes, Arabic numerals codes, alphanumeric codes, and punctuation marks codes. , delimiter code, and other codes; a code string expansion means that expands into code strings classified into nine types of characters; Even if they are proper nouns such as product names, company names, or people's names, or combinations of words, words that are thought to be used only in new words or documents (these words are collectively referred to as Japanese target sentence-specific terms) are A Japanese target sentence-specific word candidate extraction means to extract candidates, and from among the candidates extracted by the extraction means, only candidates that meet the conditions of character strings that are likely to be Japanese target sentence-specific words using the linguistic information table. A Japanese target sentence-specific term language processing means extracts candidates for Japanese target sentence-specific terms with higher accuracy by leaving Japanese target sentence specific term selection means for outputting only words not included in Japanese target sentence specific terms as Japanese target sentence specific terms.