JPH03263260A

JPH03263260A - Japanese language analyzer

Info

Publication number: JPH03263260A
Application number: JP2063234A
Authority: JP
Inventors: Yuji Ito; 雄二伊藤
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1990-03-14
Filing date: 1990-03-14
Publication date: 1991-11-22

Abstract

PURPOSE:To infer the attribute of a proper noun and to decide an unregistered word range by performing unregistered word processing centering the detection of the proper noun, and detecting the attribute by the information of a word cogenerated with the proper noun. CONSTITUTION:Character string data comprising a Japanese language text is inputted to an input means 1, and is stored in an input character string storage part 2. Also, the word and the attribute data of the word are stored in a word dictionary part 3, and the word to be connected to the proper noun and the attribute of the proper noun are stored in a cogenerative word table 4. In such a case, a morphemic analysis control part 14 performs analysis by referring the character string data stored in the storage part 2 to the dictionary part 3. When no word equivalent to a character string targeted to be analyzed exists or no connection by the word in a connection table 6 is performed, the word is judged as an unregistered word, and the unregistered word processing is started from the word in the character string in which the word equivalent to the character string targeted to be analyzed exists. Here, an unregistered word detection control part 11 decides the temporary range of the unregistered word at an unregistered word temporary detecting part 8, and the attribute of a detected proper noun is inferred.

Description

【発明の詳細な説明】［産業上の利用分野Ｊ本発明は、日本語文を構成する各単語の品詞や構成を解
析する日本語解析装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION [Industrial Field of Application J] The present invention relates to a Japanese language analysis device that analyzes the part of speech and structure of each word constituting a Japanese sentence.

［従来の技術］近年、機械翻訳に代表される自然言語処理技術への要請
が高まってきている。その中で、漢字かな混じり文の形
態素解析は、日本語解析を行なう場合最初に行なわれる
処理であり、その精度は全体の解析精度に影響を及ぼす
。ところで、形態素解析の際、未登録語の処理が大きな
問題となる。[Background Art] In recent years, there has been an increasing demand for natural language processing technology represented by machine translation. Among these, morphological analysis of sentences containing kanji and kana is the first process performed when analyzing Japanese, and its accuracy affects the overall analysis accuracy. By the way, processing of unregistered words poses a major problem during morphological analysis.

未登録語を文中から切り出す方法としては、文の先頭か
ら解析を行なって、辞書引きに失敗したところから同じ
字種が続く範囲を未登録語とする方法、失敗した文字を
スキップして次の文字から辞書引きを行い単語を抽出で
きるまでそれを繰り返し単語が抽出できた時点でその単
語の直前の文字までを未登録語とする、等が一般に行な
われている。There are two ways to extract unregistered words from a sentence: analyze from the beginning of the sentence and define the range where the same character type continues from the point where dictionary lookup fails as unregistered words, or skip the failed character and move on to the next one. Generally, a dictionary is searched from a character, and this process is repeated until a word can be extracted, and at the point when a word can be extracted, the characters up to the character immediately before that word are treated as unregistered words.

［発明が解消しようとする課題］上述の中で、同じ字種が続く範囲を未登録語とする方法
では、例えば「中曽根首相」という文字列があった場合
に、「中コのところで辞書引きに失敗すると、「首相」
という登録された語があるにもかかわらず全体が未登録
語となる。また、同じ例で、−文字ずつずらしながら辞
書引きを行なう方法について考えてみると、３回目で「
首相」が引けてその前の「中曽根」が未登録語であると
いう認定がなされることになるが、文字の並び方によっ
ては間違った切り出し方をしてしまう可能性がある。[Problem to be solved by the invention] In the method described above, in which the range of consecutive characters of the same type is treated as unregistered words, for example, if there is a character string "Prime Minister Nakasone", "Dictionary lookup at Nakako" If you fail, "Prime Minister"
Even though there is a registered word, the entire word is unregistered. Also, in the same example, if you think about how to look up a dictionary by shifting - characters, on the third try,
When the word ``Prime Minister'' is drawn, the word ``Nakasone'' before it will be recognized as an unregistered word, but depending on the way the letters are arranged, there is a possibility that the word may be extracted incorrectly.

１課題を解決するための手段］未登録語として検出される語を見てみると、そのほとん
どが名詞でありその中でも固有名詞（人名、地名、団体
名〉の占める割合が大きいと思われる。また、これらの
固有名詞は文中に単独で現れることは少なく、共起単語
（例：首相、社長、市、銀行など〉をともなって現れる
場合が多い。Measures to Solve Problem 1] Looking at the words detected as unregistered words, most of them are nouns, and among these, proper nouns (names of people, places, organizations) seem to account for a large proportion. Furthermore, these proper nouns rarely appear alone in a sentence, but often appear with co-occurring words (eg, prime minister, president, city, bank, etc.).

このことに着目し、この課題を解決するために本発明は
、固有名詞に接続する単語と、該単語が接続する固有名
詞の属性を記憶した共起単語テーブルを備え、記憶手段
に記憶された文字列データを単語辞書を参照して解析す
る解析手段と、解析手段によって解析不可能で共起単語
テーブルに記憶された語に接続した文字列の属性を共起
テーブルを参照して得るよう構成される。Focusing on this, and in order to solve this problem, the present invention includes a co-occurrence word table that stores words connected to proper nouns and attributes of proper nouns to which these words connect, and the co-occurrence word table is stored in a storage means. An analysis means for analyzing character string data by referring to a word dictionary; and a structure configured to refer to the co-occurrence table to obtain attributes of character strings connected to words that cannot be analyzed by the analysis means and are stored in the co-occurrence word table. be done.

［作用ｊ上記の手段で固有名詞の検出を中心とした未登録語処理
を行うことにより、文中の未登録語の間違った解析の可
能性を少なくし、固有名詞と共起する単語のもつ情報に
より検出した固有名詞の属性（人名、地名、団体名）を
推定することができる。[Operation j] By performing unregistered word processing that focuses on detecting proper nouns using the above method, the possibility of incorrect analysis of unregistered words in a sentence is reduced, and the information possessed by words that co-occur with proper nouns is reduced. It is possible to estimate the attributes (person name, place name, organization name) of the proper noun detected by this method.

を実施例］以下、本発明の一実施例における日本語解析装置につい
て図面を参照しながら説明を行う。Embodiment] A Japanese language analysis device according to an embodiment of the present invention will be described below with reference to the drawings.

第１図は、本発明の一実施例における日本語解析装置の
機能ブロック図である。同図において、］は日本語文字
列の入力手段である。２は入力手段ｌから入力される日
本語文字列を記憶する入力文字列記憶部、３は単語の読
み２表記２品詞情報などからなる形態素情報を保持して
いる単語辞書部、４は後述するように、人名や地名、団
体名といった固有名詞と共起して現れる可能性の高い単
語の表記、読みおよび共起する固有名詞の属性（前述の
人名、地名、団体名）を持つ共起単語テーブルで、５は
共起単語の接頭辞として使われることの多い語を登録す
る接頭辞テーブルである。FIG. 1 is a functional block diagram of a Japanese language analysis device according to an embodiment of the present invention. In the figure, ] is a means for inputting Japanese character strings. 2 is an input character string storage unit that stores Japanese character strings input from the input means 1; 3 is a word dictionary unit that holds morpheme information including word readings, 2 spellings, 2 part-of-speech information; and 4, which will be described later. , the spelling and pronunciation of words that are likely to co-occur with proper nouns such as people's names, place names, and organization names, as well as co-occurring words that have the attributes of co-occurring proper nouns (person names, place names, and organization names mentioned above). In the table, 5 is a prefix table that registers words that are often used as prefixes of co-occurring words.

６は単語間の接続可／不可の情報を持つ接続テーブルで
ある。７は入力文字列中の解析対象文字列と、接頭辞テ
ーブル５、共起単語テーブル４、単語辞書３を頭から順
次比較して、解析対象文字列と最も長く一致する単語を
検索する辞書・テーブル検索部である。８は辞書検索に
失敗した場合に後述する方法により仮の未登録文字列範
囲を検出する未登録語仮範囲検出部、９はその未登録文
字列中から後述の方法により固有名詞を検出する固有名
詞検出部、さらに１０は固有名詞検出部９で検出した共
起単語の持つ情報からその固有名詞の属性を推定する固
有名詞属性判定部である。１１は未登録語仮範囲検出部
８や固有名詞属性判定部１０により未登録語の範囲を決
定する未登録語検出制御部である。また、１２は形態素
解析の結果を記憶する解析結果記憶部、１３は出力表示
部、１４は日本語形態素解析を制御する形態素解析制御
部である。Reference numeral 6 represents a connection table having information on whether or not connections between words are possible. 7 is a dictionary that sequentially compares the character string to be analyzed in the input character string with the prefix table 5, the co-occurrence word table 4, and the word dictionary 3 from the beginning to search for the word that matches the character string to be analyzed the longest. This is a table search section. Reference numeral 8 denotes an unregistered word provisional range detection unit that detects a provisional unregistered character string range using the method described later when the dictionary search fails, and 9 represents a proper noun that detects a proper noun from the unregistered character string using the method described later. The noun detection unit and further 10 are a proper noun attribute determination unit that estimates the attributes of the proper noun from the information of the co-occurring words detected by the proper noun detection unit 9. Reference numeral 11 denotes an unregistered word detection control section that determines the range of unregistered words using the unregistered word provisional range detection section 8 and the proper noun attribute determination section 10. Further, 12 is an analysis result storage unit that stores the results of morphological analysis, 13 is an output display unit, and 14 is a morphological analysis control unit that controls Japanese morphological analysis.

上記のように構成された装置において、以下、その動作
について説明する。まず、入力された日本語文字列に対
し先頭から解析を行う。解析対象となる文字列の先頭か
ら辞書・テーブル検索部７の辞書検索部が単語辞書３を
検索し、最も長く一致する単語を検索する。単語が抽出
できたら、その単語の次の文字から再び検索を行い単語
を抽出する。そして、次の単語が抽出できたら接続表６
を用いて前の＄　ｇ１４と後ろの単語が接続可能である
かどうかをチエツクし、接続可能であればその部分の解
析は正しいとして次の単語の切り出しを行う。ここで、
解析対象文字列に相当する単語か辞書中で見つからない
かあるいは接続表によって前の単語と接続が不可能であ
るということになればそこに未登録語があると判断する
。そして、未登録語が存在すると判断したところから、
未登録語処理に入る。The operation of the apparatus configured as described above will be described below. First, the input Japanese character string is analyzed from the beginning. The dictionary search section of the dictionary/table search section 7 searches the word dictionary 3 from the beginning of the character string to be analyzed, and searches for the longest matching word. Once a word has been extracted, the search is performed again starting with the next character of the word to extract the word. Then, when the next word is extracted, connection table 6
is used to check whether the previous $g14 and the following word can be connected, and if they are connectable, the analysis of that part is correct and the next word is extracted. here,
If the word corresponding to the character string to be analyzed cannot be found in the dictionary, or if the connection table indicates that it cannot be connected to the previous word, it is determined that there is an unregistered word. Then, from the point where it is determined that there is an unregistered word,
Enters unregistered word processing.

以下、第２図のフローチャートを参照しながら説明を行
う。未登録語検出制御部１１は、未登録語仮範囲検出部
８により第３図の基準に従って仮の未登録語範囲を決定
する。さらに１１は先の基準に従って設定された未登録
語仮範囲に対して、固有名詞検出部９によりステップｓ
１で仮の未登録語範囲の最後尾から、辞書・テーブル検
索部７により第４図の共起単語テーブルに登録されてい
る語を検索する。ステップｓ１で一致する単語が見つか
ったら、ステップｓｌａに移行しさらにその一つ前の文
字を見て、第５図の接頭辞テーブルに登録されている語
があるかどうかを調べる。該当するものがあれば、ステ
ップｓｌｂでその文字より前の部分を未登録語とする。The following description will be given with reference to the flowchart shown in FIG. The unregistered word detection control section 11 determines a tentative unregistered word range using the unregistered word tentative range detection section 8 according to the criteria shown in FIG. Furthermore, step s 11 is performed by the proper noun detection unit 9 for the tentative range of unregistered words set according to the above criteria.
1, the dictionary/table search unit 7 searches for words registered in the co-occurrence word table shown in FIG. 4 from the end of the tentative unregistered word range. If a matching word is found in step s1, the process moves to step sla and the character immediately before it is looked at to see if there is a word registered in the prefix table of FIG. 5. If there is a corresponding character, the part before that character is set as an unregistered word in step slb.

接頭辞に該当するものがなければ、ステップｓｌｃで先
頭から共起単語テーブル登録語の前までを未登録語とす
る。If there is no matching prefix, in step slc, the words from the beginning to before the registered word in the co-occurrence word table are treated as unregistered words.

ステップｓｌｄでは固有名詞属性判定部１０が共起単語
テーブルの「属性」情報から、検出した固有名詞の属性
を決定し、解析結果記憶部１２にその情報を出力する。In step sld, the proper noun attribute determination unit 10 determines the attribute of the detected proper noun from the “attribute” information in the co-occurrence word table, and outputs the information to the analysis result storage unit 12.

ステップｓ２で共起単語が有るか否かを判別し、なかっ
た場合は、ステップＳ２ａで未登録語仮範囲の一つ前の
単語を取り出し、共起単語テーブルと照合する。一致す
るものがあればステップｓ２ａで仮範囲全体を未登録語
とし、ステップｓｌｄに移行する。これは固有名詞の前
に共起単語が来る場合を考慮した処理である。固有名詞
の前にも共起単語が見つからなかった場合は、ステップ
ｓ２６で仮範囲全体を未登録語とする。In step s2, it is determined whether or not there is a co-occurring word. If there is no co-occurring word, in step S2a, the previous word in the tentative unregistered word range is extracted and compared with the co-occurring word table. If there is a match, the entire temporary range is set as an unregistered word in step s2a, and the process moves to step sld. This process takes into consideration the case where a co-occurring word comes before a proper noun. If no co-occurring word is found before the proper noun, the entire tentative range is set as an unregistered word in step s26.

さらに、具体例を挙げて実際の処理を詳しく説明する。Furthermore, actual processing will be explained in detail by giving a specific example.

次のような例文を考える。「・・・は中骨根元首相が１
２月２５日に北京で・・・」この例で、「中」のところ
で辞書引きに失敗したとすると未登録語仮範囲検出部８
が第３図の基準により「中骨根元首相」を仮範囲と判定
する。この部分に対して文字列の最後尾から辞書・テー
ブル検索部により第４図の共起単語テーブルを検索し「
首相」が発見される（ステップ５１）０そこで「首相」
の前の文字「元」で第５図の接頭辞テーブルを検索する
と一致するものが見つかる。従って、未登録語検出制御
部１１によってこの場合の未登録語は「中曽根」の部分
であると判断されることになる（ステップ５ｌｂ）。さ
らに、テーブル中で「首相」の属性は「人名」となって
いるので、この結果が解析結果記憶部１２に、第６図の
ように出力される（ステップｓ　１　ｄ　）　ｏもう一
つ別の例を考える。「・・・は元首相中曽根が・・・」
上記の例文で、先と同じようにｒ中コのところで辞書引
きに失敗したとすると、仮範囲は「中曽根」でありこれ
に対して後ろから共起単語の検索に入るがこの場合は見
つからない！（ステップｓ２）。そこで、この範囲の前
の単語「首相Ｊを共起単語テーブルで検索すると一致す
るものが見つかる。そこで「中曽根」が未登録語とされ
、「首相」の属性「人名」が付与されて解析結果記憶部
１２に、この結果が出力される。Consider the following example sentences. ``...is the former Prime Minister Nakabonene.
In Beijing on February 25th...'' In this example, if the dictionary lookup fails at ``中'', the unregistered word provisional range detection unit 8
Based on the criteria shown in Figure 3, we determine that ``the prime minister of the middle bone root'' is the provisional range. For this part, the dictionary/table search unit searches the co-occurrence word table in Figure 4 from the end of the character string, and
"Prime Minister" is found (step 51) 0 there "Prime Minister"
If you search the prefix table in Figure 5 for the character ``Gen'' before ``, a match will be found. Therefore, the unregistered word detection control unit 11 determines that the unregistered word in this case is the part of "Nakasone" (step 5lb). Furthermore, since the attribute of "Prime Minister" in the table is "Person's name", this result is output to the analysis result storage unit 12 as shown in Fig. 6 (step s1d). Consider the example of "...is former Prime Minister Nakasone..."
In the example sentence above, if the dictionary look-up fails at the middle letter r as before, the tentative range is ``Nakasone'', and a search for co-occurring words starts after this, but in this case it is not found. ! (Step s2). Therefore, if you search for the word "Prime Minister J" in front of this range in the co-occurrence word table, you will find a match.Therefore, "Nakasone" is treated as an unregistered word, and the attribute "Person's name" of "Prime Minister" is added to the analysis result. This result is output to the storage unit 12.

［発明の効果］以上のように本発明は、固有名詞に接続する単語と該単
語が接続する固有名詞の属性を記憶した共起単語テーブ
ルとを設け、辞書に未登録の語があった場合、その語に
接続する語で共起単語テーブルを検索し、該未登録語の
属性情報を得るようにしたので、的確な未登録語範囲の
判定および、検出した固有名詞の属性を推定することが
できる。[Effects of the Invention] As described above, the present invention provides a co-occurrence word table that stores words connected to proper nouns and attributes of proper nouns to which the words are connected, so that when there is a word that is not registered in the dictionary, , the co-occurrence word table is searched for words connected to that word and the attribute information of the unregistered word is obtained, making it possible to accurately determine the range of unregistered words and estimate the attributes of the detected proper noun. I can do it.

[Brief explanation of drawings]

第１図は本発明の一実施例における日本語解析装置のブ
ロック図、第２図は本発明の一実施例（こおける制御手
順を示すフローチャート、第３図は未登録文字列仮範囲
の決定基準を示す図、第４図は共起単語テーブルを示す
図、第５図は接頭辞テーブルを示す図、第６図は本発明
の一実施例における出力結果の例を示す図である。１・・・入力手段、２・・・入力文字列記憶部、３・・
・単語辞書部、４・・・共起単語テーブル、５・・・接
頭辞テーブル、６・・・接続テーブル、７・・・辞書・
テーブル検索部、８・・・未登録語仮範囲検出部、９・
・・固有名詞検出部、１０・・・固有名詞属性判定部、
１１・・・未登録語検出制御部１２・・・解析結果記憶部、第図３・・・出力表示部、４・・・形態素解析制御部。Fig. 1 is a block diagram of a Japanese language analysis device according to an embodiment of the present invention, Fig. 2 is a flowchart showing a control procedure in an embodiment of the present invention (Fig. 3 is a flowchart showing the control procedure in this embodiment), and Fig. 3 is a block diagram of a Japanese language analysis device according to an embodiment of the present invention. 4 is a diagram showing a co-occurrence word table, FIG. 5 is a diagram showing a prefix table, and FIG. 6 is a diagram showing an example of an output result in an embodiment of the present invention.1 ...Input means, 2...Input character string storage section, 3...
・Word dictionary part, 4...Co-occurrence word table, 5...Prefix table, 6...Connection table, 7...Dictionary・
Table search unit, 8... Unregistered word temporary range detection unit, 9.
... Proper noun detection unit, 10... Proper noun attribute determination unit,
11... Unregistered word detection control unit 1 2... Analysis result storage unit, Figure 3... Output display unit, 4... Morphological analysis control unit.

Claims

[Scope of Claims] An input means for inputting character string data constituting a Japanese sentence, a storage means for storing the character string data input from the input means, and a word dictionary storing words and attribute data of the words. a co-occurrence word table storing words connected to proper nouns and attributes of proper nouns connected to the words; and morphological analysis for analyzing character string data stored in the storage means with reference to the word dictionary. and means for obtaining attributes of character strings connected to words that cannot be analyzed by the morphological analysis means and are stored in the co-occurrence word table by referring to the co-occurrence table. Analysis device.