JP2004062262A

JP2004062262A - Method of registering unknown word automatically to dictionary

Info

Publication number: JP2004062262A
Application number: JP2002216005A
Authority: JP
Inventors: ▲廣▼田　純子; Junko Hirota; Shinichi Koto; 弘藤　慎一; Junichi Imai; 今井　順一
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2002-07-25
Filing date: 2002-07-25
Publication date: 2004-02-26

Abstract

<P>PROBLEM TO BE SOLVED: To evaluate an extracted unknown word to judge automatically the propriety of registration to a dictionary. <P>SOLUTION: In an unknown word registering device provided with a classification function for setting classification for a document of an analytical object, a retrieval condition setting function for setting a proper retrieval site and an objective range as to years retrieved out of registered information in every category, an information input function for storing the document as a character sequence, and an unknown word extracting function for extracting the unknown word from the character sequence, the unknown word is registered to the dictionary when adapted to a reference, based on an analytical result, a retrieving function for conducting the retrieval using the unknown word as a key word, and a retrieved result analysis function for analyzing the retrieved result, in the retrieval. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は辞書に語句を自動登録する方法に関し、特に未知語の登録可否を判定して自動登録する方法に関する。
【０００２】
【従来の技術】
従来は、特開平１１−８５７６１号公報に記載のような未知語登録方法が提案されていた。その手法とは、コンピュータにより入力された日本語文字列を辞書で参照しつつ形態素解析して文節に分かち書きし、該結果に基づいて前記辞書に存在しない未知語を、前記日本語文字列から抽出する。次に、抽出された未知語の連接語を少なくともひとつ抽出し、未知語に含まれる文字構成に基づき品詞を判定し、未知語の読みを推測した上で判定された品詞と読みを含めて未知語に関するデータを辞書に追加登録する方法であった。
【０００３】
【発明が解決しようとする課題】
しかし、上記未知語登録方法では、未知語の連接語を抽出し、属性データに基づいて品詞を判定させ、判定された品詞や読み等の未知語に関する情報を辞書へ登録することは可能であるが、該未知語の辞書登録可否判断を行わずにすべて登録していた。
【０００４】
本発明は、抽出された未知語を評価して辞書登録可否判断を自動的に行う方法を提供することを目的とする。
【０００５】
【課題を解決するための手段】
本発明は、上記課題を解決するために、辞書登録対象となる未知語がどれくらい一般化した語であるのかを判断する材料として、対象数が多く情報の新しいＷｅｂサイトを利用する。Ｗｅｂサイトを利用する前条件として、分野特有の語句に対応させるため、未知語を抽出する前に予め文書をカテゴリ分けしておく。さらに、一時的な流行語を辞書登録対象から排除するために、検索結果のヒット件数やＷｅｂサイト登録日も辞書登録可否の判定条件に含める。判定条件として予めヒット件数のしきい値や、統計的処理を行った値を設定しておき、それに基づいて結果を判定し辞書登録可否を判定し、辞書登録可とされた語句を登録する。
【０００６】
【発明の実施の形態】
次に、本発明の実施例について、図面を参照して説明する。図１は本発明の実施例のブロックダイヤグラムを示す。
【０００７】
分類設定機能１は、語句の抽出を行いたい文書がどの分野に区分されるのかを判断し、カテゴリ分けを行う機能である。
【０００８】
検索条件設定機能２は、分類設定機能１でカテゴリ分けされた情報をもとに、対象となるネットワーク上の検索サイトを設定する機能である。また、カテゴリ毎に何年分の登録情報までを対象とするのかを設定する機能である。また、カテゴリ毎に辞書登録の可否を判断する材料となるしきい値を設定する機能である。情報入力機能３は、語句の抽出を行いたい文書を文字列として入力する機能である。
【０００９】
未知語抽出機能４は、すでに提案されている特開平１１−３３８８６３号公報に記載されているような従来の方法を用いて未知語の抽出を行う機能である。
【００１０】
検索機能５は、検索条件設定機能２で設定された検索サイトにて、未知語をキーワードとして検索を行い、設定された登録期間のみの結果を抽出する機能である。
【００１１】
検索結果分析機能６は、検索機能５で検索した結果をもとに、１ヶ月ごとのヒット件数の分布を解析する機能である。カテゴリ毎に検索結果を分析対象とする期間が異なるのであるが、例えば過去３年分の更新日時から情報を収集する場合、更新日が１、２ヶ月前の場合と３年前の場合では、後者のほうが過去に登録されていた情報が削除されている可能性があり、ヒット件数が少ないことは明らかであるため補正処理を行う。ここでは、補正率として３年前の結果に３倍、２年前に２倍、１年前に１倍かけることとする。
【００１２】
辞書登録の可否判定部７は、検索結果分析機能６で解析した結果をもとに辞書への登録可否を判定する機能である。ヒット件数がしきい値よりも低い場合は、一般的に普及されていない語句であると判断し、辞書登録対象とはしないこととする。設定したしきい値より高い場合には、未知語を辞書へ登録する対象とする。次に、一時的な流行語を辞書への登録対象から排除する場合、検索結果を分析し統計的に解析するのであるが、ここでは解析結果が正規分布となる場合には、辞書へ登録しないこととする。
【００１３】
図２にカテゴリ毎に検索サイトおよびの対象登録期間としきい値の例を表に示す。
【００１４】
図３に本発明の実施例における処理態様を表すフローチャートを示す。
【００１５】
カテゴリ分類の入力８で分類設定機能１により、未知語を抽出する文書のカテゴリを分類し結果を入力する。つぎに検索条件の設定９では、検索カテゴリ分類で入力された情報をもとに検索条件設定機能２を用いて、カテゴリ毎の適当な検索サイトを設定する。また、カテゴリ毎に検索結果の登録日の過去何年分を対象とするのかを設定する。つぎに情報入力１０で、未知語を抽出したい文書を情報入力機能３を用いて入力する。そして、未知語抽出１１により、既存の未知語抽出方法を用いて未知語抽出機能により未知語を抽出する。
【００１６】
検索実行１２では、検索条件の設定９で設定された検索サイトにて未知語をキーワードとして検索を実行する。
【００１７】
検索結果と閾値の比較１３では、検索条件の設定９で設定されたカテゴリ毎の閾値と検索結果の比較を行う。ここでは検索結果は、上記でも記載したように、１、２ヶ月前と３年前では、検索にヒットする件数が異なるのは明確であるため、検索結果を補正する必要がある。本発明では、補正率を用いて結果の補正を行う。そして、検索結果が閾値よりも小さい場合は辞書へ登録しないこととする（図３のフローチャートの１５）。
【００１８】
検索結果が閾値よりも大きい場合は、検索結果の分析１４を行う。ここでは、一時的な流行語を辞書へ登録することを排除するために、検索結果をもとに１ヶ月毎のヒット件数の分布がどのようになっているのかを解析する。統計学的に、正規分布においては、平均値±標準偏差の範囲内には全データの約６８％が含まれる傾向がある。
【００１９】
次に、検索結果の分析１４での解析結果を参照して、分析結果の判定１６で結果を判定する。本発明では、検索結果の分布が正規分布を示す場合には、ある特定の期間に約７０％の件数が集中するため、該未知語が一時的な流行語である可能性が高いと考え辞書への登録は行わないこととする（図３のフローチャート１５）。そして、分析結果が正規分布とならない場合には、辞書の登録対象と判断し、辞書へ登録することとする（図３のフローチャート１７）。
【００２０】
【発明の効果】
以上説明した如く、本発明によれば、対象の多いＷｅｂサイトを判断基準として利用することで、一般化しているか否かの判断をより適切に行うことができる。また、カテゴリ毎に適当な検索サイトを選択し、既存の新聞サイト等を利用することで簡易にその分野における適切な判断基準を利用することができる。
【図面の簡単な説明】
【図１】本発明の実施例のブロックダイヤグラムである。
【図２】カテゴリ毎の検索サイトと検索結果対象期間としきい値の対応表である。
【図３】本発明の実施例における処理態様を示す図である。
【符号の説明】
１…分類設定機能、２…情報入力機能、３…未知語抽出機能、４…検索条件設定機能、５…検索機能、６…検索結果分析機能、７…辞書登録の可否判定部、８…カテゴリの入力、９…検索条件の設定、１０…情報入力、１１…未知語抽出、１２…検索実行、１３…検索結果と閾値の比較、１４…検索結果の分析、１５…辞書へ非登録、１６…分析結果の判定、１７…辞書へ登録。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method for automatically registering a word in a dictionary, and more particularly to a method for automatically determining whether or not to register an unknown word and automatically registering it.
[0002]
[Prior art]
Conventionally, an unknown word registration method as described in JP-A-11-85761 has been proposed. The technique is that a Japanese character string input by a computer is morphologically analyzed while referring to the dictionary and divided into phrases, and unknown words that do not exist in the dictionary are extracted from the Japanese character string based on the result. I do. Next, extract at least one concatenated word of the extracted unknown word, judge the part of speech based on the character configuration included in the unknown word, estimate the reading of the unknown word, and It was a method of additionally registering data on words in a dictionary.
[0003]
[Problems to be solved by the invention]
However, in the above-mentioned unknown word registration method, it is possible to extract a conjunctive word of an unknown word, determine a part of speech based on attribute data, and register information on the determined unknown word such as a part of speech or a reading in a dictionary. However, all of the unknown words were registered without making a dictionary registration determination.
[0004]
SUMMARY OF THE INVENTION It is an object of the present invention to provide a method for automatically evaluating a dictionary registration by evaluating extracted unknown words.
[0005]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, the present invention uses a Web site with a large number of objects and new information as a material for determining how generalized an unknown word to be registered in a dictionary is. As a pre-condition for using the Web site, documents are classified in advance before extracting unknown words in order to correspond to words specific to the field. Furthermore, in order to exclude temporary buzzwords from the dictionary registration target, the number of hits in the search result and the website registration date are also included in the dictionary registration availability determination condition. A threshold value of the number of hits or a value obtained by performing a statistical process is set in advance as a determination condition, a result is determined based on the threshold value, and whether or not dictionary registration is possible is determined.
[0006]
BEST MODE FOR CARRYING OUT THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows a block diagram of an embodiment of the present invention.
[0007]
The classification setting function 1 is a function of determining a field into which a document from which a word / phrase is to be extracted is classified and classifying the document.
[0008]
The search condition setting function 2 is a function for setting a search site on the target network based on the information classified by the category setting function 1. It is also a function for setting up to how many years of registration information is to be targeted for each category. It is also a function to set a threshold value for determining whether dictionary registration is possible for each category. The information input function 3 is a function of inputting a document from which a phrase is to be extracted as a character string.
[0009]
The unknown word extraction function 4 is a function of extracting an unknown word using a conventional method as described in Japanese Patent Application Laid-Open No. 11-338863 which has already been proposed.
[0010]
The search function 5 is a function of performing a search using an unknown word as a keyword at the search site set by the search condition setting function 2, and extracting results only for the set registration period.
[0011]
The search result analysis function 6 is a function for analyzing the distribution of the number of hits for each month based on the result searched by the search function 5. The period for which the search results are analyzed for each category is different. For example, when collecting information from the update date and time for the past three years, if the update date is one or two months ago and three years ago, In the latter case, the information registered in the past may have been deleted, and it is clear that the number of hits is small, so the correction process is performed. Here, it is assumed that the correction rate is multiplied by three times the result of three years ago, twice by two years ago, and by one time by one year ago.
[0012]
The dictionary registration availability determination unit 7 has a function of determining whether the dictionary can be registered based on the result analyzed by the search result analysis function 6. If the number of hits is lower than the threshold value, it is determined that the phrase is not commonly used, and is not to be registered in the dictionary. If it is higher than the set threshold, the unknown word is registered in the dictionary. Next, when temporary buzzwords are excluded from the registration target in the dictionary, the search results are analyzed and statistically analyzed. However, if the analysis result has a normal distribution, it is not registered in the dictionary. It shall be.
[0013]
FIG. 2 is a table showing examples of search sites and target registration periods and thresholds for each category.
[0014]
FIG. 3 is a flowchart showing a processing mode in the embodiment of the present invention.
[0015]
In the category classification input 8, the category setting function 1 classifies the category of the document from which the unknown word is to be extracted and inputs the result. Next, in the search condition setting 9, an appropriate search site for each category is set using the search condition setting function 2 based on the information input in the search category classification. Further, for each category, the number of years in the past on the registration date of the search result is set. Next, in information input 10, a document from which an unknown word is to be extracted is input using information input function 3. An unknown word is extracted by an unknown word extraction function using an unknown word extraction function using an existing unknown word extraction method.
[0016]
In the search execution 12, a search is performed at the search site set in the search condition setting 9 using the unknown word as a keyword.
[0017]
In the comparison 13 of the search result and the threshold value, the threshold value for each category set in the search condition setting 9 is compared with the search result. Here, as described above, since it is clear that the number of hits in the search differs between one and two months ago and three years ago, it is necessary to correct the search result. In the present invention, the result is corrected using the correction rate. If the search result is smaller than the threshold, it is not registered in the dictionary (15 in the flowchart of FIG. 3).
[0018]
When the search result is larger than the threshold, the analysis 14 of the search result is performed. Here, the distribution of the number of hits per month is analyzed based on the search results in order to exclude the registration of temporary buzzwords in the dictionary. Statistically, in a normal distribution, there is a tendency that about 68% of all data falls within the range of the mean value ± standard deviation.
[0019]
Next, with reference to the analysis result in the analysis 14 of the search result, the result is determined in the analysis result determination 16. According to the present invention, when the distribution of the search results shows a normal distribution, the number of cases of about 70% is concentrated in a specific period, and it is considered that the unknown word is likely to be a temporary buzzword. Is not registered (flow chart 15 in FIG. 3). If the analysis result does not have a normal distribution, it is determined that the dictionary is to be registered and registered in the dictionary (flowchart 17 in FIG. 3).
[0020]
【The invention's effect】
As described above, according to the present invention, it is possible to more appropriately determine whether or not generalization has been performed by using a Web site with many targets as a determination criterion. Further, by selecting an appropriate search site for each category and using an existing newspaper site or the like, it is possible to easily use an appropriate criterion in the field.
[Brief description of the drawings]
FIG. 1 is a block diagram of an embodiment of the present invention.
FIG. 2 is a correspondence table of search sites, search result target periods, and thresholds for each category.
FIG. 3 is a diagram showing a processing mode in an embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Classification setting function, 2 ... Information input function, 3 ... Unknown word extraction function, 4 ... Search condition setting function, 5 ... Search function, 6 ... Search result analysis function, 7 ... Dictionary registration availability judgment section, 8 ... Category 9: Setting of search conditions, 10: Input of information, 11: Extraction of unknown words, 12: Search execution, 13: Comparison of search results with thresholds, 14: Analysis of search results, 15: Non-registered in dictionary, 16 ... Judgment of analysis results, 17 ... Registered in dictionary.

Claims

Classification setting function to set the classification of the document to be analyzed, search condition setting function to set the appropriate search site and how many years of registration information to target for each category, and store the document as a character string In the unknown word registration device equipped with an information input function to perform an unknown word extraction function to extract an unknown word from a character string, a search function for performing a search using an unknown word as a keyword when performing a search, and an analysis of the search results An unknown word registration method characterized by a search result analysis function and registering unknown words in a dictionary when adapting to criteria based on the analysis results.