JP2004062262A - Method of registering unknown word automatically to dictionary - Google Patents

Method of registering unknown word automatically to dictionary Download PDF

Info

Publication number
JP2004062262A
JP2004062262A JP2002216005A JP2002216005A JP2004062262A JP 2004062262 A JP2004062262 A JP 2004062262A JP 2002216005 A JP2002216005 A JP 2002216005A JP 2002216005 A JP2002216005 A JP 2002216005A JP 2004062262 A JP2004062262 A JP 2004062262A
Authority
JP
Japan
Prior art keywords
function
unknown word
search
dictionary
unknown
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2002216005A
Other languages
Japanese (ja)
Inventor
▲廣▼田 純子
Junko Hirota
Shinichi Koto
弘藤 慎一
Junichi Imai
今井 順一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Priority to JP2002216005A priority Critical patent/JP2004062262A/en
Publication of JP2004062262A publication Critical patent/JP2004062262A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

<P>PROBLEM TO BE SOLVED: To evaluate an extracted unknown word to judge automatically the propriety of registration to a dictionary. <P>SOLUTION: In an unknown word registering device provided with a classification function for setting classification for a document of an analytical object, a retrieval condition setting function for setting a proper retrieval site and an objective range as to years retrieved out of registered information in every category, an information input function for storing the document as a character sequence, and an unknown word extracting function for extracting the unknown word from the character sequence, the unknown word is registered to the dictionary when adapted to a reference, based on an analytical result, a retrieving function for conducting the retrieval using the unknown word as a key word, and a retrieved result analysis function for analyzing the retrieved result, in the retrieval. <P>COPYRIGHT: (C)2004,JPO

Description

【0001】
【発明の属する技術分野】
本発明は辞書に語句を自動登録する方法に関し、特に未知語の登録可否を判定して自動登録する方法に関する。
【0002】
【従来の技術】
従来は、特開平11−85761号公報に記載のような未知語登録方法が提案されていた。その手法とは、コンピュータにより入力された日本語文字列を辞書で参照しつつ形態素解析して文節に分かち書きし、該結果に基づいて前記辞書に存在しない未知語を、前記日本語文字列から抽出する。次に、抽出された未知語の連接語を少なくともひとつ抽出し、未知語に含まれる文字構成に基づき品詞を判定し、未知語の読みを推測した上で判定された品詞と読みを含めて未知語に関するデータを辞書に追加登録する方法であった。
【0003】
【発明が解決しようとする課題】
しかし、上記未知語登録方法では、未知語の連接語を抽出し、属性データに基づいて品詞を判定させ、判定された品詞や読み等の未知語に関する情報を辞書へ登録することは可能であるが、該未知語の辞書登録可否判断を行わずにすべて登録していた。
【0004】
本発明は、抽出された未知語を評価して辞書登録可否判断を自動的に行う方法を提供することを目的とする。
【0005】
【課題を解決するための手段】
本発明は、上記課題を解決するために、辞書登録対象となる未知語がどれくらい一般化した語であるのかを判断する材料として、対象数が多く情報の新しいWebサイトを利用する。Webサイトを利用する前条件として、分野特有の語句に対応させるため、未知語を抽出する前に予め文書をカテゴリ分けしておく。さらに、一時的な流行語を辞書登録対象から排除するために、検索結果のヒット件数やWebサイト登録日も辞書登録可否の判定条件に含める。判定条件として予めヒット件数のしきい値や、統計的処理を行った値を設定しておき、それに基づいて結果を判定し辞書登録可否を判定し、辞書登録可とされた語句を登録する。
【0006】
【発明の実施の形態】
次に、本発明の実施例について、図面を参照して説明する。図1は本発明の実施例のブロックダイヤグラムを示す。
【0007】
分類設定機能1は、語句の抽出を行いたい文書がどの分野に区分されるのかを判断し、カテゴリ分けを行う機能である。
【0008】
検索条件設定機能2は、分類設定機能1でカテゴリ分けされた情報をもとに、対象となるネットワーク上の検索サイトを設定する機能である。また、カテゴリ毎に何年分の登録情報までを対象とするのかを設定する機能である。また、カテゴリ毎に辞書登録の可否を判断する材料となるしきい値を設定する機能である。情報入力機能3は、語句の抽出を行いたい文書を文字列として入力する機能である。
【0009】
未知語抽出機能4は、すでに提案されている特開平11−338863号公報に記載されているような従来の方法を用いて未知語の抽出を行う機能である。
【0010】
検索機能5は、検索条件設定機能2で設定された検索サイトにて、未知語をキーワードとして検索を行い、設定された登録期間のみの結果を抽出する機能である。
【0011】
検索結果分析機能6は、検索機能5で検索した結果をもとに、1ヶ月ごとのヒット件数の分布を解析する機能である。カテゴリ毎に検索結果を分析対象とする期間が異なるのであるが、例えば過去3年分の更新日時から情報を収集する場合、更新日が1、2ヶ月前の場合と3年前の場合では、後者のほうが過去に登録されていた情報が削除されている可能性があり、ヒット件数が少ないことは明らかであるため補正処理を行う。ここでは、補正率として3年前の結果に3倍、2年前に2倍、1年前に1倍かけることとする。
【0012】
辞書登録の可否判定部7は、検索結果分析機能6で解析した結果をもとに辞書への登録可否を判定する機能である。ヒット件数がしきい値よりも低い場合は、一般的に普及されていない語句であると判断し、辞書登録対象とはしないこととする。設定したしきい値より高い場合には、未知語を辞書へ登録する対象とする。次に、一時的な流行語を辞書への登録対象から排除する場合、検索結果を分析し統計的に解析するのであるが、ここでは解析結果が正規分布となる場合には、辞書へ登録しないこととする。
【0013】
図2にカテゴリ毎に検索サイトおよびの対象登録期間としきい値の例を表に示す。
【0014】
図3に本発明の実施例における処理態様を表すフローチャートを示す。
【0015】
カテゴリ分類の入力8で分類設定機能1により、未知語を抽出する文書のカテゴリを分類し結果を入力する。つぎに検索条件の設定9では、検索カテゴリ分類で入力された情報をもとに検索条件設定機能2を用いて、カテゴリ毎の適当な検索サイトを設定する。また、カテゴリ毎に検索結果の登録日の過去何年分を対象とするのかを設定する。つぎに情報入力10で、未知語を抽出したい文書を情報入力機能3を用いて入力する。そして、未知語抽出11により、既存の未知語抽出方法を用いて未知語抽出機能により未知語を抽出する。
【0016】
検索実行12では、検索条件の設定9で設定された検索サイトにて未知語をキーワードとして検索を実行する。
【0017】
検索結果と閾値の比較13では、検索条件の設定9で設定されたカテゴリ毎の閾値と検索結果の比較を行う。ここでは検索結果は、上記でも記載したように、1、2ヶ月前と3年前では、検索にヒットする件数が異なるのは明確であるため、検索結果を補正する必要がある。本発明では、補正率を用いて結果の補正を行う。そして、検索結果が閾値よりも小さい場合は辞書へ登録しないこととする(図3のフローチャートの15)。
【0018】
検索結果が閾値よりも大きい場合は、検索結果の分析14を行う。ここでは、一時的な流行語を辞書へ登録することを排除するために、検索結果をもとに1ヶ月毎のヒット件数の分布がどのようになっているのかを解析する。統計学的に、正規分布においては、平均値±標準偏差の範囲内には全データの約68%が含まれる傾向がある。
【0019】
次に、検索結果の分析14での解析結果を参照して、分析結果の判定16で結果を判定する。本発明では、検索結果の分布が正規分布を示す場合には、ある特定の期間に約70%の件数が集中するため、該未知語が一時的な流行語である可能性が高いと考え辞書への登録は行わないこととする(図3のフローチャート15)。そして、分析結果が正規分布とならない場合には、辞書の登録対象と判断し、辞書へ登録することとする(図3のフローチャート17)。
【0020】
【発明の効果】
以上説明した如く、本発明によれば、対象の多いWebサイトを判断基準として利用することで、一般化しているか否かの判断をより適切に行うことができる。また、カテゴリ毎に適当な検索サイトを選択し、既存の新聞サイト等を利用することで簡易にその分野における適切な判断基準を利用することができる。
【図面の簡単な説明】
【図1】本発明の実施例のブロックダイヤグラムである。
【図2】カテゴリ毎の検索サイトと検索結果対象期間としきい値の対応表である。
【図3】本発明の実施例における処理態様を示す図である。
【符号の説明】
1…分類設定機能、2…情報入力機能、3…未知語抽出機能、4…検索条件設定機能、5…検索機能、6…検索結果分析機能、7…辞書登録の可否判定部、8…カテゴリの入力、9…検索条件の設定、10…情報入力、11…未知語抽出、12…検索実行、13…検索結果と閾値の比較、14…検索結果の分析、15…辞書へ非登録、16…分析結果の判定、17…辞書へ登録。
[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method for automatically registering a word in a dictionary, and more particularly to a method for automatically determining whether or not to register an unknown word and automatically registering it.
[0002]
[Prior art]
Conventionally, an unknown word registration method as described in JP-A-11-85761 has been proposed. The technique is that a Japanese character string input by a computer is morphologically analyzed while referring to the dictionary and divided into phrases, and unknown words that do not exist in the dictionary are extracted from the Japanese character string based on the result. I do. Next, extract at least one concatenated word of the extracted unknown word, judge the part of speech based on the character configuration included in the unknown word, estimate the reading of the unknown word, and It was a method of additionally registering data on words in a dictionary.
[0003]
[Problems to be solved by the invention]
However, in the above-mentioned unknown word registration method, it is possible to extract a conjunctive word of an unknown word, determine a part of speech based on attribute data, and register information on the determined unknown word such as a part of speech or a reading in a dictionary. However, all of the unknown words were registered without making a dictionary registration determination.
[0004]
SUMMARY OF THE INVENTION It is an object of the present invention to provide a method for automatically evaluating a dictionary registration by evaluating extracted unknown words.
[0005]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, the present invention uses a Web site with a large number of objects and new information as a material for determining how generalized an unknown word to be registered in a dictionary is. As a pre-condition for using the Web site, documents are classified in advance before extracting unknown words in order to correspond to words specific to the field. Furthermore, in order to exclude temporary buzzwords from the dictionary registration target, the number of hits in the search result and the website registration date are also included in the dictionary registration availability determination condition. A threshold value of the number of hits or a value obtained by performing a statistical process is set in advance as a determination condition, a result is determined based on the threshold value, and whether or not dictionary registration is possible is determined.
[0006]
BEST MODE FOR CARRYING OUT THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows a block diagram of an embodiment of the present invention.
[0007]
The classification setting function 1 is a function of determining a field into which a document from which a word / phrase is to be extracted is classified and classifying the document.
[0008]
The search condition setting function 2 is a function for setting a search site on the target network based on the information classified by the category setting function 1. It is also a function for setting up to how many years of registration information is to be targeted for each category. It is also a function to set a threshold value for determining whether dictionary registration is possible for each category. The information input function 3 is a function of inputting a document from which a phrase is to be extracted as a character string.
[0009]
The unknown word extraction function 4 is a function of extracting an unknown word using a conventional method as described in Japanese Patent Application Laid-Open No. 11-338863 which has already been proposed.
[0010]
The search function 5 is a function of performing a search using an unknown word as a keyword at the search site set by the search condition setting function 2, and extracting results only for the set registration period.
[0011]
The search result analysis function 6 is a function for analyzing the distribution of the number of hits for each month based on the result searched by the search function 5. The period for which the search results are analyzed for each category is different. For example, when collecting information from the update date and time for the past three years, if the update date is one or two months ago and three years ago, In the latter case, the information registered in the past may have been deleted, and it is clear that the number of hits is small, so the correction process is performed. Here, it is assumed that the correction rate is multiplied by three times the result of three years ago, twice by two years ago, and by one time by one year ago.
[0012]
The dictionary registration availability determination unit 7 has a function of determining whether the dictionary can be registered based on the result analyzed by the search result analysis function 6. If the number of hits is lower than the threshold value, it is determined that the phrase is not commonly used, and is not to be registered in the dictionary. If it is higher than the set threshold, the unknown word is registered in the dictionary. Next, when temporary buzzwords are excluded from the registration target in the dictionary, the search results are analyzed and statistically analyzed. However, if the analysis result has a normal distribution, it is not registered in the dictionary. It shall be.
[0013]
FIG. 2 is a table showing examples of search sites and target registration periods and thresholds for each category.
[0014]
FIG. 3 is a flowchart showing a processing mode in the embodiment of the present invention.
[0015]
In the category classification input 8, the category setting function 1 classifies the category of the document from which the unknown word is to be extracted and inputs the result. Next, in the search condition setting 9, an appropriate search site for each category is set using the search condition setting function 2 based on the information input in the search category classification. Further, for each category, the number of years in the past on the registration date of the search result is set. Next, in information input 10, a document from which an unknown word is to be extracted is input using information input function 3. An unknown word is extracted by an unknown word extraction function using an unknown word extraction function using an existing unknown word extraction method.
[0016]
In the search execution 12, a search is performed at the search site set in the search condition setting 9 using the unknown word as a keyword.
[0017]
In the comparison 13 of the search result and the threshold value, the threshold value for each category set in the search condition setting 9 is compared with the search result. Here, as described above, since it is clear that the number of hits in the search differs between one and two months ago and three years ago, it is necessary to correct the search result. In the present invention, the result is corrected using the correction rate. If the search result is smaller than the threshold, it is not registered in the dictionary (15 in the flowchart of FIG. 3).
[0018]
When the search result is larger than the threshold, the analysis 14 of the search result is performed. Here, the distribution of the number of hits per month is analyzed based on the search results in order to exclude the registration of temporary buzzwords in the dictionary. Statistically, in a normal distribution, there is a tendency that about 68% of all data falls within the range of the mean value ± standard deviation.
[0019]
Next, with reference to the analysis result in the analysis 14 of the search result, the result is determined in the analysis result determination 16. According to the present invention, when the distribution of the search results shows a normal distribution, the number of cases of about 70% is concentrated in a specific period, and it is considered that the unknown word is likely to be a temporary buzzword. Is not registered (flow chart 15 in FIG. 3). If the analysis result does not have a normal distribution, it is determined that the dictionary is to be registered and registered in the dictionary (flowchart 17 in FIG. 3).
[0020]
【The invention's effect】
As described above, according to the present invention, it is possible to more appropriately determine whether or not generalization has been performed by using a Web site with many targets as a determination criterion. Further, by selecting an appropriate search site for each category and using an existing newspaper site or the like, it is possible to easily use an appropriate criterion in the field.
[Brief description of the drawings]
FIG. 1 is a block diagram of an embodiment of the present invention.
FIG. 2 is a correspondence table of search sites, search result target periods, and thresholds for each category.
FIG. 3 is a diagram showing a processing mode in an embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Classification setting function, 2 ... Information input function, 3 ... Unknown word extraction function, 4 ... Search condition setting function, 5 ... Search function, 6 ... Search result analysis function, 7 ... Dictionary registration availability judgment section, 8 ... Category 9: Setting of search conditions, 10: Input of information, 11: Extraction of unknown words, 12: Search execution, 13: Comparison of search results with thresholds, 14: Analysis of search results, 15: Non-registered in dictionary, 16 ... Judgment of analysis results, 17 ... Registered in dictionary.

Claims (1)

解析対象とする文書の分類を設定する分類設定機能と、カテゴリ毎に適当な検索サイトや何年分の登録情報までを対象とするのかを設定する検索条件設定機能と、文書を文字列として記憶する情報入力機能と、文字列から未知語を抽出する未知語抽出機能を備えた未知語登録装置において、検索する場合に、未知語をキーワードとして検索を行う検索機能と、検索結果の分析を行う検索結果分析機能と、分析結果をもとに基準に適応する場合に未知語を辞書へ登録することを特徴とする未知語登録方法。Classification setting function to set the classification of the document to be analyzed, search condition setting function to set the appropriate search site and how many years of registration information to target for each category, and store the document as a character string In the unknown word registration device equipped with an information input function to perform an unknown word extraction function to extract an unknown word from a character string, a search function for performing a search using an unknown word as a keyword when performing a search, and an analysis of the search results An unknown word registration method characterized by a search result analysis function and registering unknown words in a dictionary when adapting to criteria based on the analysis results.
JP2002216005A 2002-07-25 2002-07-25 Method of registering unknown word automatically to dictionary Pending JP2004062262A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2002216005A JP2004062262A (en) 2002-07-25 2002-07-25 Method of registering unknown word automatically to dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2002216005A JP2004062262A (en) 2002-07-25 2002-07-25 Method of registering unknown word automatically to dictionary

Publications (1)

Publication Number Publication Date
JP2004062262A true JP2004062262A (en) 2004-02-26

Family

ID=31937878

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2002216005A Pending JP2004062262A (en) 2002-07-25 2002-07-25 Method of registering unknown word automatically to dictionary

Country Status (1)

Country Link
JP (1) JP2004062262A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007010836A1 (en) * 2005-07-15 2007-01-25 Hewlett-Packard Development Company, L.P. Community specific expression detecting device and method
WO2009016729A1 (en) * 2007-07-31 2009-02-05 Fujitsu Limited Voice recognition correlation rule learning system, voice recognition correlation rule learning program, and voice recognition correlation rule learning method
KR101195341B1 (en) 2009-11-30 2012-10-29 엔이씨 (차이나) 씨오., 엘티디. Method and apparatus for determining category of an unknown word
CN111339250A (en) * 2020-02-20 2020-06-26 北京百度网讯科技有限公司 Mining method of new category label, electronic equipment and computer readable medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007010836A1 (en) * 2005-07-15 2007-01-25 Hewlett-Packard Development Company, L.P. Community specific expression detecting device and method
CN101223521B (en) * 2005-07-15 2010-06-16 惠普开发有限公司 Community specific expression detecting device and method
WO2009016729A1 (en) * 2007-07-31 2009-02-05 Fujitsu Limited Voice recognition correlation rule learning system, voice recognition correlation rule learning program, and voice recognition correlation rule learning method
JP5141687B2 (en) * 2007-07-31 2013-02-13 富士通株式会社 Collation rule learning system for speech recognition, collation rule learning program for speech recognition, and collation rule learning method for speech recognition
KR101195341B1 (en) 2009-11-30 2012-10-29 엔이씨 (차이나) 씨오., 엘티디. Method and apparatus for determining category of an unknown word
CN111339250A (en) * 2020-02-20 2020-06-26 北京百度网讯科技有限公司 Mining method of new category label, electronic equipment and computer readable medium
CN111339250B (en) * 2020-02-20 2023-08-18 北京百度网讯科技有限公司 Mining method for new category labels, electronic equipment and computer readable medium

Similar Documents

Publication Publication Date Title
Huston et al. Evaluating verbose query processing techniques
WO2015149533A1 (en) Method and device for word segmentation processing on basis of webpage content classification
US20070230787A1 (en) Method for automated processing of hard copy text documents
WO2017091985A1 (en) Method and device for recognizing stop word
CN108573707B (en) Method, device, equipment and medium for processing voice recognition result
CN113076735B (en) Target information acquisition method, device and server
Tandel et al. Multi-document text summarization-a survey
CN111078839A (en) Structured processing method and processing device for referee document
CN112380848A (en) Text generation method, device, equipment and storage medium
CN114266256A (en) Method and system for extracting new words in field
JP2010224984A (en) Device, method, and program for supporting patent specification evaluation-creation work
Singh et al. Named entity recognition for manipuri using support vector machine
Dunn et al. Language identification for austronesian languages
JP2006227823A (en) Information processor and its control method
JP4143085B2 (en) Synonym acquisition method and apparatus, program, and computer-readable recording medium
JP2004355224A (en) Apparatus, method and program for extracting parallel translation expression
Gustafson et al. Nowhere to hide: Finding plagiarized documents based on sentence similarity
Zhang et al. Chinese novelty mining
JP2004062262A (en) Method of registering unknown word automatically to dictionary
JP4525433B2 (en) Document aggregation device and program
CN107577667B (en) Entity word processing method and device
JPH1139313A (en) Automatic document classification system, document classification oriented knowledge base creating method and record medium recording its program
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
CN111581950B (en) Method for determining synonym names and method for establishing knowledge base of synonym names
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph