JP4304146B2

JP4304146B2 - Dictionary registration device, dictionary registration method, and dictionary registration program

Info

Publication number: JP4304146B2
Application number: JP2004349049A
Authority: JP
Inventors: 尚義永江; 幸弘福永
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2004-12-01
Filing date: 2004-12-01
Publication date: 2009-07-29
Anticipated expiration: 2024-12-01
Also published as: JP2006155528A

Description

この発明は、辞書に登録されていない単語を辞書へ登録する辞書登録装置、辞書登録方法および辞書登録プログラムに関するものである。 The present invention relates to a dictionary registration device, a dictionary registration method, and a dictionary registration program for registering words that are not registered in a dictionary.

近年、パソコンや携帯電話等において日本語の文章を入力する手段として、一般にかな漢字変換システムが使用されている。また、最近では音声で文字を入力できる音声認識システムも使用され始めている。これらのシステムでは、ひらがなの入力文字を漢字仮名混じり文に変換する際、システムの辞書に登録されている単語の組合せの中で最適な表記列に変換する。従って、ユーザが希望する表記の単語が辞書に登録されていない場合には正しく変換されず、システムの辞書に登録されている単語の表記を適当に並べて誤った表記列に誤変換されてしまう。 In recent years, a kana-kanji conversion system is generally used as a means for inputting Japanese sentences on a personal computer or a mobile phone. Recently, voice recognition systems that can input characters by voice have begun to be used. In these systems, when hiragana input characters are converted into kanji-kana mixed sentences, they are converted into an optimal notation string among word combinations registered in the system dictionary. Therefore, when a word with a notation desired by the user is not registered in the dictionary, the word is not correctly converted, and the notation of the word registered in the dictionary of the system is appropriately converted into an incorrect notation string.

辞書登録装置は、このような不都合を解消するために、システムの辞書に登録されていない単語である未知語を辞書へ追加登録する装置である。未知語を辞書に登録する方法としては、ユーザが入力画面から単語の表記、読み、品詞等の情報を一語ずつ入力し辞書に登録する方法と、ユーザが指定した文書を形態素解析し、抽出した未知語を一括して辞書に登録する方法が開発されている。 The dictionary registration device is a device for additionally registering unknown words, which are words that are not registered in the system dictionary, in the dictionary in order to eliminate such inconvenience. As a method of registering unknown words in the dictionary, the user inputs information such as word notation, reading and part of speech from the input screen one by one and registers them in the dictionary, and morphological analysis and extraction of the document specified by the user A method has been developed in which unknown words are registered in a dictionary.

ユーザが指定した文書から未知語を抽出して登録する方法においては、システムの辞書に登録されていない単語が抽出された場合、当該単語の部分文字列のうち文字種が同一で連続する範囲を一語の未知語として推定する機能が開発されている（例えば、特許文献１）。 In the method of extracting and registering an unknown word from a document specified by the user, when a word that is not registered in the system dictionary is extracted, a continuous range of the same character type in the partial character string of the word is identified. A function for estimating a word as an unknown word has been developed (for example, Patent Document 1).

特開平２−１６３８７４号公報JP-A-2-163874

しかしながら、文字種が同一で連続する範囲を一語の未知語として推定する方法によると、例えばカタカナとアラビア数字が結合した単語のように、複数の文字種が含まれる単語は一語として抽出することができない。このため、複数の文字種が含まれる単語を一語として正しく登録するためには、抽出された単語をユーザが確認し修正する必要があるという問題があった。 However, according to the method of estimating a continuous range having the same character type as one unknown word, a word including a plurality of character types can be extracted as one word, such as a word combining katakana and Arabic numerals. Can not. For this reason, in order to correctly register a word including a plurality of character types as one word, there is a problem that the user needs to check and correct the extracted word.

本発明は、上記に鑑みてなされたものであって、文字種で区切ることにより未知語を抽出するだけでなく、抽出された未知語の前方と後方の少なくとも一方の単語を抽出された未知語に結合して拡張未知語を生成し、生成した拡張未知語に類似する単語が辞書に既に登録されている場合は、当該拡張未知語も未知語として抽出することにより、複数の文字種が含まれる単語も同時に辞書に登録することができる辞書登録装置、辞書登録方法および辞書登録プログラムを提供することを目的とする。 The present invention has been made in view of the above, and not only extracts unknown words by separating them with character types, but also extracts at least one word in front of and behind the extracted unknown words as an extracted unknown word. When a word similar to the generated extended unknown word is already registered in the dictionary by combining to generate the extended unknown word, the extended unknown word is also extracted as an unknown word, thereby including a plurality of character types Another object of the present invention is to provide a dictionary registration device, a dictionary registration method, and a dictionary registration program that can be simultaneously registered in a dictionary.

上述した課題を解決し、目的を達成するために、本発明は、単語を保持する辞書を記憶する辞書記憶手段と、入力文書を形態素解析し、前記入力文書の中から前記辞書に登録されていない未知語を抽出する形態素解析手段と、前記形態素解析手段が抽出した前記未知語の前方と後方の少なくとも一方の単語を前記未知語に結合した拡張未知語を生成する未知語範囲拡張手段と、前記未知語範囲拡張手段が生成した前記拡張未知語のうち、前記未知語を拡張した部分の表記が一致する単語であって前記辞書に登録されている既登録単語を前記辞書から検索する部分一致検索手段と、前記部分一致検索手段が検索した前記既登録単語のうち前記未知語に相当する部分の表記の文字属性と前記未知語の表記の文字属性とに基づき、前記部分一致検索手段が検索した前記既登録単語のうち前記未知語に相当する部分の表記と前記未知語の表記との類似性を判定する表記類似性判定手段と、前記表記類似性判定手段が前記既登録単語のうち前記未知語に相当する部分の表記と前記未知語の表記とが類似すると判定した場合に、前記拡張未知語を前記辞書に登録する辞書登録手段と、を備えたことを特徴とする。 In order to solve the above-described problems and achieve the object, the present invention includes a dictionary storage means for storing a dictionary that holds words, and morphological analysis of the input document, which is registered in the dictionary from the input document. Morpheme analyzing means for extracting no unknown word, unknown word range extending means for generating an extended unknown word in which at least one of the front and rear words of the unknown word extracted by the morpheme analyzing means is combined with the unknown word, Of the extended unknown words generated by the unknown word range expansion means, a partial match that searches for a registered word registered in the dictionary that matches the notation of the expanded portion of the unknown word. The partial match search based on a search attribute and a character attribute of a portion corresponding to the unknown word and a character attribute of the unknown word in the registered word searched by the partial match search portion A notation similarity determination means for determining a similarity between a notation of a portion corresponding to the unknown word and a notation of the unknown word, and the notation similarity determination means includes the registered word And a dictionary registering means for registering the extended unknown word in the dictionary when it is determined that the notation of the portion corresponding to the unknown word is similar to the notation of the unknown word.

また、本発明は、上記装置を実行することができる辞書登録方法および辞書登録プログラムである。 The present invention also provides a dictionary registration method and a dictionary registration program that can execute the above-described apparatus.

本発明によれば、ユーザが指定した文書から未知語を抽出し辞書に登録するとき、文字種で区切ることにより未知語と推定された単語の表記だけでなく、未知語と推定された単語の前方と後方の少なくとも一方の単語を結合した単語であって、類似する単語が辞書に既に登録されている単語も未知語として抽出することができる。このため、複数の文字種が含まれる単語も同時に辞書に登録することができ、未知語抽出の精度を高めることができる。また、抽出された未知語の確認などのユーザの作業負担を軽減することができるという効果を奏する。 According to the present invention, when an unknown word is extracted from a document specified by a user and registered in the dictionary, not only the notation of the word estimated as an unknown word by dividing the character type but also the front of the word estimated as an unknown word A word in which at least one of the back words is combined and a similar word is already registered in the dictionary can also be extracted as an unknown word. For this reason, words including a plurality of character types can be registered in the dictionary at the same time, and the accuracy of unknown word extraction can be improved. In addition, there is an effect that it is possible to reduce a user's work burden such as confirmation of the extracted unknown word.

以下に添付図面を参照して、この発明にかかる辞書登録装置、辞書登録方法および辞書登録プログラムの最良な実施の形態を詳細に説明する。 Exemplary embodiments of a dictionary registration device, a dictionary registration method, and a dictionary registration program according to the present invention will be explained below in detail with reference to the accompanying drawings.

（第１の実施の形態）
第１の実施の形態にかかる辞書登録装置は、ユーザが指定した文書から形態素解析により未知語を抽出し、抽出した未知語の範囲を拡張し、拡張した未知語のうち未知語を拡張した部分が一致する単語であって、未知語に相当する部分の文字種が未知語の文字種と同一である単語が辞書に存在する場合は、当該拡張した未知語を抽出した未知語と同時に辞書に登録するものである。 (First embodiment)
The dictionary registration apparatus according to the first embodiment extracts an unknown word from a document specified by a user by morphological analysis, expands the range of the extracted unknown word, and expands the unknown word among the expanded unknown words If there is a word in the dictionary where the character type of the part corresponding to the unknown word is the same as the character type of the unknown word, the expanded unknown word is registered in the dictionary simultaneously with the extracted unknown word. Is.

図１は、第１の実施の形態にかかる辞書登録装置１００の構成を示すブロック図である。同図に示すように、辞書登録装置１００は、入出力制御部１０１と、形態素解析部１０２と、未知語範囲拡張部１０３と、部分一致検索部１０４と、表記類似性判定部１０５と、辞書登録部１０６と、ユーザＩ／Ｆ１１０とを備えている。また、本実施の形態にかかる辞書登録装置１００は、ＲＡＭ(Random Access Memory)１３０に単語列バッファ１３１と、未知語バッファ１３２とを保存し、ハードディスクドライブ装置（ＨＤＤ：Hard Disk Drive）に解析ルール１２０と、辞書１２１とを保存している。ＨＤＤは、本発明における辞書記憶手段に相当する。 FIG. 1 is a block diagram illustrating a configuration of the dictionary registration device 100 according to the first embodiment. As shown in the figure, the dictionary registration apparatus 100 includes an input / output control unit 101, a morpheme analysis unit 102, an unknown word range expansion unit 103, a partial match search unit 104, a notation similarity determination unit 105, a dictionary, A registration unit 106 and a user I / F 110 are provided. Further, the dictionary registration device 100 according to the present embodiment stores a word string buffer 131 and an unknown word buffer 132 in a RAM (Random Access Memory) 130 and analyzes rules in a hard disk drive (HDD). 120 and the dictionary 121 are stored. The HDD corresponds to the dictionary storage means in the present invention.

入出力制御部１０１は、ユーザＩ／Ｆ１１０を制御する処理部であり、ユーザＩ／Ｆへの入出力指示およびユーザＩ／Ｆ１１０と他の機能部との間で入出力データの授受を行う。 The input / output control unit 101 is a processing unit that controls the user I / F 110, and performs input / output instructions to the user I / F and exchange of input / output data between the user I / F 110 and other functional units.

形態素解析部１０２は、入力された文書を解析ルール１２０および辞書１２１を参照して形態素解析することにより単語に分割し、単語列バッファ１３１に格納する。 The morpheme analysis unit 102 divides the input document into words by performing morpheme analysis with reference to the analysis rules 120 and the dictionary 121, and stores the words in the word string buffer 131.

未知語範囲拡張部１０３は、拡張未知語を生成し、後述する部分一致検索部１０４および表記類似性判定部１０５と連動して当該拡張未知語の辞書への登録の有効性を判定し、有効と判定された場合は当該拡張未知語を辞書登録のために未知語バッファ１３２に格納するものである。ここで、拡張未知語とは、形態素解析部１０２により抽出された未知語の前方または後方または前後両方の単語、すなわち抽出された未知語の前方と後方の少なくとも一方の単語を結合して未知語を拡張した単語をいう。 The unknown word range expansion unit 103 generates an extended unknown word, determines the validity of registration of the extended unknown word in the dictionary in conjunction with the partial match search unit 104 and the notation similarity determination unit 105 described later, and Is determined, the extended unknown word is stored in the unknown word buffer 132 for dictionary registration. Here, the extended unknown word is an unknown word obtained by combining at least one word in front of, behind, or behind the extracted unknown word, that is, a word in front of, behind, or before and after the unknown word extracted by the morphological analysis unit 102. A word that is an extension of.

部分一致検索部１０４は、未知語範囲拡張部１０３が生成した拡張未知語に含まれる部分文字列のうち、未知語を拡張した部分が一致する単語を辞書１２１から検索する。 The partial match search unit 104 searches the dictionary 121 for a word that matches the expanded part of the unknown word in the partial character string included in the extended unknown word generated by the unknown word range expansion unit 103.

表記類似性判定部１０５は、部分一致検索部１０４が検索した単語に含まれる部分文字列のうち、未知語に相当する部分が、形態素解析部１０２により抽出された未知語と類似するか否かを判定する。 The notation similarity determination unit 105 determines whether a portion corresponding to the unknown word in the partial character string included in the word searched by the partial match search unit 104 is similar to the unknown word extracted by the morpheme analysis unit 102. Determine.

辞書登録部１０６は、部分一致検索部１０４が検索した単語に含まれる部分文字列のうち、未知語に相当する部分が形態素解析部１０２により抽出された未知語と類似すると表記類似性判定部１０５によって判定されたとき、未知語範囲拡張部１０３が生成した拡張未知語を辞書１２１に登録するものである。 The dictionary registration unit 106 includes a notation similarity determination unit 105 when a portion corresponding to an unknown word in the partial character string included in the word searched by the partial match search unit 104 is similar to the unknown word extracted by the morpheme analysis unit 102. The unknown word range expansion unit 103 registers the extended unknown word generated in the dictionary 121 when determined by the above.

ユーザＩ／Ｆ１１０は、ディスプレイ装置等の表示装置と、キーボードやマウスなどの入力装置であり、文書指定画面、未知語確認画面の表示を行うとともに、これらの画面からの入力操作を受付ける。 The user I / F 110 is a display device such as a display device and an input device such as a keyboard and a mouse. The user I / F 110 displays a document designation screen and an unknown word confirmation screen and accepts input operations from these screens.

解析ルール１２０は、品詞間の結合度等の文法規則や、単語選択の優先規則などの形態素解析に必要なルールが記述されている。 The analysis rule 120 describes rules necessary for morphological analysis, such as grammatical rules such as the degree of connection between parts of speech and priority selection rules for word selection.

辞書１２１は、単語を保持する辞書であり、一般的なかな漢字変換システムや音声認識システムなどで使用される辞書である。図２は、辞書１２１の構造の一例を示す説明図である。同図に示すように、辞書１２１は、辞書番号と、単語の表記と、単語の読みと、単語の品詞とを格納している。 The dictionary 121 is a dictionary that holds words, and is a dictionary used in a general kana-kanji conversion system, a speech recognition system, or the like. FIG. 2 is an explanatory diagram showing an example of the structure of the dictionary 121. As shown in the figure, the dictionary 121 stores a dictionary number, a word notation, a word reading, and a word part of speech.

ＲＡＭ１３０は、ランダムアクセスが可能なメモリであり、単語列バッファ１３１や、未知語バッファ１３２を一時的に保存するための記憶部として機能する。単語列バッファ１３１は、形態素解析部１０２により抽出された単語列を格納する。未知語バッファ１３２は、形態素解析部１０２により抽出された未知語と、未知語範囲拡張部１０３により生成された拡張未知語を格納する。 The RAM 130 is a random accessible memory and functions as a storage unit for temporarily storing the word string buffer 131 and the unknown word buffer 132. The word string buffer 131 stores the word string extracted by the morpheme analyzer 102. The unknown word buffer 132 stores the unknown word extracted by the morphological analysis unit 102 and the extended unknown word generated by the unknown word range expansion unit 103.

図３は、単語列バッファ１３１の構造の一例を示す説明図である。同図に示すように、単語列バッファ１３１は、単語番号と、単語の表記と、単語の読みと、単語の品詞とを格納している。 FIG. 3 is an explanatory diagram showing an example of the structure of the word string buffer 131. As shown in the figure, the word string buffer 131 stores a word number, a word notation, a word reading, and a word part of speech.

図４は、未知語バッファ１３２の構造の一例を示す説明図である。同図に示すように、未知語バッファ１３２は、未知語または拡張未知語の表記と、単語列バッファ１３１内の当該未知語または拡張未知語に対応する単語の単語番号である解析結果単語番号とを格納している。また、拡張未知語の場合には、結合した各単語の単語番号を並べて解析結果単語番号に格納している。 FIG. 4 is an explanatory diagram showing an example of the structure of the unknown word buffer 132. As shown in the figure, the unknown word buffer 132 includes a notation of an unknown word or extended unknown word, and an analysis result word number that is a word number of a word corresponding to the unknown word or extended unknown word in the word string buffer 131. Is stored. In the case of an extended unknown word, the word numbers of the combined words are arranged and stored in the analysis result word number.

次に、このように構成された第１の実施の形態にかかる辞書登録装置１００による未知語登録処理について説明する。図５は、第１の実施の形態における未知語登録処理の全体の流れを示すフローチャートである。 Next, an unknown word registration process performed by the dictionary registration apparatus 100 according to the first embodiment configured as described above will be described. FIG. 5 is a flowchart showing an overall flow of the unknown word registration process in the first embodiment.

まず、入出力制御部１０１がユーザＩ／Ｆ１１０に文書指定画面を表示する（ステップＳ５０１）。図６−１、図６−２は、文書指定画面の内容を示す模式図である。図６−１に示すように、文書指定画面には、参照ボタン６０１、削除ボタン６０２、次へボタン６０３、キャンセルボタン６０４が表示されている。参照ボタン６０１が押下されると、図６−２に示すようなファイル参照画面６０５を表示する。ファイル参照画面６０５で、新語登録する単語が含まれている文書ファイルを指定することができる。文書ファイルは複数指定することができる。指定した文書ファイルを選択し削除ボタン６０２を押下すると、当該文書ファイルの指定を解除することができる。次へボタン６０３が押下されると、入力された文書ファイルの指定を受付け、未知語登録処理を開始する。キャンセルボタン６０４が押下されると、未知語登録処理を中止する。 First, the input / output control unit 101 displays a document designation screen on the user I / F 110 (step S501). 6A and 6B are schematic diagrams illustrating the contents of the document designation screen. As illustrated in FIG. 6A, a reference button 601, a delete button 602, a next button 603, and a cancel button 604 are displayed on the document designation screen. When the reference button 601 is pressed, a file reference screen 605 as shown in FIG. 6-2 is displayed. On the file reference screen 605, a document file containing a word to be registered as a new word can be designated. Multiple document files can be specified. When the designated document file is selected and the delete button 602 is pressed, the designation of the document file can be canceled. When the next button 603 is pressed, the input document file specification is accepted and the unknown word registration process is started. When the cancel button 604 is pressed, the unknown word registration process is stopped.

文書指定画面で文書ファイル名が指定され、次へボタン６０３が押下されると、形態素解析部１０２が、ユーザにより指定された文書ファイル内の文書を形態素解析し、解析の結果得られた単語を単語列バッファ１３１に格納する（ステップＳ５０２）。次に、未知語範囲拡張部１０３が、単語列バッファ１３１から単語を取得し（ステップＳ５０３）、取得した単語の品詞情報を参照し、取得した単語が未知語または未知語に準ずる語（以下、単に未知語という。）であるか否かを判断する（ステップＳ５０４）。ここで、未知語に準ずる語とは、例えば英字や数字などのように、辞書にあっても単語として意味を持たない語を示す。 When the document file name is designated on the document designation screen and the next button 603 is pressed, the morphological analysis unit 102 performs morphological analysis on the document in the document file designated by the user, and the word obtained as a result of the analysis is displayed. Store in the word string buffer 131 (step S502). Next, the unknown word range expansion unit 103 acquires a word from the word string buffer 131 (step S503), refers to the part of speech information of the acquired word, and the acquired word is an unknown word or a word equivalent to an unknown word (hereinafter, It is determined whether or not it is simply an unknown word (step S504). Here, a word corresponding to an unknown word refers to a word that has no meaning as a word even in the dictionary, such as English letters and numbers.

取得した単語が未知語でない場合は（ステップＳ５０４：ＮＯ）、単語列バッファ１３１内のすべての単語を処理したか否かの判断処理に遷移する（ステップＳ５１５）。取得した単語が未知語である場合は（ステップＳ５０４：ＹＥＳ）、取得した未知語を未知語バッファ１３２に格納する（ステップＳ５０５）。 If the acquired word is not an unknown word (step S504: NO), the process proceeds to a determination process for determining whether all the words in the word string buffer 131 have been processed (step S515). If the acquired word is an unknown word (step S504: YES), the acquired unknown word is stored in the unknown word buffer 132 (step S505).

次に、未知語範囲拡張部１０３が、取得した未知語の前の単語を取得した未知語に結合して拡張した拡張未知語を生成し、生成した拡張未知語を部分一致検索部１０４に渡し、部分一致検索部１０４は拡張未知語の未知語相当部分以外が前方一致する単語を辞書１２１から検索する（ステップＳ５０６）。 Next, the unknown word range expansion unit 103 generates an extended unknown word expanded by combining the acquired unknown word with the acquired unknown word, and passes the generated extended unknown word to the partial match search unit 104 Then, the partial match search unit 104 searches the dictionary 121 for a word that matches forward except for the unknown word equivalent part of the extended unknown word (step S506).

未知語範囲拡張部１０３は、該当する単語が辞書１２１に存在するか否かを判断する（ステップＳ５０７）。存在しない場合は（ステップＳ５０７：ＮＯ）、最初の単語まで処理したか否か、すなわち、文書の最初の単語まで遡って前の単語の結合がなされたか否かを判断する処理に遷移する（ステップＳ５１４）。 The unknown word range expansion unit 103 determines whether the corresponding word exists in the dictionary 121 (step S507). If it does not exist (step S507: NO), the process proceeds to a process of determining whether or not the first word has been processed, that is, whether or not the previous word has been combined by going back to the first word of the document (step S507). S514).

該当する単語が辞書１２１に存在する場合は（ステップＳ５０７：ＹＥＳ）、未知語範囲拡張部１０３は、未知語の後ろの単語を結合してさらに拡張した拡張未知語を未知語バッファ１３２に格納する（ステップＳ５０８）。なお、前の単語を結合した直後の初回は、後ろの単語は結合せず、前の単語のみを結合した拡張未知語に対し、以降の部分一致検索処理、表記類似性判定処理を行う。その後、順次後ろの単語を結合した拡張未知語に対し同様の処理を行う。 If the corresponding word exists in the dictionary 121 (step S507: YES), the unknown word range expansion unit 103 stores the expanded unknown word further expanded by combining words after the unknown word in the unknown word buffer 132. (Step S508). In the first time immediately after the previous word is combined, the subsequent partial match search process and the notation similarity determination process are performed on the extended unknown word in which only the previous word is combined without combining the subsequent word. Thereafter, the same processing is performed on the extended unknown word obtained by sequentially combining the subsequent words.

次に、部分一致検索部１０４は、拡張未知語の未知語相当部分以外の文字列が部分一致する単語を辞書１２１から検索する（ステップＳ５０９）。さらに、未知語範囲拡張部１０３が、該当する単語が辞書１２１に存在するか否かを判断する（ステップＳ５１０）。存在しない場合は（ステップＳ５１０：ＮＯ）、最後の単語まで処理したか否か、すなわち、文書の最後の単語まで単語の結合がなされたか否かを判断する処理に遷移する（ステップＳ５１３）。 Next, the partial match search unit 104 searches the dictionary 121 for a word that partially matches a character string other than the unknown word equivalent portion of the extended unknown word (step S509). Further, the unknown word range expansion unit 103 determines whether or not the corresponding word exists in the dictionary 121 (step S510). If it does not exist (step S510: NO), the process proceeds to a process of determining whether or not processing has been performed up to the last word, that is, whether or not words have been combined up to the last word of the document (step S513).

該当する単語が辞書１２１に存在する場合は（ステップＳ５１０：ＹＥＳ）、表記類似性判定部１０５が、該当する単語の未知語相当部分の文字種が、単語列バッファ１３１から取得した未知語の文字種と同一であるか否かを判断する（ステップＳ５１１）。同一でない場合は（ステップＳ５１１：ＮＯ）、拡張未知語を未知語バッファ１３２から削除する（ステップＳ５１２）。当該拡張未知語と類似する単語が辞書１２１に登録されていないため、当該拡張未知語を辞書１２１に追加登録するのは妥当でないと判断されたためである。 When the corresponding word exists in the dictionary 121 (step S510: YES), the notation similarity determination unit 105 determines that the character type of the unknown word corresponding part of the corresponding word is the character type of the unknown word acquired from the word string buffer 131. It is determined whether or not they are the same (step S511). If not identical (step S511: NO), the extended unknown word is deleted from the unknown word buffer 132 (step S512). This is because it is determined that it is not appropriate to additionally register the extended unknown word in the dictionary 121 because a word similar to the extended unknown word is not registered in the dictionary 121.

拡張未知語を削除した後、または文字種が同一である場合は（ステップＳ５１１：ＹＥＳ）、未知語範囲拡張部１０３は、文書の最後の単語まで処理したか否かを判断し（ステップＳ５１３）、最後の単語まで処理していない場合は（ステップＳ５１３：ＮＯ）、次の後ろの単語に対して処理を繰り返す（ステップＳ５０８）。最後の単語まで処理した場合は（ステップＳ５１３：ＹＥＳ）、次の処理に遷移する。 After deleting the extended unknown word or when the character types are the same (step S511: YES), the unknown word range expansion unit 103 determines whether or not the last word of the document has been processed (step S513). If the last word has not been processed (step S513: NO), the process is repeated for the next subsequent word (step S508). If the last word has been processed (step S513: YES), the process proceeds to the next process.

未知語範囲拡張部１０３は、文書の最初の単語まで処理したか否かを判断し（ステップＳ５１４）、最初の単語まで処理していない場合は（ステップＳ５１４：ＮＯ）、次の前の単語に対して処理を繰り返す（ステップＳ５０６）。最初の単語まで処理した場合は（ステップＳ５１４：ＹＥＳ）、次の処理に遷移する。 The unknown word range expansion unit 103 determines whether or not the first word of the document has been processed (step S514). If the unknown word range expansion unit 103 has not processed the first word (step S514: NO), the unknown word range expansion unit 103 sets the next word to the next previous word. The process is repeated (step S506). If the first word has been processed (step S514: YES), the process proceeds to the next process.

未知語範囲拡張部１０３は、単語列バッファ１３１内のすべての単語を処理したか否かを判断し（ステップＳ５１５）、すべての単語を処理していない場合は（ステップＳ５１５：ＮＯ）、次の単語を単語列バッファ１３１から取得し処理を繰り返す（ステップＳ５０３）。すべての単語を処理した場合は（ステップＳ５１５：ＹＥＳ）、次の処理に遷移する。 The unknown word range expansion unit 103 determines whether or not all the words in the word string buffer 131 have been processed (step S515). If all the words have not been processed (step S515: NO), The word is acquired from the word string buffer 131 and the process is repeated (step S503). If all the words have been processed (step S515: YES), the process proceeds to the next process.

入出力制御部１０１は、上記処理で抽出した未知語および拡張未知語を辞書１２１に登録するか否かをユーザに確認させるための未知語確認画面をユーザＩ／Ｆ１１０に表示する（ステップＳ５１６）。 The input / output control unit 101 displays on the user I / F 110 an unknown word confirmation screen for allowing the user to confirm whether or not to register the unknown word and the extended unknown word extracted in the above process in the dictionary 121 (step S516). .

図７は、未知語確認画面の内容を示す模式図である。同図に示すように、未知語確認画面には、未知語または拡張未知語の表記、読み、品詞が一覧表示され、個々の未知語または拡張未知語の左側にチェックボックス７０１が表示されている。また、未知語確認画面の下部には、全て選択ボタン７０２、全て解除ボタン７０３、修正ボタン７０４、戻るボタン７０５、次へボタン７０６、キャンセルボタン７０７が表示されている。 FIG. 7 is a schematic diagram showing the contents of the unknown word confirmation screen. As shown in the figure, the unknown word confirmation screen displays a list of unknown words or extended unknown words, readings, parts of speech, and a check box 701 on the left side of each unknown word or extended unknown word. . At the bottom of the unknown word confirmation screen, an all select button 702, an all cancel button 703, a correction button 704, a back button 705, a next button 706, and a cancel button 707 are displayed.

ユーザがチェックボックス７０１をチェックすることにより、その右側に表示されている未知語または拡張未知語を辞書１２１に登録することが指定される。全て選択ボタン７０２が押下されると、全てのチェックボックス７０１がチェックされる。全て解除ボタン７０３が押下されると、全てのチェックボックス７０１のチェックが解除される。修正ボタン７０４が押下されると、読みや品詞の修正を行う画面（図示せず）を表示し、ユーザが読みや品詞の修正を行うことができる。 When the user checks the check box 701, it is specified that the unknown word or the extended unknown word displayed on the right side is registered in the dictionary 121. When the select all button 702 is pressed, all the check boxes 701 are checked. When the all cancel button 703 is pressed, all check boxes 701 are unchecked. When the correction button 704 is pressed, a screen (not shown) for correcting reading and part of speech is displayed, and the user can correct reading and part of speech.

戻るボタン７０５が押下されると、文書指定画面に戻り、再度文書ファイルの指定を行うことができる。次へボタン７０６が押下されると、指定された未知語または拡張未知語を辞書１２１に登録する。キャンセルボタン７０７が押下された場合は、未知語登録処理を中止する。 When the return button 705 is pressed, the screen returns to the document designation screen and the document file can be designated again. When the next button 706 is pressed, the specified unknown word or extended unknown word is registered in the dictionary 121. When the cancel button 707 is pressed, the unknown word registration process is stopped.

未知語確認画面が表示されると、入出力制御部１０１は、次へボタン７０６が押下されたか否かを判断する（ステップＳ５１７）。次へボタン７０６が押下されていない場合は（ステップＳ５１７：ＮＯ）、次へボタン７０６の入力待ち状態となる。次へボタン７０６が押下された場合は（ステップＳ５１７：ＹＥＳ）、辞書登録部１０６は指定された単語を辞書１２１へ登録し（ステップＳ５１８）、未知語登録処理が終了する。 When the unknown word confirmation screen is displayed, the input / output control unit 101 determines whether or not the next button 706 has been pressed (step S517). If the next button 706 has not been pressed (step S517: NO), the next button 706 is waited for input. When the next button 706 is pressed (step S517: YES), the dictionary registration unit 106 registers the specified word in the dictionary 121 (step S518), and the unknown word registration process ends.

このように、第１の実施の形態にかかる辞書登録装置１００では、ユーザが指定した文書を形態素解析し未知語と推定された単語の表記だけでなく、未知語と推定された単語の前方と後方の少なくとも一方の単語を結合して拡張した単語であって、辞書１２１に既に登録されている単語と類似する単語も未知語として抽出し辞書に登録することができる。 As described above, in the dictionary registration device 100 according to the first embodiment, not only the notation of the word estimated as an unknown word by morphological analysis of the document specified by the user, but also the front of the word estimated as an unknown word, Words that are expanded by combining at least one of the rear words and similar to words already registered in the dictionary 121 can also be extracted as unknown words and registered in the dictionary.

（第２の実施の形態）
第２の実施の形態にかかる辞書登録装置は、拡張未知語と類似する単語との類似度を予め定められた類似度判定規則に従い算出し、算出した値が予め定められた値より大きい場合は、当該拡張未知語を、抽出した未知語と同時に辞書に登録するものである。 (Second Embodiment)
The dictionary registration device according to the second embodiment calculates the similarity between the extended unknown word and the similar word according to a predetermined similarity determination rule, and when the calculated value is larger than the predetermined value The extended unknown word is registered in the dictionary simultaneously with the extracted unknown word.

図８は、第２の実施の形態にかかる辞書登録装置８００の構成を示すブロック図である。同図に示すように、辞書登録装置８００は、入出力制御部１０１と、形態素解析部１０２と、未知語範囲拡張部１０３と、部分一致検索部１０４と、表記類似性判定部１０５と、辞書登録部１０６と、ユーザＩ／Ｆ１１０とを備えている。また、本実施の形態にかかる辞書登録装置８００は、ＲＡＭ１３０に単語列バッファ１３１と、未知語バッファ１３２とを保存し、ＨＤＤに解析ルール１２０と、辞書１２１と、類似度判定規則表８０１を保存している。ＨＤＤは、本発明における辞書記憶手段および類似度判定規則記憶手段に相当する。 FIG. 8 is a block diagram illustrating a configuration of the dictionary registration device 800 according to the second embodiment. As shown in the figure, the dictionary registration device 800 includes an input / output control unit 101, a morpheme analysis unit 102, an unknown word range expansion unit 103, a partial match search unit 104, a notation similarity determination unit 105, a dictionary, A registration unit 106 and a user I / F 110 are provided. The dictionary registration apparatus 800 according to the present embodiment stores the word string buffer 131 and the unknown word buffer 132 in the RAM 130, and stores the analysis rule 120, the dictionary 121, and the similarity determination rule table 801 in the HDD. is doing. The HDD corresponds to a dictionary storage unit and a similarity determination rule storage unit in the present invention.

第２の実施の形態においては、類似度判定規則表８０１を追加したことが第１の実施の形態と異なっている。その他の構成および機能は、第１の実施の形態にかかる辞書登録装置１００の構成を表すブロック図である図１と同様であるので、同一符号を付し、ここでの説明は省略する。 The second embodiment is different from the first embodiment in that a similarity determination rule table 801 is added. Other configurations and functions are the same as those in FIG. 1 which is a block diagram showing the configuration of the dictionary registration apparatus 100 according to the first embodiment, and therefore, the same reference numerals are given and description thereof is omitted here.

類似度判定規則表８０１は、比較元文字の文字種と比較先文字の文字種ごとの文字類似度を保持する。図９は、類似度判定規則表８０１の構造の一例を示す説明図である。同図に示すように、類似度判定規則表８０１は、比較元文字と、比較先文字と、文字類似度とを格納している。 The similarity determination rule table 801 holds the character similarity for each character type of the comparison source character and the comparison destination character. FIG. 9 is an explanatory diagram showing an example of the structure of the similarity determination rule table 801. As shown in the figure, the similarity determination rule table 801 stores comparison source characters, comparison destination characters, and character similarity.

このように、類似度判定規則表８０１は対応する文字ごとの文字類似度を格納している。従って、文字列全体間の類似度は、比較する文字列の各文字の文字類似度の平均を求めることにより算出する。例として、文字列“ＡＢＣ”と文字列“Ｄｅｆ”との類似度を類似度判定規則表８０１に従い算出する場合を以下に示す。 Thus, the similarity determination rule table 801 stores the character similarity for each corresponding character. Accordingly, the similarity between the entire character strings is calculated by obtaining the average of the character similarities of the characters in the character strings to be compared. As an example, a case where the similarity between the character string “ABC” and the character string “Def” is calculated according to the similarity determination rule table 801 is shown below.

文字列“ＡＢＣ”の最初の文字“Ａ”および文字列“Ｄｅｆ”の最初の文字“Ｄ”の文字種は共に英大文字であり、類似度判定規則表８０１に定義されている文字類似度は１００である。文字列“ＡＢＣ”の２つ目の文字“Ｂ” の文字種は英大文字であり、文字列“Ｄｅｆ”の２つ目の文字“ｅ”の文字種は英小文字であるため、文字類似度は９０である。さらに、文字列“ＡＢＣ”の最後の文字“Ｃ” の文字種は英大文字であり、文字列“Ｄｅｆ”の最後の文字“ｆ”の文字種は英小文字であるため、文字類似度は９０である。これらの文字類似度の平均値（１００＋９０＋９０）／３＝９３が、文字列“ＡＢＣ”と文字列“Ｄｅｆ”との類似度を表す。 The character type of the first character “A” of the character string “ABC” and the first character “D” of the character string “Def” are both uppercase letters, and the character similarity defined in the similarity determination rule table 801 is 100. It is. Since the character type of the second character “B” of the character string “ABC” is uppercase, and the character type of the second character “e” of the character string “Def” is lowercase, the character similarity is 90. It is. Furthermore, since the character type of the last character “C” of the character string “ABC” is uppercase English, and the character type of the last character “f” of the character string “Def” is lowercase English, the character similarity is 90. . The average value of these character similarities (100 + 90 + 90) / 3 = 93 represents the similarity between the character string “ABC” and the character string “Def”.

なお、比較する文字列の文字数が異なる場合は、直前の文字と同文字種であれば直前の文字の文字類似度に対する所定の割合、例えば８割を該文字の類似度とするように構成してもよい。また、比較する文字の位置により異なる類似度が算出される可能性がある場合には、その最大値を類似度とするように構成してもよい。 If the number of characters in the character string to be compared is different, if the character type is the same as that of the immediately preceding character, a predetermined ratio with respect to the character similarity of the immediately preceding character, for example, 80% is set as the similarity of the character. Also good. If there is a possibility that different degrees of similarity may be calculated depending on the position of the character to be compared, the maximum value may be used as the degree of similarity.

次に、このように構成された第２の実施の形態にかかる辞書登録装置８００による未知語登録処理について説明する。図１０は、第２の実施の形態における未知語登録処理の全体の流れを示すフローチャートである。 Next, an unknown word registration process performed by the dictionary registration apparatus 800 according to the second embodiment configured as described above will be described. FIG. 10 is a flowchart showing an overall flow of the unknown word registration process in the second embodiment.

ステップＳ１００１からステップＳ１０１０までの、文書指定画面表示処理、形態素解析処理、拡張未知語検索処理は、第１の実施の形態にかかる辞書登録装置１００におけるステップＳ５０１からステップＳ５１０までと同様の処理なので、その説明を省略する。 The document designation screen display process, the morphological analysis process, and the extended unknown word search process from step S1001 to step S1010 are the same as the process from step S501 to step S510 in the dictionary registration device 100 according to the first embodiment. The description is omitted.

拡張未知語の未知語相当部分以外の文字列が部分一致する単語が辞書１２１に存在する場合は（ステップＳ１０１０：ＹＥＳ）、表記類似性判定部１０５が、未知語と当該部分一致する単語の未知語相当部分の類似度を、類似度判定規則表８０１に従い算出する(ステップＳ１０１１)。次に、表記類似性判定部１０５が、算出した値が予め定められた値より大きいか否かを判断する（ステップＳ１０１２）。 When a word that partially matches a character string other than the unknown word equivalent portion of the extended unknown word exists in the dictionary 121 (step S1010: YES), the notation similarity determination unit 105 does not know the word that partially matches the unknown word. The similarity of the word equivalent part is calculated according to the similarity determination rule table 801 (step S1011). Next, the notation similarity determination unit 105 determines whether or not the calculated value is larger than a predetermined value (step S1012).

類似度が予め定められた値より小さい場合は（ステップＳ１０１２：ＮＯ）、拡張未知語を未知語バッファ１３２から削除する（ステップＳ１０１３）。類似度が予め定められた値より大きい場合は（ステップＳ１０１２：ＹＥＳ）、文書の最後の単語まで処理したか否かを判断する処理に遷移する（ステップＳ１０１４）。 If the similarity is smaller than a predetermined value (step S1012: NO), the extended unknown word is deleted from the unknown word buffer 132 (step S1013). If the similarity is larger than a predetermined value (step S1012: YES), the process proceeds to a process of determining whether or not the last word of the document has been processed (step S1014).

ステップＳ１０１４からステップＳ１０１９までの、処理完了チェック処理、未知語確認画面表示処理、辞書登録処理は、第１の実施の形態にかかる辞書登録装置１００におけるステップＳ５１３からステップＳ５１８までと同様の処理なので、その説明を省略する。 The process completion check process, the unknown word confirmation screen display process, and the dictionary registration process from step S1014 to step S1019 are the same processes as steps S513 to S518 in the dictionary registration apparatus 100 according to the first embodiment. The description is omitted.

図１１−１〜図１１−７は、第２の実施の形態にかかる辞書登録装置８００において、上述した未知語登録処理に従って、ユーザが指定した文書ファイルから未知語および拡張未知語を検出して辞書１２１に登録する処理の例を示した説明図である。 11-1 to 11-7 illustrate an example in which an unknown word and an extended unknown word are detected from a document file designated by a user in the dictionary registration apparatus 800 according to the second embodiment in accordance with the above-described unknown word registration process. It is explanatory drawing which showed the example of the process registered into the dictionary 121. FIG.

図１１−１〜図１１−７に示す例では、ユーザにより図１１−１に示す文書を格納した文書ファイルが指定された場合が示されている。まず、形態素解析部１０２が図１１−１に示す文書を形態素解析し、得られた単語列が単語列バッファ１３１に格納される（ステップＳ１００２）。前述の図３に、このときの単語列バッファ１３１の内容の一部が示されている。図３に示すように、この例では単語番号５１に相当する単語“ＤＭＥ”が未知語として抽出される。 In the example illustrated in FIG. 11A to FIG. 11-7, the case where the user specifies a document file storing the document illustrated in FIG. 11A is illustrated. First, the morphological analysis unit 102 performs morphological analysis on the document shown in FIG. 11A, and the obtained word string is stored in the word string buffer 131 (step S1002). FIG. 3 described above shows a part of the contents of the word string buffer 131 at this time. As shown in FIG. 3, in this example, the word “DME” corresponding to the word number 51 is extracted as an unknown word.

この単語列バッファ１３１を参照し、未知語範囲拡張部１０３は、単語番号５１に相当する単語“ＤＭＥ”を未知語バッファ１３２に格納する（ステップＳ１００５）。このときの未知語バッファ１３２は図１１−２に示す状態になる。 Referring to the word string buffer 131, the unknown word range expansion unit 103 stores the word “DME” corresponding to the word number 51 in the unknown word buffer 132 (step S1005). At this time, the unknown word buffer 132 is in the state shown in FIG.

次に、直前の単語“東芝”を結合した拡張未知語“東芝ＤＭＥ”の未知語相当部分以外である“東芝”と前方一致する単語を辞書１２１から検索する（ステップＳ１００６）。例えば、前述の図２に示す単語が辞書１２１に登録されていた場合、前方一致する単語として“東芝”、“東芝ＡＶＥ”、“東芝ＡＶＥ株式会社”が検索される。前方一致する単語が存在するため、拡張未知語“東芝ＤＭＥ”が未知語バッファ１３２に格納される（ステップＳ１００８）。このときの未知語バッファ１３２は図１１−３に示す状態になる。 Next, a word that matches forward with “Toshiba” that is not the unknown word equivalent part of the extended unknown word “Toshiba DME” combined with the immediately preceding word “Toshiba” is searched from the dictionary 121 (step S1006). For example, if the word shown in FIG. 2 is registered in the dictionary 121, “Toshiba”, “Toshiba AVE”, and “Toshiba AVE Co., Ltd.” are searched for as a word that matches forward. Since there is a forward matching word, the extended unknown word “Toshiba DME” is stored in the unknown word buffer 132 (step S1008). At this time, the unknown word buffer 132 is in the state shown in FIG.

各単語の未知語相当部分と未知語“ＤＭＥ”の類似度を算出すると、“東芝”は未知語相当部分が存在しないため類似度は０、“東芝ＡＶＥ”の未知語相当部分“ＡＶＥ”は文字数、文字種がすべて一致するので類似度は１００となる。“東芝ＡＶＥ株式会社”の未知語相当部分“ＡＶＥ株式会社”は、文字数が４多いため、その部分の文字類似度を０として算出すると類似度は４３（＝３００／７）となる。類似すると判定する類似度の基準値を７５とすると、条件を満たす単語“東芝ＡＶＥ”（類似度１００）が辞書１２１に存在することから、拡張未知語“東芝ＤＭＥ”は削除されることなく未知語バッファ１３２に残される（ステップＳ１０１２）。 When the similarity between the unknown word equivalent part of each word and the unknown word "DME" is calculated, "Toshiba" has no unknown word equivalent part, so the similarity is 0, and the unknown word equivalent part "AVE" of "Toshiba AVE" is Since the number of characters and the character type all match, the similarity is 100. Since the unknown word equivalent part “AVE Inc.” of “Toshiba AVE Inc.” has a large number of characters, the similarity is 43 (= 300/7) when the character similarity of that part is calculated as 0. If the reference value of similarity that is determined to be similar is 75, the word “Toshiba AVE” (similarity 100) that satisfies the condition exists in the dictionary 121, and therefore the expanded unknown word “Toshiba DME” is unknown without being deleted. It remains in the word buffer 132 (step S1012).

次に、後方の単語“株式”を結合した拡張未知語“東芝ＤＭＥ株式”を未知語バッファ１３２に格納する（ステップＳ１００８）。このときの未知語バッファ１３２は図１１−４に示す状態になる。 Next, the extended unknown word “Toshiba DME stock” combined with the backward word “stock” is stored in the unknown word buffer 132 (step S1008). At this time, the unknown word buffer 132 is in the state shown in FIG.

当該拡張未知語の未知語相当部分以外である“東芝”、“株式”の部分が部分一致する単語を辞書１２１から検索するが、該当する単語が存在しないため、拡張未知語“東芝ＤＭＥ株式”は未知語バッファ１３２から削除される（ステップＳ１０１３）。 The word “Toshiba” and “stock” other than the unknown word equivalent part of the extended unknown word are searched from the dictionary 121, but since there is no corresponding word, the extended unknown word “Toshiba DME stock” Is deleted from the unknown word buffer 132 (step S1013).

同様に、次の後方の単語“会社”を結合した拡張未知語“東芝ＤＭＥ株式会社”を未知語バッファ１３２に格納し、部分一致検索を実行すると類似する単語“東芝ＡＶＥ株式会社”が存在し、未知語相当部分“ＡＶＥ”の類似度が１００になるため、当該拡張未知語は未知語バッファ１３２に残される。このときの未知語バッファ１３２は図１１−５に示す状態になる。 Similarly, when the extended unknown word “Toshiba DME Co.” combined with the next word “company” is stored in the unknown word buffer 132 and a partial match search is executed, a similar word “Toshiba AVE Co., Ltd.” exists. Since the similarity of the unknown word equivalent portion “AVE” becomes 100, the extended unknown word is left in the unknown word buffer 132. At this time, the unknown word buffer 132 is in the state shown in FIG.

この後、後方の単語“」”を結合した拡張未知語“東芝ＤＭＥ株式会社」”を未知語バッファ１３２に格納し（図１１−６）、同様の処理を文末まで繰り返す。文末まで処理が行われた場合は、次の前方の単語を結合して同様の処理を文頭まで繰り返す（図１１−７）。 Thereafter, the expanded unknown word “Toshiba DME Co., Ltd.” combined with the backward word “” ”is stored in the unknown word buffer 132 (FIG. 11-6), and the same processing is repeated until the end of the sentence. If it is, the next forward word is combined and the same process is repeated up to the beginning of the sentence (FIG. 11-7).

このように、第２の実施の形態にかかる辞書登録装置８００では、比較する文字の文字種ごとの文字類似度を格納した類似度判定規則表８０１に従い、拡張未知語と既登録単語の類似度を判定することができる。これにより文字種が同一でない拡張未知語であっても未知語として抽出することができ、未知語抽出の精度を高めることができる。 Thus, in the dictionary registration apparatus 800 according to the second embodiment, the similarity between the extended unknown word and the registered word is determined according to the similarity determination rule table 801 storing the character similarity for each character type of the character to be compared. Can be determined. Thereby, even an extended unknown word whose character type is not the same can be extracted as an unknown word, and the accuracy of unknown word extraction can be improved.

なお、第１および第２の実施の形態にかかる辞書登録装置においては、未知語および拡張未知語を画面に表示してユーザが確認できるように構成しているが、確認画面の表示やユーザによる単語の選別をせずに、未知語および拡張未知語をそのまま自動登録するように構成してもよい。 In the dictionary registration device according to the first and second embodiments, the unknown word and the extended unknown word are displayed on the screen so that the user can check them. You may comprise so that an unknown word and an extended unknown word may be automatically registered as it is, without selecting a word.

また、第１および第２の実施の形態にかかる辞書登録装置においては、先に前方の単語を結合して拡張未知語を生成し類似性を判定した後、後方の単語を結合して拡張未知語を生成しているが、先に後方の単語を結合するように構成してもよいし、前方または後方の単語のみを結合するように構成してもよい。 Further, in the dictionary registration device according to the first and second embodiments, first the forward words are combined to generate extended unknown words, similarity is determined, and then the backward words are combined to expand unknown. Although the word is generated, it may be configured such that the backward word is combined first, or only the forward or backward word may be combined.

第１または第２の実施の形態にかかる辞書登録装置は、ＣＰＵなどの制御装置と、ＲＯＭ（Read Only Memory）やＲＡＭなどの記憶装置と、ＨＤＤ、ＣＤドライブ装置などの外部記憶装置と、ディスプレイ装置などの表示装置と、キーボードやマウスなどの入力装置を備えており、通常のコンピュータを利用したハードウェア構成となっている。 The dictionary registration device according to the first or second embodiment includes a control device such as a CPU, a storage device such as a ROM (Read Only Memory) and a RAM, an external storage device such as an HDD and a CD drive device, and a display. It has a display device such as a device and an input device such as a keyboard and a mouse, and has a hardware configuration using a normal computer.

第１または第２の実施の形態にかかる辞書登録装置で実行される辞書登録プログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、フレキシブルディスク（ＦＤ）、ＣＤ−Ｒ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）等のコンピュータで読み取り可能な記録媒体に記録されて提供される。 The dictionary registration program executed by the dictionary registration apparatus according to the first or second embodiment is an installable format or executable file, and is a CD-ROM, flexible disk (FD), CD-R, DVD. (Digital Versatile Disk) or the like recorded on a computer-readable recording medium.

また、第１または第２の実施の形態にかかる辞書登録装置で実行される辞書登録プログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成しても良い。また、第１または第２の実施の形態にかかる辞書登録装置で実行される辞書登録プログラムをインターネット等のネットワーク経由で提供または配布するように構成しても良い。 Further, the dictionary registration program executed by the dictionary registration apparatus according to the first or second embodiment is stored on a computer connected to a network such as the Internet and is provided by being downloaded via the network. It may be configured. Further, the dictionary registration program executed by the dictionary registration apparatus according to the first or second embodiment may be provided or distributed via a network such as the Internet.

また、第１または第２の実施の形態の辞書登録プログラムを、ＲＯＭ等に予め組み込んで提供するように構成してもよい。 The dictionary registration program according to the first or second embodiment may be provided by being incorporated in advance in a ROM or the like.

第１または第２の実施の形態にかかる辞書登録装置で実行される辞書登録プログラムは、上述した各部（入出力制御部、形態素解析部、未知語範囲拡張部、部分一致検索部、表記類似性判定部、辞書登録部）を含むモジュール構成となっており、実際のハードウェアとしてはＣＰＵ（プロセッサ）が上記記憶媒体から辞書登録プログラムを読み出して実行することにより上記各部が主記憶装置上にロードされ、入出力制御部、形態素解析部、未知語範囲拡張部、部分一致検索部、表記類似性判定部、辞書登録部が主記憶装置上に生成されるようになっている。 The dictionary registration program executed by the dictionary registration device according to the first or second embodiment includes the above-described units (input / output control unit, morpheme analysis unit, unknown word range expansion unit, partial match search unit, notation similarity) The module configuration includes a determination unit and a dictionary registration unit). As actual hardware, the CPU (processor) reads the dictionary registration program from the storage medium and executes it to load each unit onto the main storage device. In addition, an input / output control unit, a morphological analysis unit, an unknown word range expansion unit, a partial match search unit, a notation similarity determination unit, and a dictionary registration unit are generated on the main storage device.

以上のように、本発明にかかる辞書登録装置、辞書登録方法および辞書登録プログラムは、辞書に登録されていない単語を追加登録する機能を有する文書作成システム、かな漢字変換システム、音声認識システムに適している。 As described above, the dictionary registration device, the dictionary registration method, and the dictionary registration program according to the present invention are suitable for a document creation system, a kana-kanji conversion system, and a speech recognition system having a function of additionally registering words that are not registered in the dictionary. Yes.

第１の実施の形態にかかる辞書登録装置の構成を示すブロック図である。It is a block diagram which shows the structure of the dictionary registration apparatus concerning 1st Embodiment. 辞書の一例を示す説明図である。It is explanatory drawing which shows an example of a dictionary. 単語列バッファの一例を示す説明図である。It is explanatory drawing which shows an example of a word string buffer. 未知語バッファの一例を示す説明図である。It is explanatory drawing which shows an example of an unknown word buffer. 第１の実施の形態にかかる辞書登録装置における未知語登録処理を示すフローチャートである。It is a flowchart which shows the unknown word registration process in the dictionary registration apparatus concerning 1st Embodiment. 文書指定画面の一例を示す模式図である。It is a schematic diagram which shows an example of a document designation | designated screen. 文書指定画面の一例を示す模式図である。It is a schematic diagram which shows an example of a document designation | designated screen. 未知語確認画面の一例を示す模式図である。It is a schematic diagram which shows an example of an unknown word confirmation screen. 第２の実施の形態にかかる辞書登録装置の構成を示すブロック図である。It is a block diagram which shows the structure of the dictionary registration apparatus concerning 2nd Embodiment. 変換表記規則表の一例を示す説明図である。It is explanatory drawing which shows an example of a conversion notation rule table. 第２の実施の形態にかかる辞書登録装置における未知語登録処理を示すフローチャートである。It is a flowchart which shows the unknown word registration process in the dictionary registration apparatus concerning 2nd Embodiment. 第２の実施の形態にかかる辞書登録装置における未知語登録処理の一例を示す模式図である。It is a schematic diagram which shows an example of the unknown word registration process in the dictionary registration apparatus concerning 2nd Embodiment. 第２の実施の形態にかかる辞書登録装置における未知語登録処理の一例を示す模式図である。It is a schematic diagram which shows an example of the unknown word registration process in the dictionary registration apparatus concerning 2nd Embodiment. 第２の実施の形態にかかる辞書登録装置における未知語登録処理の一例を示す模式図である。It is a schematic diagram which shows an example of the unknown word registration process in the dictionary registration apparatus concerning 2nd Embodiment. 第２の実施の形態にかかる辞書登録装置における未知語登録処理の一例を示す模式図である。It is a schematic diagram which shows an example of the unknown word registration process in the dictionary registration apparatus concerning 2nd Embodiment. 第２の実施の形態にかかる辞書登録装置における未知語登録処理の一例を示す模式図である。It is a schematic diagram which shows an example of the unknown word registration process in the dictionary registration apparatus concerning 2nd Embodiment. 第２の実施の形態にかかる辞書登録装置における未知語登録処理の一例を示す模式図である。It is a schematic diagram which shows an example of the unknown word registration process in the dictionary registration apparatus concerning 2nd Embodiment. 第２の実施の形態にかかる辞書登録装置における未知語登録処理の一例を示す模式図である。It is a schematic diagram which shows an example of the unknown word registration process in the dictionary registration apparatus concerning 2nd Embodiment.

Explanation of symbols

１００、８００辞書登録装置
１０１入出力制御部
１０２形態素解析部
１０３未知語範囲拡張部
１０４部分一致検索部
１０５表記類似性判定部
１０６辞書登録部
１１０ユーザＩ／Ｆ
１２０解析ルール
１２１辞書
１３０ＲＡＭ
１３１単語列バッファ
１３２未知語バッファ
６０１参照ボタン
６０２削除ボタン
６０３次へボタン
６０４キャンセルボタン
６０５ファイル参照画面
７０１チェックボックス
７０２全て選択ボタン
７０３全て解除ボタン
７０４修正ボタン
７０５戻るボタン
７０６次へボタン
７０７キャンセルボタン
８０１類似度判定規則表 100, 800 Dictionary registration device 101 Input / output control unit 102 Morphological analysis unit 103 Unknown word range expansion unit 104 Partial match search unit 105 Notation similarity determination unit 106 Dictionary registration unit 110 User I / F
120 analysis rule 121 dictionary 130 RAM
131 Word string buffer 132 Unknown word buffer 601 Reference button 602 Delete button 603 Next button 604 Cancel button 605 File reference screen 701 Check box 702 Select all button 703 Cancel all button 704 Modify button 705 Return button 706 Next button 707 Cancel button 801 Similarity judgment rule table

Claims

Dictionary storage means for storing a dictionary holding words;
Morphological analysis of the input document, and extraction of unknown words that are not registered in the dictionary from the input document;
An unknown word range expansion means for generating an extended unknown word in which at least one of the front and rear words of the unknown word extracted by the morphological analysis means is combined with the unknown word;
Of the extended unknown words generated by the unknown word range expansion means, a partial match that searches for a registered word registered in the dictionary that matches the notation of the expanded portion of the unknown word. Search means;
Of the registered words searched by the partial match search means, the registered registration searched by the partial match search means based on the character attribute of the notation corresponding to the unknown word and the character attribute of the unknown word notation. A notation similarity determination means for determining the similarity between the notation of the part corresponding to the unknown word and the notation of the unknown word,
Dictionary registration means for registering the extended unknown word in the dictionary when the notation similarity determination means determines that the notation of the portion corresponding to the unknown word in the registered word is similar to the notation of the unknown word When,
A dictionary registration device comprising:

The notation similarity determination means, when the character type of the notation corresponding to the unknown word in the registered word and the character type of the notation of the unknown word are the same, the unknown of the registered word The dictionary registration device according to claim 1, wherein the notation of a portion corresponding to a word and the notation of the unknown word are determined to be similar.

A similarity determination rule storage means for storing a similarity determination rule table that stores character similarity for each character type of the comparison source character and the comparison target character;
The notation similarity determination means is a similarity value between a character type of a notation corresponding to the unknown word and a character type of the unknown word notation among the registered words calculated based on the similarity determination rule table. 2. The dictionary according to claim 1, wherein when the value is larger than a predetermined value, it is determined that a notation of a portion corresponding to the unknown word in the registered words is similar to a notation of the unknown word. Registration device.

A morpheme analysis step in which the morpheme analysis unit performs morpheme analysis on the input document and extracts from the input document unknown words that are not registered in the dictionary stored in the dictionary storage unit that stores a dictionary that holds words When,
An unknown word range extending unit generates an extended unknown word by combining at least one of the front and rear words of the unknown word extracted by the morpheme analyzing step with the unknown word; and
A partial match search means, among the extended unknown words generated by the unknown word range expansion step, the registered words that are registered in the dictionary and are words that match the notation of the expanded part of the unknown words A partial match search step for searching from a dictionary;
The notation similarity determination means , based on the character attribute of the notation of the part corresponding to the unknown word and the character attribute of the notation of the unknown word, among the registered words searched by the partial match search step A notation similarity determination step for determining the similarity between the notation of the part corresponding to the unknown word and the notation of the unknown word among the registered words searched by the step;
When the dictionary registration means determines that the notation similarity determination step is similar to the notation of the portion corresponding to the unknown word and the notation of the unknown word in the registered word, the extended unknown word is stored in the dictionary. A dictionary registration step to register;
A dictionary registration method comprising:

In the notation similarity determination step , when the notation similarity determination unit has the same character type of the notation of the part corresponding to the unknown word and the character type of the notation of the unknown word, The dictionary registration method according to claim 4, wherein the notation of a portion corresponding to the unknown word in the registered words is determined to be similar to the notation of the unknown word.

In the notation similarity determination step , the notation similarity determination means calculates based on a similarity determination rule table that holds the character similarity for each character type of the comparison source character and the comparison destination character stored in the storage means When the similarity value between the character type of the portion corresponding to the unknown word and the character type of the unknown word in the registered word is greater than a predetermined value, the registered word The dictionary registration method according to claim 4, wherein the notation of a portion corresponding to an unknown word is determined to be similar to the notation of the unknown word.

Morphological analysis means, and morphological analysis of the input document, the unknown word not registered in the dictionary stored in the dictionary storage means for storing a dictionary for holding a word, morphological analysis procedure for extracting from the input document When,
An unknown word range extension means for generating an extended unknown word in which at least one word before and behind the unknown word extracted by the morphological analysis procedure is combined with the unknown word; and
A partial match search means, among the extended unknown words generated by the unknown word range expansion procedure, is a word that matches the notation of the expanded part of the unknown word and is registered in the dictionary Partial match search procedure to search from dictionary,
The notation similarity determination unit is configured to perform the partial match search based on a character attribute of a portion corresponding to the unknown word and a character attribute of the unknown word notation among the registered words searched by the partial match search procedure. A notation similarity determination procedure for determining the similarity between the notation of the part corresponding to the unknown word and the notation of the unknown word among the registered words searched by the procedure;
When the dictionary registration means determines that the notation similarity determination procedure indicates that the notation of the portion corresponding to the unknown word in the registered word is similar to the notation of the unknown word, the extended unknown word is stored in the dictionary. Dictionary registration procedure to register,
Dictionary registration program for executing on a computer.

In the notation similarity determination procedure , when the notation similarity determination unit has the same character type of the notation of the part corresponding to the unknown word and the character type of the notation of the unknown word, 8. The dictionary registration program according to claim 7, wherein it is determined that a notation of a portion corresponding to the unknown word in the registered words is similar to a notation of the unknown word.

In the notation similarity determination procedure , the notation similarity determination means calculates based on a similarity determination rule table that holds the character similarity for each character type of the comparison source character and the comparison destination character stored in the storage means When the similarity value between the character type of the portion corresponding to the unknown word and the character type of the unknown word in the registered word is greater than a predetermined value, the registered word The dictionary registration program according to claim 7, wherein the notation of a portion corresponding to an unknown word is determined to be similar to the notation of the unknown word.