JP4941495B2

JP4941495B2 - User dictionary creation system, method, and program

Info

Publication number: JP4941495B2
Application number: JP2009084096A
Authority: JP
Inventors: 敬子稲垣
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-03-31
Filing date: 2009-03-31
Publication date: 2012-05-30
Anticipated expiration: 2029-03-31
Also published as: JP2010237351A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system, a method, and a program for user dictionary registration for improving the efficiency of the registration while preventing the erroneous recognition when registering a new word into a user dictionary. <P>SOLUTION: The user dictionary preparing system includes a text inputting means 11 for inputting a character string, an unknown-word extracting means 22 for extracting an unknown word from the inputted character string, a similarity calculating means 32 for calculating similarity between the extracted unknown word and the registered word already registered in the dictionary, an environmental information extracting means 33 for extracting environmental information including the information of words in front and rear of the unknown word when the similarity exceeds a predetermined value, and a registering means 41 for registering the unknown word and the environmental information into the dictionary. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、ユーザ辞書作成システム、方法、及び、プログラムに関し、更に詳しくは、音声認識システムに好適に用いられるユーザ辞書作成システム、ユーザ辞書作成方法、及び、プログラムに関する。 The present invention relates to a user dictionary creation system, method, and program, and more particularly to a user dictionary creation system, a user dictionary creation method, and a program that are preferably used in a speech recognition system.

音声認識装置を利用して、音声で読み上げられた文章から文書を作成する音声認識システムが知られている。音声認識システムは、音声認識装置と、音声認識装置側で予め用意されている音声認識用辞書と、音声認識辞書に含まれていない単語をユーザ側で登録するためのユーザ辞書作成機能とを有する。 2. Description of the Related Art A voice recognition system that creates a document from a sentence read out by voice using a voice recognition device is known. The speech recognition system has a speech recognition device, a speech recognition dictionary prepared in advance on the speech recognition device side, and a user dictionary creation function for registering words not included in the speech recognition dictionary on the user side. .

ところで、音声認識用辞書で認識しない単語を、ユーザ辞書作成機能を利用して、その読みと形態素情報とを正しく付与して辞書に登録する作業は、ユーザに多大の負担を強いるという問題があった。また、音声認識辞書で認識しない単語を無条件に登録すると、同じ読みを含む単語や、類似する読みを持つ単語が既に辞書に登録されている場合には、双方の単語を区別することができなくなり、類似性の高い単語に誤って出力する誤認識を誘発するという問題もあった。 By the way, using a user dictionary creation function to register words that are not recognized by the speech recognition dictionary with the correct reading and morpheme information and registering them in the dictionary has the problem of imposing a heavy burden on the user. It was. In addition, if you unconditionally register a word that is not recognized by the speech recognition dictionary, you can distinguish both words if a word that contains the same reading or a word that has a similar reading is already registered in the dictionary. There was also a problem of inducing misrecognition that erroneously output words with high similarity.

特許文献１には、上記問題を解決するユーザ辞書作成システムが記載されている。図７は、このユーザ辞書作成システムのブロック図である。ユーザ辞書作成システムは、音声入力部７０１と、音声認識部７０２と、類似度算出部７０３と、辞書登録部７０４と、音声認識用辞書７０５と、ユーザ辞書７０６とから構成されている。ユーザ辞書作成システム７００は次のように動作する。まず、予め単語音声を音声入力部７０１より入力し、音声認識部７０２において文字列に変換し、その文字列を音声認識用辞書７０５に登録しておく。ユーザが新しい単語を入力すると、類似度算出部７０３は、音声認識用辞書７０５と入力音声とを比較し、類似している場合には、ユーザ辞書７０６への登録を却下する。 Patent Document 1 describes a user dictionary creation system that solves the above problem. FIG. 7 is a block diagram of the user dictionary creation system. The user dictionary creation system includes a voice input unit 701, a voice recognition unit 702, a similarity calculation unit 703, a dictionary registration unit 704, a voice recognition dictionary 705, and a user dictionary 706. The user dictionary creation system 700 operates as follows. First, word speech is input in advance from the speech input unit 701, converted into a character string by the speech recognition unit 702, and the character string is registered in the speech recognition dictionary 705. When the user inputs a new word, the similarity calculation unit 703 compares the speech recognition dictionary 705 and the input speech, and if they are similar, rejects registration in the user dictionary 706.

特許文献１に記載のユーザ辞書作成システムでは、辞書内の単語と新たに登録しようとする単語との類似性を判定し、類似性が高い場合には、一律に登録できないという方式を採用している。しかし、この方式を採用すると、登録できない単語については、その認識率を改善することができなかった。 In the user dictionary creation system described in Patent Document 1, the similarity between a word in the dictionary and a word to be newly registered is determined, and if the similarity is high, a system that cannot be uniformly registered is adopted. Yes. However, when this method is adopted, the recognition rate of words that cannot be registered cannot be improved.

特許文献２には、辞書作成時に、登録したい単語と識別させたい単語の音声パターンを比較し、類似していることが判明すると、ユーザに誤認識をする可能性がある旨を提示するユーザ辞書作成システムが記載されている。ユーザは、問い合わせを受けると、その単語の登録の可否を判断する。この方法では、特許文献１とは異なり、登録できない単語はなくなるが、ユーザの誤った判断に起因して、誤認識の発生が考えられる。 Patent Document 2 discloses a user dictionary that indicates that there is a possibility of misrecognition to a user when comparing a speech pattern of a word to be registered with a word to be identified and finding that they are similar when creating a dictionary. The creation system is described. When receiving the inquiry, the user determines whether or not the word can be registered. In this method, unlike Patent Document 1, there are no words that cannot be registered, but it is possible that erroneous recognition occurs due to an erroneous determination by the user.

特許文献３には、ユーザ辞書に登録したい単語と、既に登録されている単語との類似度を判定し、双方の単語の類似度が所定以上の場合には、個々のユーザが自身の辞書に登録するか否かを判断する辞書登録装置を有する音声認識システムが記載されている。このユーザ辞書作成システムでは、登録可と判断したユーザのＩＤ番号がその登録した単語と共に登録され、他のユーザにはその登録の効果を及ばない。このため、ユーザ各自の判断で登録が可能となり、また、他のユーザに誤認識を発生させるおそれが除かれる。 In Patent Document 3, the degree of similarity between a word to be registered in the user dictionary and a word that has already been registered is determined. A speech recognition system having a dictionary registration device for determining whether or not to register is described. In this user dictionary creation system, the ID number of the user determined to be registered is registered together with the registered word, and the effect of the registration is not exerted on other users. For this reason, registration becomes possible at the discretion of each user, and the possibility of causing misrecognition to other users is eliminated.

特許文献４には、ワードプロセッサにおいて、ユーザ辞書に登録したい単語に同音異義語があると、その登録したい単語の前後に付属する単語やその品詞などを付加して登録する旨が記載されている。 Patent Document 4 describes that, in a word processor, if a word to be registered in a user dictionary has a homonym, a word attached to the front and back of the word to be registered, its part of speech, and the like are registered.

特開平８−１１０７９０号公報JP-A-8-110790 特開平７−４４１８８号公報Japanese Patent Laid-Open No. 7-44188 特開２０００−２５９１７２号公報JP 2000-259172 A 特開平２−２９７２４７号公報JP-A-2-297247

特許文献３の音声認識システムでは、ユーザが登録した単語の効果はユーザ自身にのみ及ぶので、他のユーザに誤認識が発生する可能性がなくなる。しかし、ユーザがそれぞれ単独に同じ単語を登録する必要が生じ、ユーザ辞書作成における効率が低下する。 In the voice recognition system of Patent Document 3, the effect of the word registered by the user only affects the user himself / herself, so that there is no possibility that misrecognition occurs in other users. However, it is necessary for each user to register the same word independently, and the efficiency in creating the user dictionary is reduced.

また、特許文献４のワードプロセッサでは、同音異義語がある単語について、その前後に付属する単語やその品詞が付加されるものの、音声認識システムでは、誤認識が発生するのは、同音異義語の単語間ばかりではなく、同じような発音を有する単語間でも発生する。 In addition, in the word processor of Patent Document 4, although words attached to the front and back of words having homonyms are added and parts of speech thereof, in the speech recognition system, misrecognition is caused by words of homonyms. It occurs not only between words but also between words with similar pronunciation.

本発明は、上述の点に鑑み、新たな単語をユーザ辞書に登録する際に、誤認識の発生を回避しつつ且つ登録の効率が高いユーザ辞書登録システム、及び、そのようなユーザ辞書登録システムで用いられる辞書登録方法、及び、プログラムを提供することを目的とする。 In view of the above-described points, the present invention provides a user dictionary registration system that avoids the occurrence of misrecognition and has high registration efficiency when registering a new word in a user dictionary, and such a user dictionary registration system. An object of the present invention is to provide a dictionary registration method and a program used in the above.

本発明は、上記目的を達成するために、文字列を入力するテキスト入力手段と、入力された文字列から辞書に登録されていない単語を未知語として抽出する未知語抽出手段と、前記抽出された未知語と辞書に既に登録されている登録単語との類似度を算出する類似度算出手段と、前記類似度が所定値以上のときに、前記文字列中の未知語の前後の文章の形態解析に基づいて、前記未知語の前後の単語の品詞を示す環境情報を抽出する環境情報抽出手段と、前記類似度が所定値以上のときに、前記未知語及び前記環境情報抽出手段が抽出した環境情報を辞書に登録する登録手段と、を備えるユーザ辞書作成システムを提供する。 To achieve the above object, the present invention provides a text input means for inputting a character string, an unknown word extraction means for extracting a word not registered in the dictionary from the input character string as an unknown word, and the extracted a similarity calculation means for calculating a similarity between the registered word already registered in the unknown word and dictionary, when the similarity is equal to or higher than the predetermined value, the unknown words surrounding text form in the character string Based on the analysis, the environment information extraction means for extracting environment information indicating the part of speech of the word before and after the unknown word, and the unknown word and the environment information extraction means when the similarity is equal to or greater than a predetermined value There is provided a user dictionary creation system comprising registration means for registering environmental information in a dictionary.

本発明のユーザ辞書作成システム、方法、及び、プログラムは、既知語との間で類似性が高い未知語を登録する際に、効率が高い登録を可能にしつつ、その後に未知語と既知語との間で発生しがちな誤認識を抑制できる効果がある。 The user dictionary creation system, method, and program according to the present invention enable an efficient registration when registering an unknown word having a high similarity with a known word, and thereafter, an unknown word and a known word are registered. It is possible to suppress misrecognition that tends to occur between

本発明の第１の実施形態に係るユーザ辞書作成システムを含む音声認識システムのブロック図。1 is a block diagram of a voice recognition system including a user dictionary creation system according to a first embodiment of the present invention. 本発明の第２の実施形態に係るユーザ辞書作成システムのブロック図。The block diagram of the user dictionary creation system which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施形態に係るユーザ辞書作成システムのブロック図。The block diagram of the user dictionary creation system which concerns on the 3rd Embodiment of this invention. 本発明の第４の実施形態に係るユーザ辞書作成システムのブロック図。The block diagram of the user dictionary creation system which concerns on the 4th Embodiment of this invention. 類似語間で誤りが発生しやすい例を示すグラフ。A graph showing an example in which an error is likely to occur between similar words. 第１の実施形態で採用される登録例を示すグラフ。The graph which shows the registration example employ | adopted by 1st Embodiment. 特許文献１に記載のユーザ辞書作成システムのブロック図。The block diagram of the user dictionary creation system of patent document 1. FIG.

本発明の最小構成のユーザ辞書作成システムは、文字列を入力するテキスト入力手段と、入力された文字列から辞書に登録されていない単語を未知語として抽出する未知語抽出手段と、前記抽出された未知語と辞書に既に登録されている登録単語との類似度を算出する類似度算出手段と、前記類似度が所定値以上のときに、前記文字列中の未知語の前後の文章の形態解析に基づいて、前記未知語の前後の単語の品詞を示す環境情報を抽出する環境情報抽出手段と、前記類似度が所定値以上のときに、前記未知語及び前記環境情報抽出手段が抽出した環境情報を辞書に登録する登録手段と、を備える。 Minimum configuration user dictionary creation system of the present invention, the unknown word extracting means for extracting a text input means for inputting a character string, a word from the input string is not registered in the dictionary as unknown words, said extracted a similarity calculation means for calculating a similarity between the registered word already registered in the unknown word and dictionary, when the similarity is equal to or higher than the predetermined value, the unknown words surrounding text form in the character string based on the analysis, and environmental information extraction means for extracting environmental information indicating a before and after the word part of speech of the unknown word, when the similarity is equal to or greater than a predetermined value, the unknown word and the environmental information extracting means has extracted Registration means for registering environmental information in a dictionary.

本発明のユーザ辞書作成システムでは、辞書に既に登録されている登録語と所定の類似度以上の類似性を有する未知語が抽出されると、入力文字列中の未知語の前後の文章から未知語の環境情報を抽出し、その環境情報を未知語と共に辞書に登録する構成を採用する。本構成を採用することにより、その後の辞書参照の際に発生しがちな誤認識の発生を防止する。 In the user dictionary creation system of the present invention, when an unknown word having a similarity equal to or higher than a predetermined similarity with a registered word already registered in the dictionary is extracted, it is unknown from sentences before and after the unknown word in the input character string. A configuration is adopted in which environmental information of words is extracted and the environmental information is registered in a dictionary together with unknown words. By adopting this configuration, it is possible to prevent the occurrence of misrecognition that tends to occur when referring to the dictionary thereafter.

本発明のユーザ辞書作成システムでは、前記未知語の読みを決定する未知語読み付け手段を更に有する構成を採用できる。この場合、未知語を選択する際に、その判定が容易になる。また、実際に登録する際の処理も簡素化できる。 In the user dictionary creation system of the present invention, a configuration further including an unknown word reading means for determining reading of the unknown word can be adopted. In this case, when an unknown word is selected, the determination becomes easy. Also, the process for actual registration can be simplified.

本発明のユーザ辞書作成システムでは、入力された文字列から辞書に既に登録されている登録単語を抽出し、抽出した登録単語の環境情報が前記辞書に登録されていない場合に、前記登録単語の前後の文章の形態解析結果に基づいて前記登録単語の前後の単語の品詞を示す環境情報を抽出し、前記登録単語に前記抽出された環境情報を付加して前記登録単語を更新する登録単語抽出手段を更に有する構成を採用することが出来る。この場合、既に作成された辞書の登録語の認識率の向上が可能となる。 In the user dictionary creation system of the present invention, a registered word that is already registered in the dictionary is extracted from the input character string , and when the environment information of the extracted registered word is not registered in the dictionary, the registered word extracting environmental information indicating a before and after the word part of speech of the previous SL registered word on the basis of the morphological analysis results before and after the text, updating the registration word by adding the extracted environmental information to the registered words registered word It is possible to adopt a configuration that further includes an extracting means . In this case, it is possible to improve the recognition rate of the registered words in the already created dictionary.

本発明のユーザ辞書作成システムでは、前記環境情報抽出手段は、前記未知語と所定以上の類似度を有すると判定された登録語が既に環境情報を含むときには、該含まれた環境情報が抽出された前記未知語の前後の単語よりも前及び／又は後ろに１語ずつ多い単語から環境情報を抽出する構成を採用することが出来る。この場合には、新たに登録される未知語の環境情報をよりきめ細かに登録することで、誤認識の低減が可能になる。 In the user dictionary creation system of the present invention, the environmental information extracting unit, when the unknown word and the determined registered word to have a predetermined or more similarity already contains environmental information, the included environment information was is extracted In addition, it is possible to adopt a configuration in which environment information is extracted from words that are one word before and / or behind words before and after the unknown word . In this case, it is possible to reduce misrecognition by registering environment information of newly registered unknown words more finely.

以下、本発明の例示的な実施形態について図面を参照して詳細に説明する。図１は、本発明の第１の実施形態に係るユーザ辞書作成システムを含む音声認識システムを示している。ユーザ辞書作成システムは、テキスト入力部１０と、入力したテキストに音声認識用辞書５０とユーザ辞書６０を用いて未知語を抽出する未知語抽出部２０と、抽出した未知語の中からユーザ辞書に登録する単語を抽出する未知語選択部３０と、未知語を登録するユーザ辞書登録部４０とを有する。音声認識システムは、このユーザ辞書作成システムと、音声認識用辞書５０と、ユーザ辞書６０と、音声認識手段７０と、文書作成手段８０とを含む。 Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 shows a speech recognition system including a user dictionary creation system according to the first embodiment of the present invention. The user dictionary creation system includes a text input unit 10, an unknown word extraction unit 20 that extracts an unknown word from the input text using the speech recognition dictionary 50 and the user dictionary 60, and a user dictionary from the extracted unknown words. It has the unknown word selection part 30 which extracts the word to register, and the user dictionary registration part 40 which registers an unknown word. The speech recognition system includes the user dictionary creation system, a speech recognition dictionary 50, a user dictionary 60, speech recognition means 70, and document creation means 80.

テキスト入力部１０は、過去の議事録や発言録、報告書などのデジタル化されたファイルを入力するとテキスト情報のみを抽出し、出力するテキスト入力手段１１を備えている。 The text input unit 10 includes text input means 11 that extracts and outputs only text information when a digitized file such as past minutes, memorandums, and reports is input.

未知語抽出部２０は、形態素解析手段２１と、未知語抽出手段２２とを含む。形態素解析手段２１は、テキスト入力部１０より入力されたテキストに対し、音声認識用辞書５０とユーザ辞書６０とを用いて形態素解析を行い、品詞を付与する。未知語抽出手段２２では、形態素解析手段２１で、品詞を付与できなかった単語を未知語として抽出し、未知語と該当する未知語を含む文章を保存しリスト化して、未知語選択部３０に供給する。 The unknown word extraction unit 20 includes a morphological analysis unit 21 and an unknown word extraction unit 22. The morpheme analyzing means 21 performs morphological analysis on the text input from the text input unit 10 using the speech recognition dictionary 50 and the user dictionary 60, and gives parts of speech. In the unknown word extraction means 22, the morpheme analysis means 21 extracts words that could not be given parts of speech as unknown words, saves sentences including unknown words and corresponding unknown words, stores them in a list, and stores them in the unknown word selection unit 30. Supply.

未知語選択部３０は、未知語選択手段３１と、類似度算出手段３２と、環境情報抽出手段３３とを含む。これらの手段はそれぞれ概略、次のように動作する。未知語選択手段３１は、未知語抽出手段２２で作成された未知語のリストを参照しながら、登録するか否かのユーザの判断を受け付ける。ユーザが登録すると判断した未知語には、その読みと品詞の付与をユーザ自身が行う。 The unknown word selection unit 30 includes an unknown word selection unit 31, a similarity calculation unit 32, and an environment information extraction unit 33. Each of these means generally operates as follows. The unknown word selection means 31 accepts the user's judgment as to whether or not to register while referring to the list of unknown words created by the unknown word extraction means 22. The user himself / herself performs reading and part-of-speech assignment for unknown words determined to be registered by the user.

類似度算出手段３２では、未知語の読みと辞書中の単語の読みの音響的特長の類似度を音素間の距離を定義したテーブルを用いて計算する。また、類似度算出手段３２は、未知語の品詞と音響的特長が類似する辞書中の単語（既知語）の品詞の類似度を形態素間の距離を定義したテーブルを用いて類似度を計算する。環境情報抽出手段３３では、辞書中に類似度が高い既知語が存在する未知語のみに、未知語選択手段３１で付与した品詞を用いて、未知語が含まれている文の形態素解析を再度行い、未知語の前後の単語の品詞など、未知語の前後の環境条件を求める。 The similarity calculation means 32 calculates the similarity between the acoustic features of the unknown word reading and the word reading in the dictionary using a table defining the distance between phonemes. Further, the similarity calculation means 32 calculates the similarity using a table in which the distance between morphemes is defined as the similarity of the part of speech of a word (known word) in a dictionary whose acoustic features are similar to the part of speech of the unknown word. . In the environment information extraction means 33, the morphological analysis of the sentence containing the unknown word is performed again using only the part of speech given by the unknown word selection means 31 for the unknown word having a known word with high similarity in the dictionary. And determine the environmental conditions before and after the unknown word, such as the part of speech of the word before and after the unknown word.

ユーザ辞書登録部４０は、未知語の登録手段４１を含む。登録手段４１は、未知語選択部３０でユーザが選択した未知語をユーザ辞書６０に登録する。ユーザ辞書登録部４０は、未知語を登録する際に、辞書に類似度の高い単語があるものは、未知語の品詞と読み以外に、環境情報抽出手段３３で求めた未知語の前後の品詞を環境情報としてユーザ辞書６０に登録する。 The user dictionary registration unit 40 includes unknown word registration means 41. The registration unit 41 registers the unknown word selected by the user in the unknown word selection unit 30 in the user dictionary 60. When registering an unknown word, the user dictionary registration unit 40 includes words with high similarity in the dictionary, parts of speech before and after the unknown word obtained by the environment information extraction means 33, in addition to the part of speech and reading of the unknown word. Is registered in the user dictionary 60 as environment information.

音声認識手段７０は、音声をリアルタイムに入力して記録し、その記録した音声から５０音やアルファベットなどの表音文字を認識する機能を有する。文書作成手段８０は、音声認識手段７０から表音文字を入力し、音声認識用辞書５０及びユーザ辞書６０に基づいて、表音文字から文書を作成する。 The voice recognition means 70 has a function of inputting and recording voice in real time and recognizing phonetic characters such as 50 tones and alphabets from the recorded voice. The document creation unit 80 inputs phonograms from the speech recognition unit 70 and creates a document from phonograms based on the speech recognition dictionary 50 and the user dictionary 60.

文書作成手段８０は、ユーザ辞書６０を参照して文書を作成する際に、ユーザ辞書中の単語が選択されるのは、環境情報まで一致した場合のみとすることで、ユーザ辞書に類似度の高い単語を登録したことによる悪影響を極力抑えることができる。 When the document creation unit 80 creates a document with reference to the user dictionary 60, the word in the user dictionary is selected only when the environment information is matched, and the similarity to the user dictionary is thus determined. The adverse effect of registering a high word can be suppressed as much as possible.

以下、上記第１の実施形態に係るユーザ辞書作成装置の具体的動作について詳細に説明する。まず、ユーザは、図１のテキスト入力部１０のテキスト入力手段１１から、認識させたい単語を含むテキストや話題が近いテキストを大量に入力する。未知語抽出部２０は、形態素解析手段２１にて入力されたテキストに対し音声認識用辞書５０とユーザ辞書６０を用いて形態素解析を行い、未知語抽出手段２２で未知語を抽出し、未知語リストを作成する。 Hereinafter, a specific operation of the user dictionary creation device according to the first embodiment will be described in detail. First, the user inputs a large amount of text including a word to be recognized or text close to a topic from the text input unit 11 of the text input unit 10 in FIG. The unknown word extraction unit 20 performs morphological analysis on the text input by the morpheme analysis unit 21 using the speech recognition dictionary 50 and the user dictionary 60, extracts the unknown word by the unknown word extraction unit 22, and extracts the unknown word Create a list.

作成された未知語リストは、未知語選択部３０の未知語選択手段３１を用いて、ユーザが登録の有無を選択する。登録が決まった未知語は、辞書に類似する単語があるか否かの判断を類似度算出手段３２にて行う。類似する単語が発見された場合には、環境情報抽出手段３３が、登録する単語の前後の品詞情報を取得する。ユーザ辞書登録部４０では、未知語選択部３０で選択された単語の情報をユーザ辞書６０に登録する。 In the created unknown word list, the user selects whether or not to register using the unknown word selection means 31 of the unknown word selection unit 30. For the unknown word that has been registered, the similarity calculation means 32 determines whether there is a word similar to the dictionary. When a similar word is found, the environment information extraction unit 33 acquires part-of-speech information before and after the word to be registered. The user dictionary registration unit 40 registers information on the word selected by the unknown word selection unit 30 in the user dictionary 60.

図５に、環境制限を加えないで未知語を登録する際に発生しがちな誤り例を示す。図５は、ユーザが未知語“Sun”（読み：サン）を登録すると、入力された“佐藤さん。”の“さん”を“Sun”と誤認識し、“佐藤Sun。”という認識結果が出ることを示す。同様に、ユーザが未知語“ARIS”（読み：アリス）を登録すると、入力された“そうであります”の“あります”を、“ARIS”と誤認識し、“そうでARIS。”となることを示している。 FIG. 5 shows an example of an error that tends to occur when an unknown word is registered without adding environmental restrictions. FIG. 5 shows that when the user registers the unknown word “Sun” (reading: Sun), “san” in the input “san” is misrecognized as “sun”, and the recognition result “sato sun.” Indicates exiting. Similarly, when the user registers the unknown word “ARIS” (reading: Alice), “Yes” in the input “Yes” is misrecognized as “ARIS” and becomes “Yes ARIS.” Is shown.

図６は、本実施形態において、既に辞書に類似度の高い単語があった場合に、登録される単語の環境情報を考慮して行われた未知語のユーザ辞書への登録内容のサンプルを示している。図６では、未知語“Sun”を登録する際には、“Sun”を固有名詞として、その読みと、記号−固有名詞−記号とを登録する旨が示されている。つまり、単語“Sun”の前後に環境情報を考慮して記号（句点）を登録することが示されている。また、語頭−固有名詞−助詞と登録する旨が示されており、未知語“Sun”が語頭にあった場合には、環境情報を考慮して、その固有名詞の後ろの助詞“が”が登録される旨が示されている。 FIG. 6 shows a sample of contents registered in the user dictionary of unknown words, which are performed in consideration of environment information of registered words when there is already a word with high similarity in the dictionary in this embodiment. ing. FIG. 6 shows that when registering the unknown word “Sun”, “Sun” is regarded as a proper noun and its reading and symbol-proper noun-symbol are registered. That is, it is indicated that symbols (punctuation points) are registered before and after the word “Sun” in consideration of environmental information. In addition, it is indicated that the initial word-proprietary noun-particle is registered. When the unknown word "Sun" is at the beginning of the word, the particle "ga" after the proper noun is considered in consideration of environmental information. It indicates that it will be registered.

図６には、更に、未知語“ARIS”の登録の際には、その読みに加えて、記号−固有名詞−記号、語頭−固有名詞−助詞、又は、語頭−固有名詞−名詞が登録する旨が示されている。“ARIS”を登録する際には、このルールに従って環境情報の登録を行い、その後に“ARIS”が入力された際に、その前後の環境を調べることで、“あります”などとの間で発生する誤認識を防止する。 Further, in FIG. 6, when the unknown word “ARIS” is registered, in addition to the reading, a symbol-proper noun-symbol, initial-proper noun-particle, or initial-proper noun-noun is registered. The effect is shown. When "ARIS" is registered, environment information is registered according to this rule, and when "ARIS" is entered after that, the environment before and after it is checked and it occurs between "Yes" and others Prevent misrecognition.

第１の実施形態では、未知語を登録する際には、その登録すべき未知語と類似度の高い単語が既に辞書に登録されているか否かを判定する。次いで、既に登録されている登録後と類似度が高いと判定された未知語を登録する際には、未知語の表記、読み、品詞情報だけでなく、その未知語が出現しうる前後の品詞の環境を登録している。この構成を採用することで、未知語と類似する単語との区別を図り、認識結果に悪影響を及ぼさないように図っている。 In the first embodiment, when registering an unknown word, it is determined whether or not a word having a high similarity to the unknown word to be registered is already registered in the dictionary. Next, when registering an unknown word that has been determined to have a high degree of similarity after registration, not only the notation of the unknown word, reading, part of speech information, but also the part of speech before and after the unknown word can appear The environment is registered. By adopting this configuration, an unknown word and a similar word are distinguished from each other so that the recognition result is not adversely affected.

次に、本発明の第２の実施形態について説明する。図２を参照すると、本発明の第２の実施形態に係るユーザ辞書作成システムは、第１の実施形態のユーザ辞書作成システムとは未知語抽出手段の構成が異なっている。詳しくは、第２の実施形態の未知語抽出部２０Ａは、第１の実施形態における未知語抽出部２０の構成に加えて、未知語読み付け手段２３と未知語読み付け辞書９０とを有する。その他の構成、動作については、第１の実施形態と同様である。 Next, a second embodiment of the present invention will be described. Referring to FIG. 2, the user dictionary creation system according to the second embodiment of the present invention is different from the user dictionary creation system according to the first embodiment in the configuration of the unknown word extraction means. Specifically, the unknown word extraction unit 20A of the second embodiment includes an unknown word reading unit 23 and an unknown word reading dictionary 90 in addition to the configuration of the unknown word extraction unit 20 of the first embodiment. Other configurations and operations are the same as those in the first embodiment.

未知語読み付け手段２３は、未知語抽出手段２２で抽出された未知語に対し、未知語読み付け辞書９０を用いて未知語の読みを自動的に付与する。未知語読み付け辞書９０は、１文字ごとに読みを定義したテーブルで、未知語の先頭から順にこのテーブルにある読みを付与していく。例えば、文字列“ＡＢＣ”が未知語として抽出された場合には、“ＡＢＣ”を１文字ずつに分割し、未知語読み付け辞書９０にある“Ａ（えー）”、“Ｂ（びー）”、“Ｃ（しー）”の読みを付与する。 The unknown word reading means 23 automatically gives an unknown word reading to the unknown words extracted by the unknown word extraction means 22 using the unknown word reading dictionary 90. The unknown word reading dictionary 90 is a table in which readings are defined for each character, and readings in this table are given in order from the top of the unknown words. For example, when the character string “ABC” is extracted as an unknown word, “ABC” is divided into characters and “A (e)” and “B (b)” in the unknown word reading dictionary 90 are divided. , “C” is given.

上記構成により、ユーザが未知語選択手段３１で、登録する未知語を選んだ時には、既になんらかの読みがその未知語に付与された状態である。このため、最初から全ての読みを登録する場合に比べ、登録する手間を軽減することができる。 With the above configuration, when the user selects an unknown word to be registered by the unknown word selection means 31, some reading has already been given to the unknown word. For this reason, compared with the case where all readings are registered from the beginning, the trouble of registering can be reduced.

次に、本発明の第３の実施形態について図面を参照して詳細に説明する。図３を参照すると、本発明の第３の実施形態に係るユーザ辞書作成システムは、第１の実施形態におけるユーザ辞書作成システムとは、未知語選択部の構成が異なる。詳しくは、本実施形態の未知語選択部３０Ａは、未知語選択手段３１と、類似度算出手段３４と、環境情報抽出手段３５とを有する。未知語選択手段３１の構成及び動作は、第１の実施形態の未知語選択手段と同様である。 Next, a third embodiment of the present invention will be described in detail with reference to the drawings. Referring to FIG. 3, the user dictionary creation system according to the third embodiment of the present invention differs from the user dictionary creation system according to the first embodiment in the configuration of the unknown word selection unit. Specifically, the unknown word selection unit 30A according to the present embodiment includes an unknown word selection unit 31, a similarity calculation unit 34, and an environment information extraction unit 35. The configuration and operation of the unknown word selection unit 31 are the same as those of the unknown word selection unit of the first embodiment.

本実施形態では、類似度算出手段３４は、第１の実施形態における類似度算出手段３２と同様に類似度を算出する機能を有し、且つ、辞書中に類似する単語が見つかったときには、その単語が音声認識用辞書５０とユーザ辞書６０のどちらに存在しているかを判別する。環境情報抽出手段３５は、類似度算出手段３４により類似する単語がユーザ辞書６０に登録されていると判定されると、新しく登録する単語はユーザ辞書に登録されている類似する単語の環境情報よりも前及び／又は後に、１語ずつ多くの環境情報を保持させるようにする。 In the present embodiment, the similarity calculation unit 34 has a function of calculating a similarity in the same manner as the similarity calculation unit 32 in the first embodiment, and when a similar word is found in the dictionary, It is determined whether the word exists in the voice recognition dictionary 50 or the user dictionary 60. When it is determined by the similarity calculation unit 34 that the similar word is registered in the user dictionary 60, the environmental information extraction unit 35 determines that the newly registered word is based on the environmental information of the similar word registered in the user dictionary. Before and / or after, a large amount of environmental information is retained one word at a time.

第３の実施形態では、登録したい単語が、音声認識用辞書ではなく、ユーザ辞書に既に登録された単語との間で類似性が高い場合でも、誤認識の発生を低減することができる。 In the third embodiment, it is possible to reduce the occurrence of misrecognition even when the word to be registered is not a speech recognition dictionary but has a high similarity with a word already registered in the user dictionary.

次に、本発明の第４の実施形態について説明する。図４を参照すると、第４の実施形態に係るユーザ辞書作成システムでは、未知語抽出部２０Ｂが、第１の実施形態の未知語抽出部２０の構成に加えて、登録単語抽出手段２４を有する。本実施形態における他の構成および動作については第１の実施形態と同様である。 Next, a fourth embodiment of the present invention will be described. Referring to FIG. 4, in the user dictionary creation system according to the fourth embodiment, the unknown word extraction unit 20B includes a registered word extraction unit 24 in addition to the configuration of the unknown word extraction unit 20 of the first embodiment. . Other configurations and operations in the present embodiment are the same as those in the first embodiment.

登録単語抽出手段２４は、ユーザ辞書６０に登録されている単語の環境情報を追加するための手段である。登録単語抽出手段２４は、形態素解析手段２１により解析された文字列中にユーザ辞書に既に登録された単語が含まれていると、ユーザ辞書６０の中身を検索し当該単語の環境情報を確認し、登録されていない環境情報である場合には、ユーザ辞書登録部４０を介してユーザ辞書６０に登録する。また、環境情報が登録されていても、識別が充分でない場合には、登録されている環境情報に加えて、又は、これに代えて、新たに環境情報を登録する。 The registered word extraction unit 24 is a unit for adding environment information of words registered in the user dictionary 60. If the word string already analyzed in the user dictionary is included in the character string analyzed by the morpheme analyzing means 21, the registered word extracting means 24 searches the contents of the user dictionary 60 and confirms the environment information of the word. If the environment information is not registered, it is registered in the user dictionary 60 via the user dictionary registration unit 40. If the environmental information is registered but the identification is not sufficient, the environmental information is newly registered in addition to or instead of the registered environmental information.

第４の実施形態では、既にユーザ辞書に登録された単語の環境情報を、その後の情報に基づいて、後からでも拡充又は変更することが出来る。このため、更なる認識率の向上が期待できる。 In the fourth embodiment, the environment information of words already registered in the user dictionary can be expanded or changed later on the basis of the subsequent information. For this reason, further improvement of the recognition rate can be expected.

本発明を特別に示し且つ例示的な実施形態を参照して説明したが、本発明は、その実施形態及びその変形に限定されるものではない。当業者に明らかなように、本発明は、添付の特許請求の範囲に規定される本発明の精神及び範囲を逸脱することなく、種々の変更が可能である。 Although the invention has been particularly shown and described with reference to illustrative embodiments, the invention is not limited to these embodiments and variations thereof. It will be apparent to those skilled in the art that various modifications can be made to the present invention without departing from the spirit and scope of the invention as defined in the appended claims.

本発明は、ユーザ辞書を用いてシステム構築後もユーザの用途に合わせて必要な単語を随時登録することができる高精度な音声認識性能が要求される音声認識システム、及び、その辞書登録装置に適している。 The present invention relates to a speech recognition system that requires a highly accurate speech recognition performance capable of registering necessary words at any time even after system construction using a user dictionary, and a dictionary registration apparatus thereof. Is suitable.

１０：テキスト入力部
１１：テキスト入力手段
２０、２０Ａ、２０Ｂ：未知語抽出部
２１：形態素解析手段
２２：未知語抽出手段
２３：未知語読み付け手段
２４：登録単語抽出手段
３０、３０Ａ：未知語選択部
３１：未知語選択手段
３２、３４：類似度算出手段
３３、３５：環境情報抽出手段
４０：ユーザ辞書登録部
４１：登録手段
５０：音声認識用辞書
６０：ユーザ辞書
７０：音声認識手段
８０：文章作成手段
９０：未知語読み付け辞書 10: Text input unit 11: Text input unit 20, 20A, 20B: Unknown word extraction unit 21: Morphological analysis unit 22: Unknown word extraction unit 23: Unknown word reading unit 24: Registered word extraction unit 30, 30A: Unknown word Selection unit 31: unknown word selection unit 32, 34: similarity calculation unit 33, 35: environment information extraction unit 40: user dictionary registration unit 41: registration unit 50: voice recognition dictionary 60: user dictionary 70: voice recognition unit 80 : Sentence creation means 90: unknown word reading dictionary

Claims

A text input means for inputting a character string;
An unknown word extraction means for extracting a word that is not registered in the dictionary as an unknown word from the input character string;
Similarity calculating means for calculating the similarity between the extracted unknown word and a registered word already registered in the dictionary;
When the similarity is greater than a predetermined value, based on the morphological analysis of the surrounding text in an unknown language in the character string, and the environmental information extraction means for extracting environmental information indicating a before and after the word part of speech of the unknown word ,
A user dictionary creation system comprising: registration means for registering the unknown word and the environment information extracted by the environment information extraction means in a dictionary when the similarity is equal to or greater than a predetermined value .

The user dictionary creation system according to claim 1, further comprising unknown word reading means for determining reading of the unknown word.

When a registered word already registered in the dictionary is extracted from the input character string and the environment information of the extracted registered word is not registered in the dictionary , based on the form analysis result of the sentence before and after the registered word extracting environmental information indicating a before and after the word part of speech of the previous SL registered word Te, further comprising a registration word extracting means for updating the registration word by adding the extracted environmental information to the registration word claim 1 Or the user dictionary creation system of 2.

When the registered word determined to have a predetermined degree of similarity or higher with the unknown word already includes the environmental information, the environmental information extracting means uses the words before and after the unknown word from which the included environmental information is extracted. The user dictionary creation system according to any one of claims 1 to 3 , wherein the environment information is extracted from words having one word before and / or behind .

The process of entering a string,
Processing to extract words not registered in the dictionary as unknown words from the input character string;
A process of calculating the similarity between the extracted unknown word and a registered word already registered in the dictionary;
When the similarity is greater than a predetermined value, the processing on the basis of the character form analysis of the surrounding text in an unknown language in a column, and extracts the environment information indicating the front and rear of a word part of speech of the unknown word,
And a process of registering the unknown word and the extracted environment information in a dictionary when the similarity is equal to or greater than a predetermined value .

On the computer,
The process of entering a string,
Processing to extract words not registered in the dictionary as unknown words from the input character string;
A process of calculating the similarity between the extracted unknown word and a registered word already registered in the dictionary;
When the similarity is greater than a predetermined value, the processing on the basis of the character form analysis of the surrounding text in an unknown language in a column, and extracts the environment information indicating the front and rear of a word part of speech of the unknown word,
A user dictionary creation program for executing processing for registering the unknown word and the extracted environment information in a dictionary when the similarity is equal to or greater than a predetermined value .