JP4942860B2

JP4942860B2 - Recognition dictionary creation device, speech recognition device, and speech synthesis device

Info

Publication number: JP4942860B2
Application number: JP2011550720A
Authority: JP
Inventors: 裕三丸田
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2010-01-22
Filing date: 2010-01-22
Publication date: 2012-05-30
Anticipated expiration: 2030-01-22
Also published as: DE112010005168T5; WO2011089651A1; JPWO2011089651A1; US9177545B2; CN102687197B; CN102687197A; DE112010005168B4; US20120203553A1

Description

この発明は、ユーザが発話した音声によって音声認識用辞書に語彙登録を行う認識辞書作成装置、これを用いた音声認識装置及び音声合成装置に関するものである。 The present invention relates to a recognition dictionary creation device for registering vocabulary in a speech recognition dictionary using speech uttered by a user, a speech recognition device and a speech synthesis device using the recognition dictionary creation device.

音声認識を適用するアプリケーションによっては、ユーザが発話した音声を登録して認識対象語として使う場合がある。以降では、この動作をユーザ辞書の生成と称する。音声によるユーザ辞書の生成例としては、ラジオの周波数に対応したラジオ局名を音声で登録したり、電話番号に対応した人名や場所名を音声で登録したりする場合がある。 Depending on the application to which speech recognition is applied, the speech uttered by the user may be registered and used as a recognition target word. Hereinafter, this operation is referred to as user dictionary generation. As an example of generating a user dictionary by voice, a radio station name corresponding to a radio frequency may be registered by voice, or a person name or a place name corresponding to a telephone number may be registered by voice.

また、カーナビゲーションシステムや携帯端末等、複数の国をまたがって使用され得る機器に搭載される音声認識では、言語の切り替え機能が求められる。
関連する従来技術として、例えば、特許文献１には、電子辞書の使用言語を切り替えるにあたり、ユーザが発話した音声を音声認識して得られた文字データと装置内に記憶されている単語とを照合することにより、ユーザが使いたい言語を決定する使用言語切り替え方法が開示されている。In addition, a language switching function is required for speech recognition installed in devices that can be used across multiple countries, such as car navigation systems and portable terminals.
As a related art, for example, in Patent Document 1, when switching the language used in an electronic dictionary, character data obtained by voice recognition of a voice spoken by a user is collated with a word stored in the apparatus. Thus, there is disclosed a method of switching the language used to determine the language that the user wants to use.

一般的には、言語ごとに音声データを収集し、収集された音声データを用いて構築した音声認識アルゴリズムや音声標準モデルを使って、ユーザが発話した音声が認識される。このため、言語を切り替えた場合には、音声認識手段そのものや音声標準モデルを切り替える必要がある。
従来では、一般的に知られている音声認識の技術を用いて、言語ごとに、ユーザが発生した音声を最も良く表現する音素のラベル列を生成し、ユーザ辞書として保存することにより、音声認識で使う言語を切り替えても、ユーザが発生した音声を音声認識可能とした音声認識装置も提案されている。In general, voice data is collected for each language, and a voice uttered by a user is recognized using a voice recognition algorithm or a voice standard model constructed using the collected voice data. For this reason, when the language is switched, it is necessary to switch the voice recognition means itself or the voice standard model.
Conventionally, by using generally known speech recognition technology, for each language, a phoneme label string that best represents the speech generated by the user is generated and stored as a user dictionary. There has also been proposed a voice recognition device that can recognize a voice generated by a user even when a language used in the system is switched.

しかしながら、言語変更の度に音素ラベル列を作成する場合には、発話音声をメモリに保存して処理を行うため、発話音声の保存領域を確保できる大容量のメモリが必要であるという課題があった。
また、発生音声をメモリに保存できない場合には、想定される全ての言語について音素ラベル列をそれぞれ作成しておかなければならないが、単一の言語の音素ラベル列を作成する場合であっても多大な時間を要するため、想定される全ての言語について必要な処理時間は膨大なものとなる。この他に、全ての言語分の音素ラベル列を保存可能な大容量のメモリも必要である。However, when a phoneme label string is created each time the language is changed, the speech is stored in the memory for processing, and thus there is a problem that a large-capacity memory that can secure a storage area for the speech is required. It was.
In addition, if the generated speech cannot be stored in the memory, it is necessary to create a phoneme label sequence for all possible languages, but even when creating a phoneme label sequence for a single language. Since a great deal of time is required, the processing time required for all possible languages is enormous. In addition to this, a large-capacity memory capable of storing phoneme label strings for all languages is also required.

この発明は、上記のような課題を解決するためになされたもので、発話音声を保存する大容量のメモリが不要であり、かつ全ての言語について音素ラベル列を予め作成する必要がなく、言語ごとの音素ラベル列の作成時間を短縮することができる認識辞書作成装置、これを用いた音声認識装置及び音声合成装置を得ることを目的とする。 The present invention has been made to solve the above-described problems, and does not require a large-capacity memory for storing speech speech, and it is not necessary to create a phoneme label string in advance for all languages. An object of the present invention is to obtain a recognition dictionary creation device capable of shortening the creation time of each phoneme label string, and a speech recognition device and speech synthesis device using the recognition dictionary creation device.

特開２００１−２８２７８８号公報JP 2001-282788 A

この発明に係る認識辞書作成装置は、入力音声の音声信号を音響分析して音響特徴の時系列を出力する音響分析部と、標準の音響特徴を示す音響標準パタンを言語ごとに記憶する音響標準パタン記憶部と、音響分析部から入力した入力音声の音響特徴の時系列と、音響標準パタン記憶部に記憶された音響標準パタンとを照合して入力音声の音素ラベル列を作成する音響データマッチング部と、音響データマッチング部により作成された入力音声の音素ラベル列を登録したユーザ辞書を記憶するユーザ辞書記憶部と、ユーザ辞書に登録された音素ラベル列の言語を記憶する言語記憶部と、言語を切り替える言語切り替え部と、言語間の音素ラベルの対応関係が規定されたマッピングテーブルを記憶するマッピングテーブル記憶部と、マッピングテーブル記憶部に記憶されるマッピングテーブルを参照して、ユーザ辞書に登録した音素ラベル列を、言語記憶部に記憶した言語の音素レベル列から、言語切り替え部により切り替えた言語の音素ラベル列へ変換する音素ラベル列変換部とを備えるものである。 The recognition dictionary creation apparatus according to the present invention includes an acoustic analysis unit that acoustically analyzes a speech signal of an input speech and outputs a time series of acoustic features, and an acoustic standard that stores acoustic standard patterns indicating standard acoustic features for each language. Acoustic data matching that creates a phoneme label string of input speech by collating the time series of the acoustic features of the input speech input from the pattern storage unit and the acoustic analysis unit with the acoustic standard pattern stored in the acoustic standard pattern storage unit A user dictionary storing a user dictionary in which the phoneme label sequence of the input speech created by the acoustic data matching unit is registered, a language storage unit storing the language of the phoneme label sequence registered in the user dictionary, A language switching unit for switching languages, a mapping table storage unit for storing a mapping table in which correspondence between phoneme labels between languages is defined, and mapping The phoneme label sequence registered in the user dictionary is converted from the phoneme level sequence stored in the language storage unit to the phoneme label sequence of the language switched by the language switching unit with reference to the mapping table stored in the table storage unit. A phoneme label string conversion unit.

この発明によれば、入力音声の音素ラベル列を登録したユーザ辞書と、言語間の音素ラベルの対応関係が規定されたマッピングテーブルとを備え、マッピングテーブルを参照して、ユーザ辞書に登録した音素ラベル列を、ユーザ辞書を作成したときの言語の音素ラベル列から、切り替え後の言語の音素ラベル列へ変換する。
このように言語が切り替わっても、マッピングテーブルを参照して高速に登録語彙を、切り替え後の言語用に変換することができるため、発話音声を保存する大容量のメモリが不要であり、かつ全ての言語について音素ラベル列を予め作成する必要がなく、言語ごとの音素ラベル列の作成時間を短縮することができるという効果がある。According to the present invention, a user dictionary in which a phoneme label sequence of input speech is registered, and a mapping table in which a correspondence relationship between phoneme labels between languages is defined, and the phonemes registered in the user dictionary with reference to the mapping table are provided. The label string is converted from the phoneme label string in the language when the user dictionary is created into the phoneme label string in the language after switching.
Even if the language is switched in this way, the registered vocabulary can be quickly converted to the language after switching by referring to the mapping table, so that a large-capacity memory for storing the speech is not necessary, and all There is no need to create a phoneme label string in advance for each language, and the time for creating a phoneme label string for each language can be shortened.

この発明の実施の形態１による認識辞書作成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the recognition dictionary creation apparatus by Embodiment 1 of this invention. 実施の形態１の認識辞書作成装置によるユーザ辞書登録動作の流れを示すフローチャートである。4 is a flowchart showing a flow of a user dictionary registration operation by the recognition dictionary creation device of the first embodiment. 実施の形態１の認識辞書作成装置による言語切り替え後のユーザ辞書登録動作の流れを示すフローチャートである。6 is a flowchart showing a flow of user dictionary registration operation after language switching by the recognition dictionary creating apparatus of the first embodiment. この発明の実施の形態２による音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus by Embodiment 2 of this invention. 実施の形態２の音声認識装置による動作の流れを示すフローチャートである。10 is a flowchart showing a flow of operations performed by the speech recognition apparatus according to the second embodiment. この発明の実施の形態３による音声合成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech synthesizer by Embodiment 3 of this invention. 実施の形態３の音声合成装置による動作の流れを示すフローチャートである。10 is a flowchart illustrating a flow of operations performed by the speech synthesizer according to the third embodiment. この発明の実施の形態４による認識辞書作成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the recognition dictionary creation apparatus by Embodiment 4 of this invention. 実施の形態４の認識辞書作成装置によるユーザ辞書登録動作の流れを示すフローチャートである。10 is a flowchart illustrating a flow of a user dictionary registration operation performed by the recognition dictionary creation device according to the fourth embodiment. 実施の形態４の認識辞書作成装置による言語切り替え後のユーザ辞書登録動作の流れを示すフローチャートである。15 is a flowchart showing a flow of user dictionary registration operation after language switching by the recognition dictionary creating apparatus of the fourth embodiment.

以下、この発明をより詳細に説明するために、この発明を実施するための形態について、添付の図面に従って説明する。
実施の形態１．
図１は、この発明の実施の形態１による認識辞書作成装置の構成を示すブロック図である。図１において、実施の形態１の認識辞書作成装置１は、マイク２ａ、音声取り込み部２、音響分析部３、言語ごとの音響標準パタン４、音響データマッチング部５、ユーザ辞書登録部（ユーザ辞書記憶部）６、ユーザ辞書作成時言語記憶部（言語記憶部）７、言語切り替え部８、音素ラベル列変換部９及び言語間音響データマッピングテーブル保存部（マッピングテーブル記憶部）１０を備える。Hereinafter, in order to explain the present invention in more detail, modes for carrying out the present invention will be described with reference to the accompanying drawings.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a recognition dictionary creation apparatus according to Embodiment 1 of the present invention. In FIG. 1, a recognition dictionary creating apparatus 1 according to Embodiment 1 includes a microphone 2a, a voice capturing unit 2, an acoustic analysis unit 3, an acoustic standard pattern 4 for each language, an acoustic data matching unit 5, a user dictionary registration unit (user dictionary (Storage unit) 6, user dictionary creation language storage unit (language storage unit) 7, language switching unit 8, phoneme label string conversion unit 9, and inter-language acoustic data mapping table storage unit (mapping table storage unit) 10.

音声取り込み部２は、マイク２ａで取り込まれた音声をデジタル信号に変換する構成部である。音響分析部３は、音声取り込み部２でデジタル信号化された音声信号を分析して音響特徴の時系列に変換する構成部である。例えば、音声信号を一定の時間間隔で分析して、音声の特徴を表す音響特徴量（音響特徴量ベクトル）を計算する。 The sound capturing unit 2 is a component that converts sound captured by the microphone 2a into a digital signal. The acoustic analysis unit 3 is a configuration unit that analyzes the voice signal converted into a digital signal by the voice capturing unit 2 and converts it into a time series of acoustic features. For example, an audio signal is analyzed at regular time intervals, and an acoustic feature amount (acoustic feature amount vector) representing the feature of the speech is calculated.

音響標準パタン４は、言語Ｘ（Ｘ＝１，２，３，・・・）の各音素ラベル列にそれぞれ対応する標準の音響特徴（音声の断片について音響特徴量の性質を表す標準モデル）であり、例えば音素を単位としてＨＭＭ（隠れマルコフモデル）等によりモデル化したものである。音響データマッチング部５は、音響分析部３によって得られた入力音声の音響特徴の時系列と言語Ｘの音響標準パタン４とを照合して、音響標準パタン４を構成する標準の音響特徴に対応した音素ラベル列から、入力音声に最も類似する音素ラベル列を作成する構成部である。 The acoustic standard pattern 4 is a standard acoustic feature corresponding to each phoneme label string of the language X (X = 1, 2, 3,...) (A standard model representing the characteristics of acoustic features for speech fragments). For example, it is modeled by HMM (Hidden Markov Model) etc. with phonemes as units. The acoustic data matching unit 5 matches the time series of the acoustic features of the input speech obtained by the acoustic analysis unit 3 with the acoustic standard pattern 4 of the language X and corresponds to the standard acoustic features constituting the acoustic standard pattern 4. This is a component that creates a phoneme label string most similar to the input speech from the phoneme label string that has been processed.

ユーザ辞書登録部６は、ユーザ辞書を有する構成部であり、音響データマッチング部５によって作成された入力音声の音素ラベル列をユーザ辞書に格納する。ユーザ辞書作成時言語記憶部７は、ユーザ辞書を作成した際に、音声認識の言語として設定されていた設定言語を記憶する記憶部である。言語切り替え部８は、音声認識の言語として使用する設定言語を切り替える構成部である。 The user dictionary registration unit 6 is a component having a user dictionary, and stores the phoneme label string of the input speech created by the acoustic data matching unit 5 in the user dictionary. The user dictionary creation language storage unit 7 is a storage unit that stores a set language set as the speech recognition language when the user dictionary is created. The language switching unit 8 is a configuration unit that switches a setting language used as a language for speech recognition.

音素ラベル列変換部９は、言語間音響データマッピングテーブルを用いて、ユーザ辞書に登録された際の言語で表現された音素ラベル列を、言語切り替え部８により変更された言語の音素ラベル列へ変換する構成部である。言語間音響データマッピングテーブル保存部１０は、互いに異なる言語の対とこれら言語の各音素ラベルとの対応関係を示す言語間音響データマッピングテーブルを記憶する記憶部である。 The phoneme label string conversion unit 9 uses the inter-language acoustic data mapping table to convert the phoneme label string expressed in the language registered in the user dictionary to the phoneme label string of the language changed by the language switching unit 8. It is the component part to convert. The inter-language acoustic data mapping table storage unit 10 is a storage unit that stores an inter-language acoustic data mapping table indicating a correspondence relationship between pairs of different languages and phoneme labels of these languages.

なお、一方の言語では他方の言語の音素ラベルを表現できない場合、当該他方の言語で表現し得る音素ラベルのうち、類似した音素ラベルを対応付ける。例えば、日本語では、英語の音素ラベル／ｌ／を表現できない。そこで、日本語と英語における言語間音響データマッピングテーブルには、英語の音素ラベル／ｌ／に発音が類似した日本語の音素ラベル／ｒ／を対応させる。 If one language cannot express the phoneme label of the other language, a similar phoneme label is associated among the phoneme labels that can be expressed in the other language. For example, Japanese phoneme label / l / cannot be expressed in Japanese. Therefore, the Japanese phoneme label / r / whose pronunciation is similar to the English phoneme label / l / is associated with the inter-language acoustic data mapping table in Japanese and English.

また、音声取り込み部２、音響分析部３、音響標準パタン４、音響データマッチング部５、ユーザ辞書登録部６、ユーザ辞書作成時言語記憶部７、言語切り替え部８、音素ラベル列変換部９及び言語間音響データマッピングテーブル保存部１０は、この発明の趣旨に従う認識辞書作成プログラムをコンピュータに記憶し、ＣＰＵに実行させることにより、ハードウエアとソフトウエアが協働した具体的な手段として、当該コンピュータ上で実現することができる。さらに、音響標準パタン４、ユーザ辞書登録部６、ユーザ辞書作成時言語記憶部７及び言語間音響データマッピングテーブル保存部１０で用いる記憶領域は、上記コンピュータに搭載された記憶装置、例えばハードディスク装置や外部記憶メディア等に構築される。 Also, a voice capturing unit 2, an acoustic analysis unit 3, an acoustic standard pattern 4, an acoustic data matching unit 5, a user dictionary registration unit 6, a user dictionary creation language storage unit 7, a language switching unit 8, a phoneme label string conversion unit 9, and The inter-language acoustic data mapping table storage unit 10 stores a recognition dictionary creation program according to the gist of the present invention in a computer and causes the CPU to execute the computer as a specific means in which hardware and software cooperate. Can be realized above. Further, the storage area used in the acoustic standard pattern 4, the user dictionary registration unit 6, the user dictionary creation language storage unit 7 and the inter-language acoustic data mapping table storage unit 10 is a storage device mounted on the computer, such as a hard disk device or the like. Built on external storage media.

次に動作について説明する。
図２は、実施の形態１の認識辞書作成装置によるユーザ辞書登録動作の流れを示すフローチャートである。
ユーザが、入力装置を用いてユーザ辞書作成開始を指示してから（ステップＳＴ１）、登録しようとしている語彙を発話する。例えば、個人名の「Ｍｉｃｈａｅｌ」が発話されたものとする。音声取り込み部２は、マイク２ａを介して、ユーザから発話された音声を取り込み、この入力音声をデジタル信号に変換してから音響分析部３に出力する（ステップＳＴ２）。Next, the operation will be described.
FIG. 2 is a flowchart showing the flow of the user dictionary registration operation by the recognition dictionary creating apparatus of the first embodiment.
After the user gives an instruction to start creating a user dictionary using the input device (step ST1), the user speaks the vocabulary to be registered. For example, it is assumed that the personal name “Michael” is spoken. The voice capturing unit 2 captures the voice uttered by the user via the microphone 2a, converts the input voice into a digital signal, and outputs the digital signal to the acoustic analysis unit 3 (step ST2).

続いて、ユーザ辞書作成時言語記憶部７が、音響データマッチング部５に現在設定されている、ユーザ辞書登録時の設定言語を確認し（ステップＳＴ３）、自身に登録する（ステップＳＴ４）。なお、設定言語は、認識辞書作成装置１を用いた音声認識装置や音声合成装置において、音声認識や音声合成の対象となる言語として予め設定されている言語である。図２の例では、英語を設定言語としている。音響分析部３は、ステップＳＴ２で音声取り込み部２から入力した音声信号を音響分析し、この音声信号を音響特徴の時系列に変換する（ステップＳＴ５）。 Subsequently, the user dictionary creation language storage unit 7 checks the language currently set in the acoustic data matching unit 5 at the time of user dictionary registration (step ST3) and registers it in itself (step ST4). The set language is a language that is set in advance as a language for speech recognition or speech synthesis in the speech recognition device or speech synthesis device using the recognition dictionary creation device 1. In the example of FIG. 2, English is set as the setting language. The acoustic analysis unit 3 acoustically analyzes the voice signal input from the voice capturing unit 2 in step ST2, and converts the voice signal into a time series of acoustic features (step ST5).

音響データマッチング部５は、自身に設定されている言語（設定言語）に対応する音響標準パタン４を読み出し、この設定言語の音響標準パタン４と、音響分析部３で得られた入力音声の音響特徴の時系列とを照合して、音響標準パタン４を構成する標準の音響特徴に対応した音素ラベル列から、入力音声の音響特徴の時系列に最も類似した当該入力音声を表す最適な音素ラベル列を作成する（ステップＳＴ６）。例えば、入力音声が「Ｍｉｃｈａｅｌ」であり、設定言語が英語の場合は、図２に示すように「＃，／ｍ／，／ａ／，／ｉ／，／ｋ／，／ｌ／，＃」という音素ラベル列が得られる。 The acoustic data matching unit 5 reads out the acoustic standard pattern 4 corresponding to the language (setting language) set in itself, and the acoustic standard pattern 4 of the setting language and the acoustic of the input speech obtained by the acoustic analysis unit 3 The optimal phoneme label representing the input speech most similar to the time series of the acoustic features of the input speech from the phoneme label sequence corresponding to the standard acoustic features constituting the acoustic standard pattern 4 by collating with the feature time series A column is created (step ST6). For example, when the input voice is “Michael” and the set language is English, “#, / m /, / a /, / i /, / k /, / l /, #” as shown in FIG. Is obtained.

ユーザ辞書登録部６は、音響データマッチング部５により作成された入力音声の音素ラベル列を、ユーザ辞書に登録する（ステップＳＴ７）。これにより、設定言語の登録語彙テキストに対応した音素ラベル列が登録されたユーザ辞書が作成される。 The user dictionary registration unit 6 registers the phoneme label string of the input speech created by the acoustic data matching unit 5 in the user dictionary (step ST7). As a result, a user dictionary in which a phoneme label string corresponding to the registered vocabulary text in the set language is registered is created.

次に設定言語を切り替えた場合における動作について説明する。
図３は、実施の形態１の認識辞書作成装置による言語切り替え後のユーザ辞書登録動作の流れを示すフローチャートであり、図２で示したユーザ辞書登録が実行された後に言語が切り替えられた場合を示している。
例えば、ユーザが、入力装置を用いて言語切り替え部８に新たな言語を指定することにより、言語切り替え部８が、切り替え後の言語を音素ラベル列変換部９に設定する（ステップＳＴ１ａ）。ここでは、日本語に切り替えられたものとする。
音素ラベル列変換部９は、ユーザ辞書作成時言語記憶部７に記憶された言語を読み出して、ユーザ辞書の登録時における設定言語を確認する（ステップＳＴ２ａ）。上述したように、図２では、ユーザ辞書登録時の設定言語は英語である。Next, the operation when the set language is switched will be described.
FIG. 3 is a flowchart showing the flow of user dictionary registration operation after language switching by the recognition dictionary creation apparatus of the first embodiment, and shows a case where the language is switched after the user dictionary registration shown in FIG. 2 is executed. Show.
For example, when the user designates a new language in the language switching unit 8 using the input device, the language switching unit 8 sets the switched language in the phoneme label string conversion unit 9 (step ST1a). Here, it is assumed that the language has been switched to Japanese.
The phoneme label string conversion unit 9 reads the language stored in the user dictionary creation language storage unit 7 and checks the set language when registering the user dictionary (step ST2a). As described above, in FIG. 2, the setting language at the time of user dictionary registration is English.

続いて、音素ラベル列変換部９は、ステップＳＴ２ａで確認したユーザ辞書の登録時における設定言語と言語切り替え部８から指定された切り替え後の言語とを用いて、言語間音響データマッピングテーブル保存部１０を検索して、ユーザ辞書の登録時における設定言語と切り替え後の言語に対応する言語間音響データマッピングテーブルを読み込む。 Subsequently, the phoneme label string conversion unit 9 uses the language set at the time of registration of the user dictionary confirmed in step ST2a and the language after switching designated by the language switching unit 8 to store the inter-language acoustic data mapping table. 10 is read, and the inter-language acoustic data mapping table corresponding to the language set at the time of registration of the user dictionary and the language after switching is read.

言語間音響データマッピングテーブルは、図３に示すように、英語の音素ラベルと日本語の音素ラベルとの対応関係を示すテーブルデータである。例えば、図３において、英語の音素ラベルのうち、符号Ａで示す発音が類似する３つの異なる音素ラベルは、日本語で表現できないものを含んでいる。この場合は、日本語の音素ラベルのうち、符号Ａで示す音素ラベルの発音に類似した１つの音素ラベル（／ａ／）を対応付ける。また、日本語では、英語の音素ラベル／ｌ／を表現できないので、英語の音素ラベル／ｌ／に発音が類似した日本語の音素ラベル／ｒ／を対応付ける。 As shown in FIG. 3, the inter-language acoustic data mapping table is table data indicating the correspondence between English phoneme labels and Japanese phoneme labels. For example, in FIG. 3, among the phoneme labels in English, three different phoneme labels with similar pronunciations indicated by the symbol A include those that cannot be expressed in Japanese. In this case, of the Japanese phoneme labels, one phoneme label (/ a /) similar to the pronunciation of the phoneme label indicated by symbol A is associated. Also, since the Japanese phoneme label / l / cannot be expressed in Japanese, the Japanese phoneme label / r / whose pronunciation is similar to the English phoneme label / l / is associated.

音素ラベル列変換部９は、言語間音響データマッピングテーブル保存部１０から読み込んだ言語間音響データマッピングテーブルに基づいて、ユーザ辞書に登録されている音素ラベル列を、切り替え後の言語の音素ラベル列に変換する（ステップＳＴ３ａ）。
例えば、図３に示すように、「Ｍｉｃｈａｅｌ」という英語の音素ラベル列である「＃，／ｍ／，／ａ／，／ｉ／，／ｋ／，／ｌ／，＃」が、英語と日本語の言語間音響データマッピングテーブルにおける対応関係に基づいて、日本語の音素ラベル列である「＃，／ｍ／，／ａ／，／ｉ／，／ｋ／，／ｒ／，＃」に変換される。
なお、言語間音響データマッピングテーブルの作成方法に関して、例えば下記の参考文献１に開示されている。
（参考文献１）；特開２００７−１５５８３３号公報The phoneme label string conversion unit 9 converts the phoneme label string registered in the user dictionary based on the interlanguage acoustic data mapping table read from the interlanguage acoustic data mapping table storage unit 10 to the phoneme label string of the language after switching. (Step ST3a).
For example, as shown in FIG. 3, “#, / m /, / a /, / i /, / k /, / l /, #”, which is an English phoneme label string “Michael”, is English and Japanese. Based on the correspondence relationship in the inter-language acoustic data mapping table of words, it is converted into “#, / m /, / a /, / i /, / k /, / r /, #” which is a Japanese phoneme label string. Is done.
The method for creating the inter-language acoustic data mapping table is disclosed, for example, in Reference Document 1 below.
(Reference Document 1); Japanese Unexamined Patent Application Publication No. 2007-155833

ユーザ辞書登録部６は、ステップＳＴ３ａで音素ラベル列変換部９により変換された音素ラベル列を、ユーザ辞書に再格納する（ステップＳＴ４ａ）。図３では、登録語彙が「Ｍｉｃｈａｅｌ」であって、切り替え後の言語が日本語であるので、日本語の音素ラベル列である「＃，／ｍ／，／ａ／，／ｉ／，／ｋ／，／ｒ／，＃」が１つの登録語として格納される。 The user dictionary registration unit 6 re-stores the phoneme label sequence converted by the phoneme label sequence conversion unit 9 in step ST3a in the user dictionary (step ST4a). In FIG. 3, since the registered vocabulary is “Michael” and the language after switching is Japanese, “#, / m /, / a /, / i /, / k” is a Japanese phoneme label string. /, / R /, # "are stored as one registered word.

以上のように、この実施の形態１によれば、入力音声の音素ラベル列を登録したユーザ辞書と、言語間の音素ラベルの対応関係が規定された言語間音響データマッピングテーブルとを備え、言語間音響データマッピングテーブルを参照して、ユーザ辞書に登録した音素ラベル列を、ユーザ辞書を作成したときの言語の音素ラベル列から、切り替え後の言語の音素ラベル列へ変換する。
このように構成することにより、ユーザ辞書の登録時から設定言語が変更された場合であっても、言語間音響データマッピングテーブルに基づいて音素ラベル列を変換するだけで、変更後の言語のユーザ辞書を作成でき、対応言語の音素ラベル列を作成する処理時間を格段に短縮することが可能である。
また、言語を変更する度に音素ラベル列を作成する場合であっても発話音声を保存する必要がなく、ユーザ辞書登録時の音素ラベル列のみを保存して、想定される全ての言語について音素ラベルを予め作成しておく必要もない。これにより、大容量のメモリも不要である。As described above, according to the first embodiment, the user dictionary in which the phoneme label sequence of the input speech is registered, and the inter-language acoustic data mapping table in which the correspondence of phoneme labels between languages is defined, With reference to the inter-acoustic data mapping table, the phoneme label string registered in the user dictionary is converted from the phoneme label string in the language when the user dictionary is created into the phoneme label string in the language after switching.
By configuring in this way, even if the set language is changed from the time of registration of the user dictionary, it is possible to convert the phoneme label string based on the inter-language acoustic data mapping table and to change the language user after the change. A dictionary can be created, and the processing time for creating a phoneme label string of a corresponding language can be significantly reduced.
Also, even if a phoneme label string is created every time the language is changed, it is not necessary to save the utterance speech, only the phoneme label string at the time of registering the user dictionary is saved, and the phonemes for all assumed languages are saved. There is no need to prepare labels in advance. This eliminates the need for a large-capacity memory.

実施の形態２．
図４は、この発明の実施の形態２による音声認識装置の構成を示すブロック図であり、上記実施の形態１による認識辞書作成装置を用いた音声認識装置を示している。図４において、実施の形態２による音声認識装置１Ａは、上記実施の形態１で示した認識辞書作成装置１の構成に加え、辞書照合部１１、言語ごとの音響標準パタンで表現される一般辞書１２及び認識結果出力部１３を備える。なお、図４において、図１と同一又は同様に動作する構成部には、同一符号を付して説明を省略する。Embodiment 2. FIG.
FIG. 4 is a block diagram showing a configuration of a speech recognition apparatus according to Embodiment 2 of the present invention, and shows a speech recognition apparatus using the recognition dictionary creation apparatus according to Embodiment 1 described above. 4, in addition to the configuration of the recognition dictionary creation device 1 shown in the first embodiment, the speech recognition device 1A according to the second embodiment includes a dictionary collation unit 11 and a general dictionary expressed by an acoustic standard pattern for each language. 12 and a recognition result output unit 13. In FIG. 4, the same reference numerals are given to components operating in the same or similar manner as in FIG. 1, and description thereof is omitted.

辞書照合部１１は、入力音声の音素ラベル列と、設定言語の音響標準パタンで表現される一般辞書１２の語彙と、ユーザ辞書登録部６のユーザ辞書に登録されている語彙を照合して、一般辞書１２及びユーザ辞書の語彙のうちから、入力音声の音素ラベル列に最も類似する語彙を特定する構成部である。一般辞書１２は、言語Ｘ（Ｘ＝１，２，３，・・・）の音響標準パタンで表現される辞書であり、その言語の地名などの大語彙（音素ラベル列）が登録される。認識結果出力部１３は、音声認識結果を出力する構成部であり、辞書照合部１１による照合の結果として得られた入力音声の音素ラベル列に最も類似する語彙を出力する。 The dictionary collation unit 11 collates the phoneme label string of the input speech, the vocabulary of the general dictionary 12 expressed by the acoustic standard pattern of the set language, and the vocabulary registered in the user dictionary of the user dictionary registration unit 6, It is a component that identifies the vocabulary most similar to the phoneme label string of the input speech from the vocabulary of the general dictionary 12 and the user dictionary. The general dictionary 12 is a dictionary expressed by an acoustic standard pattern of a language X (X = 1, 2, 3,...), And a large vocabulary (phoneme label string) such as a place name of the language is registered. The recognition result output unit 13 is a component that outputs a speech recognition result, and outputs a vocabulary most similar to the phoneme label string of the input speech obtained as a result of collation by the dictionary collation unit 11.

また、辞書照合部１１、言語ごとの音響標準パタンで表現される一般辞書１２及び認識結果出力部１３は、この発明の趣旨に従う音声認識プログラムをコンピュータに記憶し、ＣＰＵに実行させることにより、ハードウエアとソフトウエアが協働した具体的な手段として、当該コンピュータ上で実現することができる。さらに、音響標準パタン４や一般辞書１２に用いる記憶領域は、上記コンピュータに搭載された記憶装置、例えばハードディスク装置や外部記憶メディア等に構築される。 In addition, the dictionary collation unit 11, the general dictionary 12 expressed by the acoustic standard pattern for each language, and the recognition result output unit 13 store a speech recognition program according to the gist of the present invention in a computer and cause the CPU to execute the hardware recognition program. As a specific means in which the software and the software cooperate, it can be realized on the computer. Furthermore, the storage area used for the acoustic standard pattern 4 and the general dictionary 12 is constructed in a storage device mounted on the computer, such as a hard disk device or an external storage medium.

次に動作について説明する。
図５は、実施の形態２の音声認識装置による動作の流れを示すフローチャートである。
ユーザが、入力装置を用いて音声認識開始を指示してから（ステップＳＴ１ｂ）、音声認識の対象となる音声を発話する。例えば、個人名の「Ｍｉｃｈａｅｌ」が発話されたものとする。音声取り込み部２は、マイク２ａを介して、ユーザから発話された音声を取り込み、この入力音声をデジタル信号に変換してから音響分析部３に出力する（ステップＳＴ２ｂ）。音響分析部３は、ステップＳＴ２ｂで音声取り込み部２から入力した音声信号を音響分析し、この音声信号を音響特徴の時系列に変換する。Next, the operation will be described.
FIG. 5 is a flowchart showing a flow of operations performed by the speech recognition apparatus according to the second embodiment.
After the user instructs the start of voice recognition using the input device (step ST1b), the user utters the voice that is the target of voice recognition. For example, it is assumed that the personal name “Michael” is spoken. The voice capturing unit 2 captures voice uttered by the user via the microphone 2a, converts the input voice into a digital signal, and outputs the digital signal to the acoustic analysis unit 3 (step ST2b). The acoustic analysis unit 3 acoustically analyzes the voice signal input from the voice capturing unit 2 in step ST2b, and converts the voice signal into a time series of acoustic features.

音響データマッチング部５は、ユーザ辞書作成時言語記憶部７に記憶された言語を読み出して、ユーザ辞書の登録時における設定言語を確認する（ステップＳＴ３ｂ）。図５では、ユーザ辞書登録時の設定言語は日本語であったものとする。
続いて、音響データマッチング部５は、音響分析部３から取り込んだ入力音声の音響特徴の時系列と、設定言語の音響標準パタン４から、当該入力音声について設定言語の音素ラベル列を作成する（ステップＳＴ４ｂ）。例えば、入力音声が「Ｍｉｃｈａｅｌ」であり、設定言語が日本語である場合、日本語の音響標準パタンで表現された音素ラベル列として「＃，／ｍ／，／ａ／，／ｉ／，／ｋ／，／ｒ／，＃」が得られる。The acoustic data matching unit 5 reads the language stored in the user dictionary creation language storage unit 7 and confirms the set language when registering the user dictionary (step ST3b). In FIG. 5, it is assumed that the setting language at the time of user dictionary registration is Japanese.
Subsequently, the acoustic data matching unit 5 creates a phoneme label string of the set language for the input speech from the time series of the acoustic features of the input speech captured from the acoustic analysis unit 3 and the acoustic standard pattern 4 of the set language ( Step ST4b). For example, when the input speech is “Michael” and the setting language is Japanese, “#, / m /, / a /, / i /, / k /, / r /, # ".

次に、辞書照合部１１は、音響データマッチング部５により作成された入力音声の音素ラベル列と、設定言語の音響標準パタン４で表現される一般辞書１２の語彙と、ユーザ辞書登録部６のユーザ辞書に登録されている語彙を照合し、一般辞書１２及びユーザ辞書の語彙のうちから、入力音声の音素ラベル列に最も類似する語彙を特定する（ステップＳＴ５ｂ）。認識結果出力部１３は、辞書照合部１１による照合の結果として得られた入力音声の音素ラベル列に最も類似する語彙を出力する（ステップＳＴ６ｂ）。 Next, the dictionary collation unit 11 includes a phoneme label string of the input speech created by the acoustic data matching unit 5, a vocabulary of the general dictionary 12 expressed by the acoustic standard pattern 4 of the set language, and a user dictionary registration unit 6. The vocabulary registered in the user dictionary is collated, and the vocabulary most similar to the phoneme label string of the input speech is specified from the vocabulary of the general dictionary 12 and the user dictionary (step ST5b). The recognition result output unit 13 outputs the vocabulary most similar to the phoneme label string of the input speech obtained as a result of collation by the dictionary collation unit 11 (step ST6b).

図５に示すように、設定言語（ここでは、日本語）の音響標準パタンで表現される一般辞書１２には、地名等の大語彙が音素ラベル列として登録されている。また、ユーザ辞書には、上記実施の形態１で示したように、ユーザの発話により任意の語彙が音素ラベル列として登録されている。ここで、ユーザ辞書の登録語１として「＃，／ｍ／，／ａ／，／ｉ／，／ｋ／，／ｒ／，＃」が登録されている場合、辞書照合部１１が、入力音声の音素ラベル列に最も類似する語彙として登録語１を特定し、認識結果出力部１３が、登録語１を認識結果として出力する。 As shown in FIG. 5, a large vocabulary such as a place name is registered as a phoneme label string in the general dictionary 12 expressed by an acoustic standard pattern of a set language (here, Japanese). In the user dictionary, as shown in the first embodiment, an arbitrary vocabulary is registered as a phoneme label string by the user's utterance. Here, when “#, / m /, / a /, / i /, / k /, / r /, #” is registered as the registered word 1 of the user dictionary, the dictionary collation unit 11 performs the input speech. The registered word 1 is specified as the vocabulary most similar to the phoneme label string of, and the recognition result output unit 13 outputs the registered word 1 as the recognition result.

以上のように、この実施の形態２によれば、上記実施の形態１の認識辞書作成装置の構成に加え、一般辞書１２を記憶する一般辞書記憶部と、音響データマッチング部５により作成された入力音声の音素ラベル列と、一般辞書１２と、ユーザ辞書とを照合して、一般辞書１２及びユーザ辞書のうちから、入力音声の音素ラベル列に最も類似する語彙を特定する辞書照合部１１と、辞書照合部１１によって特定された語彙を、音声認識結果として出力する認識結果出力部１３とを備えたので、上記実施の形態１の効果に加えて、ユーザ辞書を用いた音声認識を行う音声認識装置１Ａを提供することができる。 As described above, according to the second embodiment, in addition to the configuration of the recognition dictionary creating apparatus of the first embodiment, the general dictionary storage unit that stores the general dictionary 12 and the acoustic data matching unit 5 are used. A dictionary collation unit 11 that collates the phoneme label sequence of the input speech, the general dictionary 12 and the user dictionary, and identifies the vocabulary most similar to the phoneme label sequence of the input speech from the general dictionary 12 and the user dictionary; And the recognition result output unit 13 for outputting the vocabulary specified by the dictionary collation unit 11 as a speech recognition result. In addition to the effects of the first embodiment, the speech for performing speech recognition using the user dictionary The recognition device 1A can be provided.

実施の形態３．
図６は、この発明の実施の形態３による音声合成装置の構成を示すブロック図であり、上記実施の形態１による認識辞書作成装置を用いた音声合成装置を示している。図６において、実施の形態３による音声合成装置１Ｂは、上記実施の形態１で示した認識辞書作成装置１の構成と、上記実施の形態２で示した言語ごとの音響標準パタンで表現される一般辞書１２に加え、テキスト入力部１４、登録語部分検出部１５、登録語部分音素ラベル列置き換え部（登録語彙置換部）１６、その他の部分の音素ラベル列置き換え部（一般辞書置換部）１７及び音声合成部１８を備える。なお、図６において、図１及び図４と同一又は同様に動作する構成部には、同一符号を付して説明を省略する。Embodiment 3 FIG.
FIG. 6 is a block diagram showing a configuration of a speech synthesizer according to Embodiment 3 of the present invention, and shows a speech synthesizer using the recognition dictionary creating apparatus according to Embodiment 1 described above. In FIG. 6, the speech synthesizer 1B according to the third embodiment is expressed by the configuration of the recognition dictionary creation device 1 shown in the first embodiment and the acoustic standard pattern for each language shown in the second embodiment. In addition to the general dictionary 12, a text input unit 14, a registered word part detection unit 15, a registered word partial phoneme label string replacement unit (registered vocabulary replacement unit) 16, and a phoneme label string replacement unit (general dictionary replacement unit) 17 of other parts. And a speech synthesizer 18. In FIG. 6, the same reference numerals are given to components that operate in the same or similar manner as in FIGS. 1 and 4, and description thereof is omitted.

テキスト入力部１４は、音声に変換するテキストを入力する構成部である。登録語部分検出部１５は、テキスト入力部１４から取り込んだ入力テキストから、ユーザ辞書に登録された登録語を検出する構成部である。登録語部分音素ラベル列置き換え部１６は、登録語部分検出部１５により検出された登録語を、ユーザ辞書から取り込んだ音素ラベル列に置き換える構成部である。その他の部分の音素ラベル列置き換え部１７は、登録語部分音素ラベル列置き換え部１６を介して、登録語部分検出部１５により検出された登録語以外の入力テキスト部分を入力する構成部であり、登録語以外の入力テキスト部分の語を、設定言語の音響標準パタンで表現された一般辞書１２から取り込んだ音素ラベル列に置き換える。音声合成部１８は、音素ラベル列置き換え部１６，１７により得られた入力テキストについての音素ラベル列から、当該入力テキストの合成音声を生成する構成部である。 The text input unit 14 is a component that inputs text to be converted into speech. The registered word part detection unit 15 is a component that detects a registered word registered in the user dictionary from the input text captured from the text input unit 14. The registered word partial phoneme label string replacement unit 16 is a component that replaces the registered word detected by the registered word part detection unit 15 with the phoneme label string imported from the user dictionary. The other part phoneme label string replacement unit 17 is a component that inputs an input text part other than the registered word detected by the registered word part detection unit 15 via the registered word part phoneme label string replacement unit 16. The words in the input text portion other than the registered words are replaced with phoneme label strings taken from the general dictionary 12 expressed in the acoustic standard pattern of the set language. The speech synthesizer 18 is a component that generates synthesized speech of the input text from the phoneme label string for the input text obtained by the phoneme label string replacement units 16 and 17.

なお、テキスト入力部１４、登録語部分検出部１５、登録語部分音素ラベル列置き換え部１６、その他の部分の音素ラベル列置き換え部１７及び音声合成部１８は、この発明の趣旨に従う音声合成プログラムをコンピュータに記憶し、ＣＰＵに実行させることにより、ハードウエアとソフトウエアが協働した具体的な手段として当該コンピュータ上で実現することができる。さらに、音響標準パタン４や一般辞書１２に用いる記憶領域は、上記コンピュータに搭載された記憶装置、例えばハードディスク装置や外部記憶メディア等に構築される。 Note that the text input unit 14, the registered word part detection unit 15, the registered word part phoneme label string replacement unit 16, the other part phoneme label string replacement unit 17 and the speech synthesis unit 18 execute a speech synthesis program according to the spirit of the present invention. By being stored in a computer and executed by a CPU, it can be realized on the computer as a specific means in which hardware and software cooperate. Furthermore, the storage area used for the acoustic standard pattern 4 and the general dictionary 12 is constructed in a storage device mounted on the computer, such as a hard disk device or an external storage medium.

次に動作について説明する。
図７は、実施の形態３の音声合成装置による動作の流れを示すフローチャートである。
ユーザが、テキスト入力部１４を用いて、音声に変換したいテキストを入力する（ステップＳＴ１ｃ）。このとき、ユーザ辞書の登録語を識別する識別子を設定する。例えば、図７に示すように、ユーザ辞書の登録語１をテキスト入力する場合、登録語の識別子である二重括弧を登録語１の前後に設定する。Next, the operation will be described.
FIG. 7 is a flowchart showing a flow of operations performed by the speech synthesizer according to the third embodiment.
The user uses the text input unit 14 to input text to be converted into speech (step ST1c). At this time, an identifier for identifying a registered word in the user dictionary is set. For example, as shown in FIG. 7, when the registered word 1 of the user dictionary is input as text, double parentheses that are registered word identifiers are set before and after the registered word 1.

登録語部分検出部１５は、テキスト入力部１４から入力テキストを取り込み、入力テキストに設定された登録語の識別子を用いて登録語を検出する（ステップＳＴ２ｃ）。図７の例では、二重括弧が前後に設定された登録語１が検出される。
次に、登録語部分音素ラベル列置き換え部１６は、登録語部分検出部１５によって検出された登録語を、ユーザ辞書から取り込んだ音素ラベル列に置き換える（ステップＳＴ３ｃ）。これにより、登録語１が、対応する音素ラベル列である「＃，／ｍ／，／ａ／，／ｉ／，／ｋ／，／ｒ／，＃」に置換される。The registered word part detection unit 15 takes in the input text from the text input unit 14, and detects a registered word using the registered word identifier set in the input text (step ST2c). In the example of FIG. 7, registered word 1 in which double parentheses are set before and after is detected.
Next, the registered word partial phoneme label string replacement unit 16 replaces the registered word detected by the registered word part detection unit 15 with the phoneme label string imported from the user dictionary (step ST3c). As a result, the registered word 1 is replaced with “#, / m /, / a /, / i /, / k /, / r /, #” which is the corresponding phoneme label string.

その他の部分の音素ラベル列置き換え部１７は、登録語部分音素ラベル列置き換え部１６を介して、入力テキストにおける、登録語部分検出部１５により検出された登録語以外の部分を入力し、登録語以外の入力テキスト部分の語を、設定言語の一般辞書１２から取り込んだ音素ラベル列に置き換える（ステップＳＴ４ｃ）。ここでは、設定言語が日本語であるものとし、登録語以外の入力テキスト部分である、助詞の「は」、名詞の「大阪」、助詞の「に」、動詞の「いった」が、図７に示すように、日本語の一般辞書１２に登録されている、対応する音素ラベル列にそれぞれ置き換えられる。 The phoneme label string replacement unit 17 of the other part inputs a part other than the registered word detected by the registered word part detection unit 15 in the input text via the registered word partial phoneme label string replacement unit 16. The words in the input text part other than are replaced with the phoneme label string imported from the general dictionary 12 of the set language (step ST4c). Here, it is assumed that the setting language is Japanese, and the input text part other than the registered word is the particle “ha”, the noun “Osaka”, the particle “ni”, and the verb “ni”. As shown in FIG. 7, the corresponding phoneme label string registered in the Japanese general dictionary 12 is replaced.

音声合成部１８は、登録語部分音素ラベル列置き換え部１６及びその他の部分の音素ラベル列置き換え部１７によって得られた入力テキストについての音素ラベル列から、当該入力テキストの合成音声を生成する（ステップＳＴ５ｃ）。図７の例では、「マイクルは大阪に行った」という合成音声が出力される。ここで、登録語１以外の部分は、日本語の音素ラベルで発話されるが、登録語１である「マイクル」は、上記実施の形態１で示したようにユーザ辞書に設定時の設定言語が英語であるので、英語的な発話となる。 The speech synthesizer 18 generates synthesized speech of the input text from the phoneme label sequence for the input text obtained by the registered word partial phoneme label sequence replacement unit 16 and the phoneme label sequence replacement unit 17 of the other part (step) ST5c). In the example of FIG. 7, the synthesized voice “Micle went to Osaka” is output. Here, the parts other than the registered word 1 are uttered by Japanese phoneme labels, but the registered word 1 “Mikel” is set in the user dictionary as shown in the first embodiment. Because it is English, it becomes an English utterance.

以上のように、この実施の形態３によれば、上記実施の形態１の認識辞書作成装置の構成に加えて、テキストを入力するテキスト入力部１４と、テキスト入力部１４から入力されたテキストの文字列から、ユーザ辞書に登録した音素ラベル列に相当する語彙部分を検出する登録語部分検出部１５と、登録語部分検出部１５によって検出された語彙部分を、ユーザ辞書から取得した対応する音素ラベル列に置き換える登録語部分音素ラベル列置き換え部１６と、テキストの文字列のうち、登録語部分検出部１５によって検出された語彙部分以外の部分を、一般辞書１２の対応する音素ラベル列に置き換えるその他の部分の音素ラベル列置き換え部１７と、登録語部分音素ラベル列置き換え部１６及びその他の部分の音素ラベル列置き換え部１７によって得られたテキストの音素ラベル列から、当該テキストの合成音声を生成する音声合成部１８を備える。
このように構成することで、上記実施の形態１の効果に加え、ユーザ辞書を用いた音声合成を行う音声合成装置１Ｂを提供することができる。As described above, according to the third embodiment, in addition to the configuration of the recognition dictionary creating apparatus of the first embodiment, the text input unit 14 for inputting text and the text input from the text input unit 14 A registered word part detection unit 15 that detects a vocabulary part corresponding to a phoneme label string registered in the user dictionary from the character string, and a corresponding phoneme obtained from the user dictionary by the vocabulary part detected by the registered word part detection unit 15 A part other than the vocabulary part detected by the registered word part detection unit 15 in the registered word part phoneme label string replacement part 16 to be replaced with the label string and a character string of the text is replaced with a corresponding phoneme label string in the general dictionary 12. The other part phoneme label string replacement unit 17, the registered word partial phoneme label string replacement unit 16 and the other part phoneme label string replacement unit 17. Therefore the phoneme label string of the text obtained, provided with a voice synthesizer 18 for generating synthesized speech of the text.
With this configuration, in addition to the effects of the first embodiment, it is possible to provide a speech synthesizer 1B that performs speech synthesis using a user dictionary.

実施の形態４．
図８は、この発明の実施の形態４による認識辞書作成装置の構成を示すブロック図である。図８において、実施の形態４の認識辞書作成装置１ａは、上記実施の形態１の構成におけるユーザ辞書作成時言語記憶部７がない代わりに、登録時音響パタン設定部１９を備える。登録時音響パタン設定部１９は、認識辞書作成装置１ａを用いた音声認識装置や音声合成装置に設定されている設定言語に関わらず、音響データマッチング部５の処理に用いる音響標準パタン４の言語として、自身に予め登録されている所定の言語を設定する構成部である。この所定の言語は、設定言語に依らず、登録時音響パタン設定部１９に予め登録される。なお、図８において、図１で示した構成部と同一又は同様に動作するものについては、同一符号を付し説明を省略する。Embodiment 4 FIG.
FIG. 8 is a block diagram showing a configuration of a recognition dictionary creation apparatus according to Embodiment 4 of the present invention. In FIG. 8, the recognition dictionary creation apparatus 1a of the fourth embodiment includes a registration acoustic pattern setting unit 19 instead of the user dictionary creation language storage unit 7 in the configuration of the first embodiment. The registration acoustic pattern setting unit 19 is the language of the acoustic standard pattern 4 used for the processing of the acoustic data matching unit 5 regardless of the setting language set in the speech recognition device or speech synthesis device using the recognition dictionary creation device 1a. As a configuration unit for setting a predetermined language registered in advance in itself. This predetermined language is registered in advance in the acoustic pattern setting unit 19 at the time of registration regardless of the set language. In FIG. 8, the same or similar components as those shown in FIG.

次に動作について説明する。
図９は、実施の形態４の認識辞書作成装置によるユーザ辞書登録動作の流れを示すフローチャートである。
ユーザが、入力装置を用いてユーザ辞書作成開始を指示してから（ステップＳＴ１ｄ）、登録しようとしている語彙を発話する。例えば、個人名の「Ｍｉｃｈａｅｌ」が発話されたものとする。音声取り込み部２は、マイク２ａを介して、ユーザから発話された音声を取り込み、この入力音声をデジタル信号に変換してから音響分析部３に出力する（ステップＳＴ２ｄ）。Next, the operation will be described.
FIG. 9 is a flowchart showing the flow of a user dictionary registration operation by the recognition dictionary creation apparatus of the fourth embodiment.
After the user instructs the user dictionary creation start using the input device (step ST1d), the user speaks the vocabulary to be registered. For example, it is assumed that the personal name “Michael” is spoken. The voice capturing unit 2 captures the voice uttered by the user via the microphone 2a, converts the input voice into a digital signal, and outputs the digital signal to the acoustic analysis unit 3 (step ST2d).

続いて、登録時音響パタン設定部１９が、システムの設定言語の代わりに、自身に予め登録されている所定言語を音響データマッチング部５に設定する（ステップＳＴ３ｄ）。図９の例では、英語を所定言語としている。音響分析部３は、ステップＳＴ２ｄで音声取り込み部２から入力した音声信号を音響分析し、この音声信号を音響特徴の時系列に変換する（ステップＳＴ４ｄ）。 Subsequently, the registered acoustic pattern setting unit 19 sets a predetermined language registered in advance in the acoustic data matching unit 5 instead of the system setting language (step ST3d). In the example of FIG. 9, English is the predetermined language. The acoustic analysis unit 3 acoustically analyzes the voice signal input from the voice capturing unit 2 in step ST2d, and converts the voice signal into a time series of acoustic features (step ST4d).

音響データマッチング部５は、登録時音響パタン設定部１９から設定された所定言語に対応する音響標準パタン４を読み出し、この設定言語の音響標準パタン４と、音響分析部３で得られた入力音声の音響特徴の時系列とから、入力音声を表す最適な音素ラベル列を作成する（ステップＳＴ５ｄ）。入力音声が「Ｍｉｃｈａｅｌ」であり、所定言語が英語であると、図９に示すように、「＃，／ｍ／，／ａ／，／ｉ／，／ｋ／，／ｌ／，＃」という音素ラベル列が得られる。 The acoustic data matching unit 5 reads the acoustic standard pattern 4 corresponding to the predetermined language set from the registration acoustic pattern setting unit 19, and the acoustic standard pattern 4 of the set language and the input speech obtained by the acoustic analysis unit 3. An optimal phoneme label string representing the input speech is created from the time series of the acoustic features (step ST5d). When the input voice is “Michael” and the predetermined language is English, as shown in FIG. 9, it is called “#, / m /, / a /, / i /, / k /, / l /, #”. A phoneme label string is obtained.

ユーザ辞書登録部６は、音響データマッチング部５により作成された入力音声の音素ラベル列を、ユーザ辞書に登録する（ステップＳＴ６ｄ）。
次に、音素ラベル列変換部９が、言語間音響データマッピングテーブル保存部１０から読み込んだ言語間音響データマッピングテーブルに基づいて、上述のようにして得られた入力音声（登録語彙）に対する所定言語の音素ラベル列と、システムに現在設定されている設定言語の音素ラベルとの対応付けを行い、ユーザ辞書に登録した所定言語による登録語彙の音素ラベル列を設定言語の音素ラベル列に変換し、現在のユーザ辞書としてユーザ辞書登録部６に登録する（ステップＳＴ７ｄ）。The user dictionary registration unit 6 registers the phoneme label string of the input speech created by the acoustic data matching unit 5 in the user dictionary (step ST6d).
Next, based on the interlanguage acoustic data mapping table read by the phoneme label string conversion unit 9 from the interlanguage acoustic data mapping table storage unit 10, a predetermined language for the input speech (registered vocabulary) obtained as described above is used. The phoneme label string of the set language and the phoneme label of the set language currently set in the system, and the phoneme label string of the registered vocabulary in the predetermined language registered in the user dictionary is converted into the phoneme label string of the set language, The current user dictionary is registered in the user dictionary registration unit 6 (step ST7d).

次に設定言語を切り替えた場合における動作について説明する。
図１０は、実施の形態４の認識辞書作成装置による言語切り替え後のユーザ辞書登録動作の流れを示すフローチャートであり、図９で示したユーザ辞書登録が実行された後に言語が切り替えられた場合を示している。
ユーザが、入力装置を用いて言語切り替え部８に新たな言語を指定することにより、言語切り替え部８が、切り替え後の言語を音素ラベル列変換部９に設定する（ステップＳＴ１ｅ）。ここでは、日本語に切り替えられたものとする。Next, the operation when the set language is switched will be described.
FIG. 10 is a flowchart showing the flow of user dictionary registration operation after language switching by the recognition dictionary creating apparatus of the fourth embodiment. The case where the language is switched after the user dictionary registration shown in FIG. 9 is executed. Show.
When the user designates a new language in the language switching unit 8 using the input device, the language switching unit 8 sets the language after switching in the phoneme label string conversion unit 9 (step ST1e). Here, it is assumed that the language has been switched to Japanese.

音素ラベル列変換部９は、言語切り替え部８から指定された切り替え後の言語と、所定言語とを用いて、言語間音響データマッピングテーブル保存部１０を検索して、ユーザ辞書の登録時における所定言語と切り替え後の言語に対応する言語間音響データマッピングテーブルを読み込み、この言語間音響データマッピングテーブルに基づいて、ユーザ辞書に登録した所定言語の音素ラベル列を、切り替え後の言語の音素ラベル列に変換する（ステップＳＴ２ｅ）。
例えば、所定言語である英語の「Ｍｉｃｈａｅｌ」の音素ラベル列「＃，／ｍ／，／ａ／，／ｉ／，／ｋ／，／ｌ／，＃」が、切り替え後の言語である日本語との言語間音響データマッピングテーブルの対応関係に基づいて、日本語の音素ラベル列である「＃，／ｍ／，／ａ／，／ｉ／，／ｋ／，／ｒ／，＃」に変換される。The phoneme label string conversion unit 9 searches the inter-language acoustic data mapping table storage unit 10 using the switched language designated by the language switching unit 8 and the predetermined language, and performs predetermined processing when registering the user dictionary. The inter-language acoustic data mapping table corresponding to the language and the language after switching is read, and based on this inter-language acoustic data mapping table, the phoneme label string of the predetermined language registered in the user dictionary is converted into the phoneme label string of the language after switching. (Step ST2e).
For example, the phoneme label string “#, / m /, / a /, / i /, / k /, / l /, #” of English “Michael” that is a predetermined language is Japanese after switching. Is converted into “#, / m /, / a /, / i /, / k /, / r /, #”, which is a Japanese phoneme label string Is done.

ユーザ辞書登録部６は、ステップＳＴ２ｅにおいて音素ラベル列変換部９により変換された音素ラベル列を、ユーザ辞書に追加格納する（ステップＳＴ３ｅ）。図１０では、登録語彙テキストが「Ｍｉｃｈａｅｌ」であって、切り替え後の言語が日本語であるので、日本語の音素ラベル列である「＃，／ｍ／，／ａ／，／ｉ／，／ｋ／，／ｒ／，＃」が、登録語として格納される。 The user dictionary registration unit 6 additionally stores the phoneme label string converted by the phoneme label string conversion unit 9 in step ST2e in the user dictionary (step ST3e). In FIG. 10, since the registered vocabulary text is “Michael” and the language after switching is Japanese, “#, / m /, / a /, / i /, / k /, / r /, # "are stored as registered words.

以上のように、この実施の形態４によれば、入力音声の音素ラベル列を登録したユーザ辞書と、言語間の音素ラベルの対応関係が規定された言語間音響データマッピングテーブルと、音響標準パタンのうちから、予め設定された言語の音響標準パタンを選択する登録時音響パタン設定部１９とを備え、言語間音響データマッピングテーブルを参照して、ユーザ辞書に登録した音素ラベル列を、登録時音響パタン設定部１９により選択された言語の音素ラベル列から、切り替え後の言語の音素ラベル列へ変換する。
このように構成することで、上記実施の形態１では、ユーザ辞書への登録する語彙の対象言語としてＮ個の言語が設定可能である場合、ユーザ辞書への登録時の言語と設定可能な言語との（Ｎ×（Ｎ−１））／２個分の組み合わせの全てに対応する言語間音響データマッピングテーブルが必要であったところ、登録時音響パタン設定部１９によって設定される１つの所定言語と上記設定可能な言語との（Ｎ−１）個分の組み合わせに対応する言語間音響データマッピングテーブルでよく、言語間音響データマッピングテーブルのデータサイズを低減することが可能である。As described above, according to the fourth embodiment, the user dictionary in which the phoneme label sequence of the input speech is registered, the inter-language acoustic data mapping table in which the correspondence between phoneme labels between languages is defined, and the acoustic standard pattern A registration-time acoustic pattern setting unit 19 for selecting a preset acoustic standard pattern of a language, and referring to an inter-language acoustic data mapping table to register a phoneme label string registered in a user dictionary The phoneme label string of the language selected by the acoustic pattern setting unit 19 is converted into the phoneme label string of the language after switching.
With this configuration, in the first embodiment, when N languages can be set as target languages of the vocabulary to be registered in the user dictionary, the languages at the time of registration in the user dictionary and the languages that can be set are set. When the inter-language acoustic data mapping table corresponding to all combinations of (N × (N−1)) / 2 is required, one predetermined language set by the acoustic pattern setting unit 19 at the time of registration And an inter-language acoustic data mapping table corresponding to (N-1) combinations of the settable languages, and the data size of the inter-language acoustic data mapping table can be reduced.

なお、上記実施の形態２及び上記実施の形態３では、上記実施の形態１による認識辞書作成装置１を用いて音声認識装置及び音声合成装置を構成する場合を示したが、図４及び図６で示した構成において、上記実施の形態１による認識辞書作成装置の代わりに、図８に示した上記実施の形態４による認識辞書作成装置１ａを組み合わせて音声認識装置及び音声合成装置を構成しても構わない。これにより、上記実施の形態４による効果も併せて得られる音声認識装置及び音声合成装置を提供することができる。 In the second embodiment and the third embodiment, the case where the speech recognition device and the speech synthesis device are configured using the recognition dictionary creation device 1 according to the first embodiment has been described. In the configuration shown in FIG. 8, instead of the recognition dictionary creation device according to the first embodiment, a speech recognition device and a speech synthesis device are configured by combining the recognition dictionary creation device 1a according to the fourth embodiment shown in FIG. It doesn't matter. As a result, it is possible to provide a speech recognition apparatus and a speech synthesis apparatus that can also obtain the effects of the fourth embodiment.

この発明に係る認識辞書作成装置は、発話音声を保存する大容量のメモリが不要で、全ての言語について音素ラベル列を予め作成する必要がなく、言語ごとの音素ラベル列の作成時間を短縮することができることから、車載機器の音声認識装置や音声合成装置に好適である。 The recognition dictionary creation apparatus according to the present invention does not require a large-capacity memory for storing uttered speech, eliminates the need to create phoneme label sequences in advance for all languages, and shortens the creation time of phoneme label sequences for each language. Therefore, it is suitable for a speech recognition device and a speech synthesis device for in-vehicle devices.

Claims

An acoustic analysis unit that acoustically analyzes a voice signal of an input voice and outputs a time series of acoustic features;
An acoustic standard pattern storage unit for storing acoustic standard patterns indicating standard acoustic features for each language;
An acoustic data matching unit that creates a phoneme label string of the input speech by collating a time series of acoustic features of the input speech input from the acoustic analysis unit and an acoustic standard pattern stored in the acoustic standard pattern storage unit When,
A user dictionary storage unit that stores a user dictionary in which the phoneme label sequence of the input speech created by the acoustic data matching unit is registered;
A language storage unit for storing the language of the phoneme label sequence registered in the user dictionary;
A language switching unit for switching languages;
A mapping table storage unit that stores a mapping table in which correspondences between phoneme labels between languages are defined;
Referring to the mapping table stored in the mapping table storage unit, the phoneme label sequence registered in the user dictionary is changed from the phoneme level sequence of the language stored in the language storage unit to the language switched by the language switching unit. A recognition dictionary creation device comprising a phoneme label string conversion unit for converting into a phoneme label string.

An acoustic analysis unit that acoustically analyzes a voice signal of an input voice and outputs a time series of acoustic features;
An acoustic standard pattern storage unit for storing acoustic standard patterns indicating standard acoustic features for each language;
An acoustic data matching unit that creates a phoneme label string of the input speech by collating a time series of acoustic features of the input speech input from the acoustic analysis unit and an acoustic standard pattern stored in the acoustic standard pattern storage unit When,
A user dictionary storage unit that stores a user dictionary in which the phoneme label sequence of the input speech created by the acoustic data matching unit is registered;
A language storage unit for storing the language of the phoneme label sequence registered in the user dictionary;
A language switching unit for switching languages;
A mapping table storage unit that stores a mapping table in which correspondences between phoneme labels between languages are defined;
With reference to the mapping table stored in the mapping table storage unit, the phoneme label sequence registered in the user dictionary is changed from the phoneme label sequence of the language stored in the language storage unit to the language switched by the language switching unit. A phoneme label string conversion unit for converting to a phoneme label string;
A general dictionary storage unit for storing a general dictionary of vocabulary expressed in the acoustic standard pattern;
The phoneme label sequence of the input speech is selected from the general dictionary and the user dictionary by collating the phoneme label sequence of the input speech created by the acoustic data matching unit, the general dictionary, and the user dictionary. A dictionary matching unit that identifies the vocabulary most similar to
A speech recognition apparatus comprising: a recognition result output unit that outputs the vocabulary specified by the dictionary collation unit as a speech recognition result.

An acoustic analysis unit that acoustically analyzes a voice signal of an input voice and outputs a time series of acoustic features;
An acoustic standard pattern storage unit for storing acoustic standard patterns indicating standard acoustic features for each language;
An acoustic data matching unit that creates a phoneme label string of the input speech by collating a time series of acoustic features of the input speech input from the acoustic analysis unit and an acoustic standard pattern stored in the acoustic standard pattern storage unit When,
A user dictionary storage unit that stores a user dictionary in which the phoneme label sequence of the input speech created by the acoustic data matching unit is registered;
A language storage unit for storing the language of the phoneme label sequence registered in the user dictionary;
A language switching unit for switching languages;
A mapping table storage unit that stores a mapping table in which correspondences between phoneme labels between languages are defined;
With reference to the mapping table stored in the mapping table storage unit, the phoneme label sequence registered in the user dictionary is changed from the phoneme label sequence of the language stored in the language storage unit to the language switched by the language switching unit. A phoneme label string conversion unit for converting to a phoneme label string;
A text input section for entering text;
A registered word part detection unit for detecting a vocabulary part corresponding to a phoneme label string registered in the user dictionary from a character string of text input from the text input unit;
A registered vocabulary replacement unit that replaces the vocabulary part detected by the registered word part detection unit with a phoneme label string corresponding to the vocabulary part acquired from the user dictionary;
A general dictionary replacement unit that replaces a part of the text string other than the vocabulary part detected by the registered word part detection unit with a phoneme label string of a corresponding vocabulary of the general dictionary;
A speech synthesizer comprising: a speech synthesizer that generates synthesized speech of the text from the phoneme label string of the text obtained by the registered vocabulary replacement unit and the general dictionary replacement unit.

An acoustic analysis unit that acoustically analyzes a voice signal of an input voice and outputs a time series of acoustic features;
An acoustic standard pattern storage unit for storing acoustic standard patterns indicating standard acoustic features for each language;
An acoustic standard pattern setting unit for selecting an acoustic standard pattern in a preset language from the acoustic standard patterns stored in the acoustic standard pattern storage unit;
Acoustic data matching for generating a phoneme label string of the input speech by collating a time series of acoustic features of the input speech input from the acoustic analysis unit and an acoustic standard pattern of a language selected by the acoustic standard pattern setting unit And
A user dictionary storage unit that stores a user dictionary in which the phoneme label sequence of the input speech created by the acoustic data matching unit is registered;
A language switching unit for switching languages;
A mapping table storage unit that stores a mapping table in which correspondences between phoneme labels between languages are defined;
With reference to the mapping table stored in the mapping table storage unit, the phoneme label sequence registered in the user dictionary is switched by the language switching unit from the phoneme label sequence of the language selected by the acoustic standard pattern setting unit. Recognition dictionary creating device comprising a phoneme label string conversion unit for converting into a phoneme label string of a different language.

An acoustic analysis unit that acoustically analyzes a voice signal of an input voice and outputs a time series of acoustic features;
An acoustic standard pattern storage unit for storing acoustic standard patterns indicating standard acoustic features for each language;
An acoustic standard pattern setting unit for selecting an acoustic standard pattern in a preset language from the acoustic standard patterns stored in the acoustic standard pattern storage unit;
Acoustic data matching for generating a phoneme label string of the input speech by collating a time series of acoustic features of the input speech input from the acoustic analysis unit and an acoustic standard pattern of a language selected by the acoustic standard pattern setting unit And
A user dictionary storage unit that stores a user dictionary in which the phoneme label sequence of the input speech created by the acoustic data matching unit is registered;
A language switching unit for switching languages;
A mapping table storage unit that stores a mapping table in which correspondences between phoneme labels between languages are defined;
With reference to the mapping table stored in the mapping table storage unit, the phoneme label sequence registered in the user dictionary is switched by the language switching unit from the phoneme label sequence of the language selected by the acoustic standard pattern setting unit. A phoneme label string conversion unit for converting to a phoneme label string of a different language;
A general dictionary storage unit for storing a general dictionary of vocabulary expressed in the acoustic standard pattern;
The phoneme label sequence of the input speech is selected from the general dictionary and the user dictionary by collating the phoneme label sequence of the input speech created by the acoustic data matching unit, the general dictionary, and the user dictionary. A dictionary matching unit that identifies the vocabulary most similar to
A speech recognition apparatus comprising: a recognition result output unit that outputs the vocabulary specified by the dictionary collation unit as a speech recognition result.

An acoustic analysis unit that acoustically analyzes a voice signal of an input voice and outputs a time series of acoustic features;
An acoustic standard pattern storage unit for storing acoustic standard patterns indicating standard acoustic features for each language;
An acoustic standard pattern setting unit for selecting an acoustic standard pattern in a preset language from the acoustic standard patterns stored in the acoustic standard pattern storage unit;
Acoustic data matching for generating a phoneme label string of the input speech by collating a time series of acoustic features of the input speech input from the acoustic analysis unit and an acoustic standard pattern of a language selected by the acoustic standard pattern setting unit And
A user dictionary storage unit that stores a user dictionary in which the phoneme label sequence of the input speech created by the acoustic data matching unit is registered;
A language switching unit for switching languages;
A mapping table storage unit that stores a mapping table in which correspondences between phoneme labels between languages are defined;
With reference to the mapping table stored in the mapping table storage unit, the phoneme label sequence registered in the user dictionary is switched by the language switching unit from the phoneme label sequence of the language selected by the acoustic standard pattern setting unit. A phoneme label string conversion unit for converting to a phoneme label string of a different language;
A text input section for entering text;
A registered word part detection unit for detecting a vocabulary part corresponding to a phoneme label string registered in the user dictionary from a character string of text input from the text input unit;
A registered vocabulary replacement unit that replaces the vocabulary part detected by the registered word part detection unit with a phoneme label string corresponding to the vocabulary part acquired from the user dictionary;
A general dictionary replacement unit that replaces a part of the text string other than the vocabulary part detected by the registered word part detection unit with a phoneme label string of a corresponding vocabulary of the general dictionary;
A speech synthesizer comprising: a speech synthesizer that generates synthesized speech of the text from the phoneme label string of the text obtained by the registered vocabulary replacement unit and the general dictionary replacement unit.