JP6376486B2

JP6376486B2 - Acoustic model generation apparatus, acoustic model generation method, and program

Info

Publication number: JP6376486B2
Application number: JP2013171321A
Authority: JP
Inventors: 雅弘西光; 繁樹松田; 堀　智織; 智織堀; 亮輔磯谷; 健花沢
Original assignee: NEC Corp; National Institute of Information and Communications Technology
Current assignee: NEC Corp; National Institute of Information and Communications Technology
Priority date: 2013-08-21
Filing date: 2013-08-21
Publication date: 2018-08-22
Anticipated expiration: 2033-08-21
Also published as: JP2015040946A

Description

本発明は、例えば、音声認識処理に使用する音響モデルを生成する音響モデル生成装置等に関するものである。 The present invention relates to an acoustic model generation apparatus that generates an acoustic model used for speech recognition processing, for example.

従来、識別対象となるデータに含まれる非音声区間に影響を受けずに男女識別を行うことができる音響モデル生成装置があった（例えば、特許文献１参照）。 Conventionally, there has been an acoustic model generation apparatus that can perform gender identification without being affected by non-speech sections included in data to be identified (see, for example, Patent Document 1).

また、従来の音声認識システムでは、音響モデル生成用音声と音声認識システムに発話される音声とのミスマッチにより、音声認識の精度が劣化していた。つまり、音響モデルの生成に用意できる音声は、録音室で収録した音声、原稿の読上げ音声など、限定的であった。この課題を解決するために、音響モデル適応技術などが存在していた。 Further, in the conventional speech recognition system, the accuracy of speech recognition has deteriorated due to a mismatch between the sound for generating the acoustic model and the speech uttered by the speech recognition system. In other words, the voices that can be prepared for generating the acoustic model are limited, such as voices recorded in the recording room and voices of reading the original. In order to solve this problem, acoustic model adaptation technology and the like existed.

音響モデル適応技術は、音声認識システムに蓄積された音声（「発話環境で発話された音声」とも言う。）を用いて、録音室で収録した音声等から生成された音響モデルに対して適応処理を行い、多様な環境での音声認識処理に利用できる音響モデルを生成する技術である（例えば、非特許文献１参照）。なお、上記の音声認識システムに蓄積された音声は、多様な音声であり、例えば、背景雑音の混入音声や発話スタイルの異なる音声などを含む。なお、適応処理は、公知技術であるので、詳細な説明を省略する。 The acoustic model adaptation technology uses the speech accumulated in the speech recognition system (also referred to as “speech uttered in the speech environment”) to adapt the acoustic model generated from the speech recorded in the recording room. And generating an acoustic model that can be used for speech recognition processing in various environments (see, for example, Non-Patent Document 1). Note that the voices accumulated in the voice recognition system described above are various voices, and include, for example, voices mixed with background noise and voices with different utterance styles. Note that the adaptive process is a known technique, and thus detailed description thereof is omitted.

特開２０１３−５７７８９号公報（第１頁、第１図等）JP 2013-57789 A (first page, FIG. 1 etc.)

磯谷他，「全国音声翻訳実証実験の実施と実利用データを用いた音声認識のモデル適応」、電子情報通信学会論文誌 D，Vol J96-D，No.1，pp.209-220Sugaya et al., “National speech translation demonstration experiment and model adaptation of speech recognition using actual usage data”, IEICE Transactions D, Vol J96-D, No.1, pp.209-220

しかしながら、従来の音響モデル適応技術では、発話環境で発話された音声が必要であり、音声の蓄積のない言語においては、本アプローチを使用できず、音声認識精度を上げる音響モデルが生成できなかった。 However, the conventional acoustic model adaptation technology requires speech uttered in an utterance environment, and this approach cannot be used in languages without speech accumulation, and an acoustic model that improves speech recognition accuracy could not be generated. .

本発明は、このような課題を解決するために為されたものであって、発話環境等の適した環境における音声データが存在しない言語であって音声認識精度を高め得る音響モデルを生成することを目的としている。 The present invention has been made to solve such problems, and generates an acoustic model that is a language in which voice data does not exist in a suitable environment such as a speech environment and can improve voice recognition accuracy. It is an object.

本第一の発明の音響モデル生成装置は、音声認識の対象言語の音響モデルを生成する音響モデル生成装置であって、対象言語の第二の音響モデルである対象言語新音響モデルを格納し得る対象言語新音響モデル格納部と、対象言語とは異なる１以上の各他言語の第一の音響モデルである１以上の各他言語旧音響モデルと１以上の各他言語の第二の音響モデルである１以上の各他言語新音響モデルとの関係に関する情報である１以上の第一相関情報、または１以上の各他言語旧音響モデルと対象言語の第一の音響モデルである対象言語旧音響モデルとの関係に関する情報である１以上の第二相関情報のうちの、いずれか１以上の相関情報を用いて、対象言語旧音響モデルまたは１以上の他言語新音響モデルから、対象言語新音響モデルを生成する音響モデル生成部と、音響モデル生成部が生成した対象言語新音響モデルを対象言語新音響モデル格納部に蓄積する音響モデル蓄積部とを具備する音響モデル生成装置である。 The acoustic model generation device according to the first aspect of the present invention is an acoustic model generation device that generates an acoustic model of a target language for speech recognition, and can store a target language new acoustic model that is a second acoustic model of the target language. Target language new acoustic model storage unit, one or more other language old acoustic models that are first acoustic models of one or more other languages different from the target language, and one or more second acoustic models of each other language 1 or more first correlation information that is information relating to the relationship with one or more other language new acoustic models, or one or more of each other language old acoustic model and the target language that is the first acoustic model of the target language Using one or more correlation information of one or more second correlation information, which is information related to the relationship with the acoustic model, the target language new from the target language old acoustic model or one or more other language new acoustic models. Generate an acoustic model And acoustic model generator, an acoustic model generating apparatus comprising an acoustic model storage unit for storing the target language new acoustic model acoustic model generator has generated the target language new acoustic model storage unit.

かかる構成により、発話環境等の適した環境における音声データが存在しない言語でも、当該言語に対応する音響モデルであり、音声認識精度を上げる音響モデルを生成できる。 With this configuration, even in a language that does not have voice data in a suitable environment such as a speech environment, an acoustic model corresponding to the language can be generated, and an acoustic model that improves voice recognition accuracy can be generated.

また、本第二の発明の音響モデル生成装置は、第一の発明に対して、音響モデル生成部は、対象言語旧音響モデルを格納し得る対象言語旧音響モデル格納部と、１以上の第一相関情報を格納し得る第一相関情報格納部と、１以上の第一相関情報を用いて、対象言語旧音響モデル格納部に格納されている対象言語旧音響モデルから対象言語新音響モデルを生成する音響モデル生成手段とを具備する音響モデル生成装置である。 Further, in the acoustic model generation device of the second invention, in contrast to the first invention, the acoustic model generation unit includes a target language old acoustic model storage unit that can store the target language old acoustic model, and one or more first acoustic models. The target language new acoustic model is obtained from the target language old acoustic model stored in the target language old acoustic model storage unit using the first correlation information storage unit capable of storing one correlation information and the one or more first correlation information. An acoustic model generation device comprising an acoustic model generation means for generating.

かかる構成により、他言語旧音響モデルと他言語新音響モデルとの相関関係を利用することにより、発話環境等の適した環境における音声データが存在しない言語でも、当該言語に対応する音響モデルであり、音声認識精度を上げる音響モデルを生成できる。 With this configuration, by utilizing the correlation between the old acoustic model of another language and the new acoustic model of another language, even in languages where there is no speech data in a suitable environment such as an utterance environment, the acoustic model corresponds to the language. It is possible to generate an acoustic model that improves voice recognition accuracy.

また、本第三の発明の音響モデル生成装置は、第一の発明に対して、音響モデル生成部は、１以上の他言語新音響モデルを格納し得る他言語新音響モデル格納部と、１以上の第二相関情報を格納し得る第二相関情報格納部と、１以上の第二相関情報を用いて、他言語新音響モデル格納部に格納されている１以上の他言語新音響モデルから対象言語新音響モデルを生成する音響モデル生成手段とを具備する音響モデル生成装置である。 The acoustic model generation device according to the third aspect of the present invention is different from the first aspect in that the acoustic model generation unit includes an other language new acoustic model storage unit that can store one or more other language new acoustic models, and 1 From the second correlation information storage unit capable of storing the second correlation information and one or more other language new acoustic models stored in the other language new acoustic model storage unit using the one or more second correlation information. An acoustic model generation apparatus including an acoustic model generation unit that generates a target language new acoustic model.

かかる構成により、他言語旧音響モデルと対象言語新音響モデルとの相関関係を利用することにより、発話環境等の適した環境における音声データが存在しない言語でも、当該言語に対応する音響モデルであり、音声認識精度を上げる音響モデルを生成できる。 With such a configuration, by utilizing the correlation between the old acoustic model of another language and the new acoustic model of the target language, the acoustic model corresponding to the language can be used even in a language where there is no speech data in a suitable environment such as a speech environment. It is possible to generate an acoustic model that improves voice recognition accuracy.

また、本第四の発明の音響モデル生成装置は、第一の発明に対して、音響モデル生成部は、対象言語旧音響モデルを格納し得る対象言語旧音響モデル格納部と、１以上の他言語新音響モデルを格納し得る他言語新音響モデル格納部と、１以上の第一相関情報を格納し得る第一相関情報格納部と、１以上の第二相関情報を格納し得る第二相関情報格納部と、１以上の第一相関情報と１以上の第二相関情報とを用いて、対象言語旧音響モデル、または１以上の他言語新音響モデル、または対象言語旧音響モデルと１以上の他言語新音響モデルとから、対象言語新音響モデルを生成する音響モデル生成手段とを具備する音響モデル生成装置である。 Further, in the acoustic model generation device according to the fourth aspect of the invention, in contrast to the first aspect, the acoustic model generation unit includes a target language old acoustic model storage unit that can store the target language old acoustic model, and one or more other acoustic model generation units. Another language new acoustic model storage unit that can store a language new acoustic model, a first correlation information storage unit that can store one or more first correlation information, and a second correlation that can store one or more second correlation information Using the information storage unit, the one or more first correlation information and the one or more second correlation information, the target language old acoustic model, or one or more other language new acoustic models, or the target language old acoustic model and one or more And an acoustic model generation means for generating a target language new acoustic model from another language new acoustic model.

かかる構成により、他言語旧音響モデルと他言語新音響モデルとの相関関係、および他言語旧音響モデルと対象言語旧音響モデルとの相関関係を利用することにより、発話環境等の適した環境における音声データが存在しない言語でも、当該言語に対応する音響モデルであり、音声認識精度を上げる音響モデルを生成できる。 In such a configuration, by using the correlation between the old acoustic model of another language and the new acoustic model of another language, and the correlation between the old acoustic model of the other language and the old acoustic model of the target language, Even a language in which no voice data exists is an acoustic model corresponding to the language, and an acoustic model that improves voice recognition accuracy can be generated.

また、本第五の発明の音響モデル生成装置は、第四の発明に対して、音響モデル生成部は、対象言語新音響モデルを生成する２以上のアルゴリズムのうち、対象言語旧音響モデルまたは１以上の他言語新音響モデルが有するデータに応じて、いずれか一のアルゴリズムを選択する選択手段をさらに具備し、音響モデル生成手段は、選択手段が選択した一のアルゴリズムに従って、第一相関情報と第二相関情報のうちの１以上の相関情報を用いて、対象言語新音響モデルを生成する音響モデル生成装置である。 The acoustic model generation device according to the fifth aspect of the present invention is different from the fourth aspect in that the acoustic model generation unit is a target language old acoustic model or one of two or more algorithms for generating a target language new acoustic model. According to the data possessed by the other language new acoustic model described above, further comprising a selection means for selecting any one algorithm, the acoustic model generation means, the first correlation information and the first correlation information according to the one algorithm selected by the selection means The acoustic model generation device generates a target language new acoustic model using one or more correlation information of the second correlation information.

かかる構成により、他言語旧音響モデルと他言語新音響モデルとの相関関係、および他言語旧音響モデルと対象言語旧音響モデルとの相関関係を、対象のデータに適したアルゴリズムで利用することにより、発話環境等の適した環境における音声データが存在しない言語でも、当該言語に対応する音響モデルであり、音声認識精度を上げる音響モデルを生成できる。 With this configuration, by using the correlation between the old acoustic model of the other language and the new acoustic model of the other language, and the correlation between the old acoustic model of the other language and the old acoustic model of the target language, using an algorithm suitable for the target data. Even in a language where there is no speech data in a suitable environment such as a speech environment, an acoustic model corresponding to the language can be generated, and an acoustic model that improves speech recognition accuracy can be generated.

また、本第六の発明の音響モデル生成装置は、第一から第五いずれかの発明に対して、他言語旧音響モデルは、他言語の適応処理前の音響モデルまたは他言語新音響モデルとは異なるデータで生成した音響モデルであり、他言語新音響モデルは、他言語の適応処理後の音響モデルまたは他言語旧音響モデルとは異なるデータで生成した音響モデルであり、対象言語旧音響モデルは、対象言語の適応処理前の音響モデルまたは他言語旧音響モデルと類似するデータで生成した音響モデルであり、対象言語新音響モデルは、対象言語の適応処理後の音響モデルである音響モデル生成装置である。 The acoustic model generation device according to the sixth aspect of the present invention relates to the first to fifth aspects of the invention, wherein the other language old acoustic model is an acoustic model before the adaptation processing of another language or another language new acoustic model. Is an acoustic model generated with different data, and the other language new acoustic model is an acoustic model generated with data different from the acoustic model after adaptation processing of other languages or the other language old acoustic model, and the target language old acoustic model Is an acoustic model generated with data similar to the acoustic model before adaptation processing of the target language or other language old acoustic model, and the new acoustic model of the target language is an acoustic model generation that is an acoustic model after adaptation processing of the target language Device.

また、本第七の発明の音響モデル生成装置は、第六の発明に対して、第一相関情報は、１以上の各他言語旧音響モデルに対応する１以上の各ベクトルと１以上の各他言語新音響モデルに対応する１以上の各ベクトルとの差である１以上の変換関数から取得される情報であり、第二相関情報は、１以上の各他言語旧音響モデルに対応するベクトルと対象言語旧音響モデルに対応するベクトルとの差の１以上の変換関数から取得される情報である音響モデル生成装置である。 Further, the acoustic model generation device of the seventh aspect of the invention relates to the sixth aspect of the invention, wherein the first correlation information includes one or more vectors and one or more each corresponding to one or more other language old acoustic models. It is information acquired from one or more conversion functions that are differences from one or more vectors corresponding to other language new acoustic models, and the second correlation information is a vector corresponding to one or more other language old acoustic models. And an acoustic model generation apparatus which is information acquired from one or more conversion functions of a difference between a vector corresponding to an old acoustic model of a target language.

また、本第八の発明の音響モデル生成装置は、第七の発明に対して、音響モデル生成部は、対象言語旧音響モデルに対応するベクトルを第一相関情報の変換関数を用いて写像することにより対象言語新音響モデルを生成する、または他言語新音響モデル格納部に格納されている１以上の他言語新音響モデルから第二相関情報の変換関数を用いて写像することにより他言語新音響モデルを生成する音響モデル生成装置である。 Further, in the acoustic model generation device of the eighth invention, in contrast to the seventh invention, the acoustic model generation unit maps a vector corresponding to the target language old acoustic model using a conversion function of the first correlation information. To generate a new acoustic model of the target language, or by mapping from one or more other language new acoustic models stored in the other language new acoustic model storage unit using the conversion function of the second correlation information An acoustic model generation apparatus that generates an acoustic model.

本発明による音響モデル生成装置によれば、発話環境等の適した環境における音声データが存在しない言語でも、当該言語に対応する音響モデルであり、音声認識精度を上げる音響モデルを生成できる。 The acoustic model generation apparatus according to the present invention can generate an acoustic model that improves speech recognition accuracy, even in a language that does not have voice data in a suitable environment such as a speech environment, which is an acoustic model corresponding to the language.

本発明の実施の形態１における音響モデル生成装置１のブロック図Block diagram of acoustic model generation apparatus 1 in Embodiment 1 of the present invention 同音響モデル生成装置１の動作について説明するフローチャートThe flowchart explaining operation | movement of the acoustic model production | generation apparatus 1 同音響モデル生成装置１の動作を説明する概念図The conceptual diagram explaining operation | movement of the acoustic model production | generation apparatus 1 同音響モデル生成部１２の処理を簡潔に説明する図The figure explaining the process of the acoustic model production | generation part 12 briefly 同音響モデル生成部１２の処理の概念を示す図The figure which shows the concept of the process of the acoustic model production | generation part 12 同各言語の評価のためのテストデータの量について示す図Diagram showing the amount of test data for evaluation of each language 同各言語の学習、および適応処理に使用されたデータの総量を示す図Diagram showing the total amount of data used for learning and adaptive processing in each language 同実験結果を示す図Figure showing the results of the experiment 同適応処理等の実験結果を示す図The figure which shows the experimental result of the same adaptation processing etc. 同実験結果を示す図Figure showing the results of the experiment 本発明の実施の形態２における音響モデル生成装置２のブロック図Block diagram of acoustic model generation apparatus 2 in Embodiment 2 of the present invention 同音響モデル生成装置２の動作について説明するフローチャートThe flowchart explaining operation | movement of the acoustic model production | generation apparatus 2 同音響モデル生成装置２の動作を説明する概念図The conceptual diagram explaining operation | movement of the acoustic model production | generation apparatus 2 同音響モデル生成装置２の具体的な動作について説明する図The figure explaining the specific operation | movement of the acoustic model production | generation apparatus 2 本発明の実施の形態３における音響モデル生成装置３のブロック図Block diagram of acoustic model generation apparatus 3 in Embodiment 3 of the present invention 同音響モデル生成装置３の動作について説明するフローチャートA flowchart for explaining the operation of the acoustic model generation device 3 同選択情報管理表を示す図Figure showing the same selection information management table 本発明の音響モデル生成装置を実現するコンピュータシステムの概観図Overview of a computer system that implements the acoustic model generation apparatus of the present invention 同コンピュータシステムのブロック図Block diagram of the computer system

以下、音響モデル生成装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of an acoustic model generation device and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.

（実施の形態１）
本実施の形態において、他言語の第一の音響モデルである他言語旧音響モデルと他言語の第二の音響モデルである他言語新音響モデルとを用いて、対象言語の第一の音響モデルである対象言語旧音響モデルから、対象言語の第二の音響モデルである対象言語新音響モデルを生成する音響モデル生成装置について説明する。なお、対象言語とは、音響モデル生成装置が生成する音響モデルの言語であり、当該音響モデルを用いて音声認識される音声の言語である。また、対象言語は後述する「ＴａｒｇｅｔＬａｎｇｕａｇｅ」と同意義であり、他言語は後述する「ＳｏｕｒｃｅＬａｎｇｕａｇｅ」と同意義である。また、他言語とは、生成する音響モデルの対象言語とは異なる言語である。さらに、他言語は、１または２以上の言語である。 (Embodiment 1)
In the present embodiment, the first acoustic model of the target language using the other language old acoustic model that is the first acoustic model of the other language and the other language new acoustic model that is the second acoustic model of the other language. An acoustic model generation apparatus that generates a new target language acoustic model that is a second acoustic model of the target language from the target language old acoustic model will be described. Note that the target language is a language of an acoustic model generated by the acoustic model generation device, and is a speech language that is voice-recognized using the acoustic model. The target language has the same meaning as “Target Language” described later, and the other languages have the same meaning as “Source Language” described later. The other language is a language different from the target language of the generated acoustic model. Further, the other language is one or more languages.

また、他言語旧音響モデルは、例えば、適応処理を施す前の他言語の音響モデルである。また、他言語旧音響モデルは、他言語新音響モデルとは異なるデータで生成した音響モデルでも良い。また、他言語新音響モデルは、例えば、適応処理を施した後の他言語の音響モデルである。また、他言語新音響モデルは、例えば、他言語旧音響モデルとは異なるデータで生成した音響モデルでも良い。また、対象言語旧音響モデルは、例えば、適応処理を施す前の対象言語の音響モデルである。また、対象言語旧音響モデルは、例えば、他言語旧音響モデルと類似するデータで生成した音響モデルでも良い。対象言語新音響モデルは、例えば、適応処理を施した後の対象言語の音響モデルである。ここで、適応処理とは、通常、一の言語の音声認識のために利用する音響モデルに対して、当該一の言語の蓄積音声を用いてパラメータ変換を行う処理である。蓄積音声は、音声認識を行う環境において蓄積された一の言語の音声であることは好適である。適応処理は、非特許文献１等に記載されている従来技術であるので、詳細な説明を省略する。 The other language old acoustic model is, for example, an acoustic model of another language before the adaptive processing is performed. Also, the other language old acoustic model may be an acoustic model generated with data different from the other language new acoustic model. In addition, the other language new acoustic model is, for example, an acoustic model of another language after the adaptive processing is performed. Further, the other language new acoustic model may be, for example, an acoustic model generated with data different from the other language old acoustic model. Further, the target language old acoustic model is, for example, an acoustic model of the target language before the adaptive process is performed. Further, the target language old acoustic model may be, for example, an acoustic model generated with data similar to the other language old acoustic model. The target language new acoustic model is, for example, an acoustic model of the target language after being subjected to adaptive processing. Here, the adaptive processing is processing for performing parameter conversion on an acoustic model that is normally used for speech recognition in one language, using stored speech in the one language. The stored voice is preferably a voice of one language stored in an environment where voice recognition is performed. Since the adaptive process is a conventional technique described in Non-Patent Document 1 or the like, detailed description thereof is omitted.

また、他言語旧音響モデルは、例えば、読み上げ音声の他言語の音響モデルであり、他言語新音響モデルは、例えば、話し言葉音声の他言語の音響モデルであっても良い。また、対象言語旧音響モデルは、例えば、読み上げ音声の対象言語の音響モデルであり、対象言語新音響モデルは、例えば、話し言葉音声の対象言語の音響モデルであっても良い。 Further, the other language old acoustic model may be, for example, an acoustic model of another language of a reading voice, and the other language new acoustic model may be, for example, an acoustic model of another language of spoken speech. Further, the target language old acoustic model may be, for example, an acoustic model of the target language of the reading speech, and the target language new acoustic model may be, for example, an acoustic model of the target language of spoken speech.

さらに具体的には、本実施の形態において、１以上の他言語旧音響モデルと１以上の他言語新音響モデルとの相関関係を示す１以上の第一相関情報を用いて、対象言語旧音響モデルから、対象言語新音響モデルを生成する音響モデル生成装置について説明する。 More specifically, in the present embodiment, using the one or more first correlation information indicating the correlation between one or more other language old acoustic models and one or more other language new acoustic models, the target language old acoustics. An acoustic model generation apparatus that generates a new target language acoustic model from a model will be described.

なお、音響モデルとは、音声認識を行う音声の音響的特徴をモデル化したものであり、例えば、隠れマルコフモデル（ＨＭＭ）を用い、ＨＭＭの各状態の出力確率分布をガウス混合分布（ＧＭＭ）で表現する。音響モデルの持つ情報（パラメータ）には、例えば、音素等のシンボル毎のＨＭＭの状態間の状態遷移確率、各状態のＧＭＭにおけるガウス分布の平均、分散がある。通常、音声認識では音声認識を行う特徴ベクトルとして、音声を周波数解析し得られる数十〜数百次元の特徴ベクトルを用いることが一般的であるので、ガウス分布の平均、分散は数十〜数百次元のベクトルとなる。 Note that the acoustic model is a model of acoustic features of speech for speech recognition. For example, a hidden Markov model (HMM) is used, and an output probability distribution of each state of the HMM is a Gaussian mixture distribution (GMM). It expresses with. The information (parameters) possessed by the acoustic model includes, for example, the state transition probability between states of the HMM for each symbol such as phonemes, and the average and variance of the Gaussian distribution in the GMM of each state. Usually, in speech recognition, it is common to use feature vectors of tens to hundreds of dimensions obtained by frequency analysis of speech as feature vectors for speech recognition. It is a one hundred dimensional vector.

図１は、本実施の形態における音響モデル生成装置１のブロック図である。音響モデル生成装置１は、対象言語新音響モデル格納部１１、音響モデル生成部１２、音響モデル蓄積部１３を備える。 FIG. 1 is a block diagram of an acoustic model generation apparatus 1 in the present embodiment. The acoustic model generation device 1 includes a target language new acoustic model storage unit 11, an acoustic model generation unit 12, and an acoustic model storage unit 13.

また、音響モデル生成部１２は、対象言語旧音響モデル格納部１２１、他言語旧音響モデル格納部１２２、他言語新音響モデル格納部１２３、第一相関情報格納部１２４、第一相関情報生成手段１２５、音響モデル生成手段１２６を備える。 The acoustic model generation unit 12 includes a target language old acoustic model storage unit 121, another language old acoustic model storage unit 122, another language new acoustic model storage unit 123, a first correlation information storage unit 124, and a first correlation information generation unit. 125, and an acoustic model generation means 126.

対象言語新音響モデル格納部１１は、対象言語新音響モデルを格納し得る。 The target language new acoustic model storage unit 11 can store the target language new acoustic model.

音響モデル生成部１２は、１または２以上の言語の１または２以上の他言語旧音響モデルと、１または２以上の言語の１または２以上の他言語新音響モデルとを用いて、対象言語旧音響モデルまたは１または２以上の他言語新音響モデルから、対象言語新音響モデルを生成する。 The acoustic model generation unit 12 uses one or more other language old acoustic models of one or more languages and one or more other language new acoustic models of one or more languages, and uses the target language. A target language new acoustic model is generated from the old acoustic model or one or more other language new acoustic models.

また、さらに具体的には、音響モデル生成部１２は、１または２以上の他言語旧音響モデルと１または２以上の他言語新音響モデルとの関係に関する情報である１または２以上の第一相関情報、または１または２以上の他言語旧音響モデルと対象言語旧音響モデルとの関係に関する情報である１または２以上の第二相関情報のうちの、いずれか１または２以上の相関情報を用いて、対象言語旧音響モデルまたは１または２以上の他言語新音響モデルから、対象言語新音響モデルを生成する。なお、第一相関情報は、１以上の各他言語旧音響モデルに対応する１以上の各ベクトルと１以上の各他言語新音響モデルに対応する１以上の各ベクトルとの差である１以上の変換関数から取得される情報である、とも言える。また、第二相関情報は、１以上の各他言語旧音響モデルに対応するベクトルと対象言語旧音響モデルに対応するベクトルとの差の１以上の変換関数から取得される情報である、とも言える。 More specifically, the acoustic model generation unit 12 is one or more first or more first information that is information regarding the relationship between one or more other language old acoustic models and one or more other language new acoustic models. Correlation information, or one or two or more correlation information of one or two or more second correlation information that is information related to the relationship between one or two or more other language old acoustic models and the target language old acoustic model. The target language new acoustic model is generated from the target language old acoustic model or one or more other language new acoustic models. The first correlation information is one or more that is a difference between one or more vectors corresponding to one or more other language old acoustic models and one or more vectors corresponding to one or more other language new acoustic models. It can be said that the information is obtained from the conversion function. Further, it can be said that the second correlation information is information acquired from one or more conversion functions of a difference between a vector corresponding to one or more other language old acoustic models and a vector corresponding to the target language old acoustic model. .

また、音響モデル生成部１２は、対象言語旧音響モデルに対応するベクトルを第一相関情報の変換関数を用いて写像することにより対象言語新音響モデルを生成しても良い。 The acoustic model generation unit 12 may generate a target language new acoustic model by mapping a vector corresponding to the target language old acoustic model using the conversion function of the first correlation information.

さらに、本実施の形態において、音響モデル生成部１２は、対象言語旧音響モデルから、１または２以上の第一相関情報を用いて、対象言語新音響モデルを生成する場合について説明する。 Furthermore, in this Embodiment, the acoustic model production | generation part 12 demonstrates the case where a target language new acoustic model is produced | generated using 1 or 2 or more 1st correlation information from a target language old acoustic model.

他言語新音響モデルは、他言語旧音響モデルに対して、例えば、適応処理された音響モデルである。 The other language new acoustic model is, for example, an acoustic model that is adaptively processed with respect to the other language old acoustic model.

音響モデル生成部１２を構成する対象言語旧音響モデル格納部１２１は、対象言語旧音響モデルを格納し得る。 The target language old acoustic model storage unit 121 constituting the acoustic model generation unit 12 can store the target language old acoustic model.

他言語旧音響モデル格納部１２２は、１または２以上の他言語の１または２以上の他言語旧音響モデルを格納し得る。 The other language old acoustic model storage unit 122 may store one or more other language old acoustic models of one or more other languages.

他言語新音響モデル格納部１２３は、１または２以上の１または２以上の他言語新音響モデルを格納し得る。 The other language new acoustic model storage unit 123 can store one or two or more one or more other language new acoustic models.

第一相関情報格納部１２４は、１または２以上の第一相関情報を格納し得る。第一相関情報は、他言語旧音響モデル格納部１２２に格納されている他言語旧音響モデルと、他言語新音響モデル格納部１２３に格納されている他言語新音響モデルとの関係に関する情報である。ここで、関係に関する情報とは、通常、２つの音響モデルの差分についての情報である。つまり、第一相関情報は、通常、他言語旧音響モデルが有するベクトルと他言語新音響モデルが有するベクトルとの差を示すベクトルである。なお、ベクトルは、パラメータ集合である。なお、上記の関係に関する情報は、２つの音響モデルの関係を示す情報であれば良い。 The first correlation information storage unit 124 can store one or more first correlation information. The first correlation information is information regarding the relationship between the other language old acoustic model stored in the other language old acoustic model storage unit 122 and the other language new acoustic model stored in the other language new acoustic model storage unit 123. is there. Here, the information on the relationship is usually information on the difference between the two acoustic models. That is, the first correlation information is usually a vector indicating a difference between a vector of the other language old acoustic model and a vector of the other language new acoustic model. A vector is a parameter set. In addition, the information regarding said relationship should just be the information which shows the relationship of two acoustic models.

第一相関情報生成手段１２５は、１または２以上の他言語旧音響モデルと１または２以上の他言語新音響モデルとを用いて、１または２以上の第一相関情報を生成する。具体的には、第一相関情報生成手段１２５は、例えば、他言語旧音響モデルが有するベクトルと他言語新音響モデルが有するベクトルとの差を算出し、第一相関情報に対応するベクトルを取得する。 The first correlation information generation unit 125 generates one or more first correlation information using one or more other language old acoustic models and one or more other language new acoustic models. Specifically, the first correlation information generation unit 125 calculates, for example, the difference between the vector of the other language old acoustic model and the vector of the other language new acoustic model, and acquires the vector corresponding to the first correlation information To do.

音響モデル生成手段１２６は、１または２以上の第一相関情報を用いて、対象言語旧音響モデル格納部１２１に格納されている対象言語旧音響モデルから対象言語新音響モデルを生成する。音響モデル生成手段１２６は、例えば、対象言語旧音響モデルに対応するベクトルに、第一相関情報であるベクトルを加えて、新しいベクトルである対象言語新音響モデルを取得する。 The acoustic model generation unit 126 generates a target language new acoustic model from the target language old acoustic model stored in the target language old acoustic model storage unit 121 using one or more first correlation information. For example, the acoustic model generation unit 126 adds the vector that is the first correlation information to the vector corresponding to the old acoustic model of the target language, and acquires the new target language acoustic model that is a new vector.

音響モデル蓄積部１３は、音響モデル生成部１２が生成した対象言語新音響モデルを対象言語新音響モデル格納部１１に蓄積する。 The acoustic model storage unit 13 stores the target language new acoustic model generated by the acoustic model generation unit 12 in the target language new acoustic model storage unit 11.

対象言語新音響モデル格納部１１、対象言語旧音響モデル格納部１２１、他言語旧音響モデル格納部１２２、他言語新音響モデル格納部１２３、および第一相関情報格納部１２４は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The target language new acoustic model storage unit 11, the target language old acoustic model storage unit 121, the other language old acoustic model storage unit 122, the other language new acoustic model storage unit 123, and the first correlation information storage unit 124 are nonvolatile recordings. A medium is preferred, but a volatile recording medium can also be realized.

対象言語新音響モデル格納部１１等に対象言語新音響モデル等が記憶される過程は問わない。例えば、記録媒体を介して対象言語新音響モデル等が対象言語新音響モデル格納部１１等で記憶されるようになってもよく、通信回線等を介して送信された対象言語新音響モデル等が対象言語新音響モデル格納部１１等で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された対象言語新音響モデル等が対象言語新音響モデル格納部１１等で記憶されるようになってもよい。 The process of storing the target language new acoustic model or the like in the target language new acoustic model storage unit 11 or the like is not limited. For example, the target language new acoustic model or the like may be stored in the target language new acoustic model storage unit 11 or the like via a recording medium, and the target language new acoustic model or the like transmitted via a communication line or the like may be stored. The target language new acoustic model storage unit 11 or the like may be stored, or the target language new acoustic model or the like input via the input device is stored in the target language new acoustic model storage unit 11 or the like. It may be like that.

音響モデル生成部１２、第一相関情報生成手段１２５、音響モデル生成手段１２６、および音響モデル蓄積部１３は、通常、ＭＰＵやメモリ等から実現され得る。音響モデル生成部１２の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The acoustic model generation unit 12, the first correlation information generation unit 125, the acoustic model generation unit 126, and the acoustic model storage unit 13 can be usually realized by an MPU, a memory, or the like. The processing procedure of the acoustic model generation unit 12 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、音響モデル生成装置１の動作について、図２のフローチャートを用いて説明する。 Next, operation | movement of the acoustic model production | generation apparatus 1 is demonstrated using the flowchart of FIG.

（ステップＳ２０１）第一相関情報生成手段１２５は、他言語旧音響モデル格納部１２２から他言語旧音響モデルを取得する。 (Step S <b> 201) The first correlation information generation unit 125 acquires another language old acoustic model from the other language old acoustic model storage unit 122.

（ステップＳ２０２）第一相関情報生成手段１２５は、他言語新音響モデル格納部１２３から他言語新音響モデルを取得する。 (Step S202) The first correlation information generation means 125 acquires another language new acoustic model from the other language new acoustic model storage unit 123.

（ステップＳ２０３）第一相関情報生成手段１２５は、ステップＳ２０１で取得した他言語旧音響モデルに対応するベクトル（μ_ｓ ^Ｉ）とステップＳ２０２で取得した他言語新音響モデルに対応するベクトル（μ_ｓ ^Ｒ）との差分を示す情報である第一相関情報（例えば、Ｖ_ｓ＝μ_ｓ ^Ｒ−μ_ｓ ^Ｉ）を算出する。 (Step S203) The first correlation information generating unit 125, a vector corresponding to other languages new acoustic model acquired in step S202 and the vector (mu _s ^I) corresponding to another language old acoustic model acquired at step S201 (mu _s First correlation information (for example, V _s = μ _s ^R −μ _s ^I ), which is information indicating a difference from ^{R 2} ), is calculated.

（ステップＳ２０４）第一相関情報生成手段１２５は、ステップＳ２０３で算出した第一相関情報（Ｖ_ｓ）を、第一相関情報格納部１２４に蓄積する。 (Step S204) The first correlation information generation means 125 accumulates the first correlation information (V _s ) calculated in Step S203 in the first correlation information storage unit 124.

（ステップＳ２０５）音響モデル生成手段１２６は、音響モデルを生成するか否かを判断する。音響モデルを生成する場合はステップＳ２０６に行き、音響モデルを生成しない場合はステップＳ２０５に戻る。なお、例えば、ユーザ指示の受け付けにより音響モデルを生成しても良いし、第一相関情報の蓄積等をトリガーとして音響モデルを生成しても良い。 (Step S205) The acoustic model generation means 126 determines whether to generate an acoustic model. If an acoustic model is to be generated, the process goes to step S206. If an acoustic model is not to be generated, the process returns to step S205. Note that, for example, an acoustic model may be generated by receiving a user instruction, or an acoustic model may be generated with the accumulation of the first correlation information as a trigger.

（ステップＳ２０６）音響モデル生成手段１２６は、対象言語旧音響モデル格納部１２１から対象言語旧音響モデルを取得する。 (Step S206) The acoustic model generation means 126 acquires the target language old acoustic model from the target language old acoustic model storage unit 121.

（ステップＳ２０７）音響モデル生成手段１２６は、第一相関情報格納部１２４から第一相関情報（Ｖ_ｓ）を取得する。 (Step S207) The acoustic model generation means 126 acquires the first correlation information (V _s ) from the first correlation information storage unit 124.

（ステップＳ２０８）音響モデル生成手段１２６は、ステップＳ２０６で取得した対象言語旧音響モデルに対して、ステップＳ２０７で取得した第一相関情報を適用し、対象言語新音響モデルを生成する。音響モデル生成手段１２６は、例えば、対象言語旧音響モデルに対応するベクトル（μ_ｔ ^Ｉ）に、第一相関情報（Ｖ_ｓ）を加え、対象言語新音響モデル（μ_ｔ ^Ｒ＝μ_ｔ ^Ｉ＋Ｖ_ｓ）を取得する。 (Step S208) The acoustic model generation unit 126 applies the first correlation information acquired in Step S207 to the target language old acoustic model acquired in Step S206, and generates a target language new acoustic model. The acoustic model generation unit 126 adds, for example, the first correlation information (V _s ) to the vector (μ _t ^I ) corresponding to the target language old acoustic model, and the target language new acoustic model (μ _t ^R = μ _t ^I + V). _s ).

（ステップＳ２０９）音響モデル蓄積部１３は、ステップＳ２０８で生成された新音響モデル（μ_ｔ ^Ｒ）を、対象言語新音響モデル格納部１１に蓄積し、処理を終了する。 (Step S209) The acoustic model storage unit 13 stores the new acoustic model (μ _t ^R ) generated in step S208 in the target language new acoustic model storage unit 11, and ends the process.

なお、図２のフローチャートにおいて、他言語が一つの場合について説明したが、他言語が２以上でも良い。かかる場合、第一相関情報生成手段１２５は、２以上の他言語の２以上の第一相関情報を生成する。また、音響モデル生成手段１２６は、２以上の第一相関情報を用いて、対象言語旧音響モデルから対象言語新音響モデルを生成する。 In the flowchart of FIG. 2, the case where there is one other language has been described. However, two or more other languages may be used. In this case, the first correlation information generation unit 125 generates two or more first correlation information in two or more other languages. The acoustic model generation unit 126 generates a new target language acoustic model from the target language old acoustic model using two or more pieces of first correlation information.

また、図２のフローチャートのステップＳ２０５は無くても良い。つまり、第一相関情報の蓄積の後、直ちに音響モデルの生成処理を行なっても良いことは言うまでもない。 Further, step S205 in the flowchart of FIG. 2 may be omitted. That is, it goes without saying that the acoustic model generation processing may be performed immediately after the first correlation information is accumulated.

以下、本実施の形態における音響モデル生成装置１の具体的な動作について説明する。 Hereinafter, a specific operation of the acoustic model generation device 1 according to the present embodiment will be described.

まず、音響モデル生成装置１の動作の概念を説明する。図３は、音響モデル生成装置１の動作を説明する概念図である。音響モデル生成装置１は、適応前と適応後のモデルの相関関係を利用し、音響モデルを生成する。 First, the concept of the operation of the acoustic model generation device 1 will be described. FIG. 3 is a conceptual diagram for explaining the operation of the acoustic model generation device 1. The acoustic model generation device 1 generates an acoustic model using the correlation between models before adaptation and after adaptation.

音響モデル生成装置１の図示しない第二音響モデル生成手段は、他言語（ＳｏｕｒｃｅＬａｎｇｕａｇｅ）の音響モデル生成用音声（ｓｏｕｒｃｅ）３０１から、音声モデル生成処理３０２により、他言語旧音響モデル３０３を生成する。他言語旧音響モデル３０３は、図３の「ＬａｂＡＭ（ｓｏｕｒｃｅ）」である。なお、音声モデル生成処理３０２は公知技術であるので詳細な説明を省略する。次に、音響モデル生成装置１の図示しない適応処理手段は、蓄積音声（ｓｏｕｒｃｅ）３０４を用いた適応処理（音響モデル生成（適応））３０５により、他言語旧音響モデル３０３から他言語新音響モデル３０６を生成する。他言語新音響モデル３０６は、図３の「ＦｌｄＡＭ（ｓｏｕｒｃｅ）」である。そして、第一相関情報生成手段１２５は、他言語旧音響モデル３０３と他言語新音響モデル３０６との差分である第一相関情報ｆ（３０７）を算出する。 The second acoustic model generation unit (not shown) of the acoustic model generation device 1 generates the other language old acoustic model 303 from the acoustic model generation speech 301 of another language (Source Language) by the speech model generation processing 302. . The other language old acoustic model 303 is “Lab AM (source)” in FIG. 3. The voice model generation process 302 is a known technique, and thus detailed description thereof is omitted. Next, the adaptation processing means (not shown) of the acoustic model generation apparatus 1 performs an adaptation process (acoustic model generation (adaptation)) 305 using the accumulated speech 304 to generate another language new acoustic model from the other language old acoustic model 303. 306 is generated. The other language new acoustic model 306 is “Fld AM (source)” in FIG. 3. Then, the first correlation information generation unit 125 calculates first correlation information f (307) that is a difference between the other language old acoustic model 303 and the other language new acoustic model 306.

次に、音響モデル生成装置１の図示しない第二音響モデル生成手段は、対象言語（ＴａｒｇｅｔＬａｎｇｕａｇｅ）の音響モデル生成用音声（ｔａｒｇｅｔ）３０８から、音声モデル生成処理３０９により、対象言語旧音響モデル３１０を生成する。対象言語旧音響モデル３１０は、図３の「ＬａｂＡＭ（ｔａｒｇｅｔ）」である。ここで、対象言語の蓄積音声（ｔａｒｇｅｔ）３１１は存在しない時、対象言語旧音響モデル３１０に対して音響モデル生成（適応）３１２の処理は行えない。つまり、図３の破線は、存在しないデータまたは行えない処理を示す。そして、音響モデル生成手段１２６は、対象言語旧音響モデル３１０に対して、第一相関情報ｆ（３１３）を適用し、対象言語新音響モデル３１４を生成する。この生成した対象言語新音響モデルが作りたいモデルである。また、図３の対象言語新音響モデル３１４は、「ＦｌｄＡＭ（ｔａｒｇｅｔ）」である。 Next, a second acoustic model generation unit (not shown) of the acoustic model generation device 1 performs a target language old acoustic model 310 from a target language (Target Language) acoustic model generation speech 308 by a speech model generation process 309. Is generated. The target language old acoustic model 310 is “Lab AM (target)” in FIG. 3. Here, when the target language accumulated speech 311 does not exist, the acoustic model generation (adaptation) 312 cannot be performed on the target language old acoustic model 310. That is, the broken line in FIG. 3 indicates data that does not exist or processing that cannot be performed. Then, the acoustic model generation unit 126 applies the first correlation information f (313) to the target language old acoustic model 310 to generate the target language new acoustic model 314. This generated target language new acoustic model is a model to be created. Also, the target language new acoustic model 314 of FIG. 3 is “Fld AM (target)”.

以下、音響モデル生成部１２の処理について、２つの具体例および実験結果について説明する。 Hereinafter, two specific examples and experimental results will be described for the processing of the acoustic model generation unit 12.

（具体例１）
まず、図４を用いて、音響モデル生成部１２の処理を簡潔に説明する。具体例１において、音声のある一つの状態が２次元の正規分布でモデル化されるものとし、正規分布の平均のみを適応する場合について説明する。 (Specific example 1)
First, the process of the acoustic model generation unit 12 will be briefly described with reference to FIG. In the first specific example, a case where one state of speech is modeled by a two-dimensional normal distribution and only the average of the normal distribution is applied will be described.

今、他言語（ＳｏｕｒｃｅＬａｎｇｕａｇｅ）のベースラインモデルＳ_Ｉを、平均「μ_ｓ ^Ｉ＝（１，１／２）」分散σ_ｓ ^Ｉの２次元正規分布とする。なお、ベースラインモデルＳ_Ｉは、他言語旧音響モデルである。また、蓄積音声で適応された他言語の適応モデルＳ_Ｒを、平均「μ_ｓ ^Ｒ＝（０，１）」、分散σ_ｓ ^Ｒ（＝σ_ｓ ^Ｉ）をもつ２次元正規分布とする。なお、適応モデルＳ_Ｒは、他言語新音響モデルである。 Now, the baseline model _{S I} in other languages (Source Language), and the average _{^{"μ s I = (1,1 / 2}} ) " 2-dimensional normal distribution of variance sigma _s ^I. In addition, the baseline model S _I is another language old acoustic model. Further, the adaptation model S _R of another language adapted by the stored speech is assumed to be a two-dimensional normal distribution having an average “μ _s ^R = (0, 1)” and a variance σ _s ^R (= σ _s ^I ). It should be noted that adaptive model S _R is another language new acoustic model.

そして、このとき、第一相関情報生成手段１２５は、適応モデルＳ_ＲとベースラインモデルＳ_Ｉの平均ベクトルの差分Ｖ_ｓ（図４の４１）を以下の式により算出し、「Ｖ_ｓ＝μ_ｓ ^Ｒ−μ_ｓ ^Ｉ＝（１，１／２）」を得る。 At this time, the first correlation information generating unit 125 calculates a difference V _s (41 in FIG. 4) between the average vectors of the adaptive model S _R and the baseline model S _{I according} to the following equation: “V _s = μ _s ^R −μ _s ^I = (1,1 / 2) ”.

また、対象言語（ＴａｒｇｅｔＬａｎｇｕａｇｅ）のベースラインモデルＴ_Ｉを、平均「μ_ｔ ^Ｉ＝（０，０）」、分散σ_ｔ ^Ｉをもつ２次元正規分布とする。なお、ベースラインモデルＴ_Ｉは、対象言語旧音響モデルである。 In addition, the baseline model T _I of the target language (Target Language) is a two-dimensional normal distribution having an average “μ _t ^I = (0, 0)” and a variance σ _t ^I. In addition, the baseline model T _I is a target language old acoustic model.

そして、他言語の平均ベクトルの差分Ｖ_ｓをそのまま用いて適応する場合、音響モデル生成手段１２６は、平均「μ_ｔ ^Ｒ＝μ_ｔ ^Ｉ＋Ｖ_ｓ＝（１，１／２）」、分散σ_ｔ ^Ｒ（＝σ_ｔ ^Ｉ）をもつ２次元正規分布を取得し、これを対象言語の適応モデルＴ_Ｒとする。なお、適応モデルＴ_Ｒは、対象言語新音響モデルである。 Then, when adapting by using the difference vector V _s of the average vectors of other languages as it is, the acoustic model generation unit 126 calculates the average “μ _t ^R = μ _t ^I + V _s = (1,1 / 2)”, the variance σ _t It gets the two-dimensional normal distribution with ^R (= ^σ _t ^I), which is referred to as adaptive model _{T R} of the target language. It should be noted that adaptive model T _R is a target language new acoustic model.

なお、具体例１において、音声の一の状態を２次元の正規分布でモデル化されている、としたが、２次元の正規分布に限られず、数十次元の混合正規分布等でモデル化されていることはさらに好適である。 In the specific example 1, it is assumed that one state of speech is modeled by a two-dimensional normal distribution. However, the state is not limited to a two-dimensional normal distribution, and is modeled by a mixed normal distribution of several tens of dimensions. It is further preferable.

また、混合正規分布でモデル化されているとも限らず、例えばニューラルネットワークを用いた音響モデルの場合においても、２つの音響モデルの差分である第一相関情報を用いて適応することができる。 In addition, the model is not necessarily modeled with a mixed normal distribution. For example, in the case of an acoustic model using a neural network, adaptation can be performed using the first correlation information that is a difference between two acoustic models.

（具体例２）
具体例２において、音響モデル生成装置１の図示しない適応処理手段は、ＭＡＰ適応法に基づき、他言語（ＳｏｕｒｃｅＬａｎｇｕａｇｅ）の他言語旧音響モデル（「他言語の初期の音響モデル」とも言える。）と、他言語の蓄積音声を用いて、他言語新音響モデル（「他言語の適応された音響モデル」とも言える。）を生成する。ここで、他言語旧音響モデルのｓ番目のガウス分布の平均ベクトルをμ_ｓ ^Ｉ、他言語新音響モデルのｓ番目のガウス分布平均ベクトルをμ_ｓ ^Ｒとする。 (Specific example 2)
In the second specific example, the adaptation processing unit (not shown) of the acoustic model generation apparatus 1 is based on the MAP adaptation method, and the other language old acoustic model (also referred to as “an initial acoustic model of another language”) in another language (Source Language). Then, another language new acoustic model (also referred to as “an acoustic model adapted to another language”) is generated using the accumulated speech of the other language. Here, the average vector of the sth Gaussian distribution of the other language old acoustic model is μ _s ^I , and the average vector of the sth Gaussian distribution of the other language new acoustic model is μ _s ^R.

ＭＡＰ適応法では、他言語新音響モデルの平均ベクトルを適応するとき、平均ベクトル（μ_ｓ ^Ｒ）は、他言語旧音響モデルの各平均ベクトル（μ_ｓ ^Ｉ）を、事前分布の平均ベクトルとし、以下の数式１により算出される。

In the MAP adaptation method, when the average vector of the other language new acoustic model is adapted, the average vector (μ _s ^R ) is defined as each average vector (μ _s ^I ) of the other language old acoustic model as a prior distribution average vector, It is calculated by the following formula 1.

数式１において、ｍ_ｓは蓄積音声から得られるｓ番目のガウス分布の最尤推定値である。ｎは、対応するガウス分布に関する蓄積音声から得られる学習サンプルの総数である。また、τは、事前分布と蓄積音声から得られるサンプルとの相対的なバランスを調整するパラメータである。 In Equation 1, m _s is the maximum likelihood estimate of the sth Gaussian distribution obtained from the stored speech. n is the total number of learning samples obtained from the accumulated speech for the corresponding Gaussian distribution. Also, τ is a parameter for adjusting the relative balance between the prior distribution and the sample obtained from the accumulated speech.

そして、具体例２において、以下のように差分ベクトルを求める。具体例２における処理の概念を図５に示す。 And in the specific example 2, a difference vector is calculated | required as follows. The concept of the process in the specific example 2 is shown in FIG.

第一相関情報生成手段１２５は、他言語旧音響モデル（μ_ｓ ^Ｉ）と他言語新音響モデルを（μ_ｓ ^Ｒ）との差である第一相関情報（Ｖ_ｓ）を、数式２に示すように算出する。この第一相関情報（Ｖ_ｓ）は、他言語の平均ベクトルの遷移ベクトルである。

The first correlation information generation means 125 shows the first correlation information (V _s ), which is the difference between the other language old acoustic model (μ _s ^I ) and the other language new acoustic model (μ _s ^R ), as shown in Equation 2. Calculate as follows. This first correlation information (V _s ) is a transition vector of an average vector in another language.

ここで、ｓ∈Ｋ₁（Ｋ₁は、他言語のトレーニングデータのガウス分布セットである。） Here, sεK ₁ (K ₁ is a Gaussian distribution set of training data in other languages)

第一相関情報生成手段１２５は、数式１の平均ベクトル（μ_ｓ ^Ｒ）を数式２に代入することにより、遷移ベクトルである第一相関情報を算出する（数式３参照）。

The first correlation information generation unit 125 calculates the first correlation information that is a transition vector by substituting the average vector (μ _s ^R ) of Equation 1 into Equation 2 (see Equation 3).

数式３において、ＭＡＰ適用法により得られる遷移ベクトル（Ｖ_ｓ）は、「Ｖ_ｓ ^ＭＬ＝（ｍ_ｓ−μ_ｓ ^Ｉ）」と表され、最尤（ＭＬ）推定により算出される。 In Equation 3, the transition vector (V _s ) obtained by the MAP application method is expressed as “V _s ^ML = (m _s −μ _s ^I )”, and is calculated by maximum likelihood (ML) estimation.

また、以下の数式４において、ＭＡＰ適用法による遷移ベクトルは、重み係数によるＭＬ推定を用いて修正された遷移ベクトル（Ｖ_ｓ ^ML）によって得られることを示している。なお、重み係数は、学習サンプルの総数ｎに依存する。

Further, in the following Expression 4, it is indicated that the transition vector by the MAP application method is obtained by the transition vector (V _s ^ML ) corrected by using the ML estimation by the weighting factor. The weighting factor depends on the total number n of learning samples.

次に、他言語と同様に、対象言語旧音響モデル（「対象言語の初期のモデル」とも言える。）のガウス分布の平均ベクトルは、音響モデル学習により生成される。 Next, as in other languages, an average vector of the Gaussian distribution of the target language old acoustic model (also referred to as “an initial model of the target language”) is generated by acoustic model learning.

ここで、対象言語旧音響モデルのガウス分布のｔ番目の平均ベクトルをμ_ｔ ^Ｉとする。なお、ここで、対象言語の適応処理のための蓄積音声のデータが存在しないので、対象言語の各ガウス分布の遷移ベクトルは、他言語の遷移ベクトルによって推定される。 Here, the t-th average vector of the Gaussian distribution of the target language old acoustic model is denoted by μ _t ^I. Here, since there is no accumulated voice data for the target language adaptation processing, the transition vector of each Gaussian distribution of the target language is estimated by the transition vector of the other language.

対象言語の遷移ベクトル（μ_ｔ ^Ｉ）におけるｔは、「ｔ∈Ｋ_２」である。ここで、Ｋ_２は、対象言語のガウス分布セットである。μ_ｔ ^Ｉの中の遷移ベクトル（Ｖ_ｔ）は、学習された遷移ベクトルＶ_ｓの以下の数式５により補間される。

_{T in} the transition vector (μ _t ^I ) of the target language is “tεK ₂ ”. Here, K ₂ is a Gaussian distribution set of the target language. The transition vector (V _t ) in μ _t ^I is interpolated by the following equation 5 of the learned transition vector V _s .

数式５において、Ｎ（ｔ）は、ベクトル（μ_ｔ ^Ｉ）のＫの近傍にあるガウス分布のセットである。λ_ｔ，ｓｋは、重み係数であり、μ_ｔ ^Ｉとμ_ｓｋ ^Ｉとの距離に依存する。ベクトル（μ_ｔ ^Ｉ）に遷移ベクトルＶ_ｔが加算され、ベクトル（μ_ｔ ^Ｒ）が取得される（数式６参照）。数式５において、ｓ_ｋは、ｋ番目のｓ［ｓ∈Ｋ₁（Ｋ₁は、他言語のトレーニングデータのガウス分布セットである。）］である。

In Equation 5, N (t) is a set of Gaussian distributions near K of the vector (μ _t ^I ). λ _{t, sk} is a weighting factor and depends on the distance between μ _t ^I and μ _sk ^I. Summed transition vector _{V t} is the vector (mu _t ^I), vector (mu _t ^R) is obtained (see Equation 6). In Equation 5, s _k is the k-th s [sεK ₁ (K ₁ is a Gaussian distribution set of training data in other languages)].

なお、例えば、Ｋの近接するガウス分布のセットは、従来技術であるKullback-Leibler divergence (KL-divergence)（「S. Kullback, and R. A. Leibler, "On information and sufficiency," Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79-86, 1951. 」参照）により取得される。 In addition, for example, a set of Gaussian distributions in which K is close is the conventional technique of Kullback-Leibler divergence (KL-divergence) (“S. Kullback, and RA Leibler,“ On information and sufficiency, ”Annals of Mathematical Statistics, vol. 22, no. 1, pp. 79-86, 1951 ”).

また、重み係数（λ_ａ，ｂ）は、例えば、以下の数式７により算出される。

Further, the weight coefficient (λ _{a, b} ) is calculated by, for example, the following formula 7.

数式７において、ｄ_ａ，ｂは、KL-divergenceに基づいて算出される、ベクトル（μ_ａ ^Ｉ）とベクトル（μ_ｂ ^Ｒ）との距離であり、ｆは重み調整のためのパラメータである。
（実験） In Equation 7, d _{a and b} are distances between the vector (μ _a ^I ) and the vector (μ _b ^R ) calculated based on KL-divergence, and f is a parameter for weight adjustment.
(Experiment)

以下、実験結果について説明する。本実験では、上記の具体例２の方法で動作する音響モデル生成装置１を用いた。また、本実験において、他言語は日本語であり、対象言語はインドネシア語である。つまり、日本語の適応処理前の音響モデル、日本語の適応処理後の音響モデル、およびインドネシア語の適応処理前の音響モデルが、予め存在する。 Hereinafter, experimental results will be described. In this experiment, the acoustic model generation apparatus 1 that operates according to the method of the specific example 2 is used. In this experiment, the other language is Japanese and the target language is Indonesian. In other words, an acoustic model before the Japanese adaptation process, an acoustic model after the Japanese adaptation process, and an acoustic model before the Indonesian adaptation process exist in advance.

各言語の評価のためのテストデータの量について、図６に示す。本実験において、２つの発話データを用いた。一つは、旅行会話基本表現コーパス（ＢＴＥＣ）であり、他は現実の環境で記録された音声データ（ＶＴｌｏｇ）である。ＢＴＥＣは、クリーンな環境で取得された旅行会話基本表現の音声データである。なお、クリーンな環境で取得された音声データとは、例えば、録音室で収録した音声データ、原稿を読上げた際に取得された音声データ等である。また、ＶＴｌｏｇは、ＶｏｉｃｅＴｒａ（ＵＲＬ「http://mastar.jp/translation/index.html」参照）により記録された音声データであり、ノイズを含んだ音声データや、種々の発話スタイルの音声データを含む。また、図６において、「時間」は記録時間（単位：時間）、「発話」は発話数を示す。 The amount of test data for evaluation in each language is shown in FIG. In this experiment, two utterance data were used. One is a travel conversation basic expression corpus (BTEC), and the other is voice data (VTlog) recorded in an actual environment. BTEC is voice data of travel conversation basic expressions acquired in a clean environment. The sound data acquired in a clean environment is, for example, sound data recorded in a recording room, sound data acquired when a document is read out, and the like. VTlog is voice data recorded by VoiceTra (refer to URL “http://mastar.jp/translation/index.html”). It contains voice data including noise and voice data of various utterance styles. Including. In FIG. 6, “time” indicates a recording time (unit: time), and “utterance” indicates the number of utterances.

また、図７は、各言語の学習、および適応処理に使用されたデータの総量を示す表である。学習データは、実験室で発話した音声データ（図７の「学習」の列のデータ）、および実環境で発話した音声データであり、ＶｏｉｃｅＴｒａにより記録された音声データ（図７の「適応処理（ＶＴｌｏｇ）」の列のデータ）を含む。音響モデルは、各言語の学習データにより学習された３状態のＬｅｆｔ−ｔｏ−Ｒｉｇｈｔ、性別非依存ＨＭＭである。また、状態数はインドネシア語が５０００状態、日本語が５００状態であり、状態共有手法として、決定木ベースのクラスタリング手法を使用した。また、インドネシア語に対して、状態ごとに４つのガウス分布を使用し、日本語に対して、状態ごとに１６のガウス分布を使用した。 FIG. 7 is a table showing the total amount of data used for learning and adaptive processing in each language. The learning data is voice data uttered in the laboratory (data in the column “Learning” in FIG. 7) and voice data uttered in the real environment, and is recorded by Voice Tra (“adaptive processing ( VTlog) "column data). The acoustic model is a three-state Left-to-Right, gender-independent HMM learned from learning data in each language. The number of states is 5000 in Indonesian and 500 in Japanese, and a decision tree-based clustering method is used as a state sharing method. For Indonesian, 4 Gaussian distributions were used for each state, and for Japanese, 16 Gaussian distributions were used for each state.

また、各言語の言語モデル（ＬＭｓ）は、ＢＴＥＣコーパスを用いて学習した。 Moreover, the language model (LMs) of each language was learned using the BTEC corpus.

図８は、実験結果を示す。ＢＴＥＣの単語誤り率（ＷＥＲ）は、日本語では１７．７４％であり、インドネシア語では１５．９７％であった。 FIG. 8 shows the experimental results. BTEC had a word error rate (WER) of 17.74% in Japanese and 15.97% in Indonesian.

一方、学習モデルとテスト音声が大きく異なるため、ＶＴｌｏｇのＷＥＲは、日本語では３７．７５％、インドネシア語では５５．３１％であった。この実験結果により、学習モデルとテスト音声の不整合によって精度の低下が引き起こされることが分かる。 On the other hand, because the learning model and test speech differ greatly, the VTlog WER was 37.75% in Japanese and 55.31% in Indonesian. From this experimental result, it can be seen that a mismatch in the learning model and the test speech causes a decrease in accuracy.

次に、上記ミスマッチを低減するために、ＶｏｉｃｅＴｒａによって記録された音声データである、日本語の実発話環境での蓄積音声を用いて、日本語の音響モデルに対してＭＡＰ適応を行った。図９は、適応実験の結果を示す。ＶＴｌｏｇのＷＥＲは２４．６６％となり、ベースライン（３７．７５％）と比較して大幅に改善された。このことは、実発話と整合する音声データを用いて音響モデルを適応させることの効果を示す。 Next, in order to reduce the mismatch, MAP adaptation was performed on the Japanese acoustic model using the accumulated speech in the Japanese actual speech environment, which is speech data recorded by VoiceTra. FIG. 9 shows the results of the adaptation experiment. The VTlog WER was 24.66%, a significant improvement over the baseline (37.75%). This shows the effect of adapting the acoustic model using speech data that matches the actual utterance.

次に、上記の具体例２の方法について評価した。評価において、パラメータを実験的に「τ＝１０」「ｆ＝３」「ｋ＜＝１０」と決定した。図１０は、実験結果を示す。ＶＴｌｏｇのＷＥＲは５５．３１％から５０．４０％に改善し、誤り削減率（ERR）８．９％を達成した（図１０の「Ｐｒｏｐｏｓｅｄ」の行を参照のこと）。この結果は、以下の我々の仮説を検証したことになる。我々の仮説は、他言語（ここでは日本語）の遷移ベクトルによって推定された遷移ベクトルを対象言語（ここではインドネシア語）の音響モデルに適用し、認識精度を改善することである。これにより、音響モデル生成装置１の方法は、実発話に関する対象言語（ここではインドネシア語）の蓄積音声用いず、対象言語の音響モデルを実発話環境へ適応する。 Next, the method of the specific example 2 was evaluated. In the evaluation, the parameters were experimentally determined as “τ = 10”, “f = 3”, and “k <= 10”. FIG. 10 shows the experimental results. The VTlog WER improved from 55.31% to 50.40% and achieved an error reduction rate (ERR) of 8.9% (see “Proposed” row in FIG. 10). This result verified our hypothesis below. Our hypothesis is to improve the recognition accuracy by applying the transition vector estimated by the transition vector of another language (here, Japanese) to the acoustic model of the target language (here, Indonesian). Thereby, the method of the acoustic model generation device 1 adapts the acoustic model of the target language to the actual speech environment without using the accumulated speech of the target language (here Indonesian language) related to the actual speech.

なお、具体例２において、音声の一の状態を混合正規分布とし、その平均をＭＡＰ適応法に基づき適応することで、他言語新音響モデルを生成するとしたが、平均以外の音響モデルのパラメータ、例えば正規分布の分散、ＨＭＭの状態遷移確率などの他のパラメータも同様に適応可能である。また、音響モデルは混合正規分布でモデル化されているとも限らず、例えばニューラルネットワークを用いた音響モデルの場合においても、２つの音響モデルの差分である第一相関情報を用いて適応することができる。
In specific example 2, one state of speech is a mixed normal distribution, and the average is adapted based on the MAP adaptation method to generate a new acoustic model in another language. For example, other parameters such as dispersion of normal distribution and state transition probability of HMM can be similarly applied. Also, the acoustic model is not necessarily modeled with a mixed normal distribution. For example, even in the case of an acoustic model using a neural network, it is possible to adapt using the first correlation information that is the difference between the two acoustic models. it can.

以上、本実施の形態によれば、発話環境等の適した環境における音声データが存在しない言語でも、当該言語に対応する音響モデルであり、音声認識精度を上げる音響モデルを生成できる。 As described above, according to the present embodiment, even in a language that does not have voice data in a suitable environment such as a speech environment, an acoustic model corresponding to the language can be generated, and an acoustic model that improves voice recognition accuracy can be generated.

また、本実施の形態によれば、他言語の適応処理前の音響モデルと他言語の適応処理後の音響モデルとの相関関係を利用することにより、発話環境等の適した環境における音声データが存在しない言語でも、当該言語に対応する音響モデルであり、音声認識精度を上げる音響モデルを生成できる。 In addition, according to the present embodiment, by using the correlation between the acoustic model before the adaptation process of the other language and the acoustic model after the adaptation process of the other language, the voice data in a suitable environment such as the speech environment is obtained. Even a language that does not exist is an acoustic model corresponding to the language, and an acoustic model that improves speech recognition accuracy can be generated.

なお、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における音響モデル生成装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータを、対象言語とは異なる１以上の各他言語の第一の音響モデルである１以上の各他言語旧音響モデルと前記１以上の各他言語の第二の音響モデルである１以上の各他言語新音響モデルとの関係に関する情報である第一相関情報、または前記１以上の各他言語旧音響モデルと前記対象言語の第一の音響モデルである対象言語旧音響モデルとの関係に関する情報である第二相関情報のうちの、いずれか１以上の相関情報を用いて、対象言語旧音響モデルまたは１以上の他言語新音響モデルから、対象言語新音響モデルを生成する音響モデル生成部と、前記音響モデル生成部が生成した対象言語新音響モデルを記録媒体に蓄積する音響モデル蓄積部として機能させるためのプログラムである。 Note that the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. Note that the software that implements the acoustic model generation apparatus according to the present embodiment is the following program. In other words, this program causes the computer to execute one or more other language old acoustic models which are first acoustic models of one or more other languages different from the target language and second acoustics of the one or more other languages. First correlation information that is information related to a relationship with one or more other language new acoustic models that are models, or a target language that is a first acoustic model of each of the one or more other language old acoustic models and the target language The target language new acoustic model is obtained from the target language old acoustic model or one or more other language new acoustic models using any one or more correlation information of the second correlation information that is information related to the acoustic model. A program for causing an acoustic model generation unit to generate and an acoustic model storage unit that stores a target language new acoustic model generated by the acoustic model generation unit in a recording medium.

また、上記プログラムにおいて、前記音響モデル生成部は、対象言語旧音響モデルを格納し得る対象言語旧音響モデル格納部と、第一相関情報を格納し得る第一相関情報格納部と、前記第一相関情報を用いて、前記対象言語旧音響モデル格納部に格納されている対象言語旧音響モデルから対象言語新音響モデルを生成する音響モデル生成手段とを具備するものとして、コンピュータを機能させることは好適である。 In the program, the acoustic model generation unit includes a target language old acoustic model storage unit that can store a target language old acoustic model, a first correlation information storage unit that can store first correlation information, and the first Using the correlation information, the acoustic model generation means for generating the target language new acoustic model from the target language old acoustic model stored in the target language old acoustic model storage unit, and causing the computer to function Is preferred.

（実施の形態２）
本実施の形態において、他言語旧音響モデルと対象言語旧音響モデルとを用いて、他言語新音響モデルから、対象言語新音響モデルを生成する音響モデル生成装置について説明する。 (Embodiment 2)
In this embodiment, an acoustic model generation apparatus that generates a target language new acoustic model from another language new acoustic model using the other language old acoustic model and the target language old acoustic model will be described.

さらに具体的には、本実施の形態において、他言語旧音響モデルと対象言語旧音響モデルとの相関関係を示す第二相関情報を用いて、他言語新音響モデルから、対象言語新音響モデルを生成する音響モデル生成装置について説明する。 More specifically, in the present embodiment, the target language new acoustic model is obtained from the other language new acoustic model using the second correlation information indicating the correlation between the other language old acoustic model and the target language old acoustic model. An acoustic model generation device to be generated will be described.

図１１は、本実施の形態における音響モデル生成装置２のブロック図である。音響モデル生成装置２は、対象言語新音響モデル格納部１１、音響モデル生成部２２、および音響モデル蓄積部１３を備える。 FIG. 11 is a block diagram of the acoustic model generation apparatus 2 in the present embodiment. The acoustic model generation device 2 includes a target language new acoustic model storage unit 11, an acoustic model generation unit 22, and an acoustic model storage unit 13.

音響モデル生成部２２は、対象言語旧音響モデル格納部１２１、他言語旧音響モデル格納部１２２、他言語新音響モデル格納部１２３、第二相関情報格納部２２４、第二相関情報生成手段２２５、および音響モデル生成手段２２６を備える。 The acoustic model generation unit 22 includes a target language old acoustic model storage unit 121, another language old acoustic model storage unit 122, another language new acoustic model storage unit 123, a second correlation information storage unit 224, a second correlation information generation unit 225, And an acoustic model generation means 226.

音響モデル生成部２２は、１または２以上の他言語旧音響モデルと１または２以上の他言語新音響モデルとの関係に関する情報である１または２以上の第一相関情報、または１または２以上の他言語旧音響モデルと対象言語旧音響モデルとの関係に関する情報である１または２以上の第二相関情報のうちの、いずれか１以上の相関情報を用いて、対象言語旧音響モデルまたは他言語新音響モデルから、対象言語新音響モデルを生成する。音響モデル生成部２２は、他言語新音響モデル格納部１２２に格納されている１以上の他言語新音響モデルから第二相関情報の変換関数を用いて写像することにより他言語新音響モデルを生成しても良い。 The acoustic model generation unit 22 is one or two or more pieces of first correlation information, or one or two or more pieces of information related to the relationship between one or more other language old acoustic models and one or more other language new acoustic models. The target language old acoustic model or the other using one or more correlation information of one or two or more second correlation information, which is information related to the relationship between the other language old acoustic model and the target language old acoustic model A target language new acoustic model is generated from the language new acoustic model. The acoustic model generation unit 22 generates another language new acoustic model by mapping from one or more other language new acoustic models stored in the other language new acoustic model storage unit 122 using the conversion function of the second correlation information. You may do it.

さらに、本実施の形態において、音響モデル生成部２２は、１または２以上の他言語新音響モデルから、１または２以上の第二相関情報を用いて、対象言語新音響モデルを生成する場合について説明する。 Furthermore, in the present embodiment, the acoustic model generation unit 22 generates a target language new acoustic model from one or more other language new acoustic models using one or more second correlation information. explain.

第二相関情報格納部２２４は、１または２以上の第二相関情報を格納し得る。第二相関情報は、他言語旧音響モデルと対象言語旧音響モデルとの関係に関する情報である。ここで、関係に関する情報とは、通常、２つの音響モデルの差分についての情報である。つまり、第二相関情報は、通常、他言語旧音響モデルに対応するベクトルと対象言語旧音響モデルに対応するベクトルとの差を示すベクトルである。なお、ベクトルは、パラメータ集合である。第二相関情報の構造は、第一相関情報の構造と同じで良い。 The second correlation information storage unit 224 can store one or more second correlation information. The second correlation information is information related to the relationship between the other language old acoustic model and the target language old acoustic model. Here, the information on the relationship is usually information on the difference between the two acoustic models. That is, the second correlation information is usually a vector indicating a difference between a vector corresponding to the other language old acoustic model and a vector corresponding to the target language old acoustic model. A vector is a parameter set. The structure of the second correlation information may be the same as the structure of the first correlation information.

第二相関情報生成手段２２５は、１または２以上の各他言語旧音響モデルと対象言語旧音響モデルとを用いて、１または２以上の第二相関情報を生成する。具体的には、第二相関情報生成手段２２５は、例えば、１または２以上の各他言語旧音響モデルに対応するベクトルと対象言語旧音響モデルに対応するベクトルとの差を算出し、１または２以上の各第二相関情報に対応するベクトルを取得する。 The second correlation information generation unit 225 generates one or more second correlation information using one or more other language old acoustic models and the target language old acoustic model. Specifically, the second correlation information generation unit 225 calculates, for example, a difference between a vector corresponding to one or two or more other language old acoustic models and a vector corresponding to the target language old acoustic model. Vectors corresponding to two or more pieces of second correlation information are acquired.

音響モデル生成手段２２６は、１または２以上の各第二相関情報を用いて、他言語新音響モデル格納部１２３に格納されている他言語新音響モデルから対象言語新音響モデルを生成する。音響モデル生成手段１２６は、例えば、他言語新音響モデルに対応するベクトルに、第二相関情報であるベクトルを加えて、新しいベクトルである対象言語新音響モデルを取得する。 The acoustic model generation unit 226 generates a target language new acoustic model from the other language new acoustic model stored in the other language new acoustic model storage unit 123 using one or more second correlation information. For example, the acoustic model generation unit 126 adds the vector that is the second correlation information to the vector corresponding to the other language new acoustic model, and acquires the target language new acoustic model that is a new vector.

音響モデル生成部２２、第二相関情報生成手段２２５、および音響モデル生成手段２２６は、通常、ＭＰＵやメモリ等から実現され得る。音響モデル生成部２２等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The acoustic model generation unit 22, the second correlation information generation unit 225, and the acoustic model generation unit 226 can be usually realized by an MPU, a memory, or the like. The processing procedure of the acoustic model generation unit 22 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

第二相関情報格納部２２４は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。第二相関情報格納部２２４に第二相関情報が記憶される過程は問わない。例えば、記録媒体を介して第二相関情報が第二相関情報格納部２２４で記憶されるようになってもよく、通信回線等を介して送信された第二相関情報が第二相関情報格納部２２４で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された第二相関情報が第二相関情報格納部２２４で記憶されるようになってもよい。 The second correlation information storage unit 224 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium. The process in which the second correlation information is stored in the second correlation information storage unit 224 does not matter. For example, the second correlation information may be stored in the second correlation information storage unit 224 via the recording medium, and the second correlation information transmitted via the communication line or the like is stored in the second correlation information storage unit. 224 may be stored, or the second correlation information input via the input device may be stored in the second correlation information storage unit 224.

次に、音響モデル生成装置２の動作について、図１２のフローチャートを用いて説明する。 Next, operation | movement of the acoustic model production | generation apparatus 2 is demonstrated using the flowchart of FIG.

（ステップＳ１２０１）第二相関情報生成手段２２５は、他言語旧音響モデル格納部１２２から他言語旧音響モデルを取得する。 (Step S <b> 1201) The second correlation information generation unit 225 acquires another language old acoustic model from the other language old acoustic model storage unit 122.

（ステップＳ１２０２）第二相関情報生成手段２２５は、対象言語旧音響モデル格納部１２１から対象言語旧音響モデルを取得する。 (Step S1202) The second correlation information generation unit 225 acquires the target language old acoustic model from the target language old acoustic model storage unit 121.

（ステップＳ１２０３）第二相関情報生成手段２２５は、ステップＳ１２０１で取得した他言語旧音響モデルに対応するベクトル（μ_ｓ ^Ｉ）とステップＳ１２０２で取得した対象言語旧音響モデルに対応するベクトル（μ_ｔ ^Ｉ）との差分を示す情報である第二相関情報（例えば、Ｖ_Ｉ＝μ_ｔ ^Ｉ−μ_ｓ ^Ｉ）を算出する。 (Step S1203) The second correlation information generating unit 225, vector corresponding to the target language old acoustic model acquired acquired vectors corresponding to other languages old acoustic model (mu _s ^I) in step S1202 in step S1201 (mu _t the second correlation information indicating a difference between ^I) (e.g., to calculate the _{_{^{_{V I = μ t I -μ s}}}} I).

（ステップＳ１２０４）第二相関情報生成手段２２５は、ステップＳ１２０３で算出した第二相関情報（Ｖ_Ｉ）を、第二相関情報格納部２２４に蓄積する。 (Step S1204) The second correlation information generation unit 225 accumulates the second correlation information (V _I ) calculated in Step S1203 in the second correlation information storage unit 224.

（ステップＳ１２０５）音響モデル生成手段２２６は、音響モデルを生成するか否かを判断する。音響モデルを生成する場合はステップＳ１２０６に行き、音響モデルを生成しない場合はステップＳ１２０５に戻る。なお、例えば、ユーザ指示の受け付けにより音響モデルを生成しても良いし、第二相関情報の蓄積等をトリガーとして音響モデルを生成しても良い。 (Step S1205) The acoustic model generation means 226 determines whether to generate an acoustic model. If an acoustic model is to be generated, the process goes to step S1206. If an acoustic model is not to be generated, the process returns to step S1205. Note that, for example, an acoustic model may be generated by accepting a user instruction, or an acoustic model may be generated with the accumulation of second correlation information or the like as a trigger.

（ステップＳ１２０６）音響モデル生成手段２２６は、他言語新音響モデル格納部１２３から他言語新音響モデル（μ_ｓ ^Ｉ）を取得する。 (Step S1206) The acoustic model generation means 226 acquires another language new acoustic model (μ _s ^I ) from the other language new acoustic model storage unit 123.

（ステップＳ１２０７）音響モデル生成手段２２６は、第二相関情報格納部２２４から第二相関情報（Ｖ_Ｉ）を取得する。 (Step S1207) The acoustic model generation unit 226 acquires the second correlation information (V _I ) from the second correlation information storage unit 224.

（ステップＳ１２０８）音響モデル生成手段２２６は、ステップＳ１２０６で取得した他言語新音響モデルに対して、ステップＳ１２０７で取得した第一相関情報を適用し、対象言語新音響モデルを生成する。音響モデル生成手段１２６は、例えば、対象言語旧音響モデルに対応するベクトル（μ_ｓ ^Ｉ）に、第二相関情報（Ｖ_Ｉ）を加算し、対象言語新音響モデル（μ_ｔ ^Ｒ＝μ_ｓ ^Ｉ＋Ｖ_Ｉ）を取得する。 (Step S1208) The acoustic model generation unit 226 applies the first correlation information acquired in Step S1207 to the other language new acoustic model acquired in Step S1206, and generates a target language new acoustic model. For example, the acoustic model generation unit 126 adds the second correlation information (V _I ) to the vector (μ _s ^I ) corresponding to the target language old acoustic model, and the target language new acoustic model (μ _t ^R = μ _s ^I). + V _I ).

（ステップＳ１２０９）音響モデル蓄積部１３は、ステップＳ１２０８で生成された新音響モデル（μ_ｔ ^Ｒ）を、対象言語新音響モデル格納部１１に蓄積し、処理を終了する。 (Step S1209) The acoustic model storage unit 13 stores the new acoustic model (μ _t ^R ) generated in step S1208 in the target language new acoustic model storage unit 11, and ends the process.

なお、図１２のフローチャートにおいて、他言語が一つの場合について説明したが、他言語が２以上でも良い。かかる場合、第二相関情報生成手段２２５は、２以上の他言語の２以上の第二相関情報を生成する。また、音響モデル生成手段２２６は、２以上の第二相関情報を用いて、対象言語旧音響モデルから対象言語新音響モデルを生成する。音響モデル生成手段２２６は、例えば、２以上の第二相関情報の平均ベクトルを取得し、当該平均ベクトルを対象言語旧音響モデルに対応するベクトルに加算し、対象言語新音響モデルを算出する。 In the flowchart of FIG. 12, the case where there is one other language has been described, but the number of other languages may be two or more. In this case, the second correlation information generation unit 225 generates two or more second correlation information in two or more other languages. The acoustic model generation unit 226 generates a target language new acoustic model from the target language old acoustic model using two or more pieces of second correlation information. For example, the acoustic model generation unit 226 acquires an average vector of two or more pieces of second correlation information, adds the average vector to a vector corresponding to the target language old acoustic model, and calculates a target language new acoustic model.

また、図１２のフローチャートのステップＳ１２０５は無くても良い。つまり、第二相関情報の蓄積の後、直ちに音響モデルの生成処理を行なっても良いことは言うまでもない。 Further, step S1205 in the flowchart of FIG. 12 may be omitted. That is, it goes without saying that an acoustic model generation process may be performed immediately after the second correlation information is accumulated.

以下、本実施の形態における音響モデル生成装置２の具体的な動作について説明する。まず、音響モデル生成装置１の動作の概念を説明する。図１３は、音響モデル生成装置２の動作を説明する概念図である。 Hereinafter, a specific operation of the acoustic model generation device 2 in the present embodiment will be described. First, the concept of the operation of the acoustic model generation device 1 will be described. FIG. 13 is a conceptual diagram illustrating the operation of the acoustic model generation device 2.

音響モデル生成装置２は、ここでは、適応前モデルの言語間の相関関係を利用し、音響モデルを生成する。 Here, the acoustic model generation apparatus 2 generates an acoustic model by using the correlation between the languages of the pre-adaptation model.

音響モデル生成装置２の図示しない第二音響モデル生成手段は、他言語（ＳｏｕｒｃｅＬａｎｇｕａｇｅ）の音響モデル生成用音声（ｓｏｕｒｃｅ）１３０１から、音声モデル生成処理１３０２により、他言語旧音響モデル１３０３を生成する。他言語旧音響モデル１３０３は、図１３の「ＬａｂＡＭ（ｓｏｕｒｃｅ）」である。次に、音響モデル生成装置２の図示しない適応処理手段は、蓄積音声（ｓｏｕｒｃｅ）１３０４を用いた適応処理（音響モデル生成（適応））１３０５により、他言語旧音響モデル１３０３から他言語新音響モデル１３０６を生成する。他言語新音響モデル１３０６は、図１３の「ＦｌｄＡＭ（ｓｏｕｒｃｅ）」である。 The second acoustic model generation means (not shown) of the acoustic model generation device 2 generates the other language old acoustic model 1303 from the acoustic language generation source 1301 of another language (Source Language) by the speech model generation processing 1302. . The other language old acoustic model 1303 is “Lab AM (source)” in FIG. 13. Next, the adaptive processing means (not shown) of the acoustic model generation apparatus 2 performs the adaptive processing (acoustic model generation (adaptive)) 1305 using the accumulated speech 1304 to convert the other language old acoustic model 1303 to the other language new acoustic model. 1306 is generated. The other language new acoustic model 1306 is “Fld AM (source)” in FIG. 13.

次に、音響モデル生成装置２の図示しない第二音響モデル生成手段は、対象言語（ＴａｒｇｅｔＬａｎｇｕａｇｅ）の音響モデル生成用音声（ｔａｒｇｅｔ）１３０７から、音声モデル生成処理１３０８により、対象言語旧音響モデル１３０９を生成する。対象言語旧音響モデル１３０９は、図１３の「ＬａｂＡＭ（ｔａｒｇｅｔ）」である。ここで、対象言語の蓄積音声（ｔａｒｇｅｔ）１３１０は存在しないので、対象言語旧音響モデル１３０９に対して音響モデル生成（適応）１３１１の処理は行えない。つまり、図１３の破線は、存在しないデータまたは行えない処理を示す。 Next, the second acoustic model generation unit (not shown) of the acoustic model generation device 2 performs the target language old acoustic model 1309 from the target language (Target Language) acoustic model generation speech 1307 by the speech model generation processing 1308. Is generated. The target language old acoustic model 1309 is “Lab AM (target)” in FIG. Here, since the target language accumulated speech 1310 does not exist, the acoustic model generation (adaptation) 1311 cannot be performed on the target language old acoustic model 1309. That is, the broken line in FIG. 13 indicates data that does not exist or processing that cannot be performed.

そして、第二相関情報生成手段２２５は、他言語旧音響モデル１３０３と対象言語旧音響モデル１３０９とを用いて、第二相関情報ｇ（１３１２）を生成する。具体的には、第二相関情報生成手段２２５は、例えば、他言語旧音響モデル１３０３に対応するベクトルと対象言語旧音響モデル１３０９に対応するベクトルとの差を算出する。このベクトルの差であるベクトルが第二相関情報ｇである。 Then, the second correlation information generation means 225 generates the second correlation information g (1312) using the other language old acoustic model 1303 and the target language old acoustic model 1309. Specifically, the second correlation information generation unit 225 calculates, for example, a difference between a vector corresponding to the other language old acoustic model 1303 and a vector corresponding to the target language old acoustic model 1309. A vector which is the difference between the vectors is the second correlation information g.

次に、音響モデル生成手段２２６は、対象言語旧音響モデル１３０９に対して、第二相関情報ｇを適用し、対象言語新音響モデル１３１３を生成する。具体的には、音響モデル生成手段２２６は、対象言語旧音響モデル１３０９に対応するベクトルに対して、第二相関情報ｇに対応するベクトルを加算し、対象言語新音響モデル１３１３を生成する。なお、この生成した対象言語新音響モデルが作りたいモデルである。また、図１３の対象言語新音響モデル１３１３は、「ＦｌｄＡＭ（ｔａｒｇｅｔ）」である。 Next, the acoustic model generation unit 226 applies the second correlation information g to the target language old acoustic model 1309 to generate a target language new acoustic model 1313. Specifically, the acoustic model generation unit 226 adds a vector corresponding to the second correlation information g to a vector corresponding to the target language old acoustic model 1309 to generate a target language new acoustic model 1313. The generated target language new acoustic model is a model that the user wants to create. Further, the target language new acoustic model 1313 of FIG. 13 is “Fld AM (target)”.

以下、音響モデル生成部２２の処理について、さらなる具体例を説明する。 Hereinafter, further specific examples of the processing of the acoustic model generation unit 22 will be described.

（具体例）
ここで、図１４を使用し、音響モデル生成装置２の具体的な動作について説明する。図１４において、他言語（ＳｏｕｒｃｅＬａｎｇｕａｇｅ）のベースラインモデルＳ_Ｉを、平均「μ_ｓ ^Ｉ＝（０，１／２）」、分散σ_ｓ ^Ｉの２次元正規分布とする。なお、ベースラインモデルＳ_Ｉは、他言語旧音響モデルである。また、蓄積音声で適応された他言語の適応モデルＳ_Ｒを、平均「μ_ｓ ^Ｒ＝（１，１）」、分散σ_ｓ ^Ｒ（＝σ_ｓ ^Ｉ）をもつ２次元正規分布とする。なお、適応モデルＳ_Ｒは、他言語新音響モデルである。さらに、対象言語旧音響モデルＴ_Ｉを、平均「μ_ｔ ^Ｒ＝（０，０）」、分散σ_ｔ ^Ｒ（＝σ_ｔ ^Ｉ）をもつ２次元正規分布とする。さらに、対象言語新音響モデルをＴ_Ｒとする。なお、他言語旧音響モデル（Ｓ_I）、他言語新音響モデル（Ｓ_Ｒ）、および対象言語旧音響モデル（Ｔ_I）は、実施の形態１の具体例１で説明した処理により取得された、とする。 (Concrete example)
Here, a specific operation of the acoustic model generation device 2 will be described with reference to FIG. 14, the baseline model _{S I} in other languages (Source Language), mean _{^{"μ s I = (0,1 / 2}} ) ", a two-dimensional normal distribution of variance sigma _s ^I. In addition, the baseline model S _I is another language old acoustic model. In addition, the adaptation model S _R of another language adapted with the stored speech is a two-dimensional normal distribution having an average “μ _s ^R = (1, 1)” and a variance σ _s ^R (= σ _s ^I ). It should be noted that adaptive model S _R is another language new acoustic model. Further, the target language old acoustic model T _I is a two-dimensional normal distribution having an average “μ _t ^R = (0, 0)” and a variance σ _t ^R (= σ _t ^I ). In addition, the target language new acoustic model and T _R. The other language old acoustic model (S _I ), the other language new acoustic model (S _R ), and the target language old acoustic model (T _I ) were acquired by the processing described in the first specific example of the first embodiment. , And.

かかる状況において、第二相関情報生成手段２２５は、Ｓ_ＩとＴ_Ｉの平均ベクトルの差分（Ｖ_I）を「Ｖ_I＝μ_ｔ ^Ｉ−μ_ｓ ^Ｉ＝（０，−１／２）」を取得する。この第二相関情報は、図１４の１４１の矢印である。 In such a situation, the second correlation information generation means 225 sets the difference (V _I ) between the average vectors of S _I and T _I as “V _I = μ _t ^I −μ _s ^I = (0, −1/2)”. get. This second correlation information is an arrow 141 in FIG.

次に、音響モデル生成手段２２６は、この平均ベクトルの差分（Ｖ_I）を、他言語新音響モデルに適用し（図１４の１４２）、平均「μ_ｔ ^Ｒ＝μ_ｓ ^Ｉ＋Ｖ_Ｉ＝（１，１／２）」、分散σ_ｔ ^Ｒ（＝σ_ｔ ^Ｉ）をもつ２次元正規分布を得る。この２次元正規分布が、対象言語の適応モデルＴ_Ｒである。Ｔ_Ｒは、対象言語新音響モデルである。

なお、本具体例において、具体例１と同様に、音声の一の状態を２次元の正規分布でモデル化されている、としたが、２次元の正規分布に限られず、数十次元の混合正規分布等でモデル化されていることはさらに好適である。また、混合正規分布でモデル化されているとも限らず、例えばニューラルネットワークを用いた音響モデルの場合においても、２つの音響モデルの差分である第二相関情報を用いて適応することができる。
Next, the acoustic model generation means 226 applies the difference (V _I ) of this average vector to the other language new acoustic model (142 in FIG. 14), and the average “μ _t ^R = μ _s ^I + V _I = (1) , ½) ”, a two-dimensional normal distribution with variance σ _t ^R (= σ _t ^I ). The 2-dimensional normal distribution, an adaptive model T _R of the target language. T _R is the target language new acoustic model.

In this specific example, as in specific example 1, it is assumed that one state of speech is modeled by a two-dimensional normal distribution. However, the present invention is not limited to a two-dimensional normal distribution, and is mixed by several tens of dimensions. It is more preferable to model with a normal distribution or the like. In addition, the model is not necessarily modeled with a mixed normal distribution. For example, in the case of an acoustic model using a neural network, adaptation can be performed using the second correlation information that is a difference between two acoustic models.

以上、本実施の形態によれば、他言語の適応処理前の音響モデルと対象言語の適応処理前の音響モデルとの相関関係を利用することにより、発話環境等の適した環境における音声データが存在しない言語でも、当該言語に対応する音響モデルであり、音声認識精度を上げる音響モデルを生成できる。 As described above, according to the present embodiment, by using the correlation between the acoustic model before the adaptation process of the other language and the acoustic model before the adaptation process of the target language, the voice data in a suitable environment such as the speech environment is obtained. Even a language that does not exist is an acoustic model corresponding to the language, and an acoustic model that improves speech recognition accuracy can be generated.

なお、本実施の形態における音響モデル生成装置２を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータを、対象言語とは異なる１以上の各他言語の第一の音響モデルである１以上の各他言語旧音響モデルと前記１以上の各他言語の第二の音響モデルである１以上の各他言語新音響モデルとの関係に関する情報である第一相関情報、または前記１以上の各他言語旧音響モデルと前記対象言語の第一の音響モデルである対象言語旧音響モデルとの関係に関する情報である第二相関情報のうちの、いずれか１以上の相関情報を用いて、対象言語旧音響モデルまたは１以上の他言語新音響モデルから、対象言語新音響モデルを生成する音響モデル生成部と、前記音響モデル生成部が生成した対象言語新音響モデルを記録媒体に蓄積する音響モデル蓄積部として機能させるためのプログラムである。 In addition, the software which implement | achieves the acoustic model production | generation apparatus 2 in this Embodiment is the following programs. In other words, this program causes the computer to execute one or more other language old acoustic models which are first acoustic models of one or more other languages different from the target language and second acoustics of the one or more other languages. First correlation information that is information related to a relationship with one or more other language new acoustic models that are models, or a target language that is a first acoustic model of each of the one or more other language old acoustic models and the target language The target language new acoustic model is obtained from the target language old acoustic model or one or more other language new acoustic models using any one or more correlation information of the second correlation information that is information related to the acoustic model. A program for causing an acoustic model generation unit to generate and an acoustic model storage unit that stores a target language new acoustic model generated by the acoustic model generation unit in a recording medium.

また、上記プログラムにおいて、前記音響モデル生成部は、前記他言語新音響モデルを格納し得る他言語新音響モデル格納部と、第二相関情報を格納し得る第二相関情報格納部と、前記第二相関情報を用いて、前記他言語新音響モデル格納部に格納されている他言語新音響モデルから対象言語新音響モデルを生成する音響モデル生成手段とを具備するものとして、コンピュータを機能させることは好適である。 Further, in the above program, the acoustic model generation unit includes an other language new acoustic model storage unit that can store the other language new acoustic model, a second correlation information storage unit that can store second correlation information, and the first Using the two-correlation information, the computer functions as an acoustic model generation unit that generates a target language new acoustic model from the other language new acoustic model stored in the other language new acoustic model storage unit. Is preferred.

（実施の形態３）
本実施の形態において、本実施の形態において、１以上の第一相関情報と１以上の第二相関情報のうちの１以上の相関情報を用いて、対象言語旧音響モデル、または他言語新音響モデル、または対象言語旧音響モデルと他言語新音響モデルから、対象言語新音響モデルを生成する音響モデル生成装置について説明する。 (Embodiment 3)
In the present embodiment, in the present embodiment, the target language old acoustic model or the other language new acoustic is obtained using one or more correlation information of one or more first correlation information and one or more second correlation information. An acoustic model generation apparatus that generates a target language new acoustic model from the model or the target language old acoustic model and another language new acoustic model will be described.

さらに具体的には、本実施の形態において、第一相関情報と第二相関情報の用い方（アルゴリズム）が動的に変化する音響モデル生成装置について説明する。 More specifically, an acoustic model generation apparatus in which the usage (algorithm) of the first correlation information and the second correlation information dynamically changes in the present embodiment will be described.

図１５は、本実施の形態における音響モデル生成装置３のブロック図である。 FIG. 15 is a block diagram of the acoustic model generation device 3 in the present embodiment.

音響モデル生成装置３は、対象言語新音響モデル格納部１１、音響モデル生成部３２、および音響モデル蓄積部１３を備える。 The acoustic model generation device 3 includes a target language new acoustic model storage unit 11, an acoustic model generation unit 32, and an acoustic model storage unit 13.

音響モデル生成部３２は、対象言語旧音響モデル格納部１２１、他言語旧音響モデル格納部１２２、他言語新音響モデル格納部１２３、第一相関情報格納部１２４、第一相関情報生成手段１２５、第二相関情報格納部２２４、第二相関情報生成手段２２５、選択手段３２１、および音響モデル生成手段３２６を備える。また、選択手段３２１は、選択情報管理部３２１１を備える。 The acoustic model generation unit 32 includes a target language old acoustic model storage unit 121, another language old acoustic model storage unit 122, another language new acoustic model storage unit 123, a first correlation information storage unit 124, a first correlation information generation unit 125, A second correlation information storage unit 224, a second correlation information generation unit 225, a selection unit 321 and an acoustic model generation unit 326 are provided. The selection unit 321 includes a selection information management unit 3211.

音響モデル生成部３２は、他言語旧音響モデルと他言語新音響モデルとの関係に関する情報である１または２以上の第一相関情報、または他言語旧音響モデルと対象言語旧音響モデルとの関係に関する情報である１または２以上の第二相関情報のうちの、いずれか１以上の相関情報を用いて、対象言語旧音響モデルまたは他言語新音響モデルから、対象言語新音響モデルを生成する。 The acoustic model generation unit 32 is one or more first correlation information that is information related to the relationship between the other language old acoustic model and the other language new acoustic model, or the relationship between the other language old acoustic model and the target language old acoustic model. A target language new acoustic model is generated from the target language old acoustic model or the other language new acoustic model by using any one or more correlation information of one or two or more second correlation information that is information regarding.

さらに、本実施の形態において、音響モデル生成部３２は、１または２以上の第一相関情報と１または２以上の第二相関情報とを用いて、対象言語旧音響モデル、または他言語新音響モデル、または対象言語旧音響モデルと他言語新音響モデルとから、対象言語新音響モデルを生成する。 Furthermore, in this Embodiment, the acoustic model production | generation part 32 uses a 1 or 2 or more 1st correlation information and a 1 or 2 or more 2nd correlation information, and becomes a target language old acoustic model or another language new acoustic. A target language new acoustic model is generated from the model or the target language old acoustic model and another language new acoustic model.

選択手段３２１は、対象言語新音響モデルを生成する２以上のアルゴリズムのうち、対象言語旧音響モデルまたは他言語新音響モデルが有するデータに応じて、いずれか一のアルゴリズムを選択する。例えば、第一のアルゴリズムは、第一相関情報を用いて、対象言語旧音響モデルから、対象言語新音響モデルを生成するアルゴリズムである。例えば、第二のアルゴリズムは、第二相関情報を用いて、他言語新音響モデルから、対象言語新音響モデルを生成するアルゴリズムである。また、例えば、第三のアルゴリズムは、第一相関情報と第二相関情報とを用いて、対象言語旧音響モデルと他言語新音響モデルとから、対象言語新音響モデルを生成するアルゴリズムである。 The selection means 321 selects one of the two or more algorithms for generating the target language new acoustic model according to the data of the target language old acoustic model or the other language new acoustic model. For example, a 1st algorithm is an algorithm which produces | generates a target language new acoustic model from a target language old acoustic model using 1st correlation information. For example, the second algorithm is an algorithm for generating a target language new acoustic model from another language new acoustic model using the second correlation information. Further, for example, the third algorithm is an algorithm for generating a target language new acoustic model from the target language old acoustic model and the other language new acoustic model using the first correlation information and the second correlation information.

選択情報管理部３２１１は、選択手段３２１がアルゴリズムを決定するための情報である１以上の選択情報を格納し得る。選択情報は、例えば、音素を識別する音素識別子と、アルゴリズムを識別するアルゴリズム識別子の対の情報である。なお、選択情報は、音素より細かい単位で、アルゴリズムを切替える選択情報を有しても良い。また、選択情報は、音素より荒い単位で、アルゴリズムを切替える選択情報を有しても良い。 The selection information management unit 3211 can store one or more selection information that is information for the selection means 321 to determine an algorithm. The selection information is, for example, information on a pair of a phoneme identifier that identifies a phoneme and an algorithm identifier that identifies an algorithm. Note that the selection information may include selection information for switching the algorithm in units smaller than phonemes. Further, the selection information may include selection information for switching the algorithm in units rougher than phonemes.

音響モデル生成手段３２６は、１または２以上の第一相関情報と１または２以上の第二相関情報のうちの１以上の相関情報を用いて、対象言語旧音響モデル、または他言語新音響モデル、または対象言語旧音響モデルと他言語新音響モデルとから、対象言語新音響モデルを生成する。 The acoustic model generation means 326 uses the one or more pieces of first correlation information and one or more pieces of correlation information among the one or more pieces of second correlation information, and uses the target language old acoustic model or other language new acoustic model. Alternatively, the target language new acoustic model is generated from the target language old acoustic model and the other language new acoustic model.

さらに具体的には、音響モデル生成手段３２６は、選択手段３２１が選択した一のアルゴリズムに従って、１または２以上の第一相関情報と１または２以上の第二相関情報のうちの１以上の相関情報を用いて、対象言語新音響モデルを生成する。 More specifically, the acoustic model generation unit 326 performs one or more correlations among one or more first correlation information and one or two or more second correlation information according to one algorithm selected by the selection unit 321. The target language new acoustic model is generated using the information.

音響モデル生成部３２、選択手段３２１、および音響モデル生成手段３２６は、通常、ＭＰＵやメモリ等から実現され得る。音響モデル生成部３２等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The acoustic model generation unit 32, the selection unit 321 and the acoustic model generation unit 326 can usually be realized by an MPU, a memory, or the like. The processing procedure of the acoustic model generation unit 32 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

選択情報管理部３２１１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。選択情報管理部３２１１に選択情報が記憶される過程は問わない。例えば、記録媒体を介して選択情報が選択情報管理部３２１１で記憶されるようになってもよく、通信回線等を介して送信された選択情報が選択情報管理部３２１１で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された選択情報が選択情報管理部３２１１で記憶されるようになってもよい。 The selection information management unit 3211 is preferably a nonvolatile recording medium, but can also be realized by a volatile recording medium. The process in which the selection information is stored in the selection information management unit 3211 is not limited. For example, the selection information may be stored in the selection information management unit 3211 via a recording medium, and the selection information transmitted via a communication line or the like is stored in the selection information management unit 3211. Alternatively, the selection information input via the input device may be stored in the selection information management unit 3211.

次に、音響モデル生成装置３の動作について、図１６のフローチャートを用いて説明する。図１６のフローチャートにおいて、第一相関情報を第一相関情報格納部１２４に蓄積する処理、および第二相関情報を第二相関情報格納部２２４に蓄積する処理は、実施の形態１、２で説明したので、ここでの説明は省略する。図１６のフローチャートにおいて、対象言語旧音響モデルまたは／および他言語新音響モデルから、対象言語新音響モデルを生成する処理について説明する。なお、図１６のフローチャートにおいて、図２のフローチャートと同一のステップについて、説明を省略する。 Next, operation | movement of the acoustic model production | generation apparatus 3 is demonstrated using the flowchart of FIG. In the flowchart of FIG. 16, the process of accumulating the first correlation information in the first correlation information storage unit 124 and the process of accumulating the second correlation information in the second correlation information storage unit 224 are described in the first and second embodiments. Therefore, the description here is omitted. In the flowchart of FIG. 16, processing for generating a target language new acoustic model from the target language old acoustic model and / or another language new acoustic model will be described. In the flowchart of FIG. 16, the description of the same steps as those in the flowchart of FIG. 2 is omitted.

（ステップＳ１６０１）選択手段３２１は、カウンタｉに１を代入する。 (Step S1601) The selection unit 321 substitutes 1 for the counter i.

（ステップＳ１６０２）選択手段３２１は、対象言語旧音響モデルまたは他言語新音響モデルの中に、ｉ番目の処理単位（例えば、ｉ番目の音素）が存在するか否かを判断する。ｉ番目の処理単位が存在すればステップＳ１６０３に行き、存在しなければ処理を終了する。 (Step S1602) The selection unit 321 determines whether or not the i-th processing unit (for example, the i-th phoneme) exists in the target language old acoustic model or the other language new acoustic model. If the i-th processing unit exists, the process goes to step S1603, and if it does not exist, the process ends.

（ステップＳ１６０３）選択手段３２１は、対象言語旧音響モデルまたは他言語新音響モデルの中のｉ番目の処理単位の処理単位識別子（例えば、音素識別子の「ａ」）を取得する。 (Step S1603) The selection unit 321 acquires the processing unit identifier (for example, “a” of the phoneme identifier) of the i-th processing unit in the target language old acoustic model or the other language new acoustic model.

（ステップＳ１６０４）選択手段３２１は、ステップＳ１６０３で取得した処理単位識別子に対応するアルゴリズム識別子を選択情報管理部３２１１から取得する。 (Step S1604) The selection unit 321 acquires an algorithm identifier corresponding to the processing unit identifier acquired in step S1603 from the selection information management unit 3211.

（ステップＳ１６０５）音響モデル生成手段３２６は、ステップＳ１６０４で取得したアルゴリズム識別子が第一のアルゴリズムであることを示す情報か否かを判断する。第一のアルゴリズムであればステップＳ２０６に行き、第一のアルゴリズムでなければステップＳ１６０６に行く。なお、第一のアルゴリズムは、ここでは、実施の形態１で説明した、音響モデル生成部１２が第一相関情報を用いて、対象言語旧音響モデルから、対象言語新音響モデルを生成するアルゴリズムである。 (Step S1605) The acoustic model generation unit 326 determines whether the algorithm identifier acquired in Step S1604 is information indicating that the algorithm is the first algorithm. If it is the first algorithm, go to step S206, and if it is not the first algorithm, go to step S1606. Here, the first algorithm is an algorithm for generating the target language new acoustic model from the target language old acoustic model using the first correlation information, as described in the first embodiment. is there.

（ステップＳ１６０６）音響モデル生成手段３２６は、ステップＳ１６０４で取得したアルゴリズム識別子が第二のアルゴリズムであることを示す情報か否かを判断する。第二のアルゴリズムであればステップＳ１２０６に行き、第二のアルゴリズムでなければステップＳ２０６に行く。なお、第二のアルゴリズムでない場合は、第三のアルゴリズムである。第二のアルゴリズムは、実施の形態２で説明した、音響モデル生成部２２が第二相関情報を用いて、他言語新音響モデルから、対象言語新音響モデルを生成するアルゴリズムである。また、第三のアルゴリズムは、第一相関情報と第二相関情報とを用いて、対象言語旧音響モデルおよび他言語新音響モデルから、対象言語新音響モデルを生成するアルゴリズムである。 (Step S1606) The acoustic model generation means 326 determines whether or not the algorithm identifier acquired in Step S1604 is information indicating that the algorithm is the second algorithm. If it is the second algorithm, go to step S1206, and if it is not the second algorithm, go to step S206. If it is not the second algorithm, it is the third algorithm. The second algorithm is an algorithm in which the acoustic model generation unit 22 described in Embodiment 2 generates a target language new acoustic model from another language new acoustic model using the second correlation information. The third algorithm is an algorithm for generating a target language new acoustic model from the target language old acoustic model and the other language new acoustic model using the first correlation information and the second correlation information.

（ステップＳ１６０７）選択手段３２１は、カウンタｉを１インクリメントし、ステップＳ１６０２に戻る。 (Step S1607) The selection unit 321 increments the counter i by 1, and returns to step S1602.

（ステップＳ１６０８）音響モデル生成手段３２６は、第一相関情報と第二相関情報とを用いて、対象言語旧音響モデルおよび他言語新音響モデルから、対象言語新音響モデルを生成する。ステップＳ２０９に行く。 (Step S1608) The acoustic model generation means 326 generates a target language new acoustic model from the target language old acoustic model and the other language new acoustic model using the first correlation information and the second correlation information. Go to step S209.

以下、本実施の形態における音響モデル生成装置３の具体的な動作について説明する。ここで、選択情報管理部３２１１は、図１７に示す選択情報管理表を格納している。選択情報管理表は、「音素識別子」「アルゴリズム識別子」を有するレコードを、２以上、格納している。また、アルゴリズム識別子「１」に対応する音素に対しては上記の第一のアルゴリズムを実行することを意味し、アルゴリズム識別子「２」に対応する音素に対しては上記の第二のアルゴリズムを実行することを意味し、アルゴリズム識別子「３」に対応する音素に対しては上記の第三のアルゴリズムを実行することを意味する。 Hereinafter, a specific operation of the acoustic model generation device 3 in the present embodiment will be described. Here, the selection information management unit 3211 stores the selection information management table shown in FIG. The selection information management table stores two or more records having “phoneme identifier” and “algorithm identifier”. In addition, the first algorithm is executed for the phoneme corresponding to the algorithm identifier “1”, and the second algorithm is executed for the phoneme corresponding to the algorithm identifier “2”. This means that the third algorithm is executed for the phoneme corresponding to the algorithm identifier “3”.

以下、音響モデル生成部３２の動作について説明する。まず、選択手段３２１は、対象言語旧音響モデルの中の１番目の音素の音素識別子「ａ」を取得した、とする。次に、選択手段３２１は、音素識別子「ａ」と対になるアルゴリズム識別子「３」を、選択情報管理表から取得する。 Hereinafter, the operation of the acoustic model generation unit 32 will be described. First, it is assumed that the selection unit 321 has acquired the phoneme identifier “a” of the first phoneme in the target language old acoustic model. Next, the selection unit 321 acquires the algorithm identifier “3” paired with the phoneme identifier “a” from the selection information management table.

そして、音響モデル生成手段３２６は、アルゴリズム識別子「３」に従って、第三のアルゴリズムを、以下のように実行する。なお、ここでは、実施の形態１の具体例１等と同様に、音響モデルを２次元正規分布である、とする。 Then, the acoustic model generation unit 326 executes the third algorithm as follows according to the algorithm identifier “3”. Here, it is assumed that the acoustic model is a two-dimensional normal distribution, as in the first specific example of the first embodiment.

つまり、音響モデル生成手段３２６は、第一相関情報格納部１２４の第一相関情報「Ｖ_ｓ＝μ_ｓ ^Ｒ−μ_ｓ ^Ｉ＝（１，１／２）を取得する。 That is, the acoustic model generation unit 326 acquires the first correlation information “V _s = μ _s ^R −μ _s ^I = (1, 1/2)” in the first correlation information storage unit 124.

また、音響モデル生成手段３２６は、対象言語（ＴａｒｇｅｔＬａｎｇｕａｇｅ）のベースラインモデルＴ_Ｉ（平均μ_ｔ ^Ｉ＝（０，０）、分散σ_ｔ ^Ｉ）である２次元正規分布を、対象言語旧音響モデル格納部１２１から取得する。 In addition, the acoustic model generation unit 326 converts the two-dimensional normal distribution that is the baseline model T _I (average μ _t ^I = (0, 0), variance σ _t ^I ) of the target language into the target language old sound. Obtained from the model storage unit 121.

次に、音響モデル生成手段３２６は、平均「μ_ｔ ^Ｒ１＝μ_ｔ ^Ｉ＋Ｖ_ｓ＝（１，１／２）」、分散σ_ｔ ^Ｒ（＝σ_ｔ ^Ｉ）をもつ２次元正規分布を取得する。 Next, the acoustic model generation unit 326 acquires a two-dimensional normal distribution having an average “μ _t ^R1 = μ _t ^I + V _s = (1, 1/2)” and a variance σ _t ^R (= σ _t ^I ). .

次に、音響モデル生成手段３２６は、第二相関情報格納部２２４の第二相関情報「Ｖ_Ｉ＝μ_ｔ ^Ｉ−μ_ｓ ^Ｉ＝（０，−１／２）」を取得する。 Next, the acoustic model generation unit 326 acquires the second correlation information “V _I = μ _t ^I −μ _s ^I = (0, −1/2)” in the second correlation information storage unit 224.

また、音響モデル生成手段３２６は、この平均ベクトルの差分（Ｖ_Ｉ）を、他言語新音響モデルに適用し、平均「μ_ｔ ^Ｒ２＝μ_ｓ ^Ｉ＋Ｖ_Ｉ＝（１，１／２）」、分散σ_ｔ ^Ｒ（＝σ_ｔ ^Ｉ）をもつ２次元正規分布を取得する。 Further, the acoustic model generation unit 326 applies the difference (V _I ) of the average vectors to the other language new acoustic model, and calculates the average “μ _t ^R2 = μ _s ^I + V _I = (1, 1/2)”, A two-dimensional normal distribution having a variance σ _t ^R (= σ _t ^I ) is acquired.

次に、音響モデル生成手段３２６は、「１／２（μ_ｔ ^Ｒ１＋μ_ｔ ^Ｒ２）」を実行し、最終的な他言語新音響モデルμ_ｔ ^Ｒを得る。なお、ここで、音響モデル生成手段３２６は、μ_ｔ ^Ｒ１とμ_ｔ ^Ｒ２との適用を５０％、５０％としたが、異なる重みを付けて、音響モデルを生成しても良い。

なお、本具体例において、具体例１と同様に、音声の一の状態を２次元の正規分布でモデル化されている、としたが、２次元の正規分布に限られず、数十次元の混合正規分布等でモデル化されていることはさらに好適である。また、混合正規分布でモデル化されているとも限らず、例えばニューラルネットワークを用いた音響モデルの場合においても、２つの音響モデルの差分である第二相関情報を用いて適応することができる。
Next, the acoustic model generation unit 326 executes “½ (μ _t ^R1 + μ _t ^R2 )” to obtain a final other language new acoustic model μ _t ^R. Here, the acoustic model generation unit 326 generates 50% and 50% of the application of μ _t ^R1 and μ _t ^R2 , but may generate an acoustic model with different weights.

In this specific example, as in specific example 1, it is assumed that one state of speech is modeled by a two-dimensional normal distribution. However, the present invention is not limited to a two-dimensional normal distribution, and is mixed by several tens of dimensions. It is more preferable to model with a normal distribution or the like. In addition, the model is not necessarily modeled with a mixed normal distribution. For example, in the case of an acoustic model using a neural network, adaptation can be performed using the second correlation information that is a difference between two acoustic models.

以上、本実施の形態によれば、他言語の適応処理前の音響モデルと他言語の適応処理後の音響モデルとの相関関係、および他言語の適応処理前の音響モデルと対象言語の適応処理前の音響モデルとの相関関係を利用することにより、発話環境等の適した環境における音声データが存在しない言語でも、当該言語に対応する音響モデルであり、音声認識精度を上げる音響モデルを生成できる。 As described above, according to the present embodiment, the correlation between the acoustic model before the adaptation process of the other language and the acoustic model after the adaptation process of the other language, and the adaptation process of the acoustic model and the target language before the adaptation process of the other language. By using the correlation with the previous acoustic model, even in languages where there is no speech data in a suitable environment such as a speech environment, an acoustic model corresponding to the language can be generated and an acoustic model that improves speech recognition accuracy can be generated .

なお、本実施の形態における音響モデル生成装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータを、対象言語とは異なる１以上の各他言語の第一の音響モデルである１以上の各他言語旧音響モデルと前記１以上の各他言語の第二の音響モデルである１以上の各他言語新音響モデルとの関係に関する情報である第一相関情報、または前記１以上の各他言語旧音響モデルと前記対象言語の第一の音響モデルである対象言語旧音響モデルとの関係に関する情報である第二相関情報のうちの、いずれか１以上の相関情報を用いて、対象言語旧音響モデルまたは１以上の他言語新音響モデルから、対象言語新音響モデルを生成する音響モデル生成部と、前記音響モデル生成部が生成した対象言語新音響モデルを記録媒体に蓄積する音響モデル蓄積部として機能させるためのプログラムである。 Note that the software that implements the acoustic model generation apparatus according to the present embodiment is the following program. In other words, this program causes the computer to execute one or more other language old acoustic models which are first acoustic models of one or more other languages different from the target language and second acoustics of the one or more other languages. First correlation information that is information related to a relationship with one or more other language new acoustic models that are models, or a target language that is a first acoustic model of each of the one or more other language old acoustic models and the target language The target language new acoustic model is obtained from the target language old acoustic model or one or more other language new acoustic models using any one or more correlation information of the second correlation information that is information related to the acoustic model. A program for causing an acoustic model generation unit to generate and an acoustic model storage unit that stores a target language new acoustic model generated by the acoustic model generation unit in a recording medium.

また、上記プログラムにおいて、前記音響モデル生成部は、対象言語旧音響モデルを格納し得る対象言語旧音響モデル格納部と、他言語新音響モデルを格納し得る他言語新音響モデル格納部と、第一相関情報を格納し得る第一相関情報格納部と、第二相関情報を格納し得る第二相関情報格納部と、前記第一相関情報と前記第二相関情報とを用いて、前記対象言語旧音響モデル、または前記他言語新音響モデル、または前記対象言語旧音響モデルと前記他言語新音響モデルとから、対象言語新音響モデルを生成する音響モデル生成手段とを具備するものとして、コンピュータを機能させることは好適である。 In the above program, the acoustic model generation unit includes a target language old acoustic model storage unit that can store the target language old acoustic model, another language new acoustic model storage unit that can store another language new acoustic model, Using the first correlation information storage unit capable of storing one correlation information, the second correlation information storage unit capable of storing second correlation information, the first correlation information and the second correlation information; An acoustic model generating means for generating a target language new acoustic model from the old acoustic model, the other language new acoustic model, or the target language old acoustic model and the other language new acoustic model, and a computer, It is preferable to make it function.

また、上記プログラムにおいて、前記音響モデル生成部は、対象言語新音響モデルを生成する２以上のアルゴリズムのうち、前記対象言語旧音響モデルまたは前記他言語新音響モデルが有するデータに応じて、いずれか一のアルゴリズムを選択する選択手段をさらに具備し、前記音響モデル生成手段は、前記選択手段が選択した前記一のアルゴリズムに従って、前記第一相関情報と前記第二相関情報のうちの１以上の相関情報を用いて、前記対象言語新音響モデルを生成するものとして、コンピュータを機能させることは好適である。 In the above program, the acoustic model generation unit may be any one of two or more algorithms for generating a target language new acoustic model according to data of the target language old acoustic model or the other language new acoustic model. The acoustic model generation means further comprises one or more correlations of the first correlation information and the second correlation information according to the one algorithm selected by the selection means. It is preferable to cause a computer to function as information that generates the new target language acoustic model.

また、図１８は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の音響モデル生成装置を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図１８は、このコンピュータシステム３００の概観図であり、図１９は、システム４００のブロック図である。 FIG. 18 shows the external appearance of a computer that executes the program described in this specification to realize the acoustic model generation apparatus according to the various embodiments described above. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 18 is an overview diagram of the computer system 300, and FIG. 19 is a block diagram of the system 400.

図１８において、コンピュータシステム４００は、ＣＤ−ＲＯＭドライブを含むコンピュータ４０１と、キーボード４０２と、マウス４０３と、モニタ４０４とを含む。 In FIG. 18, a computer system 400 includes a computer 401 including a CD-ROM drive, a keyboard 402, a mouse 403, and a monitor 404.

図１９において、コンピュータ４０１は、ＣＤ−ＲＯＭドライブ４０１２に加えて、ＭＰＵ４０１３と、バス４０１４と、ＲＯＭ４０１５と、ＲＡＭ４０１６と、ハードディスク４０１７とを含む。なお、バス４０１４は、ＭＰＵ４０１３やＣＤ−ＲＯＭドライブ４０１２に接続されている。また、ＲＯＭ４０１５には、ブートアッププログラム等のプログラムが記憶されている。また、ＲＡＭ４０１６は、ＭＰＵ４０１３に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのものである。また、ハードディスク４０１７は、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのものである。ここでは、図示しないが、コンピュータ４０１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 In FIG. 19, a computer 401 includes an MPU 4013, a bus 4014, a ROM 4015, a RAM 4016, and a hard disk 4017 in addition to a CD-ROM drive 4012. The bus 4014 is connected to the MPU 4013 and the CD-ROM drive 4012. The ROM 4015 stores a program such as a bootup program. A RAM 4016 is connected to the MPU 4013, and temporarily stores application program instructions and provides a temporary storage space. The hard disk 4017 is for storing application programs, system programs, and data. Although not shown here, the computer 401 may further include a network card that provides connection to a LAN.

コンピュータシステム４００に、上述した実施の形態の音響モデル生成装置の機能を実行させるプログラムは、ＣＤ−ＲＯＭ４１０１に記憶されて、ＣＤ−ＲＯＭドライブ４０１２に挿入され、さらにハードディスク４０１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ４０１に送信され、ハードディスク４０１７に記憶されても良い。プログラムは実行の際にＲＡＭ４０１６にロードされる。プログラムは、ＣＤ−ＲＯＭ４１０１またはネットワークから直接、ロードされても良い。 A program that causes the computer system 400 to execute the functions of the acoustic model generation apparatus according to the above-described embodiment may be stored in the CD-ROM 4101, inserted into the CD-ROM drive 4012, and further transferred to the hard disk 4017. Alternatively, the program may be transmitted to the computer 401 via a network (not shown) and stored in the hard disk 4017. The program is loaded into the RAM 4016 at the time of execution. The program may be loaded directly from the CD-ROM 4101 or the network.

プログラムは、コンピュータ４０１に、上述した実施の形態の音響モデル生成装置の機能を実行させるオペレーティングシステム、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム４００がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system or a third-party program that causes the computer 401 to execute the functions of the acoustic model generation apparatus according to the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 400 operates is well known and will not be described in detail.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.

本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる音響モデル生成装置は、発話環境等の適した環境における音声データが存在しない言語でも、当該言語に対応する音響モデルであり、音声認識精度を上げる音響モデルを生成できる、という効果を有し、音響モデル生成装置等として有用である。 As described above, the acoustic model generation device according to the present invention generates an acoustic model that improves speech recognition accuracy, even in a language that does not have speech data in a suitable environment such as an utterance environment, which is an acoustic model corresponding to the language. It is useful as an acoustic model generation device or the like.

１、２、３音響モデル生成装置
１１対象言語新音響モデル格納部
１２、２２、３２音響モデル生成部
１３音響モデル蓄積部
１２１対象言語旧音響モデル格納部
１２２他言語旧音響モデル格納部
１２３他言語新音響モデル格納部
１２４第一相関情報格納部
１２５第一相関情報生成手段
１２６、２２６、３２６音響モデル生成手段
２２４第二相関情報格納部
２２５第二相関情報生成手段
３２１選択手段 1, 2, 3 Acoustic model generation device 11 Target language new acoustic model storage unit 12, 22, 32 Acoustic model generation unit 13 Acoustic model storage unit 121 Target language old acoustic model storage unit 122 Other language old acoustic model storage unit 123 Other language New acoustic model storage unit 124 First correlation information storage unit 125 First correlation information generation unit 126, 226, 326 Acoustic model generation unit 224 Second correlation information storage unit 225 Second correlation information generation unit 321 Selection unit

Claims

An acoustic model generation device that generates an acoustic model of a target language for speech recognition,
A target language new acoustic model storage unit capable of storing a target language new acoustic model which is a second acoustic model of the target language;
One or more other language old acoustic models that are first acoustic models of one or more other languages different from the target language and one or more other languages that are second acoustic models of the one or more other languages. First correlation information that is information related to a new acoustic model, or information related to a relationship between the one or more other language old acoustic models and a target language old acoustic model that is a first acoustic model of the target language. An acoustic model generation unit that generates a target language new acoustic model from the target language old acoustic model or one or more other language new acoustic models using any one or more correlation information of the second correlation information;
An acoustic model storage unit that stores the target language new acoustic model generated by the acoustic model generation unit in the target language new acoustic model storage unit ;
The acoustic model generation unit
A target language old acoustic model storage unit capable of storing the target language old acoustic model; and
Another language new acoustic model storage unit capable of storing one or more other language new acoustic models;
A first correlation information storage unit capable of storing one or more pieces of first correlation information;
A second correlation information storage unit capable of storing one or more second correlation information;
A selection means for selecting any one of the two or more algorithms for generating the target language new acoustic model according to data of the target language old acoustic model or the one or more other language new acoustic models;
Using the one or more first correlation information and the one or more second correlation information, the target language old acoustic model, the one or more other language new acoustic models, or the target language old acoustic model and the one An acoustic model generation means for generating a target language new acoustic model from the above-mentioned other language new acoustic model,
The acoustic model generation means includes
According to the one algorithm selected by the selection unit, the target language new acoustic model is generated using one or more correlation information of the one or more first correlation information and the one or more second correlation information. Acoustic model generation device.

The other language old acoustic model is an acoustic model generated with data different from the acoustic model before adaptation processing of another language or the other language new acoustic model,
The other language new acoustic model is an acoustic model generated with data different from the acoustic model after adaptive processing of another language or the other language old acoustic model,
The target language old acoustic model is an acoustic model generated with data similar to an acoustic model before adaptation processing of the target language or another language old acoustic model,
The target language new acoustic model, the acoustic model generating device according to claim 1, wherein an acoustic model after adaptation processing of the target language.

The first correlation information is
Obtained from one or more transformation functions that are differences between one or more vectors corresponding to the one or more other language old acoustic models and one or more vectors corresponding to the one or more other language new acoustic models. Information
The second correlation information is
The acoustic model generation apparatus according to claim 2 , wherein the acoustic model generation apparatus is information acquired from one or more conversion functions of a difference between a vector corresponding to each of the one or more other language old acoustic models and a vector corresponding to the target language old acoustic model. .

The acoustic model generation unit
The acoustic model generation apparatus according to claim 3 , wherein a target language new acoustic model is generated by mapping a vector corresponding to the target language old acoustic model using a conversion function of the first correlation information.

The target language old acoustic model storage unit that can store the target language old acoustic model, the other language new acoustic model storage unit that can store one or more other language new acoustic models, and one or more first correlation information can be stored. An acoustic model generation method that can be realized by an acoustic model generation unit including a first correlation information storage unit and a second correlation information storage unit that can store one or more second correlation information , and an acoustic model storage unit. ,
The acoustic model generation unit includes at least one other language old acoustic model that is a first acoustic model of one or more other languages different from the target language and a second acoustic model of the one or more other languages. One or more first correlation information that is information related to a relationship with one or more other language new acoustic models, or a target language that is a first acoustic model of the one or more other language old acoustic models and the target language Using one or more correlation information of one or more second correlation information, which is information relating to the relationship with the old acoustic model, from the target language old acoustic model or one or more other language new acoustic models, the target language An acoustic model generation step for generating a new acoustic model;
The acoustic model storage unit includes an acoustic model storage step of storing the target language new acoustic model generated in the acoustic model generation step in a recording medium ,
In the acoustic model generation step,
A selection sub-step of selecting any one of the two or more algorithms for generating a target language new acoustic model according to data of the target language old acoustic model or the one or more other language new acoustic models; ,
Using the one or more first correlation information and the one or more second correlation information, the target language old acoustic model, the one or more other language new acoustic models, or the target language old acoustic model and the one An acoustic model generation substep for generating a target language new acoustic model from the above-mentioned other language new acoustic model,
In the acoustic model generation sub-step,
According to the one algorithm selected in the selection sub-step, the target language new acoustic model is determined using one or more correlation information of the one or more first correlation information and the one or more second correlation information acoustic model generating method to be generated.

The target language old acoustic model storage unit that can store the target language old acoustic model, the other language new acoustic model storage unit that can store one or more other language new acoustic models, and one or more first correlation information can be stored. A computer accessible to the first correlation information storage unit and the second correlation information storage unit capable of storing one or more second correlation information ;
One or more other language old acoustic models that are first acoustic models of one or more other languages different from the target language and one or more other languages that are second acoustic models of the one or more other languages. One or more first correlation information that is information related to a new acoustic model, or a relationship between the one or more other language old acoustic models and a target language old acoustic model that is a first acoustic model of the target language Sound that generates a target language new acoustic model from the target language old acoustic model or one or more other language new acoustic models using any one or more correlation information of one or more second correlation information that is information A model generator,
Function as an acoustic model storage unit that stores the target language new acoustic model generated by the acoustic model generation unit in a recording medium ;
The acoustic model generation unit
A selection means for selecting any one of the two or more algorithms for generating the target language new acoustic model according to data of the target language old acoustic model or the one or more other language new acoustic models;
Using the one or more first correlation information and the one or more second correlation information, the target language old acoustic model, the one or more other language new acoustic models, or the target language old acoustic model and the one From the above other language new acoustic models, function as acoustic model generation means to generate the target language new acoustic model,
The acoustic model generating means;
According to the one algorithm selected by the selection unit, the target language new acoustic model is generated using one or more correlation information of the one or more first correlation information and the one or more second correlation information. as things, because of the program cause the computer to function.