JP6546070B2

JP6546070B2 - Acoustic model learning device, speech recognition device, acoustic model learning method, speech recognition method, and program

Info

Publication number: JP6546070B2
Application number: JP2015220304A
Authority: JP
Inventors: 祐太河内; 浩和政瀧
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-11-10
Filing date: 2015-11-10
Publication date: 2019-07-17
Anticipated expiration: 2035-11-10
Also published as: JP2017090660A

Description

この発明は、音声認識技術に関し、特に、非ネイティブ発話の認識に用いる音響モデルを学習する技術に関する。 The present invention relates to speech recognition technology, and more particularly to technology for learning an acoustic model used to recognize non-native speech.

非ネイティブ発話に対する音声認識は、ネイティブ発話に対する音声認識と比較して、読み誤りや母音の挿入等、話者の言語経験や母語等に依存した、ネイティブ発話には見られない音響的性質が存在する（例えば、非特許文献１参照）。これら非ネイティブ発話に特有の性質が、入力音声の音素を判別する音響スコア計算を行う判別器（音響モデル）の判別性能に悪影響を与えるため、非ネイティブ発話音声認識はネイティブ発話音声認識と比較して精度を向上することが困難なタスクであった。 Speech recognition for non-native speech has acoustic characteristics that can not be found in native speech depending on the speaker's linguistic experience, native language, etc., such as reading errors and vowel insertion compared to speech recognition for native speech (See, for example, Non-Patent Document 1). Non-native speech recognition is compared to native speech recognition, as the uniqueness of these non-native speech adversely affects the discrimination performance of the discriminator (acoustic model) that performs acoustic score calculation to discriminate the phoneme of the input speech. Improvement in accuracy was a difficult task.

非ネイティブ発話音声認識の認識精度を向上する技術として、非ネイティブ向けＧＭＭ−ＨＭＭ音声認識がある（例えば、非特許文献２参照）。非ネイティブ向けＧＭＭ−ＨＭＭ音声認識では、非ネイティブ音声データセットに対して、ネイティブ教師の人手により発音の正しさを評定したラベルを付加し、この発音評定値に基づいて学習データを分割して、発音レベル別の複数の音響モデルを学習する。これにより、言語経験に由来する発音の違いにそれぞれ特化することが可能となり、音声認識精度を改善している。 Non-native GMM-HMM speech recognition is known as a technique for improving the recognition accuracy of non-native speech recognition (see, for example, Non-Patent Document 2). In the non-native GMM-HMM speech recognition, a label that manually evaluates the correctness of the pronunciation is added to the non-native speech data set, and the training data is divided based on the pronunciation evaluation value, Train multiple acoustic models by pronunciation level. This makes it possible to respectively specialize in differences in pronunciation derived from language experience, and improves speech recognition accuracy.

また、音声認識装置の音響モデル全般で高い認識率を実現している多層ニューラルネットワーク音響モデルを用いて非ネイティブ発話を音声認識する非ネイティブ向けＤＮＮ−ＨＭＭ音声認識がある（例えば、非特許文献３参照）。 In addition, there is non-native DNN-HMM speech recognition for speech recognition of non-native speech using a multilayer neural network acoustic model that achieves high recognition rates in all acoustic models of a speech recognition device (for example, non-patent document 3) reference).

河原達也, 峯松信明, “音声情報処理技術を用いた外国語学習支援”, 電子情報通信学会論文誌D, vol. J96-D, no. 7, pp. 1549-1565, 2013年Kawahara Tatsuya, Tsujimatsu Nobuaki, "Foreign Language Learning Support Using Speech Information Processing Technology", Transactions of the Institute of Electronics, Information and Communication Engineers D, vol. J96-D, no. 7, pp. 1549-1565, 2013 安斎拓也, 咸聖俊，伊藤彰則, “日本人英語学習者の発音レベルを考慮した音響モデルに関する検討”, 日本音響学会講演論文集, 2011年Takuya Ansai, Seitoshi Tsuji, Akinori Ito, "A Study on Acoustic Model Considering Pronunciation Level of Japanese English Learners", Proceedings of the Acoustical Society of Japan, 2011 木菱裕志, 中川聖一, “DNN-HMMによる日本人英語音声の認識”, 日本音響学会講演論文集, 2013年Hiroshi Kibishi, Seiichi Nakagawa, "Recognition of Japanese English Speech by DNN-HMM", Proceedings of the Acoustical Society of Japan, 2013

ネイティブ教師の人手による発音評定値を利用する非ネイティブ向け音声認識では、音響モデルの学習時に用いる音声データに対して人手で発音評定値を設定する必要があった。発音評定値を利用する方法には、発音評定値が主観で決まるため必ずしも信用できず、すべての発話に対し同じ基準で評価がされているとは限らないという問題と、ネイティブ教師の人手を使うことによるコストの問題が存在する。また、ＧＭＭ−ＨＭＭ音声認識と異なり、音響モデルに多層ニューラルネットワークを用いるＤＮＮ−ＨＭＭ音声認識においては、ＭＬＬＲ（Maximum Likelihood Linear Regression）のような有効な適応法がなく、音響モデル学習をやり直す必要がある。このとき、発音評定値に応じて学習データを分割すると、学習データの減少に起因する認識率低下を回避できない。そのため、ＤＮＮ−ＨＭＭ音声認識においては、ＧＭＭ−ＨＭＭ音声認識と同様に発音評定値を利用するアプローチでは認識率を向上することができなった。 In the case of non-native speech recognition using the pronunciation evaluation value by the hand of the native teacher, it was necessary to manually set the pronunciation evaluation value for the speech data used when learning the acoustic model. The method of using the pronunciation evaluation value involves the problem that the pronunciation evaluation value is determined by the subjectivity and is not necessarily reliable, and not all utterances are evaluated based on the same standard, and the native teacher's hand There is a cost problem. Also, unlike GMM-HMM speech recognition, in DNN-HMM speech recognition using multilayer neural networks for acoustic models, there is no effective adaptation method like Maximum Likelihood Linear Regression (MLLR), and it is necessary to redo acoustic model learning is there. At this time, if the learning data is divided according to the pronunciation evaluation value, it is not possible to avoid the decrease in the recognition rate caused by the decrease of the learning data. Therefore, in DNN-HMM speech recognition, as in the case of GMM-HMM speech recognition, the recognition rate can be improved by using an approach that uses a pronunciation evaluation value.

この発明の目的は、このような点に鑑みて、ＤＮＮ−ＨＭＭ音声認識であっても適用可能な、非ネイティブ発話を高精度に認識することができる音響モデルを学習する技術を提供することである。 In view of the foregoing, it is an object of the present invention to provide a technique for learning an acoustic model that can recognize non-native speech with high accuracy, which is applicable even to DNN-HMM speech recognition. is there.

上記の課題を解決するために、この発明の第一の態様の音響モデル学習装置は、学習用音声データから抽出した話者の非ネイティブ性を表す非ネイティブ特徴量と学習用音声データから抽出した音響特徴量とを結合した学習用入力特徴量と、学習用音声データの発話内容を表す書き起こしデータとが関連付けられた学習データを記憶する学習データ記憶部と、学習データを用いて音響モデルを学習する音響モデル学習部と、を含む。 In order to solve the above problems, the acoustic model learning device according to the first aspect of the present invention is extracted from non-native feature quantities representing non-nativeness of a speaker extracted from speech data for learning and speech data for learning A learning data storage unit for storing learning data in which a learning input feature amount combining an acoustic feature amount and transcription data representing utterance content of learning voice data is stored, and an acoustic model is generated using the learning data And an acoustic model learning unit to learn.

この発明の第二の態様の音声認識装置は、音響モデル学習装置により生成した音響モデルを記憶する音響モデル記憶部と、入力音声データから話者の非ネイティブ性を表す非ネイティブ特徴量を抽出する非ネイティブ性抽出部と、入力音声データから音響特徴量を抽出する音響特徴量抽出部と、非ネイティブ特徴量と音響特徴量とを結合した認識用入力特徴量を音響モデルへ入力して入力音声データの音声認識結果を得る音声認識部と、を含む。 A speech recognition apparatus according to a second aspect of the present invention extracts an acoustic model storage unit storing an acoustic model generated by an acoustic model learning apparatus, and extracts non-native feature quantities representing non-nativeness of a speaker from input speech data. Input voice input to the acoustic model input feature for recognition that combines non-native feature and acoustic feature, and non-native property extraction unit, acoustic feature quantity extraction unit for extracting acoustic feature from input voice data And a voice recognition unit for obtaining a voice recognition result of the data.

この発明の音響モデル学習技術は、言語的な専門知識を持ったネイティブ教師の人手を使うことなく、客観性の高い非ネイティブ性を表現する非ネイティブ特徴量を抽出し、それを音響特徴量と結合した学習データから音響モデルを学習する。これにより、従来は発音評定値を利用した音声認識率の向上ができなかったＤＮＮ−ＨＭＭ音声認識においても、非ネイティブ発話を高精度に認識することができる。 The acoustic model learning technology of the present invention extracts non-native feature quantities that express objective non-nativeness without using human hands of a native teacher with linguistic expertise, and uses them as acoustic feature quantities. Learn acoustic models from combined training data. As a result, non-native speech can be recognized with high accuracy even in DNN-HMM speech recognition in which the speech recognition rate can not be improved conventionally using the pronunciation evaluation value.

図１は、学習データ作成装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of a learning data creation device. 図２は、学習データ作成方法の処理手続きを例示する図である。FIG. 2 is a diagram illustrating a processing procedure of a learning data creation method. 図３は、学習用発話データの具体例を示す図である。FIG. 3 is a diagram showing a specific example of learning speech data. 図４は、学習データの具体例を示す図である。FIG. 4 is a diagram showing a specific example of learning data. 図５は、音響モデル学習装置の機能構成を例示する図である。FIG. 5 is a diagram illustrating a functional configuration of the acoustic model learning device. 図６は、音響モデル学習方法の処理手続きを例示する図である。FIG. 6 is a diagram illustrating the processing procedure of the acoustic model learning method. 図７は、音声認識装置の機能構成を例示する図である。FIG. 7 is a diagram illustrating the functional configuration of the speech recognition apparatus. 図８は、音声認識方法の処理手続きを例示する図である。FIG. 8 is a diagram illustrating the processing procedure of the speech recognition method.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In the drawings, components having the same functions are denoted by the same reference numerals and redundant description will be omitted.

この発明の実施形態は以下の３つの装置から構成される音声認識システムである。第一の装置は、学習用音声データから抽出した非ネイティブ特徴量を音響特徴量へ付加して音響モデル学習に用いる学習データを生成する学習データ作成装置である。第二の装置は、その学習データを用いて音響モデルの学習を行う音響モデル学習装置である。第三の装置は、認識対象の入力音声データから抽出した非ネイティブ特徴量を音響特徴量へ付加し、学習済みの音響モデルを用いて音声認識を行う音声認識装置である。 The embodiment of the present invention is a speech recognition system comprising the following three devices. The first apparatus is a learning data creation apparatus that adds non-native feature quantities extracted from learning speech data to acoustic feature quantities to generate learning data used for acoustic model learning. The second device is an acoustic model learning device that performs learning of an acoustic model using the learning data. The third device is a speech recognition device that adds non-native feature quantities extracted from input speech data to be recognized to acoustic feature quantities, and performs speech recognition using a learned acoustic model.

これらの装置は必ずしも３台で構成されるものではなく、各構成部を配置する装置を変更することで任意に装置構成を変更することができる。例えば、学習データ作成装置の各部を音響モデル学習装置が備えるように構成し、学習データの作成から音響モデルの学習まで１台で実行する音響モデル学習装置とすることができる。また、例えば、学習データ作成装置の各部と音響モデル学習装置の各部を音声認識装置が備えるように構成し、学習データの作成から音響認識まで１台で実行する音声認識装置とすることができる。 These apparatuses are not necessarily configured by three units, and the apparatus configuration can be arbitrarily changed by changing the apparatus in which each component is arranged. For example, each unit of the learning data creation apparatus may be configured to be provided with an acoustic model learning apparatus, and one acoustic model learning apparatus may be implemented from learning data creation to acoustic model learning. Further, for example, each part of the learning data creation device and each part of the acoustic model learning device may be configured to be provided in the speech recognition device, and a single speech recognition device may be implemented from creation of learning data to sound recognition.

実施形態の学習データ作成装置、音響モデル学習装置、および音声認識装置の各装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。各装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。各装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、各装置が備える各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。各装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 Each device of the learning data creation device, the acoustic model learning device, and the speech recognition device according to the embodiment is, for example, a known device having a central processing unit (CPU), a main storage device (RAM: random access memory), etc. Or, it is a special device configured by loading a special program into a dedicated computer. Each device executes each process, for example, under the control of a central processing unit. Data input to each device and data obtained by each process are stored, for example, in the main storage device, and data stored in the main storage device is read out as needed and used for other processes . In addition, at least a part of each processing unit included in each device may be configured by hardware such as an integrated circuit. Each storage unit included in each device is, for example, a main storage device such as a random access memory (RAM), an auxiliary storage device configured by a semiconductor memory device such as a hard disk, an optical disk, or a flash memory, or a relational database And middleware such as a key value store.

実施形態の学習データ作成装置は、図１に示すように、学習用音声記憶部１０、非ネイティブ性抽出部１１、音響特徴量抽出部１２、学習データ生成部１３、および学習データ記憶部１４を含む。学習用音声記憶部１０および学習データ記憶部１４は必ずしも学習データ作成装置自身が備える必要はなく、外部の他の装置が備える学習用音声記憶部１０および学習データ記憶部１４をネットワーク等の通信手段を介して読み書き可能なように構成することも可能である。この音響モデル学習装置が図２に示す各ステップの処理を行うことにより実施形態の学習データ作成方法が実現される。 As shown in FIG. 1, the learning data generation device according to the embodiment includes a learning voice storage unit 10, a non-native extraction unit 11, an acoustic feature quantity extraction unit 12, a learning data generation unit 13, and a learning data storage unit 14. Including. The learning voice storage unit 10 and the learning data storage unit 14 do not necessarily have to be included in the learning data creation apparatus itself, and the learning voice storage unit 10 and the learning data storage unit 14 included in other external devices may be communication means such as a network. It can also be configured to be readable and writable via The acoustic model learning device performs the processing of each step shown in FIG. 2 to realize the learning data creation method of the embodiment.

学習用音声記憶部１０には、音響モデルの学習に用いる学習用発話データが記憶されている。学習用発話データは、図３に示すように、各データを一意に特定する「識別番号」と、非ネイティブ話者の発話を録音した音声ファイルへのパスを表す「音声データ」と、音声データの発話内容を書き起こした「書き起こしデータ」とが関連付けて記憶されている。 The learning speech storage unit 10 stores learning speech data used to learn an acoustic model. As shown in FIG. 3, the speech data for learning includes “identification number” for uniquely identifying each data, “voice data” representing a path to a voice file in which a non-native speaker's voice is recorded, voice data It is stored in association with "transcription data" in which the uttered content of T is transcribed.

ステップＳ１１において、非ネイティブ性抽出部１１は、学習用発話データの音声データから話者の非ネイティブ性を表現する非ネイティブ特徴量を抽出する。抽出された非ネイティブ特徴量は学習用発話データの識別番号と組にして学習データ生成部１３へ入力される。 In step S11, the non-native extraction unit 11 extracts non-native feature quantities expressing non-nativeness of the speaker from the speech data of the speech data for learning. The extracted non-native feature amount is input to the learning data generation unit 13 in combination with the identification number of the speech data for learning.

非ネイティブ特徴量は、非ネイティブ話者の言語経験や発音の正しさ、母語種類、出身地方など、非ネイティブ話者に特有の情報を直接または間接的に反映した、連続または離散の、値またはベクトルとして表現される量である。非ネイティブ性抽出部としては、例えばネイティブ話者の発話音声と非ネイティブ話者の発話音声とを区別したり評価したりするように事前に学習された判別器、ニューラルネットワーク、または機械学習装置などを用いることとしてもよい。このとき、判別や回帰、自己符号化を行う多層ニューラルネットワークやＳＶＭ（Support Vector Machine）等の機械学習装置に対して発話を入力した際の中間処理結果や出力を非ネイティブ特徴量とすればよい。中間処理結果としては、例えば多層ニューラルネットワークでは、最終出力層以外の中間層の出力値を用いてもよい。判別器等の学習には、ネイティブ発話や非ネイティブ発話の音声データ、非ネイティブ話者に関する情報、発話の単語、音素等の情報を用いることとしてもよい。学習アルゴリズムは、教師あり学習、教師なし学習のいずれでもよい。 The non-native feature value is a continuous or discrete value or value that directly or indirectly reflects information specific to the non-native speaker, such as the language experience of the non-native speaker, correctness of the pronunciation, native language type, region of origin, etc. It is an amount expressed as a vector. The non-native extraction unit may be, for example, a discriminator, neural network, machine learning device, or the like, which is learned in advance to distinguish or evaluate the speech of the native speaker from the speech of the non-native speaker. May be used. At this time, intermediate processing results and outputs when speech is input to a machine learning apparatus such as a multi-layered neural network or SVM (Support Vector Machine) that performs discrimination, regression, and self coding may be used as non-native feature quantities. . As an intermediate processing result, for example, in a multi-layered neural network, output values of intermediate layers other than the final output layer may be used. For learning of a discriminator or the like, voice data of native speech or non-native speech, information on non-native speakers, information such as words of speech or phonemes may be used. The learning algorithm may be either supervised learning or unsupervised learning.

非ネイティブ特徴量の具体例としては、学習済の言語判別モデルを非ネイティブ性抽出部として用いて、言語判別結果のスコアを非ネイティブ特徴量として出力することとしてもよい。言語判別結果のスコアとしては、例えば、各言語らしさを示すスコア値である。言語判別結果のスコアの他の例は、第１の言語に近いほど数値が０、第２の言語に近いほど数値が１に近くなるような、０〜１の評価値である。また、非ネイティブ性抽出部がネイティブ向け音響モデル（すなわち、ネイティブ発話を学習した音響モデル）を有し、入力された音声データをこの音響モデルで評価した結果のスコアを非ネイティブ特徴量としてもよい。他の例としては、非ネイティブ性抽出部がネイティブ向け音声認識（すなわち、ネイティブ発話を認識対象とする音声認識）用のモデルを有し、入力された音声データをこのモデルで音声認識したときの認識信頼度を非ネイティブ特徴量としてもよい。 As a specific example of the non-native feature, it is possible to output the score of the language determination result as the non-native feature by using a learned language discrimination model as a non-native extraction unit. As a score of a language discrimination result, it is a score value which shows each language likeness, for example. Another example of the score of the language determination result is an evaluation value of 0 to 1 in which the numerical value is closer to the first language and the numerical value is closer to 1 the closer to the second language. Also, the non-native extraction unit may have an acoustic model for native use (that is, an acoustic model that has learned native speech), and the score of the result of evaluating input speech data using this acoustic model may be a non-native feature. . As another example, the non-native extraction unit has a model for native speech recognition (that is, speech recognition that recognizes native speech), and the input speech data is recognized using this model. The recognition reliability may be a non-native feature.

ステップＳ１２において、音響特徴量抽出部１２は、学習用発話データの音声データから音響特徴量を抽出する。音響特徴量としては、例えばメル周波数ケプストラム係数やそれに対して例えば正規化等の変換をしたもの、時間的に前後する複数個の特徴量を結合したもの等、音声認識における音響モデル学習で利用される音響特徴量や、その変換結果であればよい。抽出された音響特徴量は学習用発話データの識別番号と組にして学習データ生成部１３へ入力される。 In step S12, the acoustic feature quantity extraction unit 12 extracts an acoustic feature quantity from the speech data of the speech data for learning. As the acoustic feature quantity, for example, mel frequency cepstrum coefficient or the one obtained by performing conversion such as normalization, or one obtained by combining a plurality of temporally preceding and succeeding feature quantities is used in acoustic model learning in speech recognition. Sound characteristic quantities and conversion results thereof. The extracted acoustic feature amount is input to the learning data generation unit 13 in combination with the identification number of the speech data for learning.

ステップＳ１３において、学習データ生成部１３は、非ネイティブ性抽出部１１が出力する非ネイティブ特徴量と音響特徴量抽出部１２が出力する音響特徴量とを、各特徴量と組にした識別番号が一致するように結合し、学習用入力特徴量を生成する。結合とは、一方の特徴量の後に他の特徴量をつなげる処理である。つなげる処理にあたって、２つの特徴量の前後関係は予め定めておくこととする。例えば、音響特徴量“xxx”と非ネイティブ特徴量“yyy”が抽出されたとき、“xxx”と“yyy”とをそのまま順に繋げた“xxxyyy”が学習用入力特徴量となる。その後、学習データ生成部１３は、図４に示すように、各データを一意に特定する「識別番号」と、生成した「学習用入力特徴量」と、学習用発話データの「書き起こしデータ」とを関連付けて学習データを生成する。生成された学習データは学習データ記憶部１４へ記憶される。 In step S13, the learning data generation unit 13 has an identification number in which the non-native feature amount output by the non-native property extraction unit 11 and the acoustic feature amount output by the acoustic feature amount extraction unit 12 are combined with each feature amount. Combine to match, and generate input features for learning. The combination is a process of connecting one feature quantity to another feature quantity. In connection processing, the context of the two feature quantities is determined in advance. For example, when the acoustic feature "xxx" and the non-native feature "yyy" are extracted, "xxx yyy" obtained by connecting "xxx" and "yyy" in that order becomes the learning input feature. Thereafter, as shown in FIG. 4, the learning data generation unit 13 uniquely identifies each data as an “identification number”, the generated “input feature for learning”, and “transcription data” of speech data for learning. And generate learning data. The generated learning data is stored in the learning data storage unit 14.

上述の実施形態では、２つの特徴量を結合して音響モデル学習を行う学習データとする例を説明したが、音響特徴量が学習データに含まれるという条件さえ守られれば、２つの特徴量から音響モデル学習を行う学習データを求める処理はこれに限定されない。例えば、２つの特徴量を所定の関数に入力して得られる値を音響特徴量の後（あるいは、前）に追加することとしてもよい。所定の関数としては、例えば正規化や、時間的に前後する複数個の特徴量の結合を実施してもよいし、事前に学習された別の機械学習装置に入力し、その中間処理結果や出力を関数の出力として用いてもよい。また、音響特徴量と非ネイティブ特徴量とを結合した後に、正規化や複数フレームの結合等の処理を行ったものを、音響モデル学習を行う学習データとしてもよい。 In the above-described embodiment, an example in which two feature quantities are combined to form learning data for performing acoustic model learning has been described. However, if only the condition that the acoustic feature quantity is included in the learning data is observed, two feature quantities are used. The process of obtaining learning data for performing acoustic model learning is not limited to this. For example, values obtained by inputting two feature quantities to a predetermined function may be added after (or before) the acoustic feature quantity. As a predetermined function, for example, normalization, or combining of a plurality of feature quantities that move in time may be performed, or may be input to another machine learning device learned in advance, and the intermediate processing result or the like The output may be used as the output of the function. In addition, after combining the acoustic feature amount and the non-native feature amount, one obtained by performing processing such as normalization and combining a plurality of frames may be used as learning data for performing acoustic model learning.

上述の実施形態では、各特徴量や音声データ、書き起こしデータを対応付けるために識別番号を付与する例を記載したが、識別番号と各データとを対応づけるのではなく、非ネイティブ性抽出部と音響特徴量抽出部とに同じ音声データを入力し、処理結果の各特徴量に対して、音声データに対応する書き起こしデータを関連付けることにより、識別番号の情報を用いることなく学習データの生成を行うように変形することも可能である。 In the above embodiment, an example in which an identification number is assigned to associate feature amounts, voice data, and transcription data is described, but instead of associating the identification number with each data, a non-native extraction unit and The same voice data is input to the acoustic feature quantity extraction unit, and generation data is generated without using identification number information by associating transcription data corresponding to the voice data with each feature quantity of the processing result. It is also possible to deform as it does.

上述の実施形態では、書き起こしデータを音響モデル学習時に用いる教師データに相当するものとして直接取得しているが、事前に、“音素に相当する記号”等の異なるシンボル形式に変換を実施してもよい。例えば、ひらがな、カタカナ、音素、モノフォン、トライフォン、クラスタリング済みトライフォンや状態番号等、読みや音を表現する記号や、それらに相当する番号への変換を行ってよい。その際、記号の変換を人間が行ってもよいし、別の音声認識デコーダや音響モデル等を用いて変換してもよい。例えば、ＤＮＮ音声認識分野で従来から用いられている強制アライメント処理を用いても変換してもよい。 In the above-described embodiment, although the transcription data is directly acquired as corresponding to the teacher data used at the time of acoustic model learning, conversion is performed in advance to a different symbol format such as "symbol corresponding to phoneme" It is also good. For example, hiragana, katakana, phonemes, monophones, triphones, clustered triphones, state numbers, etc., may be converted into symbols representing readings or sounds, or numbers corresponding thereto. At that time, a symbol may be converted by a human or may be converted using another speech recognition decoder or an acoustic model. For example, forced alignment processing conventionally used in the DNN speech recognition field may be used or converted.

実施形態の音響モデル学習装置は、図５に示すように、学習データ記憶部１４、音響モデル学習部１５、および音響モデル記憶部１６を含む。学習データ記憶部１４および音響モデル記憶部１６は必ずしも音響モデル学習装置自身が備える必要はなく、他の装置が備える学習データ記憶部１４および音響モデル記憶部１６をネットワーク等の通信手段を介して読み書き可能なように構成することも可能である。この音響モデル学習装置が図６に示す各ステップの処理を行うことにより実施形態の音響モデル学習方法が実現される。 The acoustic model learning device according to the embodiment includes, as shown in FIG. 5, a learning data storage unit 14, an acoustic model learning unit 15, and an acoustic model storage unit 16. The learning data storage unit 14 and the acoustic model storage unit 16 do not necessarily have to be included in the acoustic model learning device itself, and the learning data storage unit 14 and the acoustic model storage unit 16 included in other devices can be read and written via communication means such as a network. It is also possible to configure as possible. The acoustic model learning method of the embodiment is realized by the acoustic model learning device performing the processing of each step shown in FIG.

学習データ記憶部１４には、学習データ作成装置により生成された学習データが記憶されている。上述のように、学習データは、各データを一意に特定する識別番号と、学習用発話データの音声データから抽出した非ネイティブ特徴量と音響特徴量とを結合した学習用入力特徴量と、音声データの発話内容を書き起こした書き起こしデータとが関連付けられたものである。 The learning data storage unit 14 stores learning data generated by the learning data generation device. As described above, the learning data includes an identification number for uniquely identifying each data, a learning input feature amount obtained by combining a non-native feature amount extracted from speech data of learning utterance data, and an acoustic feature amount, and voice It is associated with transcription data in which the utterance content of data is transcribed.

ステップＳ１５において、音響モデル学習部１５は、学習データ記憶部１４に記憶された学習データから学習用入力特徴量と書き起こしデータとを対応付けて取得し、その学習データを用いて音声認識に用いる音響モデルパラメータを学習する。音響モデルパラメータで表現されるモデルとしては、例えば、波形を音素に相当する記号に変換するモデルがある。“音素に相当する記号”としては、例えば、事前に異なる音響モデルを作成し、その音響モデルを用いたクラスタリング済みのトライフォンや、それを表現する状態番号等を用いることができる。 In step S15, the acoustic model learning unit 15 associates the learning input feature amount and the transcription data from the learning data stored in the learning data storage unit 14 and acquires the input data and uses the learning data for speech recognition. Learn acoustic model parameters. As a model represented by acoustic model parameters, for example, there is a model that converts a waveform into a symbol corresponding to a phoneme. As the “symbol corresponding to the phoneme”, for example, it is possible to create different acoustic models in advance, and use clustered triphones using the acoustic model, a state number representing the same, and the like.

実施形態の音声認識装置は、図７に示すように、音響モデル記憶部１６、言語モデル記憶部２０、非ネイティブ性抽出部１１、音響特徴量抽出部１２、特徴量結合部２１、および音声認識部２２を含む。この音声認識装置が図８に示す各ステップの処理を行うことにより実施形態の音声認識方法が実現される。 As shown in FIG. 7, the speech recognition apparatus according to the embodiment includes an acoustic model storage unit 16, a language model storage unit 20, a nonnative extraction unit 11, an acoustic feature quantity extraction unit 12, a feature quantity combination unit 21, and speech recognition Section 22 is included. The speech recognition method of the embodiment is realized by the speech recognition apparatus performing the process of each step shown in FIG.

音響モデル記憶部１６には、音響モデル学習装置により生成された音響モデルパラメータを備える音響モデルが記憶されている。言語モデル記憶部２０には、音声認識に用いる言語モデルが記憶されている。 The acoustic model storage unit 16 stores an acoustic model including acoustic model parameters generated by the acoustic model learning device. The language model storage unit 20 stores a language model used for speech recognition.

ステップＳ１１において、非ネイティブ性抽出部１１は、入力音声データから話者の非ネイティブ性を表現する非ネイティブ特徴量を抽出する。入力音声データは、ネイティブ話者または非ネイティブ話者による発話を録音した、音声認識対象の音声データである。ここで抽出する非ネイティブ特徴量は、学習データ作成装置が抽出した非ネイティブ特徴量と同じものである。抽出された非ネイティブ特徴量は特徴量結合部２１へ入力される。 In step S11, the non-native extraction unit 11 extracts non-native feature quantities expressing non-nativeness of the speaker from the input speech data. The input speech data is speech data to be subjected to speech recognition, in which speeches by native speakers or non-native speakers are recorded. The non-native feature quantity extracted here is the same as the non-native feature quantity extracted by the learning data creation device. The extracted non-native feature amount is input to the feature amount combining unit 21.

ステップＳ１２において、音響特徴量抽出部１２は、入力音声データから音響特徴量を抽出する。ここで抽出する音響特徴量は、学習データ作成装置が抽出した音響特徴量と同じものである。抽出された音響特徴量は特徴量結合部２１へ入力される。 In step S12, the acoustic feature quantity extraction unit 12 extracts an acoustic feature quantity from the input voice data. The acoustic feature quantity extracted here is the same as the acoustic feature quantity extracted by the learning data creation device. The extracted acoustic feature amount is input to the feature amount combining unit 21.

ステップＳ２１において、特徴量結合部２１は、非ネイティブ性抽出部１１が出力する非ネイティブ特徴量と音響特徴量抽出部１２が出力する音響特徴量とを、学習データ作成装置が各特徴量を結合したときと同じ順序で結合し、認識用入力特徴量を生成する。生成された認識用入力特徴量は音声認識部２２へ入力される。 In step S21, the feature data combining unit 21 combines the feature data with the non-native feature data output by the non-native property extraction unit 11 and the acoustic feature data output by the acoustic feature data extraction unit 12. Combine in the same order as when the command is input to generate input feature quantities for recognition. The generated input feature quantity for recognition is input to the speech recognition unit 22.

ステップＳ２２において、音声認識部２２は、音響モデル記憶部１６に記憶された音響モデルを用いて、入力された認識用入力特徴量から“音素に相当する記号”の時系列データを出力する。音声認識部が“音素に相当する記号”の時系列データから音声認識結果（例えば、テキスト）を出力する言語モデルを有する場合、音響モデルの出力が言語モデルに入力され、音声認識結果が出力される。 In step S22, using the acoustic model stored in the acoustic model storage unit 16, the speech recognition unit 22 outputs time-series data of "symbol corresponding to phoneme" from the input input feature amount for recognition. When the speech recognition unit has a language model that outputs speech recognition results (for example, text) from time series data of "symbol corresponding to phoneme", the output of the acoustic model is input to the language model, and the speech recognition result is output. Ru.

なお、非ネイティブ特徴量として、非ネイティブ性抽出部の出力の代わりに、非ネイティブ性抽出部の学習に使った正解ラベルを直接使用してもよい。正解ラベルは、例えば、非ネイティブ話者の言語経験や発音の正しさ、母語種類、出身地方など、非ネイティブ話者に関する情報とすればよい。音声認識時には入力音声データから推定した非ネイティブ特徴量を用いることとすればよい。 The correctness label used for the learning of the non-native extraction unit may be directly used as the non-native feature amount instead of the output of the non-native extraction unit. The correct answer label may be, for example, information on the non-native speaker, such as language experience of the non-native speaker, correctness of pronunciation, native language type, region of origin, and the like. In speech recognition, non-native feature quantities estimated from input speech data may be used.

上述の実施形態では、学習データ作成装置および音声認識装置が非ネイティブ性抽出部を備える例を記載したが、学習データ作成装置および音声認識装置とは異なる外部の装置として非ネイティブ特徴量抽出装置が存在し、非ネイティブ特徴量抽出装置が識別番号と音声データとを学習用発話データから取り出して、識別番号と非ネイティブ特徴量を学習データ作成装置および音声認識装置に提示することとしてもよい。 The above embodiment describes an example in which the learning data creation device and the speech recognition device include the non-native extraction unit, but the non-native feature extraction device is an external device different from the learning data creation device and the speech recognition device. The non-native feature extraction device may extract the identification number and the voice data from the speech data for learning, and present the identification number and the non-native feature to the learning data creation device and the voice recognition device.

上述の実施形態では、学習用発話データとして非ネイティブ話者による発話データのみを用いる構成としたが、ネイティブ話者による発話データも学習用発話データに含めて利用するように構成してもよい。具体的には、ネイティブ発話データを非ネイティブ性抽出部に入力し、ネイティブ発話に対する非ネイティブ特徴量を計算し、それを音響特徴量と結合して学習データを生成する。その後、その学習データを用いて音響モデルを学習するように構成すればよい。 In the above-described embodiment, only speech data by non-native speakers are used as learning speech data, but speech data by native speakers may be included in learning speech data and used. Specifically, native speech data is input to the non-native extraction unit, non-native feature quantities for native speech are calculated, and these are combined with acoustic feature quantities to generate learning data. Thereafter, the acoustic data may be learned using the learning data.

上述のように、この発明の音響モデル学習技術は、言語的な専門知識を持ったネイティブ教師の人手を使うことなく、客観性の高い非ネイティブ性を表現する非ネイティブ特徴量を抽出し、それを音響特徴量と結合した学習データから音響モデルを学習する。また、この発明の音声認識技術は、認識対象の音声データから非ネイティブ特徴量を音響特徴量と結合して学習済みの音響モデルを用いて音声認識を行う。このように構成することにより、従来は発音評定値を利用した音声認識率の向上ができなかったＤＮＮ−ＨＭＭ音声認識においても、非ネイティブ発話を高精度に認識することが可能となる。 As described above, the acoustic model learning technology of the present invention extracts non-native feature quantities that express highly objective non-nativeness without using the hands of a native teacher with linguistic expertise. The acoustic model is learned from the learning data in which B is combined with the acoustic feature. Further, according to the speech recognition technology of the present invention, non-native feature quantities are combined with acoustic feature quantities from speech data to be recognized to perform speech recognition using a learned acoustic model. With this configuration, non-native speech can be recognized with high accuracy even in DNN-HMM speech recognition in which the speech recognition rate can not be improved conventionally using the pronunciation evaluation value.

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The present invention is not limited to the above-described embodiment, and it is needless to say that changes can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment are not only executed chronologically according to the order described, but may be executed in parallel or individually depending on the processing capability of the apparatus executing the process or the necessity.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiments are implemented by a computer, the processing content of the function that each device should have is described by a program. By executing this program on a computer, various processing functions in each of the above-described devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing content can be recorded in a computer readable recording medium. As the computer readable recording medium, any medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, etc. may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 Further, the distribution of this program is carried out, for example, by selling, transferring, lending, etc. a portable recording medium such as a DVD, a CD-ROM, etc. in which the program is recorded. Furthermore, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 For example, a computer that executes such a program first temporarily stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, at the time of execution of the process, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer Each time, processing according to the received program may be executed sequentially. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes processing functions only by executing instructions and acquiring results from the server computer without transferring the program to the computer It may be Note that the program in the present embodiment includes information provided for processing by a computer that conforms to the program (such as data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, although the present apparatus is configured by executing a predetermined program on a computer, at least a part of the processing contents may be realized as hardware.

１０学習用音声記憶部
１１非ネイティブ性抽出部
１２音響特徴量抽出部
１３学習データ生成部
１４学習データ記憶部
１５音響モデル学習部
１６音響モデル記憶部
２０言語モデル記憶部
２１特徴量結合部
２２音声認識部 DESCRIPTION OF SYMBOLS 10 Voice memory unit 11 for learning Non-native property extraction unit 12 Acoustic feature quantity extraction unit 13 Learning data generation unit 14 Learning data storage unit 15 Acoustic model learning unit 16 Acoustic model storage unit 20 Language model storage unit 21 Feature quantity combination unit 22 Voice Recognition unit

Claims

A learning input feature amount obtained by combining a non-native feature amount representing non-nativeness of a speaker extracted from learning voice data with an acoustic feature amount extracted from the learning voice data, and the utterance content of the learning voice data A learning data storage unit that stores learning data associated with transcription data representing
An acoustic model learning unit that learns an acoustic model using the learning data;
Only including,
The non-native feature value is the score of the intermediate processing result or language discrimination result of the language discrimination by the language discrimination model, the score of the intermediate processing result or evaluation result of the evaluation by the native acoustic model, or the confidence of the recognition result by the native speech recognition Is one of the degrees,
Acoustic model learning device.

The acoustic model learning device according to claim 1, wherein
A non-native extraction unit for extracting the non-native feature amount from the learning voice data;
An acoustic feature amount extraction unit that extracts the acoustic feature amount from the learning voice data;
A learning data generation unit that combines the non-native feature amount and the acoustic feature amount to generate the learning input feature amount, associates the learning input feature amount with the transcription data, and generates the learning data When,
An acoustic model learning device further including

An acoustic model storage unit storing an acoustic model generated by the acoustic model learning device according to claim 1 or 2 ;
A non-native soluble extract unit that extracts the non-native features from the input speech data,
An acoustic feature quantity extraction unit that extracts an acoustic feature quantity from the input voice data;
A speech recognition unit for inputting into the acoustic model an input feature quantity for recognition combining the non-native feature quantity and the acoustic feature quantity to obtain a speech recognition result of the input speech data;
Speech recognition device including.

A learning data storage unit, a learning input feature amount obtained by combining a non-native feature amount representing non-nativeness of a speaker extracted from learning voice data and an acoustic feature amount extracted from the learning voice data; Learning data associated with transcription data representing utterance content of speech data for voice is stored,
Acoustic model learning unit, viewed contains an acoustic model learning step of learning an acoustic model by using the learning data,
The non-native feature value is the score of the intermediate processing result or language discrimination result of the language discrimination by the language discrimination model, the score of the intermediate processing result or evaluation result of the evaluation by the native acoustic model, or the confidence of the recognition result by the native speech recognition Is one of the degrees,
Acoustic model learning method.

An acoustic model storage unit stores an acoustic model generated by the acoustic model learning method according to claim 4 ;
Non-native soluble extract portion, and a non-native of extracting the non-native features from the input speech data,
An acoustic feature quantity extraction step of extracting an acoustic feature quantity from the input voice data;
A speech recognition step in which a speech recognition unit inputs a recognition input feature quantity obtained by combining the non-native feature quantity and the acoustic feature quantity into the acoustic model to obtain a speech recognition result of the input speech data;
Speech recognition method including.

A program for causing a computer to function as each part of the acoustic model learning device according to claim 1 or 2 or each part of a speech recognition device according to claim 3 .