JP2010204175A

JP2010204175A - Voice learning device and program

Info

Publication number: JP2010204175A
Application number: JP2009046762A
Authority: JP
Inventors: Shoe Sato; 庄衛佐藤; Toru Imai; 亨今井
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2009-02-27
Filing date: 2009-02-27
Publication date: 2010-09-16
Anticipated expiration: 2029-02-27
Also published as: JP4972660B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice learning device to learn sound models used for voice recognition. <P>SOLUTION: The voice learning device includes a sound model-with-identifier-generation section 24 which generates a series of sound models with identifier by merging each environmentally-dependent sound model for each of a plurality of utterance environment while an utterance environment identifier is added to each phoneme label of each environmentally-dependent sound model to identify each utterance environment; an utterance environment-parallel-voice recognition section 26 which performs voice recognition in parallel for the learning voice using each environmentally-dependent sound model with the utterance environment identifier being added for each of the plurality of utterance environment, and generates a recognition result; an transcription-with-identifier-creation section 28 which creates transcription of the generated recognition result while the utterance environment identifier is added; and a sound model identification-learning section 29 which performs identification learning of the sound model with identifier using the learning voice and the transcription with identifier. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、各音素の音響特徴量の統計量を音響モデルとした音声認識に関し、特に、対談音声のように１つの音声セグメントに複数の発話環境が混在する場合の音声の認識の精度を向上するための、複数の音響モデルの学習効果を高める音声学習装置及びプログラムに関する。 The present invention relates to speech recognition using the statistic of the acoustic feature quantity of each phoneme as an acoustic model, and in particular, improves the accuracy of speech recognition when a plurality of utterance environments are mixed in one speech segment like a conversational speech. The present invention relates to a speech learning apparatus and a program for improving the learning effect of a plurality of acoustic models.

統計的な音響モデルを用いた音声認識では、各音素に現れる特徴量の統計量を得るための学習データが必要である。この学習データは、音声とその音声に対応する書き起こしの対である。この音響モデルを、複数の話者や背景音などの発話環境毎に別々に学習することで、音声認識精度の向上が期待される。以下、このような発話環境毎の音響モデルを発話環境依存音響モデルと称する。 In speech recognition using a statistical acoustic model, learning data is required to obtain a statistic of a feature amount that appears in each phoneme. This learning data is a pair of a voice and a transcription corresponding to the voice. By learning this acoustic model separately for each utterance environment such as a plurality of speakers and background sounds, improvement of speech recognition accuracy is expected. Hereinafter, such an acoustic model for each utterance environment is referred to as an utterance environment-dependent acoustic model.

しかしながら、対談音声を学習したり認識したりする場合には、収録された音声の発話と発話との間に適切な無音区間がない場合があるため、性別や複数の話者などの発話環境毎に音声を分割することが難しい。このような音声の認識に、性別依存音響モデルを利用して、男女の話者の音声が混在する音声の認識精度を向上を図る男女並列音声認識が知られている（例えば、特許文献１、非特許文献１参照）。 However, when learning or recognizing conversational speech, there may not be an appropriate silent section between recorded speech utterances, so each speech environment such as gender or multiple speakers It is difficult to divide audio into two. For such speech recognition, gender-dependent acoustic model is used to improve speech recognition accuracy in which speeches of male and female speakers are mixed (for example, Patent Document 1, Patent Document 1). Non-patent document 1).

一方、音響モデルの統計量を学習する基準には、一般的に学習音声に対する尤度を最大化する最尤基準（ＭＬ）が用いられるが、近年、音素誤り最小化基準（ＭＰＥ）に基づいて学習する方法が提案されている（例えば、非特許文献２参照）。 On the other hand, the maximum likelihood criterion (ML) that maximizes the likelihood of the learning speech is generally used as the criterion for learning the statistics of the acoustic model, but recently, based on the phoneme error minimization criterion (MPE). A learning method has been proposed (see, for example, Non-Patent Document 2).

ＭＰＥに基づく学習は、上記の男女並列音声認識における音声認識においても認識精度の向上が確認されている。このＭＰＥ基準の学習では、従来からの音響モデルの学習に用いていた音声と、その音声に対応した書き起こしのほかに、その音声の認識誤りを評価するための認識結果が必要になる。 The learning based on MPE has been confirmed to improve the recognition accuracy even in the speech recognition in the above-described gender parallel speech recognition. In the learning based on the MPE, in addition to the speech used for learning the acoustic model and the transcription corresponding to the speech, a recognition result for evaluating the speech recognition error is required.

この認識結果には、より多くの可能性のある認識誤りが含まれていることが望ましい。そこで、効率よく認識誤りを評価するため、仮説の探索中に得られる音素ラティスを用いて、ＭＰＥ学習を行うアルゴリズムも提案されている（例えば、非特許文献３参照）。 It is desirable that this recognition result includes more possible recognition errors. Therefore, in order to efficiently evaluate recognition errors, an algorithm that performs MPE learning using a phoneme lattice obtained during a hypothesis search has also been proposed (for example, see Non-Patent Document 3).

一方、発話環境（話者）依存音響モデルを学習するには、大量の学習した不特定話者モデルを適応化する技法が有効である（例えば、非特許文献３参照）。この文献には、ＭＰＥ基準に基づいて音響モデルのパラメータを線形変換する適応化技法も提案されている。この線形変換は、ＤＬＴ（Discriminative Linear Transforms）と称される。 On the other hand, in order to learn a speech environment (speaker) dependent acoustic model, a technique of adapting a large amount of learned unspecified speaker models is effective (for example, see Non-Patent Document 3). This document also proposes an adaptation technique for linearly converting the parameters of the acoustic model based on the MPE standard. This linear transformation is called DLT (Discriminative Linear Transforms).

特開２００７−２３３１４９号公報JP 2007-233149 A

今井他、音響学会春季研究発表会後援論文集、１−１−１６、２００８年Imai et al., Proceedings of the Spring Meeting of the Acoustical Society of Japan, 1-1-16, 2008 D. Povay, “Minimum phone error and L-smoothing for improved discriminative training”, in proc. ICASSP, 2002年D. Povay, “Minimum phone error and L-smoothing for improved discriminative training”, in proc. ICASSP, 2002 L. Wang et al., Computer Speech and Language, 22, 2008年、pp. 256-272L. Wang et al., Computer Speech and Language, 22, 2008, pp. 256-272

上述の方法では、音声セグメントを発話環境毎に分類して、別々に音響モデルを学習することが望ましいが、対談などの音声では１音声セグメント中に複数の発話環境の音声が混在するため、発話環境が複数混在する音声セグメントから、効率よく複数の発話環境依存音響モデルを同時に学習する技法が必要になる。 In the above-described method, it is desirable to classify the speech segments for each speech environment and learn the acoustic model separately. However, in speech such as a conversation, speech in a plurality of speech environments is mixed in one speech segment. A technique for efficiently learning a plurality of utterance environment dependent acoustic models simultaneously from a speech segment in which a plurality of environments are mixed is required.

また、男女並列音声認識では、発話環境の識別、即ち話者性別の識別と、認識仮説の探索を同時に行うことができる。このとき、話者性別の識別誤りによる単語の認識誤りが生じる場合がある。従って、男女並列音声認識では、話者の識別誤りを考慮した識別学習基準が必要とされている。 Moreover, in gender parallel speech recognition, it is possible to simultaneously identify an utterance environment, that is, identification of speaker gender, and search for a recognition hypothesis. At this time, a recognition error of the word due to the identification error of the speaker gender may occur. Therefore, in gender parallel speech recognition, a discriminative learning criterion that takes into account speaker discrimination errors is required.

そこで、本発明の目的は、発話環境毎に適応化された複数の音響モデルの学習性能を改善することで、複数の音響モデルを並列に探索する際に音声認識の認識精度の向上を図る音声学習装置及びプログラムを提供することにある。 Therefore, an object of the present invention is to improve the recognition performance of speech recognition when searching for a plurality of acoustic models in parallel by improving the learning performance of a plurality of acoustic models adapted for each utterance environment. It is to provide a learning device and a program.

本発明による音声学習装置は、音響モデルと学習音声の音素ラベルに発話環境の識別子を与え、複数の発話環境が混在する自動切り出しされた音声セグメントから、発話環境毎に適応化された複数の音響モデルを同時に学習する。つまり、複数の発話環境依存音響モデルを個別的に学習することなく、複数の発話環境依存音響モデルを自動的に識別して学習する。 The speech learning apparatus according to the present invention provides a speech environment identifier to an acoustic model and a phoneme label of a learned speech, and a plurality of acoustics adapted for each speech environment from speech segments that are automatically cut out in which a plurality of speech environments are mixed. Learn the model at the same time. That is, a plurality of speech environment dependent acoustic models are automatically identified and learned without individually learning the plurality of speech environment dependent acoustic models.

即ち、本発明による音声学習装置は、音声認識に用いる音響モデルを学習する音声学習装置であって、複数の発話環境毎の環境依存音響モデルの各々を、各環境依存音響モデルの各音素ラベルに各発話環境を識別するための発話環境識別子を付した状態でマージし、一連の識別子付き音響モデルを生成する識別子付き音響モデル生成部と、当該発話環境識別子を付した複数の発話環境毎の環境依存音響モデルの各々を用いて、当該複数の発話環境が混在する学習音声について並列に音声認識を実行し、認識結果を生成する発話環境並列音声認識部と、当該生成した認識結果に発話環境識別子を付した状態で書き起こしを作成する識別子付き書き起こし部と、当該学習音声と前記識別子付き書き起こしを用いて、識別子付き音響モデルを識別学習する音響モデル識別学習部とを備え、前記音響モデル識別学習部は、当該複数の発話環境の発話が混在する学習音声から、発話環境毎に適応化された複数の識別子付き音響モデルを学習することを特徴とする。 That is, the speech learning device according to the present invention is a speech learning device that learns an acoustic model used for speech recognition, and each environment-dependent acoustic model for each of a plurality of speech environments is assigned to each phoneme label of each environment-dependent acoustic model. An acoustic model generation unit with an identifier that generates a series of acoustic models with identifiers by merging with a speech environment identifier for identifying each speech environment, and an environment for each of a plurality of speech environments with the speech environment identifier Using each of the dependent acoustic models, an utterance environment parallel speech recognition unit that performs speech recognition in parallel on the learning speech in which the plurality of utterance environments are mixed and generates a recognition result, and an utterance environment identifier in the generated recognition result Using the transcription unit with an identifier to create a transcript with a mark attached, and the learning speech and the transcription with the identifier, the acoustic model with the identifier is discriminated. An acoustic model identification learning unit that learns a plurality of acoustic models with identifiers that are adapted for each utterance environment from learning speech in which utterances of the plurality of utterance environments are mixed. It is characterized by.

また、本発明による音声学習装置において、前記発話環境並列音声認識部は、当該発話環境識別子を付した複数の発話環境毎の環境依存音響モデルの各々を用いて、当該複数の発話環境が混在する学習音声について並列に音声認識を実行し、認識結果に自動的に識別子を付して発話環境識別子付き認識結果を生成することを特徴とする。 Further, in the speech learning device according to the present invention, the speech environment parallel speech recognition unit uses the environment-dependent acoustic model for each of the plurality of speech environments to which the speech environment identifier is attached to mix the plurality of speech environments. Speech recognition is performed in parallel on the learning speech, and an identifier is automatically attached to the recognition result to generate a recognition result with an utterance environment identifier.

また、本発明による音声学習装置において、前記音響モデル識別学習部によって生成した発話環境毎の学習後の識別子付き音響モデルに対して、発話環境識別子を除去し、学習後の当該複数の発話環境依存音響モデルを生成する学習後環境依存音響モデル生成部を更に備えることを特徴とする。 Further, in the speech learning device according to the present invention, the speech environment identifier is removed from the acoustic model with identifier for each speech environment generated by the acoustic model identification learning unit, and the plurality of speech environment dependencies after learning are removed. A post-learning environment-dependent acoustic model generation unit that generates an acoustic model is further provided.

また、本発明による音声学習装置において、前記音響モデル識別学習部は、発話環境としての男女又は話者毎の発話環境識別子を用いて、複数の音響モデルを学習することを特徴とする。 In the speech learning apparatus according to the present invention, the acoustic model identification learning unit learns a plurality of acoustic models using a speech environment identifier for each gender or speaker as a speech environment.

また、本発明による音声学習装置において、前記発話環境並列音声認識部は、音声認識における仮説ラティスに当該発話環境識別子を付与して識別子付きの仮説ラティスを生成し、前記音響モデル識別学習部は、前記発話環境並列音声認識部から識別子付きの仮説ラティスを取得して、該識別子付きの仮説ラティス、前記学習音声、及び前記識別子付き書き起こしを用いて、当該複数の発話環境の発話が混在する学習音声から、発話環境毎に適応化された複数の識別子付き音響モデルを学習することを特徴とする。 In the speech learning device according to the present invention, the speech environment parallel speech recognition unit generates a hypothesis lattice with an identifier by assigning the speech environment identifier to the hypothesis lattice in speech recognition, and the acoustic model identification learning unit includes: Learning a hypothesis lattice with an identifier from the speech environment parallel speech recognition unit, and using the hypothesis lattice with the identifier, the learning speech, and the transcription with the identifier, utterances of the plurality of speech environments are mixed A plurality of acoustic models with identifiers adapted for each utterance environment are learned from speech.

本発明によれば、例えば、発話環境依存音響モデルを個別に学習せずとも一括して音声学習を行うことができ、且つ学習結果としての学習後発話環境依存音響モデルのモデル精度を高めることができる。これは、発話環境依存音響モデルの幅の拡張を容易にするという効果を更に生じさせるとともに、例えば複数の話者が混在する音声認識に対しても発話環境の識別誤りに起因する認識誤りを削減して、精度よく話者を識別して音声認識することができるようになる。 According to the present invention, for example, it is possible to perform speech learning collectively without individually learning the utterance environment-dependent acoustic model, and to improve the model accuracy of the post-learning utterance environment-dependent acoustic model as a learning result. it can. This has the further effect of facilitating the expansion of the width of the utterance environment-dependent acoustic model, and reduces recognition errors due to utterance environment identification errors, for example, even for speech recognition in which multiple speakers are mixed. As a result, the speaker can be accurately identified and recognized.

本発明による一実施例の音声学習装置を示す図である。It is a figure which shows the speech learning apparatus of one Example by this invention. 本発明による一実施例の音声学習装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio | voice learning apparatus of one Example by this invention. 本発明による一実施例の音声学習装置で用いる音素ラベル例を示す図である。It is a figure which shows the example of the phoneme label used with the speech learning apparatus of one Example by this invention. 本発明による一実施例の音声学習装置で学習した音響モデルを用いて未知音声の音声認識を実行する音声認識装置の概略図である。It is the schematic of the speech recognition apparatus which performs speech recognition of unknown speech using the acoustic model learned with the speech learning apparatus of one Example by this invention. 本発明による一実施例の音声学習装置に係る、対談音声などのような一つの発話区間に複数の話者の音声が混在する場合に有効な、男女並列音声認識の概要を示す図である。It is a figure which shows the outline | summary of the gender parallel speech recognition which is effective when the audio | voices of several speakers are mixed in one utterance area, such as conversation audio | voice, concerning the audio | voice learning apparatus of one Example by this invention. 本発明による一実施例の音声学習装置に係る、実際に得られる男女並列音声認識から得られるラティスの例を示す図である。It is a figure which shows the example of the lattice obtained from the gender parallel speech recognition actually obtained based on the speech learning apparatus of one Example by this invention. （ａ）に、ＭＰＥ基準の識別学習に用いる発話環境を付与した学習音素のラベルを示し、（ｂ）に、ＭＰＥ基準の識別学習に用いる仮説ラティスの例を示す図である。(A) shows a label of a learning phoneme to which an utterance environment used for MPE-based identification learning is given, and (b) shows an example of a hypothesis lattice used for MPE-based identification learning.

以下、本発明による実施例の音声学習装置を説明する。 Hereinafter, a speech learning apparatus according to an embodiment of the present invention will be described.

［装置構成］
図１に、本発明による一実施例の音声学習装置を示す。本実施例の音声学習装置１は、男女並列音声認識のような複数の発話環境依存音響モデルを用いて並列に探索を行う装置である。 [Device configuration]
FIG. 1 shows a speech learning apparatus according to an embodiment of the present invention. The speech learning apparatus 1 of the present embodiment is an apparatus that searches in parallel using a plurality of utterance environment-dependent acoustic models such as gender parallel speech recognition.

本実施例の音声学習装置１は、制御部２と、記憶部３とを備える。制御部２は、音響モデル入力部２１と、音声入力部２２と、ユーザインターフェース部２３と、識別子付き音響モデル生成部２４と、発話環境並列音声認識部２６と、認識誤り修正部２７と、識別子付き書き起こし部２８と、音響モデル識別学習部２９と、学習後環境依存音響モデル生成部３０とを備える。記憶部３は、制御部２の制御により処理に必要なデータを適宜記憶し、又は読み出すことができ、中央演算処理装置（ＣＰＵ）として構成可能な制御部２は、例えば記憶部３に格納される制御プログラムを実行して各機能を実現することができる。 The speech learning apparatus 1 according to the present embodiment includes a control unit 2 and a storage unit 3. The control unit 2 includes an acoustic model input unit 21, a voice input unit 22, a user interface unit 23, an acoustic model generation unit with identifier 24, an utterance environment parallel speech recognition unit 26, a recognition error correction unit 27, and an identifier. A supplementary transcription unit 28, an acoustic model identification learning unit 29, and a post-learning environment-dependent acoustic model generation unit 30 are provided. The storage unit 3 can appropriately store or read data necessary for processing under the control of the control unit 2. The control unit 2 that can be configured as a central processing unit (CPU) is stored in the storage unit 3, for example. Each function can be realized by executing a control program.

音響モデル入力部２１は、予め用意した複数種類の環境依存音響モデルを各発話環境（例えば、男性の音声、女性の音声、別の男性の音声）を識別するための識別子を各音素ラベルに付した状態で入力し、それぞれの環境依存音響モデルを発話環境並列音声認識部２６に供給するとともに、識別子付き音響モデル生成部２４に供給する。例えば、男性の音声の音響モデルを第１環境依存音響モデル、女性の音声の音響モデルを第２環境依存音響モデル、及び別の男性の音響モデルを第３環境依存音響モデルとすることができ、背景音などの更に多くの環境依存音響モデルを扱うこともできる。 The acoustic model input unit 21 attaches an identifier for identifying each utterance environment (for example, male voice, female voice, another male voice) to a plurality of types of environment-dependent acoustic models prepared in advance to each phoneme label. In this state, each environment-dependent acoustic model is supplied to the speech environment parallel speech recognition unit 26 and also supplied to the acoustic model generation unit 24 with an identifier. For example, a male voice acoustic model can be a first environment dependent acoustic model, a female voice acoustic model can be a second environment dependent acoustic model, and another male acoustic model can be a third environment dependent acoustic model, More environment-dependent acoustic models such as background sounds can be handled.

音声入力部２２は、各環境依存音響モデルの入力に対応して、複数の発話環境が混在する学習音声を入力し、発話環境並列音声認識部２６に供給するとともに、別途後述する音響モデル識別学習部２９にも入力する。 Corresponding to the input of each environment-dependent acoustic model, the speech input unit 22 inputs learning speech in which a plurality of utterance environments are mixed, supplies the speech to the utterance environment parallel speech recognition unit 26, and acoustic model identification learning described later. This is also input to section 29.

ユーザインターフェース部２３は、発話環境並列音声認識部２６の識別結果を随意に修正するために、認識誤り修正部２７に修正情報を供給するインターフェースである。 The user interface unit 23 is an interface that supplies correction information to the recognition error correction unit 27 in order to arbitrarily correct the identification result of the speech environment parallel speech recognition unit 26.

識別子付き音響モデル生成部２４は、音響モデル入力部２１から供給される各環境依存音響モデルに対して、各発話環境（例えば、前述した男性の音声、女性の音声、別の男性の音声）を識別するための識別子を各音素ラベルに付した状態でマージし（後述する図３）、一連の識別子付き音響モデルを生成する。尚、各環境依存音響モデルの順序は、任意にマージしてよい。 The acoustic model generation unit with an identifier 24 outputs each utterance environment (for example, the above-described male voice, female voice, and another male voice) to each environment-dependent acoustic model supplied from the acoustic model input unit 21. An identifier for identification is merged with each phoneme label attached (FIG. 3 to be described later) to generate a series of acoustic models with identifiers. Note that the order of the environment-dependent acoustic models may be arbitrarily merged.

発話環境並列音声認識部２６は、例えば男女並列音声認識などの複数の発話環境を並列に音声認識する機能を有し、音響モデル入力部２１を介して供給される複数種類の環境依存音響モデルを用いて、音声入力部２２を介して供給される複数の発話環境が混在する学習音声について、該学習音声の各音素ラベルに当該発話環境識別子を付した状態で並列に音声認識を実行する（後述する図８）。従って、音響モデル識別学習部２９は、当該複数の発話環境の発話が混在する学習音声から、発話環境毎に適応化された複数の識別子付き音響モデルを学習する。認識結果におけるそれぞれの単語の発話環境（例えば、話者毎の性別）の発話環境識別子（例えば、男女別の識別子）を各音素に自動的に付与するため、後述する図７に示すように、発話環境並列音声認識部２６からは、識別子付きの認識結果が得られるとともに、識別子付きの仮説ラティスが得られる。識別子付きの仮説ラティスは、後述する識別学習に用いる際に、発話環境の識別誤りをより減少させることができる。 The speech environment parallel speech recognition unit 26 has a function of recognizing a plurality of speech environments in parallel such as gender parallel speech recognition, for example, and a plurality of types of environment-dependent acoustic models supplied via the acoustic model input unit 21. The speech recognition is performed in parallel on the learning speech including a plurality of speech environments supplied via the speech input unit 22 with the speech environment identifier attached to each phoneme label of the learning speech (described later). Figure 8). Therefore, the acoustic model identification learning unit 29 learns a plurality of acoustic models with identifiers adapted for each utterance environment from learning speech in which utterances of the plurality of utterance environments are mixed. In order to automatically give each phoneme a speech environment identifier (for example, gender-specific identifier) of the speech environment (for example, gender for each speaker) of each word in the recognition result, as shown in FIG. From the speech environment parallel speech recognition unit 26, a recognition result with an identifier is obtained and a hypothesis lattice with an identifier is obtained. When the hypothesis lattice with an identifier is used for discrimination learning described later, it is possible to further reduce discrimination errors in the speech environment.

認識誤り修正部２７は、ユーザインターフェース部２３から供給される修正情報によって必要に応じて、発話環境並列音声認識部２６の識別結果を随意に修正し、修正した識別子付き認識結果を識別子付き書き起こし部２８に供給する。 The recognition error correction unit 27 optionally corrects the identification result of the utterance environment parallel speech recognition unit 26 according to the correction information supplied from the user interface unit 23, and transcribes the corrected recognition result with an identifier with an identifier. Supplied to the unit 28.

識別子付き書き起こし部２８は、認識誤り修正部２７から供給される（修正した）識別子付き認識結果に基づいて、当該生成した認識結果に発話環境識別子を付した状態で書き起こしを作成する。 Based on the recognition result with identifier supplied (corrected) supplied from the recognition error correction unit 27, the transcription unit with identifier 28 creates a transcription with the utterance environment identifier attached to the generated recognition result.

音響モデル識別学習部２９は、音声入力部２２を介して供給される学習音声と、識別子付き書き起こし部２８を介して供給される識別子付き書き起こしを用いて、識別子付き音響モデルを識別学習する。この識別学習については、後述するように音素誤り最小化基準（ＭＰＥ）を用いた識別学習が有効である。また、音響モデル識別学習部２９は、発話環境並列音声認識部２６によって得られる識別子付きの仮説ラティス（例えば、男女の認識仮説が混在するラティス）を用いて、話者の識別を考慮した識別学習を行うのが好適である。これにより、音響モデルを並列音声認識用に最適化することができる。 The acoustic model identification learning unit 29 discriminates and learns an acoustic model with an identifier using the learning speech supplied via the speech input unit 22 and the transcription with identifier supplied via the transcription unit with identifier 28. . For this discriminative learning, discriminative learning using a phoneme error minimization criterion (MPE) is effective as will be described later. The acoustic model identification learning unit 29 uses a hypothetical lattice with an identifier (for example, a lattice in which male and female recognition hypotheses are mixed) obtained by the speech environment parallel speech recognition unit 26, and performs identification learning in consideration of speaker identification. Is preferably performed. Thereby, the acoustic model can be optimized for parallel speech recognition.

学習後環境依存音響モデル生成部３０は、音響モデル識別学習部２９から得られる発話環境毎の学習後の識別子付き音響モデルに対して、発話環境識別子を除去し、学習後の当該複数の発話環境依存音響モデルを生成する。例えば、学習後の第１環境依存音響モデル、第２環境依存音響モデル、及び第３環境依存音響モデルを生成して送出する。この生成した学習後の発話環境依存音響モデルは、記憶部３に記憶することもできる。 The post-learning environment-dependent acoustic model generation unit 30 removes the utterance environment identifier from the acoustic model with the identifier for each utterance environment obtained from the acoustic model identification learning unit 29, and the plurality of utterance environments after learning. Generate a dependent acoustic model. For example, a first environment-dependent acoustic model, a second environment-dependent acoustic model, and a third environment-dependent acoustic model after learning are generated and transmitted. The generated utterance environment-dependent acoustic model after learning can also be stored in the storage unit 3.

次に、本発明による実施例の音声学習装置の動作を詳細に説明する。図２は、本発明による一実施例の音声学習装置の動作を示すフローチャートである。 Next, the operation of the speech learning apparatus according to the embodiment of the present invention will be described in detail. FIG. 2 is a flowchart showing the operation of the speech learning apparatus according to an embodiment of the present invention.

[装置動作]
ステップＳ１にて、音響モデル入力部２１により、予め用意した複数種類の環境依存音響モデルについて各発話環境を識別するための識別子を各音素ラベルに付した状態で入力するとともに、音声入力部２２により、各環境依存音響モデルの入力に対応して、複数の発話環境が混在する学習音声を入力する。 [Device operation]
In step S1, the acoustic model input unit 21 inputs an identifier for identifying each utterance environment with respect to a plurality of types of environment-dependent acoustic models prepared in advance in a state where each phoneme label is attached. Corresponding to the input of each environment-dependent acoustic model, a learning speech in which a plurality of utterance environments are mixed is input.

ステップＳ２にて、識別子付き音響モデル生成部２４により、入力した各環境依存音響モデルに対して、各発話環境を識別するための識別子を各音素ラベルに付した状態でマージし、一連の識別子付き音響モデルを生成する。更に、発話環境並列音声認識部２６により、音響モデル入力部２１を介して供給される複数種類の環境依存音響モデルを用いて、音声入力部２２を介して入力した複数の発話環境が混在する学習音声について並列に音声認識を実行し、識別子付きの認識結果を生成する。 In step S2, the acoustic model generation unit with identifier 24 merges each input environment-dependent acoustic model with an identifier for identifying each utterance environment attached to each phoneme label, and adds a series of identifiers. Generate an acoustic model. Further, learning in which a plurality of utterance environments input through the voice input unit 22 are mixed using a plurality of types of environment-dependent acoustic models supplied through the acoustic model input unit 21 by the utterance environment parallel voice recognition unit 26. Speech recognition is performed in parallel on speech, and a recognition result with an identifier is generated.

図３は、本発明による一実施例の音声学習装置で用いる音素ラベル例を示す図である。図３に示すように、ＨＭＭ３の中央に図示するように状態（丸印）と状態を接続する遷移（矢印）で表記することができる。また、図３の右に図示するように、従来からの音素ラベルは、例えば音素の“ま”に対して“ｍ＋a”として表すことができ、例えば連続する音素の“ます”に対して“ｍ＋ａ”、“ｍ―ａ＋ｓ”、“ｍ―ｓ＋ｕ”、“ｓ−ｕ”として表すことができる。これに対して、図３の左に図示するように、本実施例の発話環境を識別するための識別子を付した識別子付き音素ラベルは、例えば２つの発話環境である男女を識別するのに、それぞれ“Ｍ＿”及び“Ｆ＿”を付す。例えば、識別子付き音素ラベルは、男性の音素に対しては、“Ｍ＿ｍ＋ａ”、“Ｍ＿ｍ―ａ＋ｓ”、“Ｍ＿ｍ―ｓ＋ｕ”、“Ｍ＿ｓ−ｕ”として表すことができる。 FIG. 3 is a diagram illustrating an example of a phoneme label used in the speech learning apparatus according to an embodiment of the present invention. As shown in FIG. 3, the state (circle) can be represented by a transition (arrow) connecting the state (circle) as shown in the center of the HMM 3. Also, as shown in the right of FIG. 3, a conventional phoneme label can be expressed as “m + a” for “phone” of a phoneme, for example, “m + a” for “mass” of continuous phonemes. ”,“ M−a + s ”,“ m−s + u ”, and“ su ”. On the other hand, as shown on the left side of FIG. 3, the phoneme label with an identifier for identifying the speech environment of the present embodiment is used to identify, for example, men and women who are two speech environments. “M_” and “F_” are attached respectively. For example, a phoneme label with an identifier can be represented as “M_m + a”, “M_m−a + s”, “M_m−s + u”, and “M_s−u” for male phonemes.

音響モデルを学習する際には、認識対象となる音声と学習データの発話切り出し基準が同一であることが望ましい。このため、本実施例の音声学習装置１では、学習音声を男女並列音声認識で認識した結果を利用し、必要であれば認識誤りを修正して、適切な学習データを作成することができる。 When learning an acoustic model, it is desirable that the speech to be recognized and the utterance extraction criteria of the learning data are the same. For this reason, in the speech learning apparatus 1 of the present embodiment, it is possible to use the result of recognizing the learning speech by gender parallel speech recognition, correct the recognition error if necessary, and create appropriate learning data.

また、本実施例の音声学習装置１は、認識結果の各単語は、いずれの音素ネットワークを通ったかが分かるため、男女の性別を発話環境識別子として自動的に付与する。図３は、男性の音声（Ｍ＿）及び女性の音声（Ｆ＿）の発話が混在する場合の学習用の識別子付き音素ラベルを示しているが、更に、音響モデル（トライフォンＨＭＭ）、及びＨＭＭで共有される各状態と遷移確率の定義にも学習音素ラベルと同様に発話環境識別子を与え、男性の音声のＨＭＭと女性の音声のＨＭＭとをマージし、識別子付きの学習ラベルを用いて一度に一括して適応学習を行うことができる。 In addition, since the speech learning apparatus 1 according to the present embodiment knows which phoneme network each word of the recognition result has passed, it automatically assigns the sex of the man and woman as the speech environment identifier. FIG. 3 shows a phoneme label with an identifier for learning in the case where utterances of male voice (M_) and female voice (F_) are mixed. Furthermore, in the acoustic model (Triphone HMM) and HMM, Similar to the learning phoneme label, the utterance environment identifier is given to the definition of each shared state and transition probability, the male voice HMM and the female voice HMM are merged, and the learning label with the identifier is used at one time. Adaptive learning can be performed collectively.

再び図２を参照するに、ステップＳ３にて、発話環境並列音声認識部２６により、オプションとして、複数の発話環境の並列処理に起因する認識誤りをより減少させるために、発話環境識別子付きの仮説ラティスを生成する。 Referring to FIG. 2 again, in step S3, the utterance environment parallel speech recognition unit 26 optionally has a hypothesis with an utterance environment identifier in order to further reduce recognition errors caused by parallel processing of multiple utterance environments. Generate a lattice.

ステップＳ４にて、本装置の操作者は、発話環境並列音声認識部２６の認識結果を参照して修正を要すると判断する場合には、ステップＳ５にて、ユーザインターフェース部２３を介して、発話環境並列音声認識部２６の識別結果を随意に修正することができる。 In step S4, if the operator of the apparatus refers to the recognition result of the utterance environment parallel speech recognition unit 26 and determines that correction is required, the utterance is made via the user interface unit 23 in step S5. The identification result of the environment parallel speech recognition unit 26 can be arbitrarily modified.

ステップＳ６にて、識別子付き書き起こし部２８により、認識誤り修正部２７から供給される（修正した）識別子付き認識結果に基づいて、識別子付き認識結果の書き起こしを作成する。 In step S6, the transcription unit with identifier 28 creates a transcript of the recognition result with identifier based on the recognition result with identifier supplied (corrected) from the recognition error correction unit 27.

ステップＳ７にて、音響モデル識別学習部２９により、音声入力部２２を介して供給される学習音声と、識別子付き書き起こし部２８を介して供給される識別子付き書き起こしと、発話環境並列音声認識部２６によって得られる識別子付きの仮説ラティスとを用いて、識別子付き音響モデルを識別学習する。 In step S7, the learning speech supplied via the speech input unit 22 by the acoustic model identification learning unit 29, the transcription with identifier supplied through the transcription unit with identifier 28, and speech environment parallel speech recognition Using the hypothesis lattice with identifier obtained by the unit 26, the acoustic model with identifier is identified and learned.

ステップＳ８にて、音響モデル識別学習部２９によって生成した学習後の識別子付き発話環境依存音響モデルから発話環境識別子を除去して記憶部３に記憶する。 In step S8, the speech environment identifier is removed from the learned speech environment dependent acoustic model with identifier generated by the acoustic model identification learning unit 29 and stored in the storage unit 3.

このように、音響モデルの学習時には、学習音声に対応した書き起こしに、例えば男女の発話環境識別子を付与したものを生成する。この識別子付きの書き起こしの作成には、学習用の男女の音響モデルを用いて並列に認識した結果を利用する。尚、並列認識結果を用いると容易に書き起こしを作成する事が出来るが，発話内容のすべてを手で書き起こして作成する事も可能である。この認識結果では、認識結果のそれぞれの単語の話者の性別が分かるため、男女の識別子を各音素に自動的に付与することができる。また、本実施例の音声学習装置１は、この識別子付きの認識結果の誤りを必要であれば適宜修正して識別子付きの書き起こしを作成する。 As described above, when learning the acoustic model, a transcript corresponding to the learning speech with, for example, a gender utterance environment identifier is generated. To create a transcript with an identifier, the result of recognition in parallel using a learning male and female acoustic model is used. Although the transcription can be easily created by using the parallel recognition result, it is also possible to transcribe all the utterance contents by hand. Since the recognition result shows the gender of the speaker of each word in the recognition result, a gender identifier can be automatically assigned to each phoneme. In addition, the speech learning apparatus 1 according to the present embodiment corrects the error in the recognition result with the identifier as necessary to create a transcript with the identifier.

さらに、本実施例の音声学習装置１は、学習のもととなる識別子付きの音響モデルは、男女の音響モデルの各音素に男女別の識別子を自動的に付与してマージして作成される。 Furthermore, in the speech learning apparatus 1 of the present embodiment, an acoustic model with an identifier that is a learning base is created by automatically assigning a gender-specific identifier to each phoneme of the male and female acoustic models. .

さらに、本実施例の音声学習装置１は、学習音声と、作成した識別子付き書き起こしを用いて、識別子付き音響モデルを識別学習することができ、並列音声認識によって得られる男女の認識仮説が混在するラティスを用いて、話者の識別を考慮した識別学習を行う。これにより、音響モデルを並列音声認識用に最適化することができる。 Furthermore, the speech learning apparatus 1 according to the present embodiment can discriminate and learn an acoustic model with an identifier using a learning speech and a created transcript with an identifier, and a mixture of male and female recognition hypotheses obtained by parallel speech recognition. Discriminative learning is performed in consideration of speaker identification. Thereby, the acoustic model can be optimized for parallel speech recognition.

また、本実施例の音声学習装置１は、識別子付き音響モデルを学習後に、識別子付き音響モデルから識別子を除去した音素ラベルの音響モデルを生成する。この学習済みの男女の音響モデルを用いて未知の音声を並列に認識することができる。 Moreover, the speech learning apparatus 1 according to the present embodiment generates an acoustic model of a phoneme label obtained by removing the identifier from the acoustic model with identifier after learning the acoustic model with identifier. Using this learned male and female acoustic model, unknown speech can be recognized in parallel.

例えば、図４に、本発明による一実施例の音声学習装置で学習した音響モデルを用いて未知音声の音声認識を実行する音声認識装置の概略図を示す。音声認識装置１０１は、音声学習装置１と同様に、制御部１０２と、記憶部１０３とを備える。制御部１０２は、音響モデル入力部１２１と、音声入力部１２２と、発話環境並列音声認識部（男女並列音声認識部）１２６とを備え、各々前述した音響モデル入力部２１と、音声入力部２２と、発話環境並列音声認識部２６に対応するものと解してよく、音声認識装置１０１と音声学習装置１とを１つの装置として構成した場合には、制御部１０２と記憶部１０３とをそれぞれ前述の制御部２と記憶部３として構成することもできる。 For example, FIG. 4 shows a schematic diagram of a speech recognition device that performs speech recognition of unknown speech using an acoustic model learned by a speech learning device according to an embodiment of the present invention. Similar to the speech learning device 1, the speech recognition device 101 includes a control unit 102 and a storage unit 103. The control unit 102 includes an acoustic model input unit 121, a speech input unit 122, and an utterance environment parallel speech recognition unit (gender parallel speech recognition unit) 126. The acoustic model input unit 21 and the speech input unit 22 described above, respectively. If the speech recognition device 101 and the speech learning device 1 are configured as a single device, the control unit 102 and the storage unit 103 are respectively connected to the speech environment parallel speech recognition unit 26. It can also be configured as the control unit 2 and the storage unit 3 described above.

音響モデル入力部１２１は、前述の記憶部３に格納済みのそれぞれの学習後の環境依存音響モデルを入力し、発話環境並列音声認識部（男女並列音声認識部）１２６に供給する。 The acoustic model input unit 121 inputs each learned environment-dependent acoustic model stored in the storage unit 3 and supplies it to the speech environment parallel speech recognition unit (gender parallel speech recognition unit) 126.

音声入力部１２２は、各環境依存音響モデルの入力に対応して、複数の発話環境が混在する未知音声を入力し、発話環境並列音声認識部（男女並列音声認識部）１２６に供給する。 Corresponding to the input of each environment-dependent acoustic model, the speech input unit 122 inputs unknown speech in which a plurality of utterance environments are mixed, and supplies the utterance environment parallel speech recognition unit (gender parallel speech recognition unit) 126.

発話環境並列音声認識部（男女並列音声認識部）１２６は、例えば男女並列音声認識などの複数の発話環境を並列に音声認識する機能を有し、音響モデル入力部１２１を介して供給される複数種類の環境依存音響モデルを用いて、音声入力部１２２を介して供給される複数の発話環境が混在する学習音声について並列に音声認識を実行して、認識結果を得る。 The utterance environment parallel speech recognition unit (gender parallel speech recognition unit) 126 has a function of recognizing a plurality of utterance environments in parallel, such as gender parallel speech recognition, and is supplied via the acoustic model input unit 121. Using a kind of environment-dependent acoustic model, speech recognition is performed in parallel on a learning speech in which a plurality of speech environments supplied via the speech input unit 122 are mixed, and a recognition result is obtained.

次に、発話環境並列音声認識部２６の一例として発話環境並列音声認識部（男女並列音声認識部）１２６について簡潔に説明する。尚、男女並列音声認識の詳細は、例えば特許文献１及び非特許文献１を参照されたい。 Next, the speech environment parallel speech recognition unit (gender parallel speech recognition unit) 126 will be briefly described as an example of the speech environment parallel speech recognition unit 26. For details of gender parallel speech recognition, see, for example, Patent Document 1 and Non-Patent Document 1.

［男女並列音声認識］
図５に、対談音声などのように、一つの発話区間に複数の話者の音声が混在する場合に有効な男女並列音声認識の概要を示す。男女並列音声認識では、男女の性別依存音響モデルにリンクした単語発音辞書の音素ネットワークを並列化し、単語境界での性別属性の入れ替えを許容して探索を行う。 [Gender parallel speech recognition]
FIG. 5 shows an outline of gender parallel speech recognition that is effective when voices of a plurality of speakers are mixed in one utterance section, such as a dialogue voice. In gender parallel speech recognition, phoneme networks of word pronunciation dictionaries linked to gender-dependent acoustic models are parallelized, and searches are performed while allowing gender attributes to be replaced at word boundaries.

発話環境並列音声認識部（男女並列音声認識部）１２６は、発話検出・性別変更制御部１２６ａを有する。発話検出・性別変更制御部１２６ａは、認識開始すると、男女間遷移が可能で枝刈りも共通の男女並列音素認識を行い、累積音素尤度を利用して発話の始端と終端を迅速に検出し、その結果に基づいて話者属性交代時刻を同定する。 The speech environment parallel speech recognition unit (gender parallel speech recognition unit) 126 includes a speech detection / gender change control unit 126a. When the recognition starts, the utterance detection / gender change control unit 126a performs gender parallel phoneme recognition that allows transition between men and women and also uses the same pruning method, and uses the cumulative phoneme likelihood to quickly detect the beginning and end of the utterance. The speaker attribute change time is identified based on the result.

この発話の終始端と話者属性の交替時刻を用いて、図５に示すように、男女間遷移が可能で枝刈りも共通の男女並列大語彙連続音声認識を行い、累積音素尤度を利用して認識結果の単語列を出力する。 As shown in Fig. 5, gender parallel large vocabulary continuous speech recognition is possible, and cumulative phoneme likelihood is used. Then, the word string of the recognition result is output.

具体的には、音声認識を開始すると（Ｓ１２）、入力音声の特徴ベクトルをケプストラムと短時間パワー及びそれらの動的特徴量として、様々な音響環境の男性話者音声から学習した音素環境依存音響モデル（トライフォン）と、同様に学習した女性の音響モデルから、単語バイグラムを利用して、図５に示すような単語を構成する音素ネットワークを構成する。ここで、女性用の発話環境音響モデルの音素ネットワークにおいて、発話始端及び発話終端の時刻を利用して、無音（Ｓ１３ａ，Ｓ１５ａ）の間に単語バイグラム（Ｓ１４ａ）を構築し、男性用の発話環境音響モデルの音素ネットワークにおいて、発話始端及び発話終端の時刻を利用して、無音（Ｓ１３ｂ，Ｓ１５ｂ）の間に単語バイグラム（Ｓ１４ｂ）を構築して、男女間遷移を可能にして、音声認識結果を出力する（Ｓ１６）。 Specifically, when speech recognition is started (S12), phoneme environment-dependent sound learned from male speaker speech in various acoustic environments using the feature vectors of the input speech as cepstrum, short-time power, and dynamic features thereof. A phoneme network that forms words as shown in FIG. 5 is configured from a model (Triphone) and a female acoustic model learned in the same manner using a word bigram. Here, in the phoneme network of the female utterance environment acoustic model, a word bigram (S14a) is constructed between silences (S13a, S15a) using the time of the utterance start and utterance ends, and the utterance environment for men. In a phoneme network of an acoustic model, a word bigram (S14b) is constructed between silences (S13b, S15b) using the time of utterance start and end of utterance, enabling transition between men and women, Output (S16).

また、発話環境並列音声認識部（男女並列音声認識部）１２６によれば、認識結果（仮説）の各単語に属性情報が付与することもできる。 Further, according to the speech environment parallel speech recognition unit (gender parallel speech recognition unit) 126, attribute information can be assigned to each word of the recognition result (hypothesis).

一方、本実施例の音声学習装置１における学習時の認識結果には、発話環境識別子を音素ラベルに付されているため、最尤単語仮説系列と、それぞれの単語がいずれの音素ネットワークを通って認識されたかを示す話者属性とを得ることができるだけでなく、枝刈りされずに残った探索パスを識別子付きラティスとして得ることもできる。この識別子付きラティスを音響モデルの識別学習に用いる。例えば、図６に、本発明による一実施例の音声学習装置に係る、実際に得られる男女並列音声認識から得られるラティスの例を示す。図６には、番号を付した各ノードを接続するアークの各枝に仮説音素と、男女のいずれの音素ネットワークを通ったかを示す話者属性も示してあり、各音素ラベルに前置する“Ｍ＿”及び“Ｆ＿”がそれぞれ男女の性別を示す発話環境識別子であり、話者属性を表すこともできる。 On the other hand, since the speech environment identifier is attached to the phoneme label in the recognition result at the time of learning in the speech learning apparatus 1 of the present embodiment, the maximum likelihood word hypothesis sequence and each word pass through any phoneme network. It is possible not only to obtain a speaker attribute indicating whether it has been recognized, but also to obtain a search path that remains without being pruned as a lattice with an identifier. This lattice with identifier is used for acoustic model discrimination learning. For example, FIG. 6 shows an example of a lattice obtained from gender parallel speech recognition actually obtained, according to the speech learning apparatus of one embodiment of the present invention. FIG. 6 also shows hypothesized phonemes and speaker attributes indicating which gender phoneme network has passed through each branch of the arc connecting each numbered node, and is prefixed with each phoneme label. M_ ”and“ F_ ”are utterance environment identifiers indicating the sexes of men and women, respectively, and can also represent speaker attributes.

次に、音響モデル識別学習部２９における識別学習について説明する。 Next, identification learning in the acoustic model identification learning unit 29 will be described.

［識別学習］
音響モデル識別学習部２９における識別学習には、音素誤り最小化基準（ＭＰＥ）を用いた識別学習が有効である（例えば、非特許文献１参照）。ＭＰＥ基準の識別学習では、認識結果から得られる音素ラティスの各枝の事後確率を算出し、音素の認識誤りの期待値が小さくなるように音響モデルのパラメータを推定するように動作する。この学習に必要なデータは、学習音声と、その音声に対応する正解音素系列（認識結果）、及び図５に示すような音素の仮説ラティスである。 [Identification learning]
For identification learning in the acoustic model identification learning unit 29, identification learning using a phoneme error minimization criterion (MPE) is effective (for example, see Non-Patent Document 1). In MPE-based discriminative learning, the a posteriori probability of each branch of the phoneme lattice obtained from the recognition result is calculated, and the acoustic model parameters are estimated so as to reduce the expected value of the phoneme recognition error. Data necessary for this learning is a learning speech, a correct phoneme sequence (recognition result) corresponding to the speech, and a hypothesis lattice of phonemes as shown in FIG.

また、発話環境（話者）依存音響モデルを学習するには、大量の学習した不特定話者の音響モデルを適応化するのが有効である（例えば、非特許文献３参照）。 In order to learn an utterance environment (speaker) dependent acoustic model, it is effective to adapt a large number of learned acoustic models of unspecified speakers (for example, see Non-Patent Document 3).

本実施例の音声学習装置１は、音響モデルの学習に用いる音声データの音素ラベルに発話環境の識別子を与え、各音素の音響モデルにも同様の発話環境識別子を与えるとともに、各音素の音響モデルにも同一の発話環境識別子を与え、複数の発話環境が混在する音声セグメントから、複数の発話環境に対応する音響モデルを同時に学習する。 The speech learning apparatus 1 according to the present embodiment gives an utterance environment identifier to a phoneme label of speech data used for learning an acoustic model, and also gives a similar utterance environment identifier to an acoustic model of each phoneme, and also an acoustic model of each phoneme. Are also given the same utterance environment identifier, and simultaneously learn acoustic models corresponding to a plurality of utterance environments from a speech segment in which a plurality of utterance environments are mixed.

また、本実施例の音声学習装置１は、発話環境識別子が与えられた音響モデルを作成するにあたって、複数の発話環境依存音響モデルを発話環境識別子を用いてマージ（統合）して作成しているため、この音響モデルを用いて一度に（一括して）複数の音響モデルを学習することができる。 Further, the speech learning apparatus 1 according to the present embodiment creates a plurality of utterance environment-dependent acoustic models by merging (integrating) them using the utterance environment identifier when creating the acoustic model given the utterance environment identifier. Therefore, a plurality of acoustic models can be learned at once (collectively) using this acoustic model.

また、本実施例の音声学習装置１は、発話環境の識別子として、男女又は話者別の識別子を与えるとともに、必要であれば男女並列音声認識の認識結果を修正し、学習音声の音素ラベルを作成するため、学習音声に対応する正解音素系列の正解精度の判別も容易になる。 In addition, the speech learning apparatus 1 according to the present embodiment gives a gender or speaker-specific identifier as an utterance environment identifier, corrects the recognition result of gender parallel speech recognition if necessary, and sets the phoneme label of the learned speech. Therefore, it is easy to determine the accuracy of the correct phoneme sequence corresponding to the learning speech.

本実施例の音声学習装置１は、発話環境識別子を用いた音響モデルの学習にこのような識別学習を導入するには、認識誤りを含む認識仮説のラティスを利用するのが好適である。男女並列音声認識のような複数の音響モデルから得られる仮説を同時に一括して探索してラティスを取得して仮説ラティスを生成することができる。このラティス上の各音素にも発話環境識別子を与えて仮説ラティスを生成するために、発話環境が異なる音素の認識誤りをモデルの誤りとして識別することができ、発話環境の認識誤りが少なくなるように音響モデルの統計量を学習することができる。 In order to introduce such discriminative learning into the learning of the acoustic model using the speech environment identifier, it is preferable that the speech learning apparatus 1 of the present embodiment uses a recognition hypothesis lattice including a recognition error. Hypothesis lattices can be generated by simultaneously searching for hypotheses obtained from a plurality of acoustic models such as gender-parallel speech recognition and acquiring lattices at the same time. Since each phoneme on this lattice is given a speech environment identifier to generate a hypothesis lattice, recognition errors of phonemes with different speech environments can be identified as model errors, so that recognition errors in the speech environment are reduced. It is possible to learn the statistics of the acoustic model.

より具体的に説明するに、音響モデル識別学習部２９による識別学習には、図６に示したような発話環境並列音声認識から得られる仮説ラティスを用いる。発話環境並列音声認識部２６から得られる仮説ラティスには仮説が通った発話環境識別子も付与されている。例えば、男女別の並列音声認識を行う場合には、図３に示した識別子付き音素ラベルと同一の発話環境識別子が仮説ラティスに付与される。従来の男女並列音声認識では、図５に示した例からも分かるように、仮設ラティス中には男女の音声が混在しており、この男女の音素誤りが単語の認識誤りを引き起こす場合があることに留意する。 More specifically, a hypothesis lattice obtained from speech environment parallel speech recognition as shown in FIG. 6 is used for identification learning by the acoustic model identification learning unit 29. The hypothesis lattice obtained from the speech environment parallel speech recognition unit 26 is also given a speech environment identifier through which the hypothesis passes. For example, when performing parallel speech recognition by gender, the same speech environment identifier as the phoneme label with identifier shown in FIG. 3 is assigned to the hypothesis lattice. In the conventional gender parallel speech recognition, as can be seen from the example shown in FIG. 5, there is a mixture of gender speech in the temporary lattice, and this gender phoneme error may cause a word recognition error. Keep in mind.

図７（ａ）に、ＭＰＥ基準の識別学習に用いる発話環境を付与した学習音素のラベルを示し、図７（ｂ）に、ＭＰＥ基準の識別学習に用いる仮説ラティスの例を示す。本実施例によれば、ＭＰＥ基準の識別学習時に、音素誤りに男女の識別誤りも考慮して識別学習することができる。図７（ｂ）のラティスの枝の上部に発話環境を考慮した学習音素を、ラティスの枝の下部に各枝の音素の正解精度を示している。 FIG. 7A shows a label of a learning phoneme provided with an utterance environment used for MPE-based discriminative learning, and FIG. 7B shows an example of a hypothesis lattice used for MPE-based discriminative learning. According to the present embodiment, at the time of MPE-based identification learning, identification learning can be performed in consideration of gender identification errors in addition to phoneme errors. The learning phoneme considering the speech environment is shown in the upper part of the branch of the lattice in FIG. 7B, and the correct accuracy of the phoneme of each branch is shown in the lower part of the branch of the lattice.

識別学習では、この音素正解精度が１．０である枝は、尤度が高くなるように学習し、正解精度が０．０もしくは−１．０の音素では尤度が低くなるように学習する例である。例えば、音素の正解精度として音素の置換及び削除の誤りを評価するときには、音素正解精度を０．０〜１．０の値をとり、音素の正解精度として更に音素の挿入誤りを評価するときには、音素正解精度を−１．０〜１．０の値をとるように学習する。 In the discriminative learning, a branch having a correct phoneme accuracy of 1.0 is learned so as to have a high likelihood, and a phoneme having a correct answer accuracy of 0.0 or −1.0 is learned to have a low likelihood. It is an example. For example, when evaluating the phoneme replacement and deletion errors as the correct accuracy of the phonemes, the phoneme correct accuracy takes a value of 0.0 to 1.0, and when evaluating the phoneme insertion errors as the correct accuracy of the phonemes, The phoneme correct answer accuracy is learned to take a value of -1.0 to 1.0.

例えば、図７（ｂ）の例では、“Ｆ＿ｍ，Ｆ＿a, Ｆ＿ｓ，Ｆ＿ｕ，Ｆ＿ｓｐ”と“Ｍ＿ｔ，Ｍ_ａ，Ｍ＿ｄ，Ｍ＿ａ”のパスは音素の誤りではないが、男女の性別を誤っているため、音素正解精度は。０．０が与えられ、即ち、観測される特徴量を表す特徴ベクトルに対して尤度が低くなるように学習する。このようにして、男女の音素の識別能力の高い音響モデルを学習することができる。 For example, in the example of FIG. 7B, the paths “F_m, F_a, F_s, F_u, F_sp” and “M_t, M_a, M_d, M_a” are not phoneme errors, but the genders of men and women are incorrect. The accuracy of correct phonemes. 0.0 is given, that is, learning is performed so that the likelihood is low with respect to the feature vector representing the observed feature amount. In this way, it is possible to learn an acoustic model having high ability to discriminate phonemes of men and women.

この学習済みの音響モデルを用いれば、例えば、発話環境依存音響モデルを個別に学習せずとも一括して音声学習を行うことができ、且つ学習結果としての学習後発話環境依存音響モデルのモデル精度を高めることができる。これは、発話環境依存音響モデルの幅の拡張を容易にするという効果を更に生じさせるとともに、複数の話者が混在する音声認識に対しても発話環境の識別誤りに起因する認識誤りを削減して、精度よく話者を識別して音声認識することができるようになる。 By using this learned acoustic model, for example, it is possible to perform speech learning collectively without individually learning the utterance environment dependent acoustic model, and the model accuracy of the post-learning utterance environment dependent acoustic model as a learning result Can be increased. This further increases the effect of facilitating the expansion of the utterance environment-dependent acoustic model, and reduces recognition errors caused by utterance environment identification errors even for voice recognition with multiple speakers. As a result, the speaker can be accurately identified and recognized.

尚、音響モデルの識別学習における適応化処理の一例を説明する。 An example of adaptation processing in acoustic model identification learning will be described.

以上のように、本実施例の音声学習装置によれば、ＭＰＥ基準に基づいて推定したＤＬＴを導入して音響モデルを適応化する場合、本実施例のように適応化を行わない場合（ＭＬＬＲ）と比して、男女別の複数種類の音素クラスに対して単語誤認識率（ＷＥＲ）を改善する。 As described above, according to the speech learning apparatus of the present embodiment, when the acoustic model is adapted by introducing the DLT estimated based on the MPE standard, the adaptation is not performed as in the present embodiment (MLLR). ), The word recognition rate (WER) is improved for a plurality of types of phoneme classes by gender.

また、本発明の一態様として、音声学習装置１をコンピュータとして構成することができ、音響モデル入力部２１と、音声入力部２２と、ユーザインターフェース部２３と、識別子付き音響モデル生成部２４と、発話環境並列音声認識部２６と、認識誤り修正部２７と、識別子付き書き起こし部２８と、音響モデル識別学習部２９と、学習後環境依存音響モデル生成部３０の機能を実現させるためのプログラムは、各コンピュータの内部又は外部に備えられる記憶部３に記憶される。また、各制御に用いる情報及びデータは、この記憶部に記憶しておくことができる。このような記憶部は、外付けハードディスクなどの外部記憶装置、或いはＲＯＭ又はＲＡＭなどの内部記憶装置で実現することができる。プログラムを実行する制御部は、中央演算処理装置（ＣＰＵ）などで実現することができる。即ち、ＣＰＵが、各構成要素の機能を実現するための処理内容が記述されたプログラムを、適宜、記憶部から読み込んで、コンピュータ上で各装置を実現することができる。ここで、いずれかの手段の機能をハードウェアの全部又は一部で実現しても良い。 Moreover, as one aspect of the present invention, the speech learning apparatus 1 can be configured as a computer, and an acoustic model input unit 21, a speech input unit 22, a user interface unit 23, an identifier-equipped acoustic model generation unit 24, A program for realizing the functions of the speech environment parallel speech recognition unit 26, the recognition error correction unit 27, the transcription unit with identifier 28, the acoustic model identification learning unit 29, and the post-learning environment-dependent acoustic model generation unit 30 is as follows. These are stored in the storage unit 3 provided inside or outside each computer. Information and data used for each control can be stored in the storage unit. Such a storage unit can be realized by an external storage device such as an external hard disk or an internal storage device such as ROM or RAM. The control unit that executes the program can be realized by a central processing unit (CPU) or the like. That is, the CPU can appropriately read from the storage unit a program in which the processing content for realizing the function of each component is described, and implement each device on the computer. Here, the function of any means may be realized by all or part of the hardware.

上述した実施例において、音声学習装置１の機能を実現するための処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくこともできる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録装置、半導体メモリ等どのようなものでもよい。 In the embodiment described above, the program describing the processing content for realizing the function of the speech learning apparatus 1 can be recorded on a computer-readable recording medium. As the computer-readable recording medium, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording device, and a semiconductor memory may be used.

また、音声学習装置１は、ＤＶＤ又はＣＤ‐ＲＯＭなどの可搬型記録媒体を介して入力しても良いし、映像入力専用のインターフェースを介して入力するようにしてもよい。 The speech learning apparatus 1 may be input via a portable recording medium such as a DVD or CD-ROM, or may be input via an interface dedicated to video input.

上述の実施例の音声学習装置１は、代表的な例として説明したが、本発明の趣旨及び範囲内で、多くの変更及び置換ができることは当業者に明らかである。従って、本発明は、上述の実施例によって制限するものと解するべきではなく、特許請求の範囲によってのみ制限される。 The speech learning apparatus 1 of the above-described embodiment has been described as a representative example, but it will be apparent to those skilled in the art that many changes and substitutions can be made within the spirit and scope of the present invention. Accordingly, the invention should not be construed as limited by the embodiments described above, but only by the claims.

本発明によれば、発話環境依存音響モデルを個別に学習せずとも一括して音声学習を行うことができ、且つ学習結果としての学習後発話環境依存音響モデルのモデル精度を高めることができるので、任意の音声認識の用途に有用である。 According to the present invention, it is possible to perform speech learning collectively without individually learning the utterance environment-dependent acoustic model, and to improve the model accuracy of the post-learning utterance environment-dependent acoustic model as a learning result. Useful for any speech recognition application.

１音声学習装置
２制御部
３記憶部
２１音響モデル入力部
２２音声入力部
２３ユーザインターフェース部
２４識別子付き音響モデル生成部
２６発話環境並列音声認識部
２７認識誤り修正部
２８識別子付き書き起こし部
２９音響モデル識別学習部
３０学習後環境依存音響モデル生成部
１０１音声認識装置
１０２制御部
１０３記憶部
１２１音響モデル入力部
１２２音声入力部
１２６発話環境並列音声認識部（男女並列音声認識部）
１２６ａ発話検出・性別変更制御部 DESCRIPTION OF SYMBOLS 1 Speech learning apparatus 2 Control part 3 Memory | storage part 21 Acoustic model input part 22 Speech input part 23 User interface part 24 Acoustic model generation part with identifier 26 Speech environment parallel speech recognition part 27 Recognition error correction part 28 Transcript part with identifier 29 Acoustic Model identification learning unit 30 Post-learning environment-dependent acoustic model generation unit 101 Speech recognition device 102 Control unit 103 Storage unit 121 Acoustic model input unit 122 Speech input unit 126 Speech environment parallel speech recognition unit (gender parallel speech recognition unit)
126a Utterance detection / gender change control unit

Claims

A speech learning device for learning an acoustic model used for speech recognition,
Each of the environment-dependent acoustic models for each utterance environment is merged with each phoneme label of each environment-dependent acoustic model with an utterance environment identifier for identifying each utterance environment, and a series of acoustic models with identifiers is created. An acoustic model generation unit with an identifier to be generated;
Using each of the environment-dependent acoustic models for each of a plurality of utterance environments to which the utterance environment identifier is attached, the speech recognition is performed in parallel on the learning speech in which the plurality of utterance environments are mixed, and an utterance environment parallel that generates a recognition result A voice recognition unit;
A transcription unit with an identifier for creating a transcript with the utterance environment identifier attached to the generated recognition result;
An acoustic model identification learning unit for identifying and learning an acoustic model with an identifier using the learning speech and the transcript with the identifier,
The acoustic model identification learning unit learns a plurality of acoustic models with identifiers adapted for each utterance environment from learning speech in which utterances of the plurality of utterance environments are mixed.

The speech environment parallel speech recognition unit performs speech recognition in parallel on the learning speech in which the plurality of speech environments are mixed, using each of the environment dependent acoustic models for each of the plurality of speech environments with the speech environment identifier. The speech learning apparatus according to claim 1, wherein a speech recognition identifier is automatically added to the recognition result to generate a recognition result with the identifier.

A post-learning environment-dependent sound for generating a plurality of utterance environment-dependent acoustic models after learning by removing the utterance environment identifier from the acoustic model with an identifier for each utterance environment generated by the acoustic model identification learning unit The speech learning apparatus according to claim 1, further comprising a model generation unit.

The said acoustic model identification learning part learns a several acoustic model using the speech environment identifier for every man and woman or speaker as speech environment, It is characterized by the above-mentioned. Voice learning device.

The speech environment parallel speech recognition unit generates a hypothesis lattice with an identifier by assigning the speech environment identifier to the hypothesis lattice in speech recognition,
The acoustic model identification learning unit acquires a hypothesis lattice with an identifier from the speech environment parallel speech recognition unit, and uses the hypothesis lattice with the identifier, the learning speech, and the transcript with the identifier, The speech learning apparatus according to any one of claims 1 to 4, wherein a plurality of acoustic models with identifiers adapted for each utterance environment are learned from learning speech in which utterances in the utterance environment are mixed. .