JP6189818B2

JP6189818B2 - Acoustic feature amount conversion device, acoustic model adaptation device, acoustic feature amount conversion method, acoustic model adaptation method, and program

Info

Publication number: JP6189818B2
Application number: JP2014236637A
Authority: JP
Inventors: 孝典芦原; 太一浅見; 裕司青野; 阪内　澄宇; 澄宇阪内
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-11-21
Filing date: 2014-11-21
Publication date: 2017-08-30
Anticipated expiration: 2034-11-21
Also published as: JP2016099507A

Description

この発明は、音響モデルを用いた音声認識を様々な認識対象タスクに適応させるときに、音響モデル学習に用いる音響特徴量を変換する技術に関する。 The present invention relates to a technique for converting an acoustic feature amount used for acoustic model learning when speech recognition using an acoustic model is adapted to various recognition target tasks.

特許文献１には、音声認識において実用レベルの性能を担保するために、音声認識の対象とするタスク（以下、認識対象タスクと呼ぶ）に対して音響モデルを適応させる技術が記載されている。ここで、認識対象タスクとは、元々の音響モデルに対して、話者や雑音タイプ、喋り方などの音響的特徴が異なるタスクである。 Patent Document 1 describes a technique for adapting an acoustic model to a task that is a target of speech recognition (hereinafter referred to as a recognition target task) in order to ensure a practical level of performance in speech recognition. Here, the recognition target task is a task having different acoustic characteristics such as a speaker, a noise type, and how to speak with respect to the original acoustic model.

一般的に、音声認識の性能は認識対象タスクの学習データ量に依存して上下する。つまり、認識対象タスクの学習データが満足に存在しない状況で、従来の技術により音響モデルを適応させたとしても満足のいく認識率は得られない場合が多い。そこで通常は、認識対象タスクの音声を十分に集め、その音声を書き起こしすることで所望の量の学習データを収集するのであるが、そのためには莫大な金銭的・時間的コストを要する。また、認識対象タスクの音声が十分に入手可能であるならば、書き起こしによる学習データの収集を実施することが可能だが、そもそもあらゆるタスクにおいて十分な量の音声が入手可能というわけではない。例えば、方言や日本人が英語を話す音声など、十分な量の音声を入手することが難しいタスクも存在する。 In general, the performance of speech recognition varies depending on the amount of learning data of the task to be recognized. That is, in a situation where the learning data of the task to be recognized does not exist satisfactorily, a satisfactory recognition rate is often not obtained even if the acoustic model is adapted by the conventional technique. Therefore, usually, a sufficient amount of learning data is collected by sufficiently collecting the voices of the tasks to be recognized and transcribe the voices, but this requires enormous monetary and time costs. If the speech of the task to be recognized is sufficiently available, it is possible to collect learning data by transcription, but in the first place, a sufficient amount of speech is not available for every task. For example, there are some tasks where it is difficult to obtain a sufficient amount of speech, such as dialects and Japanese speaking English.

非特許文献１には、Vocal Tract Length Normalization（VTLN）のWarping Factorを複数の値で実行することで、学習データにおける話者バリエーションを疑似的に作成する方法が記載されている。なお、VTLNについては非特許文献２に記載されている。 Non-Patent Document 1 describes a method of artificially creating a speaker variation in learning data by executing a Warping Factor of Vocal Tract Length Normalization (VTLN) with a plurality of values. Note that VTLN is described in Non-Patent Document 2.

特開２００７−２４９０５１号公報JP 2007-249051 A

N. Jaitly, G. E. Hinton, “Vocal Tract Length Perturbation (VTLP) improves speech recognition”, ICML Workshop on Deep Learning for Audio, Speech, and Language Processing, 2013.N. Jaitly, G. E. Hinton, “Vocal Tract Length Perturbation (VTLP) improves speech recognition”, ICML Workshop on Deep Learning for Audio, Speech, and Language Processing, 2013. E. Eide, H. Gish, “A parametric approach to vocal tract length normalization”, ICASSP, 1996.E. Eide, H. Gish, “A parametric approach to vocal tract length normalization”, ICASSP, 1996.

しかしながら、非特許文献１の技術は大きく二点の問題を抱えている。一点目は、VTLNが線形変換処理であるため、非常に大まかな変換しか実行できない点である。二点目は、VTLN自体が話者の声質変換を目的としており、それ以外の変換は実行できない点である。 However, the technique of Non-Patent Document 1 has two major problems. First, since VTLN is a linear conversion process, only very rough conversion can be performed. The second point is that VTLN itself aims to convert the voice quality of the speaker, and other conversions cannot be performed.

この発明の目的は、音響モデルを認識対象タスクに適応させるための学習データを疑似的に作成することで、学習データが十分に入手できない場合でも認識率を向上させることである。 An object of the present invention is to improve the recognition rate even when learning data cannot be obtained sufficiently by artificially creating learning data for adapting an acoustic model to a task to be recognized.

上記の課題を解決するために、この発明の第一の態様の音響特徴量変換装置は、認識対象とするタスクに関する対象音声信号から抽出した対象音響特徴量系列と対象音声信号と発話内容が対応する参照音声信号から抽出した参照音響特徴量系列とを生成する特徴量抽出部と、対象音響特徴量系列と参照音響特徴量系列との対応関係を特徴量ごとの類似度に基づいて照合した照合済み対象音響特徴量系列と照合済み参照音響特徴量系列とを生成する特徴量照合部と、照合済み対象音響特徴量系列と照合済み参照音響特徴量系列とを用いて、入力音響特徴量の音響的特徴を対象音響特徴量系列と参照音響特徴量系列との対応関係に基づいて変換した音響特徴量系列を出力する変換モデルを学習する変換モデル生成部と、変換モデルを用いて参照音響特徴量系列の音響的特徴を変換した疑似音響特徴量系列を生成する疑似特徴量生成部と、を含む。 In order to solve the above-mentioned problem, the acoustic feature quantity conversion device according to the first aspect of the present invention is such that the target acoustic feature quantity sequence extracted from the target voice signal related to the task to be recognized, the target voice signal, and the utterance content correspond to each other. A feature quantity extraction unit that generates a reference acoustic feature quantity sequence extracted from a reference speech signal to be collated, and collation in which the correspondence relationship between the target acoustic feature quantity series and the reference acoustic feature quantity series is collated based on the similarity for each feature quantity Using the feature amount matching unit that generates the verified target acoustic feature amount sequence and the verified reference acoustic feature amount sequence, and the verified target acoustic feature amount sequence and the verified reference acoustic feature amount sequence. Model generation unit that learns a conversion model that outputs an acoustic feature quantity sequence obtained by converting a characteristic feature based on a correspondence relationship between a target acoustic feature quantity sequence and a reference acoustic feature quantity sequence, and a reference sound using the conversion model Includes a pseudo feature amount generating unit for generating a pseudo acoustic features sequence obtained by converting the acoustic features of the characterizing quantity sequence, the.

この発明の第二の態様の音響モデル適応装置は、認識対象とするタスクに関する対象音声信号から抽出した対象音響特徴量系列を記憶する音響特徴量記憶部と、音響特徴量変換装置が生成した変換モデルを用いて、認識対象とするタスクと音響的特徴が異なる音声信号から抽出した音響特徴量系列の音響的特徴を変換した疑似音響特徴量系列を記憶する疑似音響特徴量記憶部と、対象音響特徴量系列と疑似音響特徴量系列とを用いて音響モデルを学習する音響モデル学習部と、を含む。 The acoustic model adaptation device according to the second aspect of the present invention includes an acoustic feature amount storage unit that stores a target acoustic feature amount sequence extracted from a target speech signal related to a task to be recognized, and a conversion generated by the acoustic feature amount conversion device. A pseudo-acoustic feature quantity storage unit for storing a pseudo-acoustic feature quantity sequence obtained by converting an acoustic feature of an acoustic feature quantity series extracted from an audio signal having a different acoustic characteristic from a task to be recognized using a model; An acoustic model learning unit that learns an acoustic model using the feature amount sequence and the pseudo acoustic feature amount sequence.

この発明の音響特徴量変換技術は、認識対象タスクの学習データが十分に入手できない場合であっても、ニューラルネットによる特徴量変換を用いることで学習データを疑似的に作成し、その疑似学習データも用いて音響モデルを適応させる。これにより、元々入手できていた少量の学習データだけで適応させた音響モデルよりも、さらに認識対象タスクに適応した音響モデルを生成することができ、認識率が向上する。 The acoustic feature amount conversion technique of the present invention creates learning data in a pseudo manner by using feature amount conversion by a neural network even when learning data of a recognition target task is not sufficiently obtained. Is also used to adapt the acoustic model. As a result, an acoustic model adapted to the task to be recognized can be generated more than the acoustic model adapted using only a small amount of learning data originally available, and the recognition rate is improved.

図１は、第一実施形態の音響特徴量変換装置および音響モデル適応装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of an acoustic feature quantity conversion device and an acoustic model adaptation device according to the first embodiment. 図２は、第一実施形態の音響特徴量変換方法および音響モデル適応方法の処理フローを例示する図である。FIG. 2 is a diagram illustrating a processing flow of the acoustic feature quantity conversion method and the acoustic model adaptation method according to the first embodiment. 図３は、第二実施形態の音響特徴量変換装置および音響モデル適応装置の機能構成を例示する図である。FIG. 3 is a diagram illustrating a functional configuration of the acoustic feature quantity conversion device and the acoustic model adaptation device of the second embodiment. 図４は、第二実施形態の音響特徴量変換方法および音響モデル適応方法の処理フローを例示する図である。FIG. 4 is a diagram illustrating a processing flow of the acoustic feature quantity conversion method and the acoustic model adaptation method according to the second embodiment.

この発明では、上述の従来技術の問題点を解決するために、音響特徴量の変換処理にニューラルネットを活用する。なお、ニューラルネットについては、例えば、「中野良平、“ニューラル情報処理の基礎数理”、数理工学社、2005年（参考文献１）」に記載されている。ニューラルネットはVTLNとは異なり、非線形処理を実現しているため、非常に複雑な表現が可能である。また、ニューラルネットは話者の声質変換以外にも、雑音タイプや喋り方など、他の音響的特徴にも対応が可能であり、VTLNより汎用性が高い。 In the present invention, in order to solve the above-described problems of the prior art, a neural network is used for the acoustic feature value conversion processing. The neural network is described, for example, in “Ryohei Nakano,“ Basic Mathematics of Neural Information Processing ”, Mathematical Engineering, 2005 (Reference 1)”. Unlike VTLN, a neural network realizes nonlinear processing, so it can express very complex expressions. In addition to the voice quality conversion of the speaker, the neural network can deal with other acoustic features such as noise type and how to speak, and is more versatile than VTLN.

＜発明のポイント＞
この発明では、認識対象タスクにおける学習データが音響モデルを適応させるのに十分な量ではない状況下を想定している。この発明では、大きく以下の流れで音響モデルの適応を行う。
（１）認識対象タスクに関して元々入手できた少量の学習データＢと、認識対象タスクではないが十分な量の学習データＡとがある前提で、学習データＡの音響特徴量から学習データＢの音響特徴量へ変換する変換器を生成する。ここで、変換器はニューラルネットを利用する。
（２）上記の変換器を利用して学習データＡを変換した十分な量の疑似学習データＣを作成する。
（３）元々の学習データＢと疑似学習データＣとを用いて、音響モデルを認識対象タスクへ適応する学習処理を行う。 <Points of invention>
In the present invention, a situation is assumed in which the learning data in the task to be recognized is not a sufficient amount to adapt the acoustic model. In the present invention, the acoustic model is adapted according to the following flow.
(1) On the premise that there is a small amount of learning data B that was originally available for the recognition target task and a sufficient amount of learning data A that is not a recognition target task, the acoustic data of the learning data B from the acoustic feature amount of the learning data A A converter for converting to a feature value is generated. Here, the converter uses a neural network.
(2) A sufficient amount of pseudo learning data C obtained by converting the learning data A using the converter is created.
(3) Using the original learning data B and the pseudo learning data C, a learning process for adapting the acoustic model to the recognition target task is performed.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。
［第一実施形態］
第一実施形態は、完全なパラレルデータが存在する場合に音響特徴量を変換するニューラルネットを学習し、そのニューラルネットを利用して疑似的な学習データを作成する音響特徴量変換装置および方法と、その学習データを利用して音響モデルの適応を行う音響モデル適応装置である。パラレルデータとは、同一の発話内容で音響的特徴が異なる二つの音響特徴量系列の組を言う。音響的特徴は、例えば、話者や雑音タイプ、喋り方などが挙げられる。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.
[First embodiment]
A first embodiment learns a neural network that converts acoustic feature amounts when complete parallel data exists, and uses the neural network to create pseudo learning data and a method for converting acoustic feature amounts The acoustic model adaptation device performs adaptation of the acoustic model using the learning data. Parallel data refers to a set of two acoustic feature quantity series having the same utterance content but different acoustic features. Examples of the acoustic features include a speaker, a noise type, and how to speak.

第一実施形態の音響特徴量変換装置１は、図１に示すように、入力端子１０、音声信号取得部１１、ラベル付与部１２、特徴量抽出部１３、特徴量照合部１４、変換モデル生成部１５、疑似特徴量生成部１６、音声信号記憶部２１、特徴量記憶部２２、変換モデル記憶部２３、および疑似特徴量記憶部２４を例えば含む。 As shown in FIG. 1, the acoustic feature amount conversion apparatus 1 according to the first embodiment includes an input terminal 10, an audio signal acquisition unit 11, a label assignment unit 12, a feature amount extraction unit 13, a feature amount verification unit 14, and a conversion model generation. The unit 15, the pseudo feature amount generation unit 16, the audio signal storage unit 21, the feature amount storage unit 22, the conversion model storage unit 23, and the pseudo feature amount storage unit 24 are included.

第一実施形態の音響モデル適応装置２は、図１に示すように、音響特徴量変換装置１の各構成部に加えて、音響モデル学習部１７を例えば含む。図１では、音響モデル適応装置２に音響特徴量変換装置１のすべての構成部が含まれる構成を例示したが、外部の音響特徴量変換装置１の出力を記憶させた特徴量記憶部２２と疑似特徴量記憶部１７のみを含む構成とすることも可能である。 As shown in FIG. 1, the acoustic model adaptation device 2 according to the first embodiment includes, for example, an acoustic model learning unit 17 in addition to the components of the acoustic feature quantity conversion device 1. In FIG. 1, the configuration in which all the components of the acoustic feature quantity conversion device 1 are included in the acoustic model adaptation device 2 is illustrated, but the feature quantity storage unit 22 that stores the output of the external acoustic feature quantity conversion device 1 and A configuration including only the pseudo feature amount storage unit 17 is also possible.

音響特徴量変換装置１および音響モデル適応装置２は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。音響特徴量変換装置１および音響モデル適応装置２は、例えば、中央演算処理装置の制御のもとで各処理を実行する。音響特徴量変換装置１および音響モデル適応装置２に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、音響特徴量変換装置１および音響モデル適応装置２の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。 The acoustic feature quantity conversion device 1 and the acoustic model adaptation device 2 are special programs in a known or dedicated computer having a central processing unit (CPU), a main storage device (RAM), and the like, for example. Is a special device constructed by reading. For example, the acoustic feature quantity conversion device 1 and the acoustic model adaptation device 2 execute each process under the control of the central processing unit. The data input to the acoustic feature quantity conversion device 1 and the acoustic model adaptation device 2 and the data obtained in each process are stored in, for example, the main storage device, and the data stored in the main storage device is read as necessary. To be used for other processing. Moreover, at least a part of each processing unit of the acoustic feature quantity conversion device 1 and the acoustic model adaptation device 2 may be configured by hardware such as an integrated circuit.

音響特徴量変換装置１および音響モデル適応装置２が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。音響特徴量変換装置１および音響モデル適応装置２が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 Each storage unit included in the acoustic feature quantity conversion device 1 and the acoustic model adaptation device 2 is, for example, a main storage device such as a RAM (Random Access Memory), a semiconductor memory element such as a hard disk, an optical disk, or a flash memory. The auxiliary storage device may be configured, or may be configured by middleware such as a relational database or a key / value store. Each storage unit included in the acoustic feature quantity conversion device 1 and the acoustic model adaptation device 2 may be logically divided, and may be stored in one physical storage device.

図２を参照して、第一実施形態の音響特徴量変換方法の処理手続きを説明する。 With reference to FIG. 2, the processing procedure of the acoustic feature amount conversion method of the first embodiment will be described.

ステップＳ１０において、音響特徴量変換装置１の入力端子１０へ、学習データとする音響信号が入力される。学習データには、認識対象タスクに関する音声信号（以下、対象音声信号と呼ぶ）と、対象音声信号と同一の発話内容で音響的特徴が異なる音声信号（以下、参照音声信号と呼ぶ）が含まれる。入力される音声信号は、マイクロホン等の収音手段を入力端子１０へ接続してリアルタイムに人間の発話を収音したものであってもよいし、あらかじめ人間の発話をICレコーダーやスマートフォンの録音機能のような録音手段で不揮発性メモリやハードディスクドライブのような記録媒体へ録音し、入力端子１０へ接続した再生手段により再生することで入力してもよい。 In step S 10, an acoustic signal as learning data is input to the input terminal 10 of the acoustic feature quantity conversion device 1. The learning data includes a speech signal related to the task to be recognized (hereinafter referred to as a target speech signal) and a speech signal (hereinafter referred to as a reference speech signal) having the same utterance content as the target speech signal but having different acoustic characteristics. . The audio signal to be input may be a sound collected from a human voice in real time by connecting a sound collecting means such as a microphone to the input terminal 10, or a voice recording function of an IC recorder or a smartphone in advance. Recording may be performed on a recording medium such as a non-volatile memory or a hard disk drive by a recording means such as that described above and may be input by being reproduced by a reproducing means connected to the input terminal 10.

ステップＳ１１において、音声信号取得部１１は、アナログの入力音声信号をディジタル信号に変換する。入力端子１０からディジタルの音声信号が入力された場合には、音声信号取得部１１は備えなくともよい。ディジタルの入力音声信号は、音声信号記憶部２１へ記憶される。 In step S11, the audio signal acquisition unit 11 converts an analog input audio signal into a digital signal. When a digital audio signal is input from the input terminal 10, the audio signal acquisition unit 11 may not be provided. The digital input voice signal is stored in the voice signal storage unit 21.

ステップＳ１２において、ラベル付与部１２は、音声信号記憶部２１に記憶された対象音声信号と参照音声信号を読み込み、それぞれの音声信号の音響的特徴を表すラベルを付与する。ラベルを付与された対象音声信号と参照音声信号は、特徴量抽出部１３へ送られる。音響的特徴とは、認識対象タスクによって必要とされる音声信号の属性であり、例えば、話者や雑音タイプ、喋り方などが挙げられる。ラベルの付与方法としては、例えば、（１）あらかじめ音声を収録する際の利用シーンをユーザが指定する方法、（２）ログイン認証や使用アプリケーション等による自動獲得、（３）クラスタリングによる自動分類による自動獲得が挙げられる。 In step S12, the label assigning unit 12 reads the target audio signal and the reference audio signal stored in the audio signal storage unit 21, and assigns a label representing the acoustic feature of each audio signal. The target audio signal and the reference audio signal to which the label is assigned are sent to the feature amount extraction unit 13. An acoustic feature is an attribute of a voice signal required by a recognition target task, and examples thereof include a speaker, a noise type, and how to speak. Examples of labeling methods include (1) a method in which a user designates a scene to use when recording audio in advance, (2) automatic acquisition by login authentication or an application used, and (3) automatic classification by automatic classification by clustering. There is an acquisition.

（１）ユーザが指定する方法は、音声を収録する際に、話者であれば「誰が発話するか（例えば、性別、年齢、居住地、言語、個人名等の個人属性情報）」、雑音タイプであれば「どこで発話するか（例えば、車内、街中、会議室等の利用環境）」などをユーザ側で指定することで、対象となる音声信号に音響的特徴のラベルを付与する。 (1) The method specified by the user is that when a voice is recorded, if it is a speaker, “who speaks (for example, personal attribute information such as gender, age, place of residence, language, personal name)”, noise If it is a type, the user designates “where to speak (for example, the use environment such as in a car, in the city, a conference room)”, etc., thereby giving a label of an acoustic feature to the target audio signal.

（２）アプリケーション等による自動獲得の方法は、以下のとおりである。話者であれば、例えば、音声を収録する前にユーザログイン認証を設けることで話者のラベルを付与する。また、雑音タイプであれば、使用アプリケーションの種類によってラベルを付与する。例えば、カーナビの音声認識であれば車内雑音、音声認識を利用したゲームであればテレビから発せられる雑音、などのラベルを付与することが考えられる。さらに、雑音タイプを使用アプリケーションから獲得した後、雑音レベルに応じてより細かく分類してもよい。雑音レベルは、例えば、サウンドレベルメータで測定する雑音の音圧レベルの絶対値や、収録音声と雑音のそれぞれの収録音圧レベルに基づくＳ／Ｎ比などを用いることができる。さらに、音声信号を収録した日時などの時間情報や地点などの空間情報を付与して細かく分類してもよい。 (2) The automatic acquisition method by an application or the like is as follows. If it is a speaker, for example, a user's label is given by providing user login authentication before recording a sound. If it is a noise type, a label is given according to the type of application used. For example, it is conceivable to give labels such as in-vehicle noise for car navigation voice recognition, and noise emitted from a television for games using voice recognition. Furthermore, after obtaining the noise type from the application in use, it may be further classified according to the noise level. As the noise level, for example, an absolute value of a sound pressure level of noise measured by a sound level meter, an S / N ratio based on each recorded sound pressure level of recorded sound and noise, or the like can be used. Furthermore, time information such as the date and time when the audio signal was recorded and spatial information such as a location may be added to classify them finely.

（３）クラスタリングによる自動獲得の方法は、話者、収録環境、喋り方等を、例えば、公知のK-means法などでクラスタリングし、ラベルとして「話者１」「話者２」「話者３」…や、「収録環境１」「収録環境２」「収録環境３」…や、「喋り方１」「喋り方２」「喋り方３」…として付与するものである。クラスタリングに用いる音響特徴量としては、例えば、音声信号の短時間フレーム分析に基づくメル周波数ケプストラム係数（MFCC: Mel-Frequency Cepstrum Coefficient）の1〜12次元と、その動的特徴量であるΔMFCC、ΔΔMFCCなどの動的パラメータや、パワー、Δパワー、ΔΔパワー等を用いる。このとき、話者に対するクラスタリングであれば、例えば、発話区間において上述の音響特徴量を発話区間単位で平均したものを抽出し、それを用いてクラスタリングを実行する。この場合、類似した音響特徴量を持つ話者は同一クラスタに分類されることがあるが、同傾向の話者として音響特徴量をまとめられるものであるため、後述の特徴量変換ニューラルネットの性能に影響を及ぼすことはない。雑音タイプに対するクラスタリングであれば、例えば、発話区間以外の区間（すなわち収録環境を表す区間）について、話者の場合と同様に音響特徴量を抽出し、発話区間以外の区間で平均した音響特徴量についてクラスタリングする。喋り方に対するクラスタリングであれば、例えば、事前に読み上げ口調と自由発話口調とに分類した上で、入力された音声がそのどちらなのかをGMM Supervectorを用いて自動分類するような方法などが挙げられる。このような喋り方に対するクラスタリングは「T. Asami, R. Masumura, H. Masataki, S. Sakauchi, “Read and Spontaneous Speech Classification Based on Variance of GMM Supervectors”, Interspeech 2014, pp. 2375-2379, 2014.（参考文献２）」に記載されている。上記のようなクラスタリングを実行すると、特定のクラスタが話者、収録環境、または喋り方と結びついていることになるので、当該クラスタと類似性の高い発話区間や、発話区間以外の区間、喋り方などに対応する音響信号を特定して、後述の特徴量抽出部１３による音響特徴量を抽出するためのラベルを付与することができる。 (3) As a method of automatic acquisition by clustering, speakers, recording environment, how to speak, etc. are clustered by, for example, the well-known K-means method, etc., and “Speaker 1”, “Speaker 2”, “Speaker” are used as labels. 3 ”,“ Recording environment 1 ”,“ Recording environment 2 ”,“ Recording environment 3 ”, and so on, and“ How to beat 1 ”,“ How to beat 2 ”,“ How to beat 3 ”, and so on. As acoustic features used for clustering, for example, 1 to 12 dimensions of Mel-Frequency Cepstrum Coefficient (MFCC) based on short-time frame analysis of speech signals, and ΔMFCC and ΔΔMFCC which are dynamic features Dynamic parameters such as power, Δ power, ΔΔ power, and the like are used. At this time, if clustering is performed for a speaker, for example, an average of the above-described acoustic features in the utterance section is extracted for each utterance section, and clustering is performed using the average. In this case, speakers with similar acoustic features may be classified into the same cluster, but since the acoustic features can be grouped as speakers with the same tendency, the performance of the feature transformation neural network described later Will not be affected. In the case of clustering for the noise type, for example, acoustic features are extracted from the sections other than the speech section (that is, sections representing the recording environment) in the same manner as in the case of the speaker, and are averaged in the sections other than the speech section. Clustering for. For example, the clustering method for speaking can be classified into a reading tone and a free speech tone, and the input speech is automatically classified using the GMM Supervector. . Clustering for this type of speaking is described in “T. Asami, R. Masumura, H. Masataki, S. Sakauchi,“ Read and Spontaneous Speech Classification Based on Variance of GMM Supervectors ”, Interspeech 2014, pp. 2375-2379, 2014. (Reference Document 2) ”. When clustering as described above is performed, a specific cluster is linked to the speaker, recording environment, or how to speak, so the utterance section that is highly similar to the cluster, sections other than the utterance section, and how to speak And a label for extracting an acoustic feature amount by the feature amount extraction unit 13 described later can be assigned.

上記（１）〜（３）のラベル付与方法を組み合わせて複数種類のラベルを付与してもよい。例えば、話者のラベルはユーザログイン認証で自動獲得して付与し、収録環境のラベルはユーザによる指定により付与し、喋り方のラベルはクラスタリングによる付与とすることができる。 A combination of the labeling methods (1) to (3) above may be used to give a plurality of types of labels. For example, a speaker's label can be automatically acquired and given by user login authentication, a recording environment label can be given by designation by the user, and a turn-over label can be given by clustering.

ステップＳ１３において、特徴量抽出部１３は、ラベル付きの対象音声信号と参照音声信号とから、それぞれの音響特徴量を抽出し、ラベル付きの音響特徴量の系列を抽出する。ラベル付きの対象音響特徴量系列と参照音響特徴量系列とは、特徴量記憶部２２へ記憶される。抽出する音響特徴量としては、例えば、音声信号の短時間フレーム分析に基づくメル周波数ケプストラム係数（MFCC: Mel-Frequency Cepstrum Coefficient）の1〜12次元と、その動的特徴量であるΔMFCC、ΔΔMFCCなどの動的パラメータや、パワー、Δパワー、ΔΔパワー等を用いる。また、MFCCに対してはケプストラム平均正規化（CMN: Cepstral Mean Normalization）処理を行ってもよい。抽出する音響特徴量は、MFCCやパワーに限定したものではなく、音声認識に用いられるパラメータを用いてもよい。 In step S13, the feature quantity extraction unit 13 extracts each acoustic feature quantity from the labeled target voice signal and the reference voice signal, and extracts a series of labeled acoustic feature quantities. The labeled target acoustic feature quantity sequence and the reference acoustic feature quantity series are stored in the feature quantity storage unit 22. As acoustic features to be extracted, for example, 1 to 12 dimensions of Mel-Frequency Cepstrum Coefficient (MFCC) based on short-time frame analysis of audio signals and their dynamic features ΔMFCC, ΔΔMFCC, etc. Dynamic parameters, power, Δ power, ΔΔ power, and the like are used. Moreover, you may perform a cepstrum mean normalization (CMN: Cepstral Mean Normalization) process with respect to MFCC. The acoustic feature amount to be extracted is not limited to MFCC or power, but a parameter used for speech recognition may be used.

特徴量記憶部２２は、ラベル付き対象音響特徴量系列およびラベル付き参照音響特徴量系列を蓄積する。上述のとおり、ラベル付き対象音響特徴量系列は、認識対象タスクに即した話者や雑音タイプ、喋り方などに相当する環境の下で収録された対象音声信号から抽出したラベル付き音響特徴量の系列である。また、ラベル付き参照音響特徴量系列は認識対象タスクではないが、音声の明瞭性が高く、大量のデータが取得可能な参照音声信号から抽出したラベル付き音響特徴量の系列である。ラベル付き参照音響特徴量系列は、例えば、音声認識で利用する本来の音響モデルを生成するために収録した音声信号に基づいた音響特徴量の系列などを用いることが考えられる。また、２つの音響特徴量系列には、同一単語の発話や同一単語ではないが類似音で発声する発話（例えば、「元気」と「天気」など）を多く含んでいることが望ましい。また、それぞれの音響特徴量系列における発話内容はすべて既知であることとする。 The feature quantity storage unit 22 accumulates the labeled target acoustic feature quantity series and the labeled reference acoustic feature quantity series. As described above, the labeled target acoustic feature quantity sequence is a labeled acoustic feature quantity extracted from the target speech signal recorded under an environment corresponding to the speaker, noise type, and how to beat according to the recognition target task. It is a series. The labeled reference acoustic feature quantity sequence is not a recognition target task, but is a series of labeled acoustic feature quantities extracted from a reference speech signal with high speech clarity and capable of acquiring a large amount of data. As the labeled reference acoustic feature quantity sequence, for example, it is conceivable to use a series of acoustic feature quantities based on a voice signal recorded in order to generate an original acoustic model used in voice recognition. In addition, it is desirable that the two acoustic feature quantity series include many utterances of the same word or utterances uttered with similar sounds but not the same word (for example, “genki” and “weather”). Further, it is assumed that the utterance contents in each acoustic feature quantity series are all known.

ステップＳ１４において、特徴量照合部１４は、特徴量記憶部２２に記憶されたラベル付きの対象音響特徴量系列と参照音響特徴量系列との対応関係を、短時間フレーム単位の特徴量ごとの類似度の大きさに基づいて時系列上で照合する。照合済みの対象音響特徴量系列と参照音響特徴量系列とは、変換モデル生成部１５へ送られる。ラベル付きの対象音響特徴量系列と参照音響特徴量系列とは、発話内容が同じだが、話者や雑音タイプ、喋り方などのラベルが異なるものである。一般的に、発話内容が同一であっても、発話時間の長さは異なる場合がある。ニューラルネットを学習するためには、同じ発話内容であってもフレーム単位で対応付ける必要があるため、音響特徴量を変換するニューラルネットを学習させる前に、時間軸において照合をしておく必要がある。照合方法については、公知の動的時間伸縮法等が挙げられる。動的時間伸縮法については、「内田誠一、“DPマッチング概説〜基本と様々な拡張〜”、信学技報、PRMU2006-166、pp. 31-36、2006年（参考文献３）」に記載されている。当該処理によって照合された結果として生成されたDPパスに基づいて、２つのラベル付き音響特徴量系列に含まれる各特徴量について、類似性の高い特徴量同士の時系列的な対応関係が得られる。 In step S 14, the feature amount matching unit 14 determines the correspondence between the labeled target acoustic feature amount sequence stored in the feature amount storage unit 22 and the reference acoustic feature amount sequence for each feature amount in units of short frames. Match on time series based on magnitude of degree. The verified target acoustic feature quantity sequence and the reference acoustic feature quantity sequence are sent to the conversion model generation unit 15. The target acoustic feature quantity series with the label and the reference acoustic feature quantity series have the same utterance content but different labels such as speaker, noise type, and manner of speaking. In general, even if the utterance content is the same, the length of the utterance time may be different. In order to learn a neural network, it is necessary to associate the same utterance content in units of frames. Therefore, before learning a neural network for converting acoustic features, it is necessary to collate on the time axis. . As a verification method, a known dynamic time expansion / contraction method and the like can be given. The dynamic time expansion and contraction method is described in “Seiichi Uchida,“ DP Matching Overview: Basics and Various Extensions ”, IEICE Technical Report, PRMU 2006-166, pp. 31-36, 2006 (Reference 3)”. Has been. Based on the DP path generated as a result of collation by this processing, for each feature quantity included in the two labeled acoustic feature quantity series, a time-series correspondence between highly similar feature quantities is obtained. .

ステップＳ１５において、変換モデル生成部１５は、照合済みの対象音響特徴量系列と参照音響特徴量系列とを用いて、入力された音響特徴量系列の音響的特徴を対象音響特徴量系列と参照音響特徴量系列との対応関係に基づいて変換した音響特徴量系列を出力する変換モデルを学習する。本形態では、この変換モデルはニューラルネットである。以下、この変換モデルを特徴量変換ニューラルネットと呼ぶ。特徴量変換ニューラルネットは、変換モデル記憶部２３へ記憶される。ニューラルネットの学習方法は、例えば、公知の誤差逆伝搬法や確率的勾配降下法等が挙げられる。誤差逆伝搬法や確率的最急勾配法については、「荒木雅弘、“フリーソフトではじめる機械学習入門”、森北出版、2014年」に記載されている。 In step S 15, the conversion model generation unit 15 uses the collated target acoustic feature amount sequence and the reference acoustic feature amount sequence to convert the acoustic features of the input acoustic feature amount sequence into the target acoustic feature amount sequence and the reference acoustic feature. A conversion model that outputs an acoustic feature amount sequence converted based on a correspondence relationship with the feature amount sequence is learned. In this embodiment, this conversion model is a neural network. Hereinafter, this conversion model is referred to as a feature amount conversion neural network. The feature quantity conversion neural network is stored in the conversion model storage unit 23. As a learning method of the neural network, for example, a known error back-propagation method or stochastic gradient descent method may be used. The error back-propagation method and the stochastic steepest gradient method are described in Masahiro Araki, “Introduction to Machine Learning with Free Software”, Morikita Publishing, 2014 ”.

特徴量変換ニューラルネットの具体的な学習方法を説明する。例えば、話者Ａから話者Ｂに変換する特徴量変換ニューラルネットであれば、話者Ａの発話と話者Ｂの発話をそれぞれ照合済み音響特徴量系列に変換した後、話者Ａの照合済み音響特徴量系列を入力とし、入力音響特徴量系列と時間軸上の対応関係が取れている話者Ｂの照合済み音響特徴量系列を出力として、特徴量変換ニューラルネットを学習する。他の音響的特徴（例えば、雑音タイプや喋り方等）についても同様にして特徴量変換ニューラルネットを学習する。例えば、雑音タイプであれば、雑音タイプＡの下での発話と雑音タイプＢの下での発話をそれぞれ照合済み音響特徴量系列に変換した後、それらの照合済み音響特徴量系列を学習し、特徴量変換ニューラルネットを生成する。また、上述では音響的特徴ごとに特徴量変換ニューラルネットを生成しているが、それらを複合的に変換する特徴量変換ニューラルネットを生成してもよい。つまり、話者Ａが雑音タイプＡの下で発話した音響特徴量系列を、話者Ｂが雑音タイプＢの下で発話した音響特徴量系列に変換する特徴量変換ニューラルネットを生成してもよい。この場合には、話者Ａと雑音タイプＡのラベルが付与された照合済み音響特徴量系列を入力とし、時間軸上の対応関係が取れている話者Ｂと雑音タイプＢのラベルが付与された照合済みラベル付き特徴量系列を出力として、特徴量変換ニューラルネットを学習すればよい。 A specific learning method of the feature amount conversion neural network will be described. For example, in the case of a feature amount conversion neural network for converting speaker A to speaker B, speaker A's speech and speaker B's speech are converted into verified acoustic feature sequences, respectively, and then speaker A's verification is performed. The feature amount conversion neural network is learned by using the already-acquired acoustic feature amount sequence as an input, and outputting the collated acoustic feature amount sequence of the speaker B who has a corresponding relationship on the time axis with the input acoustic feature amount sequence. The feature quantity conversion neural network is similarly learned for other acoustic features (for example, noise type and how to beat). For example, in the case of a noise type, after converting an utterance under noise type A and an utterance under noise type B to a matched acoustic feature quantity sequence, learning those matched acoustic feature quantity sequences, Generate a feature conversion neural network. In the above description, a feature quantity conversion neural network is generated for each acoustic feature. However, a feature quantity conversion neural network that converts them in a composite manner may be generated. That is, a feature quantity conversion neural network that converts an acoustic feature quantity sequence uttered by speaker A under noise type A into an acoustic feature quantity sequence spoken by speaker B under noise type B may be generated. . In this case, the collated acoustic feature quantity series to which the labels of speaker A and noise type A are assigned is input, and the labels of speaker B and noise type B having a correspondence relationship on the time axis are assigned. The feature-value conversion neural network may be learned by using the matched feature-value series with collation as an output.

ステップＳ１６において、疑似特徴量生成部１６は、認識対象タスクと音響的特徴が異なる音声信号から抽出した音響特徴量系列を、変換モデル記憶部２３に記憶された特徴量変換ニューラルネットを用いて変換し、疑似音響特徴量を生成する。その際、入力された音響特徴量系列に対し、出力したい疑似音響特徴量に合致したタイプの特徴量変換ニューラルネットを選択する。例えば、話者Ａから話者Ｂに音響的特徴を変換したいのであれば、話者Ａから話者Ｂに音響的特徴を変換する特徴量変換ニューラルネットを選択する。生成した疑似音響特徴量系列は、疑似特徴量記憶部２４へ記憶される。 In step S 16, the pseudo feature quantity generation unit 16 converts an acoustic feature quantity sequence extracted from a speech signal having an acoustic feature different from that of the recognition target task, using a feature quantity conversion neural network stored in the conversion model storage unit 23. Then, the pseudo acoustic feature amount is generated. At that time, a feature quantity conversion neural network of a type matching the pseudo acoustic feature quantity to be output is selected for the inputted acoustic feature quantity series. For example, if an acoustic feature is to be converted from speaker A to speaker B, a feature amount conversion neural network that converts an acoustic feature from speaker A to speaker B is selected. The generated pseudo acoustic feature quantity sequence is stored in the pseudo feature quantity storage unit 24.

引き続き、図２を参照して、第一実施形態の音響モデル適応方法の処理手続きを説明する。 Next, the processing procedure of the acoustic model adaptation method of the first embodiment will be described with reference to FIG.

特徴量記憶部２２には、認識対象タスクに関する対象音声信号から上述の音響特徴量変換方法で生成された対象音響特徴量系列が記憶されている。 The feature quantity storage unit 22 stores a target acoustic feature quantity sequence generated by the above-described acoustic feature quantity conversion method from a target voice signal related to a recognition target task.

疑似特徴量記憶部２４には、認識対象タスクと音響的特徴が異なる音声信号から抽出した音響特徴量系列を上述の音響特徴量変換方法により変換した疑似音響特徴量系列が記憶されている。 The pseudo feature quantity storage unit 24 stores a pseudo acoustic feature quantity sequence obtained by converting an acoustic feature quantity sequence extracted from an audio signal having an acoustic feature different from that of the recognition target task by the above-described acoustic feature quantity conversion method.

ステップＳ１７において、音響モデル学習部１７は、特徴量記憶部２２に記憶された対象音響特徴量系列と疑似特徴量記憶部２４に記憶された疑似音響特徴量系列とを利用して音響モデルを学習する。音声認識における音響モデルとしては、GMM-HMMなどが用いられており、音響モデルを認識対象タスクに適応させる手法は、例えば、「篠田浩一、“確率モデルによる音声認識のための話者適応化技術”、電子情報通信学会論文誌、J87-D-II(2)、pp. 371-386、2004年（参考文献４）」などに記載されている。 In step S 17, the acoustic model learning unit 17 learns an acoustic model using the target acoustic feature amount sequence stored in the feature amount storage unit 22 and the pseudo acoustic feature amount sequence stored in the pseudo feature amount storage unit 24. To do. For example, GMM-HMM is used as an acoustic model in speech recognition. For example, Koichi Shinoda, “Speaker Adaptation Technology for Speech Recognition Using Probability Models” ", IEICE Transactions, J87-D-II (2), pp. 371-386, 2004 (Reference 4)".

このように、第一実施形態の音響特徴量変換装置および方法は、認識対象タスクに関する音声信号を十分に用意できない場合であっても、認識対象タスクと音声的特徴が異なる音声信号の音響的特徴を変換することで疑似音響特徴量を生成することで、十分な学習データを用意することが可能となる。したがって、認識対象タスクに適応した音響モデルの認識率が向上する。 As described above, the acoustic feature quantity conversion apparatus and method according to the first embodiment provide an acoustic feature of an audio signal having an audio feature that is different from the recognition target task even if the audio signal related to the recognition target task cannot be sufficiently prepared. It is possible to prepare sufficient learning data by generating a pseudo acoustic feature amount by converting. Therefore, the recognition rate of the acoustic model adapted to the recognition target task is improved.

［第二実施形態］
第二実施形態は、完全なパラレルデータが存在しない場合に音響特徴量を変換するニューラルネットを学習し、そのニューラルネットを利用して疑似的な学習データを作成する音響特徴量変換装置および方法と、その学習データを利用して音響モデルの適応を行う音響モデル適応装置である。 [Second Embodiment]
The second embodiment learns a neural network that converts acoustic feature amounts when perfect parallel data does not exist, and uses the neural network to create pseudo learning data and an acoustic feature amount conversion device and method The acoustic model adaptation device performs adaptation of the acoustic model using the learning data.

第一実施形態では同じ発話内容で話者や雑音タイプ、喋り方などの音響的特徴が異なるパラレルデータを用いて特徴量変換ニューラルネットを学習したが、このようなパラレルデータが存在するケースは非常に稀であり、また存在していたとしても大量に集めることが難しいため、データスパースネスの問題も起こりうる。そこで、第二実施形態では非パラレルデータを用いる場合を考える。動的時間伸縮法等を用いた時間軸での対応付けは、発話内容が同一、すなわちパラレルデータを前提としている。そこで、非パラレルデータでは、あらかじめ各発話の分析フレーム毎の状態番号（後述の隠れマルコフモデル上の状態番号）を推定しておき、その状態番号同士を対応付けて特徴量変換ニューラルネットを生成する。そのため、第二実施形態では、特徴量照合部の処理が第一実施形態と異なることになる。 In the first embodiment, the feature conversion neural network is learned using parallel data with the same utterance content but different acoustic features such as speaker, noise type, and how to speak. However, there are cases where such parallel data exists. Data sparseness problems can occur because they are rare and difficult to collect in large quantities, even if they exist. Therefore, in the second embodiment, consider the case of using non-parallel data. The correspondence on the time axis using the dynamic time expansion / contraction method or the like is premised on the same utterance content, that is, parallel data. Therefore, for non-parallel data, a state number (a state number on a hidden Markov model described later) for each analysis frame of each utterance is estimated in advance, and the feature number conversion neural network is generated by associating the state numbers with each other. . Therefore, in the second embodiment, the process of the feature amount matching unit is different from the first embodiment.

第二実施形態の音響特徴量変換装置３は、図３に示すように、入力端子１０、音声信号取得部１１、ラベル付与部１２、特徴量抽出部１３、変換モデル生成部１５、疑似特徴量生成部１６、音響モデル学習部１７、音声信号記憶部２１、特徴量記憶部２２、変換モデル記憶部２３、および疑似特徴量記憶部２４を第一実施形態と同様に含み、発話強制アラインメント部１８および特徴量照合部１９を例えば含む。 As shown in FIG. 3, the acoustic feature value conversion device 3 according to the second embodiment includes an input terminal 10, an audio signal acquisition unit 11, a label assignment unit 12, a feature value extraction unit 13, a conversion model generation unit 15, and a pseudo feature value. The generation unit 16, the acoustic model learning unit 17, the speech signal storage unit 21, the feature amount storage unit 22, the conversion model storage unit 23, and the pseudo feature amount storage unit 24 are included as in the first embodiment, and the utterance forced alignment unit 18. And the feature-value collation part 19 is included, for example.

図４を参照して、第二実施形態の音響特徴量変換方法を説明する。以下では、上述の第一実施形態との相違点を中心に説明する。 With reference to FIG. 4, the acoustic feature-value conversion method of 2nd embodiment is demonstrated. Below, it demonstrates centering on difference with the above-mentioned 1st embodiment.

ステップＳ１８において、発話強制アラインメント部１８は、特徴量記憶部２２に記憶された対象音響特徴量系列と参照音響特徴量系列とから、強制アラインメントを実行することでアラインメント済みの対象音響特徴量系列と参照音響特徴量系列を生成する。生成したアラインメント済みの対象音響特徴量系列と参照音響特徴量系列は、特徴量照合部１９へ送られる。強制アラインメントとは、音響特徴量系列の発話内容が既知である前提で、その発話内容に一致する正解テキストに対する音声認識を実行し、認識処理過程における状態遷移を観測することで、入力した分析フレーム毎の特徴量に対応する隠れマルコフモデル（HMM: Hidden Markov Model）の状態番号を割り当てる処理である。なお、音声認識ではしばしば音素認識のために隠れマルコフモデルを用い、状態番号はトライフォン（triphone）までを考える。トライフォンは分類すべき音素の前後の音素関係も含めた音素の３つ組みである。トライフォンでは、例えば「a-k-a」のように３音素を１つの状態番号として考える。なお、モノフォン（monophone）は音素１つ、バイフォン（biphone）は音素２つの組を１つの状態番号として考える。強制アラインメントは正解テキストを用いてビタビアルゴリズム等を利用して実行される。なお、音声認識における隠れマルコフモデルやビタビアルゴリズムについては「鹿野他、“IT Text 音声認識システム”、オーム社、2001年」に記載されている。 In step S 18, the utterance forced alignment unit 18 executes the forced alignment from the target acoustic feature amount sequence and the reference acoustic feature amount sequence stored in the feature amount storage unit 22, thereby aligning the target acoustic feature amount sequence that has been aligned. A reference acoustic feature quantity sequence is generated. The generated target acoustic feature quantity sequence and reference acoustic feature quantity series generated are sent to the feature quantity matching unit 19. Forced alignment is based on the assumption that the utterance content of the acoustic feature series is known, and performs speech recognition on the correct text that matches the utterance content, and observes state transitions in the recognition processing process. This is a process of assigning a hidden Markov model (HMM) state number corresponding to each feature amount. In speech recognition, a hidden Markov model is often used for phoneme recognition, and the state number is considered up to triphone. The triphone is a triple of phonemes including the phoneme relationship before and after the phonemes to be classified. In the triphone, for example, three phonemes such as “a-k-a” are considered as one state number. Note that a monophone is considered as one phoneme, and a biphone is considered as one phone number. Forced alignment is executed using a Viterbi algorithm or the like using correct text. Hidden Markov models and Viterbi algorithms for speech recognition are described in “Kano et al.,“ IT Text Speech Recognition System ”, Ohmsha, 2001”.

ステップＳ１９において、特徴量照合部１９は、アラインメント済みの対象音響特徴量系列と参照音響特徴量系列とを、それぞれに割り当てられた状態番号同士で照合する。照合済みの対象音響特徴量系列と参照音響特徴量系列は、変換モデル生成部１５へ送られる。例えば，話者Ａのアラインメント済み音響特徴量系列と話者Ｂのアラインメント済み音響特徴量系列とにおいて、発話内容は異なるが状態番号が同じである分析フレームを照合済みの音響特徴量系列として出力する。例えば、話者Ａが発話した「天気」と話者Ｂが発話した「元気」とを音素レベルで比較した場合、発話内容は異なるが「g/e/ng/k/i」と「t/e/ng/k/i」では「ng」などは同じ音素であり、前後の音素関係も同じであるため、同じ状態番号が付与されている。 In step S19, the feature amount matching unit 19 checks the aligned target acoustic feature amount sequence and the reference acoustic feature amount sequence with the state numbers assigned to them. The target acoustic feature quantity sequence and the reference acoustic feature quantity sequence that have been verified are sent to the conversion model generation unit 15. For example, in the aligned acoustic feature quantity sequence of speaker A and the aligned acoustic feature quantity sequence of speaker B, an analysis frame having a different utterance content but the same state number is output as a matched acoustic feature quantity sequence. . For example, when “weather” spoken by speaker A and “genki” spoken by speaker B are compared at the phoneme level, the utterance contents are different, but “g / e / ng / k / i” and “t / In “e / ng / k / i”, “ng” or the like is the same phoneme, and the phoneme relationship before and after is the same, so the same state number is assigned.

特徴量照合部１９では、発話全体ではないものの、状態遷移のアラインメントが一致した２つの照合済み音響特徴量系列を得ることができる。状態遷移のアラインメントは時間軸上の対応関係とは必ずしも同一の物理量とは限らないが、２つの音響特徴量系列の対応関係を記したものとして時間軸上の対応関係と同様に取り扱うことができる。 The feature amount collating unit 19 can obtain two collated acoustic feature amount sequences in which the alignment of the state transitions is matched, although not the entire utterance. The alignment of the state transition is not necessarily the same physical quantity as the correspondence on the time axis, but can be handled in the same way as the correspondence on the time axis as a correspondence between the two acoustic feature quantity sequences. .

このように、第二実施形態の音響特徴量変換装置および方法は、パラレルデータを十分に用意できない場合であっても、強制アラインメントにより状態遷移のアラインメントが一致する照合済み音響特徴量系列を用いることができるため、第一実施形態の音響特徴量変換装置および方法と同様の疑似音響特徴量系列を得ることができる。したがって、第一実施形態と同様に、認識対象タスクに適応した音響モデルの認識率が向上する。 As described above, the acoustic feature amount conversion apparatus and method according to the second embodiment uses the collated acoustic feature amount sequence in which the alignment of the state transitions matches by the forced alignment even when the parallel data cannot be sufficiently prepared. Therefore, a pseudo acoustic feature quantity sequence similar to the acoustic feature quantity conversion apparatus and method of the first embodiment can be obtained. Accordingly, as in the first embodiment, the recognition rate of the acoustic model adapted to the recognition target task is improved.

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１、３音響特徴量変換装置
２音響モデル適応装置
１０入力端子
１１音声信号取得部
１２ラベル付与部
１３特徴量抽出部
１４、１９特徴量照合部
１５変換モデル生成部
１６疑似特徴量生成部
１７音響モデル学習部
１８発話強制アラインメント部
２１音声信号記憶部
２２特徴量記憶部
２３変換モデル記憶部
２４疑似特徴量記憶部 DESCRIPTION OF SYMBOLS 1, 3 Acoustic feature-value conversion apparatus 2 Acoustic model adaptation apparatus 10 Input terminal 11 Audio | voice signal acquisition part 12 Label provision part 13 Feature-value extraction part 14, 19 Feature-value collation part 15 Conversion model production | generation part 16 Pseudo-feature-value production | generation part 17 Acoustic Model learning unit 18 Utterance compulsory alignment unit 21 Speech signal storage unit 22 Feature amount storage unit 23 Conversion model storage unit 24 Pseudo feature amount storage unit

Claims

The target acoustic feature quantity sequence extracted from the target speech signal including the unclean speech signal related to the task to be recognized, and the reference acoustic feature quantity extracted from the reference speech signal including the unclean speech signal corresponding to the target speech signal and the utterance content. A feature amount extraction unit that generates a sequence;
An utterance forced alignment unit that performs speech recognition based on a probability model for the target acoustic feature amount sequence and the reference acoustic feature amount sequence, and assigns a state number of the probability model;
A collated target acoustic feature sequence and a collated reference acoustic feature sequence, in which the correspondence relationship between the target acoustic feature sequence and the reference acoustic feature sequence is collated based on a phoneme sequence having the same transition of the state number. A feature amount matching unit to be generated;
Learning a conversion model that outputs an acoustic feature quantity sequence obtained by converting an acoustic feature of an input acoustic feature quantity based on the correspondence using the matched target acoustic feature quantity sequence and the matched reference acoustic feature quantity sequence A conversion model generation unit to
A pseudo feature generating unit that generates a pseudo acoustic feature amount sequence obtained by converting the acoustic feature of the acoustic feature amount sequence extracted from an audio signal having an acoustic feature different from the task using the conversion model;
An acoustic feature quantity conversion device including:

The acoustic feature amount conversion apparatus according to claim 1 ,
The conversion model is a neural network that converts an acoustic feature of the input acoustic feature quantity based on a correspondence relationship between the target acoustic feature quantity sequence and the reference acoustic feature quantity series.

An acoustic feature quantity storage unit for storing a target acoustic feature quantity sequence extracted from a target speech signal related to a task to be recognized;
A pseudo-acoustic feature amount obtained by converting an acoustic feature of an acoustic feature amount sequence extracted from an audio signal having a different acoustic feature from the task, using a conversion model generated by the acoustic feature-value conversion device according to claim 1 or 2. A pseudo-acoustic feature storage unit for storing a sequence;
An acoustic model learning unit that learns an acoustic model using the target acoustic feature amount sequence and the pseudo acoustic feature amount sequence;
An acoustic model adaptation device including:

From the target sound feature amount sequence extracted from the target sound signal including the unclean sound signal related to the task to be recognized by the feature amount extraction unit and the reference sound signal including the unclean sound signal corresponding to the target sound signal and the utterance content A feature extraction step for generating an extracted reference acoustic feature sequence;
An utterance forced alignment unit performs speech recognition based on a probability model for the target acoustic feature quantity sequence and the reference acoustic feature quantity series, and assigns a state number of the probability model, and an utterance forced alignment step,
The feature amount matching unit matches the reference relationship with the matched target acoustic feature amount sequence in which the correspondence relationship between the target acoustic feature amount sequence and the reference acoustic feature amount sequence is matched based on the phoneme sequence in which the transition of the state number matches. A feature amount matching step for generating an acoustic feature amount sequence;
The conversion model generation unit converts an acoustic feature quantity sequence obtained by converting the acoustic feature of the input acoustic feature quantity based on the correspondence relationship using the matched target acoustic feature quantity series and the matched reference acoustic feature quantity series. A conversion model generation step for learning a conversion model to be output;
A pseudo feature quantity generating step in which the pseudo feature quantity generating unit generates a pseudo acoustic feature quantity sequence obtained by converting the acoustic features of the acoustic feature quantity series extracted from the audio signal having different acoustic characteristics from the task using the conversion model. When,
An acoustic feature amount conversion method including:

The acoustic feature quantity storage unit stores a target acoustic feature quantity sequence extracted from the target speech signal related to the task to be recognized,
The acoustic feature of the acoustic feature quantity sequence extracted from an audio signal having an acoustic feature different from that of the task using the transformation model generated by the acoustic feature quantity transformation method according to claim 4 in the pseudo acoustic feature quantity storage unit. The converted pseudo acoustic feature quantity sequence is stored,
An acoustic model learning unit that learns an acoustic model using the target acoustic feature quantity sequence and the pseudo acoustic feature quantity series; and
An acoustic model adaptation method including:

A program for causing a computer to function as the acoustic feature quantity conversion device according to claim 1 or 2 or the acoustic model adaptation device according to claim 3 .