JP6831767B2

JP6831767B2 - Speech recognition methods, devices and programs

Info

Publication number: JP6831767B2
Application number: JP2017198997A
Authority: JP
Inventors: 信行西澤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2021-02-17
Anticipated expiration: 2037-10-13
Also published as: JP2019074580A

Description

本発明は、音声認識方法、装置およびプログラムに係り、特に、既存の音声認識アルゴリズムによる認識率が向上するように、入力音声を少ない計算量で予め変形してから音声認識アルゴリズムに適用する音声認識方法、装置およびプログラムに関する。 The present invention relates to a speech recognition method, a device, and a program, and in particular, speech recognition that applies to a speech recognition algorithm after transforming an input speech in advance with a small amount of calculation so that the recognition rate by an existing speech recognition algorithm is improved. Regarding methods, devices and programs.

音声認識のパターンマッチに用いられる特徴量として、音声波形（例えば、８kHzから３２kHz程度のサンプリングにより離散時間の時系列情報として表現された音声波形。以下同様）の数十ミリ秒から百数十ミリ秒の長さの区間を切り出し、それらのスペクトル包絡特性を表すようなベクトルを用いることが多い。 As a feature amount used for pattern matching of voice recognition, a voice waveform (for example, a voice waveform expressed as time-series information of discrete time by sampling from 8 kHz to 32 kHz; the same applies hereinafter) is tens of milliseconds to hundreds of millimeters. In many cases, intervals of seconds are cut out and vectors that represent their spectral entanglement characteristics are used.

このようなスペクトル包絡特性としては、例えば、離散フーリエ変換結果の対数変換等により求められる対数パワースペクトルにおいて、主に音声波形の周期性に由来する調波成分の各ピーク値を滑らかな曲線でつないだ周波数−対数パワー特性がある。 As such a spectral entrainment characteristic, for example, in a logarithmic power spectrum obtained by logarithmic transformation of the discrete Fourier transform result, each peak value of the tuning component mainly derived from the periodicity of the voice waveform is connected by a smooth curve. It has a frequency-logarithmic power characteristic.

そして、このようなベクトルの１つとして、メル周波数ケプストラム係数（MFCC）がある。以下、このようなある時刻を中心とする区間の音声の特徴を表すベクトルを音響特徴ベクトルと表現する。 And one such vector is the mel frequency cepstrum coefficient (MFCC). Hereinafter, a vector representing a voice feature in a section centered on a certain time will be referred to as an acoustic feature vector.

音声認識システムでは、この音響特徴ベクトルを数十ミリ程度の時間間隔で計算し（通常、音響特徴ベクトルを計算するために切り出す区間の時間長の方が、切り出し処理を行う時間間隔よりも長くなるので、音声波形を切り出す区間は重なり合うことになる）、その時系列データに対してパターンマッチを行い、音声認識結果を出力する。このような音声特徴ベクトルは、音声の特に音韻性をよく表す一方、話者性や基本周波数の違いに対して比較的鈍感であることが知られている。 In the speech recognition system, this acoustic feature vector is calculated at a time interval of about several tens of millimeters (usually, the time length of the section to be cut out for calculating the acoustic feature vector is longer than the time interval for performing the cutting process. Therefore, the sections from which the voice waveform is cut out overlap), pattern matching is performed on the time-series data, and the voice recognition result is output. It is known that such a speech feature vector expresses particularly phonological characteristics of speech, but is relatively insensitive to differences in speaker characteristics and fundamental frequencies.

特開２００８−５８６９６号公報Japanese Unexamined Patent Publication No. 2008-58696

MFCCのような音声特徴ベクトルは、話者性に対して比較的に鈍感だが、それでも話者性の影響を受ける。したがって、パターンマッチのパターンを作成する基となった話者と、実際の音声認識対象の話者とが異なる場合、音声特徴ベクトルに現れる話者性の影響がパターンマッチの際の障害になり得る。このため、不特定話者を対象とする音声認識システムでは、音声特徴ベクトル空間上の点ではなく、様々な人の声を学習データとして用いて音声特徴ベクトルの分布をモデル化することで、様々な人の音声を認識できるようにしている。 Speech feature vectors, such as MFCC, are relatively insensitive to speakeriness, but are still influenced by speakeriness. Therefore, if the speaker that is the basis for creating the pattern of pattern matching is different from the speaker that is the actual speech recognition target, the influence of speakerness that appears in the speech feature vector can be an obstacle in pattern matching. .. For this reason, in a speech recognition system for unspecified speakers, the distribution of speech feature vectors can be modeled by using the voices of various people as learning data instead of points on the speech feature vector space. It makes it possible to recognize the voice of a person.

しかしながら、そのような方法でも全ての人の音声の特徴を網羅することは難しく、音声認識が困難なケースがある。そこで、入力話者の音声の特徴ベクトルの分布が既知であると、モデルの分布または特徴ベクトルに対してアフィン変換等を行い、音声認識ができるようにモデルの分布あるいは特徴ベクトルを変形させることがある。この方法は話者適応と呼ばれる。また、同じ話者であっても、音声の伝達特性が異なる場合には同様の問題が生じ得るが、これも同様の方法で対処できる。 However, even with such a method, it is difficult to cover all the characteristics of human voice, and there are cases where voice recognition is difficult. Therefore, if the distribution of the voice feature vector of the input speaker is known, it is possible to perform affine transformation or the like on the model distribution or feature vector to deform the model distribution or feature vector so that voice recognition can be performed. is there. This method is called speaker adaptation. Further, even if the same speaker has different voice transmission characteristics, the same problem may occur, which can be dealt with by the same method.

しかしながら、音声認識システムの入力は一般に音声波形、出力は音声認識結果であり、例えば分散型音声認識（DSR）システムのように、クライアント側で音声特徴ベクトルを抽出して特徴ベクトルをサーバに送信するような一部の特殊なシステムを除き、音声特徴ベクトルが外側からアクセス可能である必要はない。 However, the input of the speech recognition system is generally the speech waveform, and the output is the speech recognition result. For example, as in the distributed speech recognition (DSR) system, the speech feature vector is extracted on the client side and the feature vector is transmitted to the server. Except for some special systems such as, the voice feature vector does not need to be accessible from the outside.

したがって、多くの場合、音声特徴ベクトルの抽出処理は音声認識システムの内部に完全に埋め込まれており、アプリケーションで既存の音声認識システムを使う場合、使用する音声認識システムに音声特徴ベクトルの入出力機能がなければ、アプリケーションで上述のアフィン変換を自由に適用できない。あるいは、変形制御のためのインタフェースが公開されていなければ、アプリケーションからアフィン変換の機能を用いることができない。 Therefore, in many cases, the voice feature vector extraction process is completely embedded inside the voice recognition system, and when using an existing voice recognition system in an application, the voice recognition system used has the input / output function of the voice feature vector. Without it, the application cannot freely apply the above-mentioned affine transformation. Alternatively, if the interface for deformation control is not open to the public, the affine transformation function cannot be used from the application.

さらに、仮にインタフェースが利用可能であっても、上述のDSRシステムのように特殊なケースを除くと、そのインタフェースは通常、標準化されるようなものではないため、アプリケーションで音声認識システムを置き換える場合には、そのインタフェース部分も作り直す必要が生じる。 Moreover, even if an interface is available, except in special cases such as the DSR system described above, the interface is usually not standardized, so when an application replaces a speech recognition system. Will need to recreate its interface part as well.

このような技術課題に対処するアプローチとして、音声波形を特許文献１に記載の声質変換技術を用いて音声認識しやすい形に変形し、変形後の音声波形を音声認識システムに入力する方法が考えられる。多くの声質変換技術では、まず音声認識と同様に特徴量抽出を行い、それに適当な変形を施したのち、変形された特徴量に基づき、音声波形を信号処理技術により合成することで入力音声の声質の変換を実現している。 As an approach to deal with such technical problems, a method of transforming a voice waveform into a form that is easy to recognize by using the voice quality conversion technique described in Patent Document 1 and inputting the transformed voice waveform into a voice recognition system is considered. Be done. In many voice conversion technologies, features are first extracted in the same way as speech recognition, and after appropriate deformation, voice waveforms are synthesized by signal processing technology based on the deformed features to produce input voice. Achieves voice quality conversion.

この様な方法により、音声認識システム側を変更することなく、音声認識率の改善を図ることができる。しかしながら、声質変換処理における音声波形の合成処理では、演算量の多いフィルタを用いる必要があるため、その計算量が比較的大きいという技術課題があった。 By such a method, the voice recognition rate can be improved without changing the voice recognition system side. However, in the voice waveform synthesis process in the voice quality conversion process, it is necessary to use a filter having a large amount of calculation, so that there is a technical problem that the amount of calculation is relatively large.

本発明の目的は、上記の技術課題を解決し、声質変換技術により音声特徴量を音声認識し易い形に予め変形するシステムにおいて、声質変換に要する計算量が比較的大きい処理を不要とすることで、全体として計算量を減ぜられる音声認識方法、装置およびプログラムを提供することにある。 An object of the present invention is to solve the above-mentioned technical problems and to eliminate the need for processing in which the amount of calculation required for voice quality conversion is relatively large in a system in which voice feature amounts are preliminarily transformed into a form that is easy to recognize voice by voice quality conversion technology. The purpose of the present invention is to provide a speech recognition method, a device, and a program that can reduce the amount of calculation as a whole.

上記の目的を達成するために、本発明は、入力音声の声質を音声認識前に変形する音声認識装置において、以下の構成を具備した点に特徴がある。 In order to achieve the above object, the present invention is characterized in that it has the following configuration in a voice recognition device that transforms the voice quality of input voice before voice recognition.

(1) 入力音声から特徴量を抽出する手段と、前記特徴量を変形する手段と、前記変形した特徴量に基づいて音声波形を生成する手段とを具備し、前記音声波形を生成する手段は、音声認識プロセスで考慮されない特徴量を再現しないようにした。 (1) A means for extracting a feature amount from an input voice, a means for deforming the feature amount, and a means for generating a voice waveform based on the deformed feature amount, and the means for generating the voice waveform is provided. , The features that are not considered in the speech recognition process are not reproduced.

(2) 生成された音声波形に基づいて音声認識を実行する手段をさらに具備した。 (2) Further provided with means for performing speech recognition based on the generated speech waveform.

(3) 音声波形を生成する手段は、基本周期が入力音声と異なる音声波形を生成するようにした。 (3) As a means for generating a voice waveform, a voice waveform having a basic cycle different from that of the input voice is generated.

(4) 音声波形を生成する手段は、基本周期が、波形生成処理の処理区間長と等しい又はその整数分の１となる音声波形を生成するようにした。 (4) As a means for generating a voice waveform, a voice waveform having a basic period equal to or an integral fraction of the processing section length of the waveform generation process is generated.

(5) 音声波形を生成する手段は、複数の正弦関数の足し合わせに相当する処理により音声波形を生成するようにした。 (5) The means for generating the voice waveform is to generate the voice waveform by a process corresponding to the addition of a plurality of sine functions.

(6) 音声波形を生成する手段は、１周期の音声波形を所定回数繰り返す音声波形を生成するようにした。 (6) As a means for generating a voice waveform, a voice waveform in which one cycle of the voice waveform is repeated a predetermined number of times is generated.

本発明によれば、以下のような効果が達成される。 According to the present invention, the following effects are achieved.

(1) 入力音声に基づいて音声認識率の高い音声波形を生成する際に、後段の音声認識において考慮されない特徴量については、これを再現せず、計算量の削減を優先させた特徴量を採用するので、計算量の増加を抑えながら音声認識率を向上させることができる。 (1) When generating a speech waveform with a high speech recognition rate based on the input speech, the features that are not considered in the subsequent speech recognition are not reproduced, and the features that prioritize the reduction of the calculation amount are used. Since it is adopted, the voice recognition rate can be improved while suppressing an increase in the amount of calculation.

(2) 音声認識の実行部を後段に設けて一体構成とすれば、音声認識率の向上のために声質変換技術を用いる場合に、人による聴取を目的とした音声波形を出力する声質変換装置と、音声波形を入力とする音声認識装置とを縦続に接続した場合よりも計算量の増加を抑えた音声認識装置を構成できるようになる。 (2) If a voice recognition execution unit is provided in the subsequent stage and integrated, a voice quality conversion device that outputs a voice waveform for the purpose of human listening when voice quality conversion technology is used to improve the voice recognition rate. This makes it possible to configure a voice recognition device that suppresses an increase in the amount of calculation as compared with the case where a voice recognition device that inputs a voice waveform is connected in sequence.

(3) 音声認識の実行部を分離すれば、既存、汎用の音声認識装置を用いて認識率の高い音声認識を実現できるようになる。 (3) If the voice recognition execution unit is separated, it will be possible to realize voice recognition with a high recognition rate using existing and general-purpose voice recognition devices.

(4) 入力音声の基本周期は音声認識において考慮されず、かつ基本周期を所定値とすることで音声波形を生成する際の計算量を減じることができるので、計算量の増加を抑えながら音声認識率を向上させることができるようになる。 (4) The basic cycle of the input voice is not considered in speech recognition, and the amount of calculation when generating the voice waveform can be reduced by setting the basic cycle to a predetermined value, so that the voice can be suppressed while suppressing the increase in the amount of calculation. The recognition rate can be improved.

(5) 音声波形を生成する際に、基本周期が、波形生成処理の処理区間長と等しい又はその整数分の１となる音声波形を生成するようにしたので、離散時間フーリエ変換およびその逆変換を高速フーリエ変換で実現できるようになる。 (5) When generating a voice waveform, the basic period is equal to or an integral part of the processing interval length of the waveform generation process, so that the voice waveform is generated. Therefore, the discrete-time Fourier transform and its inverse transform are performed. Can be realized by the fast Fourier transform.

(6) 音声波形を生成する際に、複数の正弦関数の足し合わせに相当する処理により音声波形を生成するようにしたので、余弦関数の足し合わせでは生じる、調波成分のエネルギーが特定の時刻に集中することを防止でき、音声波形の量子化ビット数が同じ場合に、信号対雑音比のより高い音声認識が可能になる。 (6) When generating the voice waveform, the voice waveform is generated by the process corresponding to the addition of multiple sine functions, so the energy of the tuning component generated by the addition of the cosine functions is at a specific time. It can be prevented from concentrating on the voice waveform, and when the number of quantization bits of the voice waveform is the same, voice recognition with a higher signal-to-noise ratio becomes possible.

(7) 音声波形を生成する際に、１周期の音声波形を所定回数繰り返す音声波形を生成するようにしたので、その間は音声波形生成処理を行わないようにすれば計算量を削減できるようになる。 (7) When generating a voice waveform, a voice waveform that repeats one cycle of the voice waveform a predetermined number of times is generated. Therefore, if the voice waveform generation process is not performed during that period, the amount of calculation can be reduced. Become.

本発明の一実施形態に係る音声認識装置の主要部の構成を示した機能ブロック図である。It is a functional block diagram which showed the structure of the main part of the voice recognition apparatus which concerns on one Embodiment of this invention. 窓関数の離散時間フーリエ変換における正規化周波数（横軸）と対数パワースペクトル（縦軸）との関係を示した図である。It is a figure which showed the relationship between the normalized frequency (horizontal axis) and the logarithmic power spectrum (vertical axis) in the discrete-time Fourier transform of a window function. 音声波形が周波数f0=fs／Nの周期波形であるときの正規化周波数（横軸）と対数パワースペクトル（縦軸）との関係を示した図である。It is a figure which showed the relationship between the normalized frequency (horizontal axis) and the logarithmic power spectrum (vertical axis) when the voice waveform is a periodic waveform of frequency f0 = fs / N. 本発明の他の実施形態に係る音声認識装置の主要部の構成を示した機能ブロック図である。It is a functional block diagram which showed the structure of the main part of the voice recognition apparatus which concerns on other embodiment of this invention.

以下、図面を参照して本発明の実施の形態について詳細に説明する。ここでは、初めに本発明の概要について説明し、次いで、本発明の実施の形態について具体的に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Here, the outline of the present invention will be described first, and then the embodiments of the present invention will be specifically described.

音声認識において、入力音声から抽出される音響特徴ベクトルは、スペクトル包絡特性に対応する値のみで構成することが多い。その場合、周期的な音声波形の基本周波数の直接的な情報は捨てられることになる。したがって、このような音声認識システムにおいては、入力音声の基本周波数は、人間の音声の基本周波数を正確に表現している必要はなく、音声波形合成処理にとって都合の良い周波数を基本周波数としても良いことになる。 In speech recognition, the acoustic feature vector extracted from the input speech is often composed of only the values corresponding to the spectral envelope characteristics. In that case, the direct information on the fundamental frequency of the periodic speech waveform will be discarded. Therefore, in such a speech recognition system, the fundamental frequency of the input speech does not need to accurately represent the fundamental frequency of the human speech, and a frequency convenient for the speech waveform synthesis processing may be used as the fundamental frequency. It will be.

このような処理は、音声波形合成を、時間領域において合成音声波形の基本周期と等しい長さの離散時間フーリエ変換で行うことで実現できる。そして、離散時間フーリエ変換の窓長（処理区間長）を２のべき乗となる長さ（サンプル数）とすることで、離散時間フーリエ変換およびその逆変換を高速フーリエ変換で実現できるようになる。 Such processing can be realized by performing speech waveform synthesis by a discrete-time Fourier transform having a length equal to the basic period of the synthesized speech waveform in the time domain. Then, by setting the window length (processing interval length) of the discrete-time Fourier transform to the power of 2 (the number of samples), the discrete-time Fourier transform and its inverse transform can be realized by the fast Fourier transform.

本実施形態では、離散時間フーリエ変換を有限時間（サンプル数）で打ち切るために、矩形窓やハニング窓などの適当な窓関数を用いる。また、以下ではサンプリング周波数をfs、離散時間フーリエ変換の時間領域の窓長をN点として説明を続ける。 In this embodiment, an appropriate window function such as a rectangular window or a Hanning window is used to terminate the discrete-time Fourier transform in a finite time (number of samples). In the following description, the sampling frequency is fs, and the window length in the time domain of the discrete-time Fourier transform is N points.

ここで、f0をfs／Nとし、かつ周波数領域において音声の成分がf0の整数倍（ただし、その倍数であるkは|k|＜fs／(2×f0)とする）のみで構成されるとき（これは、音声波形が周波数f0の定常な周期波形であることに対応する）、窓掛けされた離散時間領域のN点のサンプル値は、周波数領域におけるk×f0の各点の値から、N点の離散フーリエ逆変換で厳密に求まり、Nが２のべき乗となる数であれば、その計算に高速フーリエ変換を容易に適用できる。これは以下の理由による。 Here, f0 is fs / N, and the audio component in the frequency domain is composed only of an integral multiple of f0 (however, k, which is a multiple of that, is | k | <fs / (2 × f0)). When (this corresponds to the audio waveform being a stationary periodic waveform at frequency f0), the sample value at point N in the windowed discrete time domain is from the value at each point k × f0 in the frequency domain. , If the number is exactly obtained by the inverse discrete Fourier transform of N points and N is a power of 2, then the fast Fourier transform can be easily applied to the calculation. This is due to the following reasons.

すなわち、一般的に離散時間領域における窓掛けは、周波数領域において、窓関数の離散時間フーリエ変換の畳み込みと等価である。窓長がNサンプルのとき、窓関数の離散時間フーリエ変換は、n×fs／N（nは０でない整数）の点で０となる。 That is, in general, windowing in the discrete time domain is equivalent to the convolution of the discrete-time Fourier transform of the window function in the frequency domain. When the window length is N sample, the discrete-time Fourier transform of the window function is 0 at the point of n × fs / N (n is an integer other than 0).

また、正規化周波数（＝f／fs）上においては、n／Nの点で振幅が０となる。例えば、図２はN=１６の矩形窓関数に対するフーリエ変換のパワースペクトルで、横軸は正規化周波数、縦軸は対数パワースペクトル（dB）である。正規化周波数において、０を除いた１／１６の倍数となる正規化周波数において、パワースペクトルが０（対数軸上において無限小）となることがわかる。 Further, on the normalized frequency (= f / fs), the amplitude becomes 0 at the point of n / N. For example, FIG. 2 shows the power spectrum of the Fourier transform for a rectangular window function of N = 16, where the horizontal axis is the normalized frequency and the vertical axis is the logarithmic power spectrum (dB). It can be seen that the power spectrum becomes 0 (infinitesimal on the logarithmic axis) at the normalized frequency, which is a multiple of 1/16 excluding 0.

一方、音声波形が周波数f0=fs／Nの周期波形であるとき、その定常性を仮定すれば、周波数領域ではその調波成分であるf0 (=fs／N) の整数倍の成分のみ（線スペクトルの足し合わせ）となる。つまり正規化周波数軸上においては、１／Nの整数倍の成分のみで構成され、例えば図３のようになる。 On the other hand, when the audio waveform is a periodic waveform with a frequency of f0 = fs / N, assuming its stationarity, in the frequency domain, only the component that is an integral multiple of the wave-tuning component f0 (= fs / N) (line). (Summary of spectra). That is, on the normalized frequency axis, it is composed of only components that are integral multiples of 1 / N, and is as shown in FIG. 3, for example.

ここで、前記窓長N点の窓関数の離散時間フーリエ変換と、周波数f0の定常な周期波形に対する離散時間フーリエ変換との畳み込みを考えると、ある正規化周波数k／N（k：整数）上におけるパワースペクトルは、その周波数の線スペクトルのパワーで決まる。なぜなら、別の線スペクトルに対する畳み込みに由来する成分は、他の線スペクトルが存在する周波数ではちょうど０となって、その影響を受けないためである。 Here, considering the convolution of the discrete-time Fourier transform of the window function of the window length N point and the discrete-time Fourier transform for a stationary periodic waveform at frequency f0, it is on a certain normalized frequency k / N (k: integer). The power spectrum in is determined by the power of the line spectrum at that frequency. This is because the component derived from the convolution to another line spectrum is exactly 0 at the frequency where the other line spectrum exists and is not affected by it.

つまり、正規化周波数k／N上のパワースペクトルは、同じ周波数上の周期波形の調波成分のみで決まる。また、パワースペクトルから対応する周期波形の計算は、その周期がNサンプルであるとき、正規化周波数k／Nの点のみで決まる。 That is, the power spectrum on the normalized frequency k / N is determined only by the tuning component of the periodic waveform on the same frequency. Further, the calculation of the corresponding periodic waveform from the power spectrum is determined only by the point of the normalized frequency k / N when the period is N samples.

以上より、周期波形の周期と窓長とが等しいとき、時間領域における窓掛けの影響は生じない。なお、ここでは周期波形の定常性を仮定しているが、Nが数ミリ秒に対応する程度の短い時間であれば、一般的な音声の時間変化の速度を考慮し、実用上定常と見なして良い。 From the above, when the period of the periodic waveform and the window length are equal, the effect of windowing in the time domain does not occur. Although the stationarity of the periodic waveform is assumed here, if N is a short time corresponding to several milliseconds, it is considered to be practically stationary in consideration of the speed of time change of general voice. It's okay.

したがって、周期波形の周期が既知で、かつ定常性を仮定できれば、窓関数の選択において畳み込みの影響を考慮する必要はなく、窓関数として矩形窓を用いることができる。そして、矩形窓を用いることで、実際には窓掛けのための乗算処理が不要になる。 Therefore, if the period of the periodic waveform is known and stationarity can be assumed, it is not necessary to consider the influence of convolution in the selection of the window function, and a rectangular window can be used as the window function. Then, by using a rectangular window, the multiplication process for window hanging is actually unnecessary.

処理の高速化および時間解像度の観点からはNは短い方がよく、Nを決めるf0は音声認識システムが対応可能な範囲で大きい（周波数が高い）方が望ましい。例えばfsが１６kHzの場合、Nを６４とすればf0は２５０Hzとなる。これは、通常観測される人間の基本周波数の範囲であり、多くの音声認識システムで、そのような基本周波数の周期波形を入力できる。また、Nは周波数解像度にも対応し、Nが小さいほど周波数解像度は低下するが、例えば先述のN=６４は、音声認識システムで通常用いられるMFCCの次数（例えば１２次程度）を考えると、そのような次数のMFCCで表せるスペクトル包絡特性を表現するには十分に大きな値である。 From the viewpoint of speeding up processing and time resolution, it is desirable that N is short, and f0 that determines N is large (high frequency) within the range that the speech recognition system can handle. For example, when fs is 16 kHz, if N is 64, f0 is 250 Hz. This is the range of human fundamental frequencies that are normally observed, and many speech recognition systems can input periodic waveforms of such fundamental frequencies. In addition, N also corresponds to the frequency resolution, and the smaller the N, the lower the frequency resolution. For example, the above-mentioned N = 64 is the order of MFCC usually used in a speech recognition system (for example, about 12th order). The value is large enough to express the spectral envelope characteristics that can be expressed by such a degree MFCC.

このとき、Nの長さは時間としては４msであり、音声認識に用いられる一般的な分析周期である数十msよりも十分に短く、例えば矩形窓で切り取られる区間が連続するように４ミリ秒周期で変換後の音声波形の生成処理を行っても、時間変化の表現における影響は小さい。 At this time, the length of N is 4 ms as a time, which is sufficiently shorter than several tens of ms, which is a general analysis cycle used for speech recognition, for example, 4 mm so that the sections cut out by a rectangular window are continuous. Even if the voice waveform generation process after conversion is performed in a second cycle, the influence on the expression of time change is small.

さらに、声質変換出力の波形生成処理における時間解像度が音声認識システムの時間解像度よりも高い場合は、その解像度を下げてもその影響は小さい。１周期の波形を複数回繰り返して出力し、その間は離散フーリエ変換を伴うような音声波形生成処理を行わないことで、処理量を削減することができる。この場合、単純な処理では、数ms〜数十ms継続した周期波形が急激に切り替わることになる。そこで、その影響を避けるために、一般にoverlap and add呼ばれる、切り替わりの前後それぞれの区間で、それぞれ後方、前方にもそれらの波形の繰り返しを延長して両者をオーバラップさせ、オーバラップさせた区間で両者の重み付け和を計算する等の方法により、波形上で徐々に切り替わっていくようにする方法が有効である。 Further, when the time resolution in the waveform generation processing of the voice quality conversion output is higher than the time resolution of the voice recognition system, even if the resolution is lowered, the effect is small. The amount of processing can be reduced by repeatedly outputting the waveform of one cycle a plurality of times and not performing the voice waveform generation processing that involves the discrete Fourier transform during that period. In this case, in a simple process, the periodic waveform that has continued for several ms to several tens of ms is suddenly switched. Therefore, in order to avoid the influence, in each section before and after the switching, which is generally called overlap and add, the repetition of those waveforms is extended to the rear and the front, respectively, and both are overlapped and overlapped. It is effective to gradually switch on the waveform by a method such as calculating the weighted sum of the two.

音声波形の合成では、時刻ｉのスペクトル包絡特性の対数パワースペクトルをS (i，ω)(ω：角周波数)とするとき、次式(1)のように、調波成分の余弦関数を足し合わせることで合成音声x(i)が得られる。 In the synthesis of speech waveforms, when the logarithmic power spectrum of the spectral entrainment characteristic at time i is S (i, ω) (ω: angular frequency), the cosine function of the tuning component is added as shown in the following equation (1). Synthetic speech x (i) can be obtained by combining them.

x(i)=Σ_k {exp(s(i，2π×k×fs／N)／2)×cos(2π×k×fs／N×i)} …(1) x (i) = Σ_k {exp (s (i, 2π × k × fs / N) / 2) × cos (2π × k × fs / N × i)}… (1)

この場合、i=n×N（n：整数）となる時刻ですべての調波成分のエネルギーが集中し、x(i)の振幅が非常に大きくなる。したがって、このような点を基準に音声認識システムへの入力レベルを決めると、その他の部分では相対的に振幅が小さくなり、信号対雑音比的に不利になる。 In this case, the energies of all the tuning components are concentrated at the time when i = n × N (n: integer), and the amplitude of x (i) becomes very large. Therefore, if the input level to the speech recognition system is determined based on such a point, the amplitude becomes relatively small in other parts, which is disadvantageous in terms of signal-to-noise ratio.

そこで、本実施形態では次式(2)のように、正弦関数の足し合わせにより合成音声x(i)を得ている。 Therefore, in the present embodiment, the synthetic speech x (i) is obtained by adding the sine functions as shown in the following equation (2).

x(i)=Σ_k {exp(s(i，2π×k×fs／N)／2)×sin(2π×k×fs／N×i)} …(2) x (i) = Σ_k {exp (s (i, 2π × k × fs / N) / 2) × sin (2π × k × fs / N × i)}… (2)

多くの音声認識システムでは、その音声特徴抽出にパワースペクトルのみを用い、音声波形の位相成分を考慮していない。したがって、このように位相を変えた波形を入力しても音声認識システムへの影響は生じない。そして、これらの計算は、Nが２のべき乗であるとき、高速フーリエ変換により容易かつ高速に行うことができる。 Many speech recognition systems use only the power spectrum for their speech feature extraction and do not consider the phase component of the speech waveform. Therefore, even if the waveforms whose phases are changed in this way are input, the voice recognition system is not affected. Then, these calculations can be easily and quickly performed by the fast Fourier transform when N is a power of 2.

図１は、本発明の一実施形態に係る音声認識装置１の主要部の構成を示した機能ブロック図であり、入力音声の声質を変換する声質変換部２および声質を変換した音声を対象に音声認識を実行する音声認識部３から構成される。ここで、音声認識部３としては既存の音声認識システムを適用可能であり、音声波形をその入力とし、音声認識結果をその出力とする。 FIG. 1 is a functional block diagram showing a configuration of a main part of a voice recognition device 1 according to an embodiment of the present invention, and targets a voice quality conversion unit 2 that converts voice quality of input voice and a voice that has been converted voice quality. It is composed of a voice recognition unit 3 that executes voice recognition. Here, the existing voice recognition system can be applied to the voice recognition unit 3, and the voice waveform is used as the input and the voice recognition result is used as the output.

前記音声認識装置１または声質変換部２は、汎用のコンピュータやサーバに各機能を実現するアプリケーション（プログラム）を実装することで構成できる。あるいはアプリケーションの一部がハードウェア化またはROM化された専用機や単能機としても構成できる。 The voice recognition device 1 or the voice quality conversion unit 2 can be configured by implementing an application (program) that realizes each function on a general-purpose computer or server. Alternatively, it can be configured as a dedicated machine or a single-purpose machine in which a part of the application is made into hardware or ROM.

前記声質変換部２は、音声特徴ベクトル抽出部２１、音声特徴変形部２２および音声波形生成部２３で構成される。前記音声特徴ベクトル抽出部２１は、入力された音声波形から音声特徴ベクトルの時系列データを抽出して出力する。本実施形態では、１つのベクトルで音声特徴ベクトルの時間変化を表すために、複数時刻に対応する音声特徴を１つのベクトルに結合する音声特徴結合部２１１を備える。 The voice quality conversion unit 2 is composed of a voice feature vector extraction unit 21, a voice feature transformation unit 22, and a voice waveform generation unit 23. The voice feature vector extraction unit 21 extracts time series data of the voice feature vector from the input voice waveform and outputs it. In the present embodiment, in order to represent the time change of the voice feature vector with one vector, the voice feature coupling unit 211 that combines the voice features corresponding to a plurality of times into one vector is provided.

音声特徴結合部２１１は、例えば連続する時刻t，t＋1，t＋2における各音声特徴を表す３つのベクトルを、それぞれv(t)、v(t＋1)、v(t＋2)とするとき、これら３つのベクトルを連結し、さらに予め設定した変換行列Ｗを用いて変換することで、次式(3)で表されるベクトルv'(t)を求め、これを音声特徴ベクトルとして出力することができる。なお、^Tは転置を表す。 When, for example, the voice feature coupling unit 211 sets three vectors representing each voice feature at consecutive times t, t + 1, and t + 2 as v (t), v (t + 1), and v (t + 2), these three vectors. By connecting the above and further converting using the preset transformation matrix W, the vector v'(t) represented by the following equation (3) can be obtained and output as a voice feature vector. Note that ^ T represents transpose.

v'(t)=W［v(t) v(t＋1) v(t＋2)］^T …(3) v'(t) = W [v (t) v (t + 1) v (t + 2)] ^ T… (3)

このような構成とすることで、音声の特徴の短時間（ここでは、３つのベクトル間）の時間変化も考慮した音声特徴ベクトルを構築することができる。 With such a configuration, it is possible to construct a voice feature vector in consideration of a short time change (here, between three vectors) of the voice feature.

前記音声特徴変形部２２は、音声特徴ベクトル抽出部２１が出力する各時刻の音声特徴ベクトルを、別途に入力される話者適応用の変換制御情報に基づいて、認識率の改善が見込まれる音声特徴ベクトルに変換し、これを音声波形を合成するのに必要な特徴ベクトルとして出力する。 The voice feature transformation unit 22 is expected to improve the recognition rate of the voice feature vector output by the voice feature vector extraction unit 21 at each time based on the separately input conversion control information for speaker adaptation. It is converted into a feature vector and output as a feature vector required to synthesize a voice waveform.

出力する特徴ベクトルの形式は、入力された特徴ベクトルと同じ形式であっても良いし異なる形式であっても良い。前記音声特徴変形部２２による変形は、例えばアフィン変換により実現できる。この場合、音声認識率が高い特定話者の音声データを予め用意しておき、入力話者の声質から特定話者の声質への声質変換が行われる。 The format of the output feature vector may be the same as or different from the input feature vector. The deformation by the voice feature deformation unit 22 can be realized by, for example, an affine transformation. In this case, voice data of a specific speaker having a high voice recognition rate is prepared in advance, and voice quality conversion from the voice quality of the input speaker to the voice quality of the specific speaker is performed.

あるいは、音声認識部３での音声認識率の改善が見込める変形を、別の話者の音声データから作成する等により予め複数用意しておき、対象話者による正解が分かっている（複数の）音声データに対して、用意してある変換をそれぞれ実験的に適用し、その結果に対して音声認識部３で認識率を測定し、最も認識率の高い変換を選ぶようにしても良い。このような方法により変換情報を決めることができる。 Alternatively, a plurality of variants that are expected to improve the voice recognition rate in the voice recognition unit 3 are prepared in advance by creating from the voice data of another speaker, and the correct answer by the target speaker is known (plural). The prepared conversions may be experimentally applied to the voice data, the recognition rate of the result may be measured by the voice recognition unit 3, and the conversion having the highest recognition rate may be selected. The conversion information can be determined by such a method.

前記音声波形生成部２３は、音声特徴変形部２２が出力した特徴ベクトルから音声のスペクトル包絡特性を求め、これを再現するような音声波形を合成し、その結果を音声認識部３へ送る。 The voice waveform generation unit 23 obtains the spectrum entrainment characteristics of voice from the feature vector output by the voice feature transformation unit 22, synthesizes a voice waveform that reproduces the characteristics, and sends the result to the voice recognition unit 3.

本実施形態では、上述の通り、入力音声の基本周期が音声認識部３での音声認識において考慮されないことを鑑みて基本周期決定部２３１を設け、当該基本周期決定部２３１が、音声合成に際して基本周期を再現せず、高速フーリエ変換の適用が容易な基本周期に変換する。すなわち、本実施形態では基本周期が波形生成処理の処理区間長Nと等しい又はその整数分の１となる音声波形を生成するようにしている。 In the present embodiment, as described above, the basic cycle determination unit 231 is provided in view of the fact that the basic cycle of the input voice is not considered in the voice recognition by the voice recognition unit 3, and the basic cycle determination unit 231 is basically used for voice synthesis. The period is not reproduced, and the basic period is converted so that the fast Fourier transform can be easily applied. That is, in the present embodiment, a voice waveform whose basic period is equal to or 1 / integer of the processing section length N of the waveform generation processing is generated.

また、本実施形態では音声波形生成部２３に音声波形加工部２３２を設け、１周期の音声波形を所定回数繰り返す音声波形を生成するようにした。したがって、その間は音声波形生成処理を行わなくすることで計算量を削減できるようになる。 Further, in the present embodiment, the voice waveform processing unit 232 is provided in the voice waveform generation unit 23 to generate a voice waveform in which one cycle of the voice waveform is repeated a predetermined number of times. Therefore, the amount of calculation can be reduced by not performing the voice waveform generation processing during that period.

さらに、本実施形態では音声波形生成部２３に正弦関数合成部２３３を設け、調波成分を合成する際、複数の正弦関数の足し合わせに相当する処理により音声波形を生成するようにしている。これにより、調波成分のエネルギーが特定の時刻に集中することを防止できるので、信号対雑音比の高い音声認識が可能になる。 Further, in the present embodiment, the sine function synthesizing unit 233 is provided in the voice waveform generation unit 23, and when synthesizing the tuning components, the voice waveform is generated by a process corresponding to the addition of a plurality of sine functions. As a result, it is possible to prevent the energy of the tuning component from being concentrated at a specific time, so that voice recognition with a high signal-to-noise ratio becomes possible.

前記音声認識部３は、音声波形生成部２３が合成した音声波形に対する音声認識処理を行って音声認識結果を出力する。 The voice recognition unit 3 performs voice recognition processing on the voice waveform synthesized by the voice waveform generation unit 23 and outputs the voice recognition result.

本実施形態によれば、入力音声に基づいて音声認識率の高い音声波形を生成する際に、後段の音声認識において考慮されない特徴量については、これを再現せず、計算量の削減を優先させた特徴量を採用するので、計算量の増加を抑えながら音声認識率を向上させることができる。 According to the present embodiment, when generating a voice waveform having a high voice recognition rate based on the input voice, the feature amount that is not considered in the voice recognition in the subsequent stage is not reproduced, and the reduction of the calculation amount is prioritized. Since the feature amount is adopted, the voice recognition rate can be improved while suppressing the increase in the calculation amount.

また、本実施形態では、音声認識の実行部を後段に設けて一体構成としたので、人による聴取を目的とした音声波形を出力する声質変換装置と、音声波形を入力とする音声認識装置とを縦続に接続した場合よりも計算量の増加を抑えながら音声認識率を向上させる音声認識装置を構成できるようになる。 Further, in the present embodiment, since the voice recognition execution unit is provided in the subsequent stage to form an integrated configuration, a voice quality conversion device that outputs a voice waveform for listening by a person and a voice recognition device that inputs the voice waveform are used. It becomes possible to configure a voice recognition device that improves the voice recognition rate while suppressing an increase in the amount of calculation as compared with the case where

なお、上記の実施形態では、窓長（処理区間長）Ｎと音声波形生成部２３で合成する音声波形の基本周期とが等しいものとして説明したが、本発明はこれのみに限定されるものではなく、窓長Nは合成する音声波形の基本周期の整数倍であっても良い。これにより、離散フーリエ変換で行う際の変換長が長くなるため処理量的な不利は生じるが、この場合でも窓関数の影響を避けることができる。 In the above embodiment, the window length (processing section length) N and the basic period of the voice waveform synthesized by the voice waveform generation unit 23 have been described as being equal, but the present invention is not limited to this. However, the window length N may be an integral multiple of the basic period of the voice waveform to be synthesized. As a result, the conversion length when performing the discrete Fourier transform becomes long, which causes a disadvantage in terms of processing amount, but even in this case, the influence of the window function can be avoided.

また、上記の実施形態では、音声認識装置１が声質変換部２を内蔵する場合を例にして説明したが、本発明はこれのみに限定されるものではなく、図４に示したように、声質変換部２を含む声質変換装置４と音声認識部３を含む音声認識装置５とを分離し、声質変換装置４が生成した音声波形を、有線、無線またはネットワーク経由で音声認識装置５へ入力させるようにしても良い。 Further, in the above embodiment, the case where the voice recognition device 1 incorporates the voice quality conversion unit 2 has been described as an example, but the present invention is not limited to this, and as shown in FIG. The voice quality conversion device 4 including the voice quality conversion unit 2 and the voice recognition device 5 including the voice recognition unit 3 are separated, and the voice waveform generated by the voice quality conversion device 4 is input to the voice recognition device 5 via wired, wireless or network. You may let it.

このような分離構造とすれば、既存、汎用の音声認識装置を用いて認識率の高い音声認識を実現できるようになる。 With such a separation structure, it becomes possible to realize voice recognition with a high recognition rate by using an existing or general-purpose voice recognition device.

１…音声認識装置，２…声質変換部，３…音声認識部，４…特徴変形部，５…音声波形生成部，２１…音声特徴ベクトル抽出部，２２…音声特徴変形部，２３…音声波形生成部 1 ... voice recognition device, 2 ... voice quality conversion unit, 3 ... voice recognition unit, 4 ... feature transformation unit, 5 ... voice waveform generation unit, 21 ... voice feature vector extraction unit, 22 ... voice feature transformation unit, 23 ... voice waveform Generator

Claims

In a voice recognition device that transforms the voice quality of input voice before voice recognition
A means to extract features from input voice,
Means for transforming the feature amount and
A means for generating a voice waveform based on the deformed feature amount is provided.
The means for generating a voice waveform is a voice recognition device having a basic period equal to or an integral part of the processing section length of the waveform generation process .

In a voice recognition device that transforms the voice quality of input voice before voice recognition
A means to extract features from input voice,
Means for transforming the feature amount and
A means for generating a voice waveform based on the deformed feature amount is provided.
The means for generating a voice waveform is a voice recognition device characterized in that a voice waveform is generated by a process corresponding to the addition of a plurality of sine functions.

In a voice recognition device that transforms the voice quality of input voice before voice recognition
A means to extract features from input voice,
Means for transforming the feature amount and
A means for generating a voice waveform based on the deformed feature amount is provided.
The means for generating a voice waveform is a voice recognition device characterized by generating a voice waveform in which a voice waveform of one cycle is repeated a predetermined number of times.

The voice recognition device according to any one of claims 1 to 3, further comprising means for performing voice recognition based on the generated voice waveform.

In a speech recognition method in which a computer transforms the voice quality of input speech before speech recognition.
Extract features from input voice
By transforming the feature amount,
A voice waveform is generated based on the deformed feature amount,
A voice recognition method characterized in that when the voice waveform is generated, a voice waveform having a basic period equal to or an integral part of the processing interval length of the waveform generation process is generated .

In a speech recognition method in which a computer transforms the voice quality of input speech before speech recognition.
Extract features from input voice
By transforming the feature amount,
A voice waveform is generated based on the deformed feature amount,
A voice recognition method characterized in that, when generating the voice waveform, the voice waveform is generated by a process corresponding to the addition of a plurality of sine functions.

In a speech recognition method in which a computer transforms the voice quality of input speech before speech recognition.
Extract features from input voice
By transforming the feature amount,
A voice waveform is generated based on the deformed feature amount,
A voice recognition method characterized by generating a voice waveform in which a voice waveform of one cycle is repeated a predetermined number of times when the voice waveform is generated.

The voice recognition method according to any one of claims 5 to 7, wherein voice recognition is executed based on the generated voice waveform.

In a speech recognition program that transforms the voice quality of input speech before speech recognition
The procedure for extracting features from input voice and
The procedure for transforming the feature amount and
A computer is made to execute the procedure of generating a voice waveform based on the deformed feature amount.
A voice recognition program characterized in that in the procedure for generating a voice waveform, a voice waveform having a basic period equal to or an integral part of the processing section length of the waveform generation process is generated .

In a speech recognition program that transforms the voice quality of input speech before speech recognition
The procedure for extracting features from input voice and
The procedure for transforming the feature amount and
A computer is made to execute the procedure of generating a voice waveform based on the deformed feature amount.
In the procedure for generating a voice waveform, a voice recognition program is characterized in that a voice waveform is generated by a process corresponding to the addition of a plurality of sine functions.

In a speech recognition program that transforms the voice quality of input speech before speech recognition
The procedure for extracting features from input voice and
The procedure for transforming the feature amount and
A computer is made to execute the procedure of generating a voice waveform based on the deformed feature amount.
The procedure for generating a voice waveform is a voice recognition program characterized by generating a voice waveform in which a voice waveform of one cycle is repeated a predetermined number of times.

The voice recognition program according to any one of claims 9 to 11, further comprising a procedure for performing voice recognition based on the generated voice waveform.