JP2019008206A

JP2019008206A - Voice band extension device, voice band extension statistical model learning device and program thereof

Info

Publication number: JP2019008206A
Application number: JP2017125132A
Authority: JP
Inventors: 信正清山; Nobumasa Seiyama; 清栗原; Kiyoshi Kurihara; 礼子齋藤; Reiko Saito; 今井　篤; Atsushi Imai; 篤今井; 都木　徹; Toru Tsugi; 徹都木
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2017-06-27
Filing date: 2017-06-27
Publication date: 2019-01-17

Abstract

To provide a voice band extension device capable of extending the band of a voice signal.SOLUTION: A voice band extension device 1 comprises sampling frequency converting means 11 for converting a broadband voice signal by sampling frequency conversion and generating a narrowband voice signal, narrowband acoustic feature quantity extracting means 13 for extracting a narrowband acoustic feature quantity from the narrowband voice signal, broadband acoustic feature quantity extracting means 12 for extracting a broadband acoustic feature quantity from the broadband voice signal, statistical model learning means 14 for learning a statistical model for inputting the narrowband acoustic feature quantity and outputting the broadband acoustic feature quantity, narrowband acoustic feature quantity extracting means 16 for extracting a narrowband acoustic feature quantity from a voice signal to be extended, broadband acoustic feature quantity generating means 17 for generating a broadband acoustic feature quantity from the extracted narrowband acoustic feature quantity by using the statistical model, and voice synthesizing means 18 for generating a band extended voice signal by using the generated broadband acoustic feature quantity.SELECTED DRAWING: Figure 1

Description

本発明は、音声信号の帯域を拡張する音声帯域拡張装置、音声帯域拡張統計モデル学習装置およびそれらのプログラムに関する。 The present invention relates to a voice band extension device that extends a bandwidth of a voice signal, a voice band extension statistical model learning device, and a program thereof.

従来、電話の音声信号である狭帯域音声信号を広帯域音声信号に帯域拡張する方法が開発されてきた。この帯域拡張の方法は、一般的に二つの手法に大別される。
一つの手法は、狭帯域音声信号（〜４ｋＨｚ）をアップサンプリングして高域（４ｋＨｚ〜８ｋＨｚ）に移動させた信号を加工、整形し、狭帯域音声信号に加算合成することで広帯域音声信号を生成する加工整形型の方法である。
もう一つの手法は、狭帯域音声信号を分析して、声道情報と音源信号とに分け、それぞれを高域に拡張して再合成する分析合成型の方法である。 Conventionally, a method has been developed for expanding the bandwidth of a narrowband audio signal, which is an audio signal of a telephone, to a wideband audio signal. This bandwidth expansion method is generally divided into two methods.
One method is to upsample a narrowband audio signal (~ 4 kHz), process and shape the signal moved to the high frequency (4 kHz to 8 kHz), and add it to the narrowband audio signal to synthesize the wideband audio signal. This is a processing shaping type method to be generated.
Another method is an analysis and synthesis type method in which a narrowband speech signal is analyzed, divided into vocal tract information and a sound source signal, and each is expanded to a high frequency and re-synthesized.

加工整形型の方法としては、例えば、特許文献１に記載の方法が挙げられる。
この特許文献１の方法は、狭帯域音声信号をフレーム単位に分割し、時間周波数変換を行うことで周波数領域のスペクトルを算出し、スペクトルの低域成分から高域成分を生成するための写像関数を求める。そして、特許文献１の方法は、写像関数により、スペクトルの低域成分から高域成分を生成し、低域成分と統合することで広帯域のスペクトルを生成する。そして、特許文献１の方法は、広帯域のスペクトルに周波数時間変換を行うことで得られる時間信号を接続して広帯域音声信号を生成する。 Examples of the processing shaping method include the method described in Patent Document 1.
This method of Patent Document 1 divides a narrowband audio signal into frames, calculates a frequency domain spectrum by performing time-frequency conversion, and generates a high frequency component from a low frequency component of the spectrum. Ask for. And the method of patent document 1 produces | generates a high frequency component from the low frequency component of a spectrum with a mapping function, and produces | generates a broadband spectrum by integrating with a low frequency component. And the method of patent document 1 connects the time signal obtained by performing frequency time conversion to a broadband spectrum, and produces | generates a broadband audio | voice signal.

分析合成型の方法としては、例えば、特許文献２に記載の方法が挙げられる。
この特許文献２の方法は、学習用の広帯域音声信号と当該広帯域音声信号の帯域を制限した狭帯域音声信号とをフレーム単位に分割し、それぞれの音声信号を線形予測符号化に基づく音響分析により声道特性と音源信号（声帯波形）とに分離する。そして、特許文献２の方法は、声道特性および音源信号をベクトル量子化して、声道特性および音源信号を表現するコードブックを生成し、狭帯域音声信号から生成したコードブックと広帯域音声信号から生成したコードブックとを予め対応付けておく。 Examples of the analytical synthesis method include the method described in Patent Document 2.
This method of Patent Document 2 divides a wideband speech signal for learning and a narrowband speech signal in which the bandwidth of the broadband speech signal is limited into frames, and each speech signal is analyzed by acoustic analysis based on linear predictive coding. Separated into vocal tract characteristics and sound source signal (voice vocal cord waveform). Then, the method of Patent Document 2 performs vector quantization on the vocal tract characteristics and the sound source signal to generate a code book representing the vocal tract characteristics and the sound source signal, and generates the code book generated from the narrowband audio signal and the wideband audio signal. The generated code book is associated in advance.

そして、特許文献２の方法は、帯域を拡張させたい狭帯域音声信号をフレーム単位に分割し、音響分析により声道特性および音源信号を求める。その後、特許文献２の方法は、狭帯域音声信号から求めた声道特性および音源信号を表現するコードブックに対応した、広帯域音声信号のコードブックによる声道特性と音源信号とから得られる合成音声の時間信号を接続して広帯域音声信号を生成する。 And the method of patent document 2 divides | segments the narrowband audio | voice signal which wants to expand a band into a frame unit, and calculates | requires a vocal tract characteristic and a sound source signal by acoustic analysis. After that, the method of Patent Document 2 is based on the synthesized speech obtained from the vocal tract characteristics and the sound source signal by the codebook of the wideband speech signal corresponding to the codebook representing the vocal tract characteristics and the sound source signal obtained from the narrowband speech signal. Are connected to generate a wideband audio signal.

特許第５４２３６８４号公報Japanese Patent No. 5423684 特許第４１３２１５４号公報Japanese Patent No. 4132154

前記した特許文献１に記載の方法は、狭帯域音声信号を時間周波数変換して高域成分を生成したのち、周波数時間変換して広帯域音声の時間信号を求めている。すなわち、この方法は、音声信号の周波数領域、すなわち振幅のみを利用しており位相を考慮していない。そのため、この手法は、位相を高域成分に拡張することができず、生成した広帯域音声信号の音質に劣化が生じるという問題がある。 In the method described in Patent Document 1, a narrowband audio signal is time-frequency converted to generate a high-frequency component, and then a frequency signal is converted to obtain a wideband audio time signal. That is, this method uses only the frequency domain of the audio signal, that is, the amplitude, and does not consider the phase. Therefore, this method has a problem that the phase cannot be expanded to a high frequency component, and the sound quality of the generated wideband audio signal is deteriorated.

また、前記した特許文献２に記載の方法は、音声信号を声道特性と音源信号とに分離したのち、離散的な値をとるコードブックにより狭帯域音声信号と広帯域音声信号とを対応付けている。そのため、この手法は、離散的なコードブックを用いることに起因して、生成した広帯域音声信号の音質に劣化が生じるという問題がある。 In the method described in Patent Document 2, the voice signal is separated into the vocal tract characteristic and the sound source signal, and then the narrowband voice signal and the wideband voice signal are associated with each other by a code book having discrete values. Yes. Therefore, this method has a problem that the sound quality of the generated wideband audio signal is deteriorated due to the use of a discrete codebook.

また、前記した二つの手法は、いずれも電話音声レベルの狭帯域音声信号に対する帯域拡張を想定したもので、電話における音声の聞きやすさを目的としたものである。すなわち、従来の手法によって、例えば、音声合成を目的とした統計モデルを学習するための音声信号や、音声合成に利用する音声データベースの音声信号のような電話音声以外の一般的な音声信号を帯域拡張しても、前記した問題により、高品質な音声信号を生成することができない。 In addition, both of the above-described two methods are intended to expand the bandwidth of a narrow-band audio signal at a telephone audio level, and are intended for ease of listening to audio in a telephone. That is, by using conventional techniques, for example, a voice signal for learning a statistical model for voice synthesis or a general voice signal other than telephone voice such as a voice signal of a voice database used for voice synthesis is band-passed. Even if it is expanded, a high-quality audio signal cannot be generated due to the above-described problem.

本発明は、このような問題に鑑みてなされたものであり、音声信号を高品質に帯域拡張することが可能な音声帯域拡張装置、音声帯域拡張統計モデル学習装置およびそれらのプログラムを提供することを課題とする。 The present invention has been made in view of such a problem, and provides a voice band extension device, a voice band extension statistical model learning device, and a program thereof capable of extending a voice signal with high quality. Is an issue.

前記課題を解決するため、本発明に係る音声帯域拡張装置は、音声信号の帯域を拡張する音声帯域拡張装置であって、標本化周波数変換手段と、狭帯域音響特徴量抽出手段と、広帯域音響特徴量抽出手段と、統計モデル学習手段と、第２狭帯域音響特徴量抽出手段と、広帯域音響特徴量生成手段と、音声合成手段と、を備える。 In order to solve the above-mentioned problems, an audio band extending device according to the present invention is an audio band extending device that extends a band of an audio signal, and includes a sampling frequency converting unit, a narrow band acoustic feature amount extracting unit, and a wideband sound. A feature amount extraction unit, a statistical model learning unit, a second narrowband acoustic feature amount extraction unit, a wideband acoustic feature amount generation unit, and a speech synthesis unit are provided.

かかる構成において、音声帯域拡張装置は、標本化周波数変換手段によって、目標とする拡張後の帯域を有する広帯域音声信号をダウンサンプリングにより標本化周波数変換し、拡張前の帯域を有する狭帯域音声信号を生成する。この広帯域音声信号は、事前学習時に入力する学習データとなる信号である。これによって、標本化周波数変換手段は、学習データである広帯域音声信号に対応した狭帯域音声信号を生成する。
そして、音声帯域拡張装置は、狭帯域音響特徴量抽出手段によって、狭帯域音声信号からフレーム単位で第１の音響特徴量を抽出する。また、音声帯域拡張装置は、広帯域音響特徴量抽出手段によって、広帯域音声信号からフレーム単位で第２の音響特徴量を抽出する。この音響特徴量は、例えば、フレームごとのスペクトルパラメータおよび音源パラメータである。
そして、音声帯域拡張装置は、統計モデル学習手段によって、第１の音響特徴量を入力し、第２の音響特徴量を出力する統計モデルを学習する。この統計モデルには、ディープニューラルネットワークを用いることができる。
以上の構成によって、音声帯域拡張装置は、事前学習として、統計モデルを学習する。 In such a configuration, the audio band expansion device performs sampling frequency conversion of the wideband audio signal having the target expanded band by downsampling by the sampling frequency conversion means, and converts the narrowband audio signal having the band before expansion to Generate. This wideband audio signal is a signal that becomes learning data to be input during prior learning. Thereby, the sampling frequency conversion means generates a narrowband audio signal corresponding to the wideband audio signal which is the learning data.
Then, the audio band expansion device extracts the first acoustic feature amount from the narrowband audio signal in units of frames by the narrowband acoustic feature amount extraction unit. In addition, the audio band extending device extracts the second acoustic feature amount in units of frames from the broadband audio signal by the broadband acoustic feature amount extraction unit. This acoustic feature amount is, for example, a spectrum parameter and a sound source parameter for each frame.
Then, the voice band extending device learns a statistical model that inputs the first acoustic feature quantity and outputs the second acoustic feature quantity by the statistical model learning means. A deep neural network can be used for this statistical model.
With the above configuration, the voice band extending device learns a statistical model as prior learning.

そして、音声帯域拡張装置は、第２狭帯域音響特徴量抽出手段によって、狭帯域音声信号と同じ帯域を有する拡張対象音声信号からフレーム単位で第３の音響特徴量を抽出する。
そして、音声帯域拡張装置は、広帯域音響特徴量生成手段によって、事前学習で学習した統計モデルを用いて、第３の音響特徴量から、広帯域の音響特徴量である第４の音響特徴量をフレーム単位で生成する。
そして、音声帯域拡張装置は、音声合成手段によって、第４の音響特徴量を用いて音声合成を行うことにより、帯域拡張した音声信号を生成する。
これによって、音声帯域拡張装置は、拡張対象となる狭帯域の音声信号から、帯域を広帯域に拡張した音声信号を生成することができる。 Then, the voice band extending device extracts the third acoustic feature quantity in units of frames from the extension target voice signal having the same band as the narrow band voice signal by the second narrow band acoustic feature quantity extraction unit.
Then, the speech band extending apparatus uses the statistical model learned in advance learning by the wideband acoustic feature amount generation unit to frame the fourth acoustic feature amount, which is the broadband acoustic feature amount, from the third acoustic feature amount. Generate in units.
Then, the voice band extending device generates a voice signal whose band has been extended by performing voice synthesis using the fourth acoustic feature amount by the voice synthesizing unit.
As a result, the audio band extending device can generate an audio signal whose band is extended to a wide band from the narrow band audio signal to be extended.

また、前記課題を解決するため、本発明に係る音声帯域拡張統計モデル学習装置は、音声信号の帯域を拡張するために用いる統計モデルを学習する音声帯域拡張統計モデル学習装置であって、標本化周波数変換手段と、狭帯域音響特徴量抽出手段と、広帯域音響特徴量抽出手段と、統計モデル学習手段と、を備える。 In order to solve the above problem, a speech band extended statistical model learning device according to the present invention is a speech bandwidth extended statistical model learning device that learns a statistical model used for extending a bandwidth of a speech signal, and is a sampling Frequency conversion means, narrowband acoustic feature quantity extraction means, wideband acoustic feature quantity extraction means, and statistical model learning means are provided.

かかる構成において、音声帯域拡張統計モデル学習装置は、標本化周波数変換手段によって、目標とする拡張後の帯域を有する広帯域音声信号をダウンサンプリングにより標本化周波数変換し、拡張前の帯域を有する狭帯域音声信号を生成する。
そして、音声帯域拡張統計モデル学習装置は、狭帯域音響特徴量抽出手段によって、狭帯域音声信号からフレーム単位で第１の音響特徴量を抽出する。また、音声帯域拡張統計モデル学習装置は、広帯域音響特徴量抽出手段によって、広帯域音声信号からフレーム単位で第２の音響特徴量を抽出する。
そして、音声帯域拡張統計モデル学習装置は、統計モデル学習手段によって、第１の音響特徴量を入力し、第２の音響特徴量を出力する統計モデルを学習する。 In such a configuration, the speech band extended statistical model learning device performs sampling frequency conversion by down-sampling a wideband speech signal having a target expanded band by a sampling frequency conversion unit, and narrowband having a band before expansion. Generate an audio signal.
Then, the speech band extended statistical model learning device extracts the first acoustic feature amount in units of frames from the narrowband speech signal by the narrowband acoustic feature amount extraction unit. Also, the speech band extended statistical model learning device extracts a second acoustic feature amount from the broadband speech signal in units of frames by the broadband acoustic feature amount extraction unit.
Then, the speech band extended statistical model learning device inputs a first acoustic feature amount and learns a statistical model that outputs a second acoustic feature amount by statistical model learning means.

また、前記課題を解決するため、本発明に係る音声帯域拡張装置は、音声帯域拡張統計モデル学習装置で学習した統計モデルを用いて、音声信号の帯域を拡張する音声帯域拡張装置であって、狭帯域音響特徴量抽出手段と、広帯域音響特徴量生成手段と、音声合成手段と、を備える。 Further, in order to solve the above-mentioned problem, a voice band extending apparatus according to the present invention is a voice band extending apparatus that extends a band of a voice signal using a statistical model learned by a voice band extended statistical model learning apparatus, Narrowband acoustic feature quantity extraction means, wideband acoustic feature quantity generation means, and speech synthesis means.

かかる構成において、音声帯域拡張装置は、狭帯域音響特徴量抽出手段によって、統計モデルの入力側の音響特徴量を抽出した狭帯域音声信号と同じ帯域を有する拡張対象音声信号からフレーム単位で第３の音響特徴量を抽出する。
そして、音声帯域拡張装置は、広帯域音響特徴量生成手段によって、統計モデルを用いて、第３の音響特徴量から、広帯域の音響特徴量である第４の音響特徴量をフレーム単位で生成する。
そして、音声帯域拡張装置は、音声合成手段によって、第４の音響特徴量を用いて音声合成を行うことにより、帯域拡張した音声信号を生成する。 In such a configuration, the speech band extending apparatus is configured to perform third frame-by-frame processing from the extension target speech signal having the same band as the narrowband speech signal from which the acoustic feature amount on the input side of the statistical model is extracted by the narrowband acoustic feature amount extraction unit. The acoustic feature amount is extracted.
Then, the audio band expanding device generates a fourth acoustic feature amount, which is a broadband acoustic feature amount, from the third acoustic feature amount in units of frames by using the statistical model by the broadband acoustic feature amount generation unit.
Then, the voice band extending device generates a voice signal whose band has been extended by performing voice synthesis using the fourth acoustic feature amount by the voice synthesizing unit.

なお、音声帯域拡張装置は、コンピュータを、音声帯域拡張装置の各手段として機能させるための音声帯域拡張プログラムで動作させることができる。
また、音声帯域拡張統計モデル学習装置は、コンピュータを、音声帯域拡張統計モデル学習装置の各手段として機能させるための音声帯域拡張統計モデル学習プログラムで動作させることができる。 The voice band extending device can be operated by a voice band extending program for causing a computer to function as each unit of the voice band extending device.
Further, the voice band extended statistical model learning apparatus can operate the computer with a voice band extended statistical model learning program for causing a computer to function as each unit of the voice band extended statistical model learning apparatus.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、目標とする拡張後の帯域を有する音声信号と、その音声信号から生成した拡張前の帯域を有する音声信号とから、帯域拡張前後のフレームごとの音響特徴量を統計モデルとして学習することができる。
そのため、本発明は、帯域拡張後の音声信号に、拡張前の音声信号の時系列の音響特徴量を反映させることができ、高品質な音声信号を生成することができる。 The present invention has the following excellent effects.
According to the present invention, an acoustic feature amount for each frame before and after band extension is used as a statistical model from an audio signal having a target band after extension and an audio signal having a band before extension generated from the voice signal. Can learn.
Therefore, the present invention can reflect the time-series acoustic feature amount of the audio signal before extension in the audio signal after the band extension, and can generate a high-quality audio signal.

本発明の実施形態に係る音声帯域拡張装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the audio | voice band expansion apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る音声帯域拡張装置が事前学習する統計モデルの構成例を示すニューラルネットワークの構成図である。It is a block diagram of the neural network which shows the structural example of the statistical model which the audio | voice band extending apparatus which concerns on embodiment of this invention learns in advance. 本発明の実施形態に係る音声帯域拡張装置における統計モデルの事前学習の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the prior learning of the statistical model in the audio | voice band extending apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る音声帯域拡張装置における統計モデルを用いた帯域拡張の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the band expansion using the statistical model in the audio | voice band expansion apparatus which concerns on embodiment of this invention. 音声信号のスペクトログラムを表す図であって、（ａ）は広帯域音声信号、（ｂ）は（ａ）の広帯域音声信号から変換した狭帯域音声信号、（ｃ）は（ｂ）の狭帯域音声信号を音声帯域拡張装置により帯域拡張した帯域拡張音声信号のスペクトログラムを示す。It is a figure showing the spectrogram of an audio | voice signal, Comprising: (a) is a wideband audio | voice signal, (b) is the narrowband audio | voice signal converted from the wideband audio | voice signal of (a), (c) is the narrowband audio | voice signal of (b). Shows a spectrogram of a band-expanded voice signal obtained by extending the bandwidth of the signal with a voice band expansion device. 本発明の変形例に係る音声帯域拡張統計モデル学習装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the audio | voice band expansion statistical model learning apparatus which concerns on the modification of this invention. 本発明の変形例に係る音声帯域拡張装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the audio | voice band expansion apparatus which concerns on the modification of this invention.

以下、本発明の実施形態について図面を参照して説明する。
〔音声帯域拡張装置の構成〕
まず、図１を参照して、本発明の実施形態に係る音声帯域拡張装置１の構成について説明する。 Embodiments of the present invention will be described below with reference to the drawings.
[Configuration of Voice Band Extender]
First, with reference to FIG. 1, the structure of the audio | voice band expansion apparatus 1 which concerns on embodiment of this invention is demonstrated.

音声帯域拡張装置１は、音声信号の帯域を拡張するための統計モデルを学習し、学習した統計モデルを用いて、音声信号の帯域を拡張するものである。以下、帯域を拡張する対象となる拡張前の帯域を狭帯域、目標とする拡張後の帯域を広帯域と呼び、狭帯域の音声信号を狭帯域音声信号、広帯域の音声信号を広帯域音声信号と呼ぶ。 The voice band extending device 1 learns a statistical model for extending a voice signal band, and uses the learned statistical model to extend a voice signal band. In the following, the pre-expansion band to be expanded is called a narrow band, the target post-expansion band is called a wide band, the narrow band audio signal is called a narrow band audio signal, and the wide band audio signal is called a wide band audio signal. .

この音声帯域拡張装置１は、統計モデルを学習する事前学習時には、学習データとして広帯域音声信号を入力する。学習データは、複数の文章を発話した広帯域音声信号であって、例えば、数千文章（数時間長）程度の音声である。
また、音声帯域拡張装置１は、音声信号の帯域を拡張させる帯域拡張時には、帯域拡張対象となる狭帯域音声信号を入力する。
ここでは、一例として、広帯域音声信号は標本化周波数（サンプリング周波数）４８ｋＨｚ、狭帯域音声信号は標本化周波数１６ｋＨｚ、変換ビット数（量子化ビット数）はいずれも１６ｂｉｔで標本化されているものとする。 The voice band extending device 1 inputs a wide band voice signal as learning data at the time of prior learning for learning a statistical model. The learning data is a wideband voice signal that utters a plurality of sentences, and is, for example, a voice of about several thousand sentences (several hours long).
In addition, the audio band expansion device 1 inputs a narrowband audio signal to be subjected to band expansion when expanding the band of the audio signal.
Here, as an example, a wideband audio signal is sampled at a sampling frequency (sampling frequency) of 48 kHz, a narrowband audio signal is sampled at a sampling frequency of 16 kHz, and the number of conversion bits (number of quantization bits) is all sampled at 16 bits. To do.

図１に示すように、音声帯域拡張装置１は、切替手段１０と、標本化周波数変換手段１１と、広帯域音響特徴量抽出手段１２と、狭帯域音響特徴量抽出手段１３と、統計モデル学習手段１４と、統計モデル記憶手段１５と、狭帯域音響特徴量抽出手段（第２狭帯域音響特徴量抽出手段）１６と、広帯域音響特徴量生成手段１７と、音声合成手段１８と、を備える。 As shown in FIG. 1, the speech band extending apparatus 1 includes a switching unit 10, a sampling frequency converting unit 11, a wideband acoustic feature amount extracting unit 12, a narrowband acoustic feature amount extracting unit 13, and a statistical model learning unit. 14, a statistical model storage unit 15, a narrowband acoustic feature quantity extraction unit (second narrowband acoustic feature quantity extraction unit) 16, a broadband acoustic feature quantity generation unit 17, and a speech synthesis unit 18.

切替手段１０は、統計モデルの事前学習を行うモードと音声信号の帯域拡張を行うモードとで、外部から入力される音声信号の出力経路を切り替えるものである。切替手段１０は、外部の図示を省略した入力装置（スイッチ等）を介した指示により切り替えを行う。
切替手段１０は、モードして「事前学習」を指示された場合、入力した音声信号である広帯域音声信号を、標本化周波数変換手段１１と広帯域音響特徴量抽出手段１２とに出力する。
また、切替手段１０は、モードとして「帯域拡張」を指示された場合、入力した音声信号である狭帯域音声信号を、狭帯域音響特徴量抽出手段１６に出力する。 The switching means 10 switches the output path of the audio signal input from the outside between the mode for performing prior learning of the statistical model and the mode for expanding the bandwidth of the audio signal. The switching means 10 performs switching according to an instruction via an external input device (switch or the like) (not shown).
When the mode is instructed to perform “pre-learning”, the switching unit 10 outputs a wideband audio signal, which is an input audio signal, to the sampling frequency conversion unit 11 and the wideband acoustic feature amount extraction unit 12.
Further, when “band extension” is instructed as a mode, the switching unit 10 outputs a narrowband audio signal that is an input audio signal to the narrowband acoustic feature amount extraction unit 16.

標本化周波数変換手段１１は、広帯域音声信号の標本化周波数を、ダウンサンプリングして、狭帯域音声信号の標本化周波数に変換するものである。
ここでは、標本化周波数変換手段１１は、標本化周波数４８ｋＨｚの広帯域音声信号を、ダウンサンプリングして、標本化周波数１６ｋＨｚの狭帯域音声信号に変換する。
標本化周波数変換手段１１は、標本化周波数変換後の狭帯域音声信号を、狭帯域音響特徴量抽出手段１３に出力する。 The sampling frequency converting means 11 downsamples the sampling frequency of the wideband audio signal and converts it to the sampling frequency of the narrowband audio signal.
Here, the sampling frequency conversion means 11 down-samples a wideband audio signal with a sampling frequency of 48 kHz and converts it into a narrowband audio signal with a sampling frequency of 16 kHz.
The sampling frequency conversion unit 11 outputs the narrowband audio signal after the sampling frequency conversion to the narrowband acoustic feature amount extraction unit 13.

広帯域音響特徴量抽出手段１２は、事前学習時に入力される広帯域音声信号から音響特徴量（広帯域音響特徴量）を抽出するものである。この広帯域音響特徴量抽出手段１２は、音声信号（広帯域音声信号）を、所定のフレーム長（例えば、２５ｍｓ）、所定のフレーム周期（例えば、５ｍｓ）ごとに切り出し、音響分析を行うことで、音響特徴量として、スペクトルパラメータ、音源パラメータを抽出する。 The broadband acoustic feature amount extraction unit 12 extracts an acoustic feature amount (wideband acoustic feature amount) from a broadband speech signal input at the time of prior learning. The broadband acoustic feature extraction means 12 cuts out an audio signal (wideband audio signal) every predetermined frame length (for example, 25 ms) and every predetermined frame period (for example, 5 ms), and performs acoustic analysis, thereby performing acoustic analysis. Spectral parameters and sound source parameters are extracted as feature quantities.

スペクトルパラメータは、音声のスペクトル包絡を表すパラメータで、例えば、メルケプストラム係数、線スペクトル対（ＬＳＰ；line spectral pairs）等である。
音源パラメータは、音声の強度、ピッチ周期等を表すパラメータで、例えば、対数ピッチ周波数、帯域非周期成分等である。 The spectral parameter is a parameter representing the spectral envelope of speech, and is, for example, a mel cepstrum coefficient, a line spectral pair (LSP), or the like.
The sound source parameter is a parameter representing the strength of the voice, the pitch period, etc., and is, for example, a logarithmic pitch frequency, a band non-periodic component, or the like.

なお、音声信号から音響特徴量を抽出する手法は、一般的な手法を用いればよいため、
ここでは説明を省略するが、例えば、以下の参考文献１で開示されている手法（ＷＯＲＬＤ）を用いることができる。
（参考文献１）森勢将雅，西浦敬信，河原英紀，“高品質音声分析変換合成システムＷＯＲＬＤの提案と基礎的評価〜基本周波数・スペクトル包絡制御が品質の知覚に与える影響〜,”日本音響学会聴覚研究会、Vol. 41, No. 7, pp. 555-560, Toyama, Oct. 2011. In addition, since the method of extracting the acoustic feature amount from the audio signal may be a general method,
Although the description is omitted here, for example, the technique (WORD) disclosed in Reference Document 1 below can be used.
(Reference 1) Masamasa Mori, Takanobu Nishiura, Hidenori Kawahara, “Proposal and Basic Evaluation of High-Quality Speech Analysis Conversion Synthesis System WORD—Effects of Fundamental Frequency and Spectrum Envelope Control on Quality Perception—” Hearing Society of Japan, Vol. 41, No. 7, pp. 555-560, Toyama, Oct. 2011.

広帯域音響特徴量抽出手段１２は、予め定めた次元数のパラメータを抽出する。例えば、広帯域音響特徴量抽出手段１２は、フレームごとに、６０次元のメルケプストラム係数、１次元の対数ピッチ周波数および５次元の帯域非周期成分の計６６次元の静特定（静的特徴量）を広帯域音響特徴量として抽出する。
この広帯域音響特徴量抽出手段１２が抽出する広帯域音響特徴量は、統計モデル学習手段１４において統計モデルを学習する際の教師データ（正解データ）となる。
広帯域音響特徴量抽出手段１２は、抽出した６６次元の広帯域音響特徴量を統計モデル学習手段１４に出力する。 The broadband acoustic feature amount extraction unit 12 extracts a parameter having a predetermined number of dimensions. For example, the wideband acoustic feature quantity extraction means 12 performs a 66-dimensional mel cepstrum coefficient, a one-dimensional logarithmic pitch frequency, and a five-dimensional band non-periodic component total 66-dimensional static identification (static feature quantity) for each frame. Extracted as broadband acoustic features.
The broadband acoustic feature quantity extracted by the broadband acoustic feature quantity extraction unit 12 becomes teacher data (correct data) when the statistical model learning unit 14 learns the statistical model.
The broadband acoustic feature quantity extraction unit 12 outputs the extracted 66-dimensional broadband acoustic feature quantity to the statistical model learning unit 14.

狭帯域音響特徴量抽出手段１３は、標本化周波数変換手段１１で変換された狭帯域音声信号から音響特徴量（狭帯域音響特徴量）を抽出するものである。
この狭帯域音響特徴量抽出手段１３は、広帯域音響特徴量抽出手段１２と同様に、音響特徴量として、音声信号から予め定めた次元数のパラメータを抽出する。ただし、狭帯域音響特徴量抽出手段１３は、音響特徴量として、静特定（静的特徴量）のパラメータに加え、動特性（動的特徴量）を抽出することとする。 The narrowband acoustic feature amount extraction unit 13 extracts an acoustic feature amount (narrowband acoustic feature amount) from the narrowband audio signal converted by the sampling frequency conversion unit 11.
The narrow-band acoustic feature quantity extraction unit 13 extracts a parameter having a predetermined number of dimensions from the audio signal as the acoustic feature quantity, similarly to the broadband acoustic feature quantity extraction unit 12. However, the narrow-band acoustic feature amount extraction unit 13 extracts a dynamic characteristic (dynamic feature amount) as an acoustic feature amount in addition to a static specification (static feature amount) parameter.

例えば、狭帯域音響特徴量抽出手段１３は、フレームごとに、６０次元のメルケプストラム係数、１次元の対数ピッチ周波数および１次元の帯域非周期成分の計６２次元の静特定（静的特徴量）を抽出する。
さらに、狭帯域音響特徴量抽出手段１３は、フレームごとに求める静特性から、時間方向で変化する動特性を生成しておく。ここでは、狭帯域音響特徴量抽出手段１３は、動特性として、静特性の時間方向の１次差分（１フレーム後の静特性との差）および２次差分（２フレーム後の静特性との差）の１２４次元の動特性を求める。 For example, the narrow-band acoustic feature quantity extraction unit 13 determines a total of 62-dimensional static identification (static feature quantity) of 60-dimensional mel cepstrum coefficient, 1-dimensional logarithmic pitch frequency, and 1-dimensional band non-periodic component for each frame To extract.
Further, the narrowband acoustic feature quantity extraction unit 13 generates a dynamic characteristic that changes in the time direction from the static characteristic obtained for each frame. Here, the narrowband acoustic feature quantity extraction means 13 has a dynamic characteristic with a first-order difference (difference from a static characteristic after one frame) and a secondary difference (a static characteristic after two frames) as static characteristics. (Difference) 124-dimensional dynamic characteristics are obtained.

狭帯域音響特徴量抽出手段１３は、抽出した狭帯域音響特徴量を統計モデル学習手段１４に出力する。ここでは、狭帯域音響特徴量抽出手段１３は、６２次元の静特定と１２４次元の動特性とを合わせた１８６次元の狭帯域音響特徴量を統計モデル学習手段１４に出力する。 The narrowband acoustic feature quantity extraction unit 13 outputs the extracted narrowband acoustic feature quantity to the statistical model learning unit 14. Here, the narrowband acoustic feature quantity extraction unit 13 outputs to the statistical model learning unit 14 a 186-dimensional narrowband acoustic feature quantity that is a combination of the 62-dimensional static identification and the 124-dimensional dynamic characteristics.

なお、狭帯域音響特徴量抽出手段１３は、広帯域音響特徴量抽出手段１２と同様に静特性のみを狭帯域音響特徴量としてもよい。しかし、狭帯域音響特徴量抽出手段１３は、動特性を抽出することで、時間方向の変化を音響特徴量として抽出することができ、音響特徴をより精度よく抽出することができる。 Note that the narrowband acoustic feature quantity extraction unit 13 may use only the static characteristics as the narrowband acoustic feature quantity as in the wideband acoustic feature quantity extraction unit 12. However, the narrowband acoustic feature quantity extraction unit 13 can extract changes in the time direction as acoustic feature quantities by extracting dynamic characteristics, and can extract acoustic features more accurately.

統計モデル学習手段１４は、狭帯域音響特徴量を入力し、広帯域音響特徴量を出力する統計モデルを学習するものである。この統計モデル学習手段１４は、広帯域音響特徴量抽出手段１２で抽出された広帯域音響特徴量を教師データとして、狭帯域音響特徴量抽出手段１３で抽出された狭帯域音響特徴量が教師データに近似するように、統計モデルを学習する。 The statistical model learning means 14 is for learning a statistical model that receives a narrowband acoustic feature and outputs a broadband acoustic feature. The statistical model learning unit 14 uses the broadband acoustic feature extracted by the broadband acoustic feature extracting unit 12 as teacher data, and the narrowband acoustic feature extracted by the narrowband acoustic feature extracting unit 13 approximates the teacher data. To learn statistical models.

この統計モデル学習手段１４は、統計モデルとして、ディープニューラルネットワーク（Deep Neural Network：ＤＮＮ）を学習する。
図２にＤＮＮで構成した統計モデルの例を示す。具体的には、統計モデル学習手段１４は、図２に示す入力層Ｌ１、隠れ層Ｌ２、出力層Ｌ３で構成される順伝播ニューラルネットワーク（Feed Forward Neural Network：ＦＦＮＮ）により、統計モデルＭを学習する。
なお、統計モデルＭの構造は、特に規定するものではないが、ここでは、入力層Ｌ１を狭帯域音響特徴量と同じ１８６次元、隠れ層Ｌ２を２５６次元×３層、出力層Ｌ３を広帯域音響特徴量と同じ６６次元とする。 The statistical model learning means 14 learns a deep neural network (DNN) as a statistical model.
FIG. 2 shows an example of a statistical model composed of DNN. Specifically, the statistical model learning means 14 learns the statistical model M by using a forward forward neural network (FFNN) including the input layer L1, the hidden layer L2, and the output layer L3 shown in FIG. To do.
The structure of the statistical model M is not particularly specified, but here, the input layer L1 is 186 dimensions same as the narrowband acoustic feature, the hidden layer L2 is 256 dimensions × 3 layers, and the output layer L3 is wideband acoustic. The same 66 dimensions as the feature amount.

統計モデル学習手段１４は、図２の入力層Ｌ１に、狭帯域音響特徴量抽出手段１３で抽出された狭帯域音響特徴量（ここでは、１８６次元）を入力する。そして、統計モデル学習手段１４は、隠れ層Ｌ２において、入力層Ｌ１に入力された狭帯域音響特徴量の各要素の値に重みを付加して伝搬させ、出力層Ｌ３において、広帯域音響特徴量（ここでは、６６次元）を推定する統計モデルＭを学習する。なお、隠れ層Ｌ２における活性化関数には、例えば、正規化線形関数を用いる。 The statistical model learning unit 14 inputs the narrowband acoustic feature amount (here, 186 dimensions) extracted by the narrowband acoustic feature amount extraction unit 13 to the input layer L1 in FIG. Then, the statistical model learning means 14 adds a weight to each element value of the narrowband acoustic feature value input to the input layer L1 and propagates it in the hidden layer L2, and transmits the broadband acoustic feature value (in the output layer L3). Here, the statistical model M for estimating 66 dimensions) is learned. For example, a normalized linear function is used as the activation function in the hidden layer L2.

ここで、統計モデル学習手段１４は、広帯域音響特徴量抽出手段１２で抽出された教師データである広帯域音響特徴量と、出力層Ｌ３から出力される推定した広帯域音響特徴量との誤差が“０”に近づく方向に、各層の重みを統計モデルＭのモデルパラメータとして学習する。なお、誤差を演算する損失誤差関数には、例えば、平均二乗誤差関数を用いる。また、統計モデルＭの学習には、例えば、誤差逆伝播法（back propagation）を用いる。 Here, the statistical model learning means 14 has an error of “0” between the broadband acoustic feature quantity that is the teacher data extracted by the broadband acoustic feature quantity extraction means 12 and the estimated broadband acoustic feature quantity output from the output layer L3. The weight of each layer is learned as a model parameter of the statistical model M in a direction approaching "." For example, a mean square error function is used as the loss error function for calculating the error. For learning the statistical model M, for example, a back propagation method is used.

この統計モデル学習手段１４は、事前学習において、広帯域音響特徴量および狭帯域音響特徴量が入力される間、繰り返し学習を行うが、学習を所定回数行うか、パラメータ誤差が予め定めた誤差内に収束した段階で学習を終了することとしてもよい。
統計モデル学習手段１４は、学習した統計モデルＭ（モデルパラメータ）を、統計モデル記憶手段１５に書き込み記憶する。 The statistical model learning unit 14 repeatedly performs learning while the wideband acoustic feature quantity and the narrowband acoustic feature quantity are input in the pre-learning, but the learning is performed a predetermined number of times or the parameter error is within a predetermined error. The learning may be terminated at the stage of convergence.
The statistical model learning unit 14 writes and stores the learned statistical model M (model parameter) in the statistical model storage unit 15.

統計モデル記憶手段１５は、統計モデル学習手段１４で学習した統計モデルを記憶するものである。この統計モデル記憶手段１５は、ハードディスク、半導体メモリ等の一般的な記憶装置で構成することができる。
この統計モデル記憶手段１５に記憶される統計モデル（モデルパラメータ）は、広帯域音響特徴量生成手段１７が参照する。 The statistical model storage unit 15 stores the statistical model learned by the statistical model learning unit 14. The statistical model storage unit 15 can be configured by a general storage device such as a hard disk or a semiconductor memory.
The statistical model (model parameter) stored in the statistical model storage unit 15 is referred to by the broadband acoustic feature generation unit 17.

狭帯域音響特徴量抽出手段（第２狭帯域音響特徴量抽出手段）１６は、帯域拡張時に入力される拡張対象音声信号（狭帯域音声信号）から音響特徴量（狭帯域音響特徴量）を抽出するものである。この狭帯域音響特徴量抽出手段１６は、狭帯域音響特徴量抽出手段１３と同じ機能を有し、狭帯域音声信号から狭帯域音響特徴量抽出手段１３と同じ種類の狭帯域音響特徴量を抽出する。 The narrowband acoustic feature quantity extraction unit (second narrowband acoustic feature quantity extraction unit) 16 extracts an acoustic feature quantity (narrowband acoustic feature quantity) from an extension target voice signal (narrowband voice signal) input at the time of band extension. To do. The narrowband acoustic feature quantity extraction unit 16 has the same function as the narrowband acoustic feature quantity extraction unit 13 and extracts the same type of narrowband acoustic feature quantity as the narrowband acoustic feature quantity extraction unit 13 from the narrowband audio signal. To do.

ここでは、狭帯域音響特徴量抽出手段１６は、狭帯域音響特徴量抽出手段１３と同じ６２次元の静特定と１２４次元の動特性とを合わせた１８６次元の狭帯域音響特徴量を抽出し、広帯域音響特徴量生成手段１７に出力する。
なお、狭帯域音響特徴量抽出手段１６と、狭帯域音響特徴量抽出手段１３とは、同じ機能を有するため、１つの手段で構成し、入出力のみを切り替える構成としてもよい。 Here, the narrowband acoustic feature quantity extraction unit 16 extracts the 186-dimensional narrowband acoustic feature quantity that combines the same 62-dimensional static identification and 124-dimensional dynamic characteristics as the narrowband acoustic feature quantity extraction unit 13, This is output to the broadband acoustic feature quantity generation means 17.
Note that the narrowband acoustic feature quantity extraction unit 16 and the narrowband acoustic feature quantity extraction unit 13 have the same function, and thus may be configured by a single unit and switched only between input and output.

広帯域音響特徴量生成手段１７は、統計モデルを用いて、狭帯域音響特徴量から広帯域音響特徴量を生成するものである。
この広帯域音響特徴量生成手段１７は、統計モデル記憶手段１５に記憶されている統計モデル、具体的には、そのモデルパラメータを参照して、狭帯域音響特徴量抽出手段１６で抽出された狭帯域音響特徴量から広帯域音響特徴量を生成する。例えば、図２に示した統計モデルＭの場合、広帯域音響特徴量生成手段１７は、狭帯域音響特徴量を入力層Ｌ１への入力とし、出力層Ｌ３の出力として広帯域音響特徴量を算出する。
これによって、広帯域音響特徴量生成手段１７は、狭帯域音響特徴量に対応する広帯域音響特徴量をフレームごとに生成する。
この広帯域音響特徴量生成手段１７は、生成した広帯域音響特徴量を音声合成手段１８に出力する。 The broadband acoustic feature quantity generation unit 17 generates a broadband acoustic feature quantity from the narrowband acoustic feature quantity using a statistical model.
The broadband acoustic feature quantity generation means 17 refers to a statistical model stored in the statistical model storage means 15, specifically, a narrowband extracted by the narrowband acoustic feature quantity extraction means 16 with reference to the model parameter. A broadband acoustic feature is generated from the acoustic feature. For example, in the case of the statistical model M shown in FIG. 2, the broadband acoustic feature quantity generation unit 17 uses the narrowband acoustic feature quantity as an input to the input layer L1 and calculates the broadband acoustic feature quantity as the output of the output layer L3.
As a result, the broadband acoustic feature quantity generation unit 17 generates a broadband acoustic feature quantity corresponding to the narrowband acoustic feature quantity for each frame.
The broadband acoustic feature quantity generation unit 17 outputs the generated broadband acoustic feature quantity to the speech synthesis unit 18.

音声合成手段１８は、広帯域音響特徴量生成手段１７で生成された広帯域音響特徴量を用いて音声合成を行い、音声信号（帯域拡張音声信号）を生成するものである。
この音声合成手段１８は、音源パラメータ（ここでは、対数ピッチ周波数、帯域非周期成分）で近似した音源波形を、スペクトルパラメータ（ここでは、メルケプストラム係数）を用いて声道における共振特性を表現した声道フィルタへの入力とし、フレームごとの音声波形を合成することで音声合成を行う。
この音響特徴量を用いて音声合成を行う手法は、ボコーダ方式の一般的な手法を用いればよく、例えば、前記した参考文献１で開示されている手法（ＷＯＲＬＤ）を用いることができる。 The voice synthesizer 18 performs voice synthesis using the broadband acoustic feature quantity generated by the broadband acoustic feature quantity generator 17 and generates a voice signal (band extended voice signal).
The speech synthesizer 18 expresses resonance characteristics in the vocal tract using a spectrum parameter (here, mel cepstrum coefficient) of a sound source waveform approximated by a sound source parameter (here, logarithmic pitch frequency, band non-periodic component). Speech synthesis is performed by synthesizing speech waveforms for each frame as input to the vocal tract filter.
As a method of performing speech synthesis using this acoustic feature amount, a general vocoder method may be used. For example, the method (WORD) disclosed in Reference Document 1 described above can be used.

以上、本発明の実施形態に係る音声帯域拡張装置１の構成について説明したが、音声帯域拡張装置１は、コンピュータを前記した各手段として機能させるためのプログラム（音声帯域拡張プログラム）で動作させることができる。 The configuration of the voice band extending apparatus 1 according to the embodiment of the present invention has been described above. The voice band extending apparatus 1 is operated by a program (voice band expanding program) for causing a computer to function as each of the above-described units. Can do.

以上説明したように音声帯域拡張装置１を構成することで、音声帯域拡張装置１は、事前学習時において、広帯域音声信号から、狭帯域音響特徴量を広帯域音響特徴量に変換する統計モデルを学習することができる。
そして、音声帯域拡張装置１は、帯域拡張時において、統計モデルを用いて、狭帯域音声信号から帯域を拡張した帯域拡張音声信号（広帯域音声信号）を生成することができる。
このように、音声帯域拡張装置１は、音響特徴量を用いることで、フレームごとの連続した特徴を加味することができ、従来手法に比べて高品質な音声信号を生成することができる。 By configuring the voice band extending apparatus 1 as described above, the voice band extending apparatus 1 learns a statistical model that converts a narrowband acoustic feature amount into a wideband acoustic feature amount from a wideband speech signal at the time of prior learning. can do.
Then, the voice band extension device 1 can generate a band extension voice signal (broadband voice signal) obtained by extending a band from a narrow band voice signal using a statistical model at the time of band extension.
As described above, the audio band extension device 1 can take into consideration continuous features for each frame by using the acoustic feature amount, and can generate a high-quality audio signal as compared with the conventional method.

〔音声帯域拡張装置の動作〕
次に、図３，図４を参照して、本発明の実施形態に係る音声帯域拡張装置１の動作について説明する。ここでは、音声帯域拡張装置１の動作を、統計モデルの事前学習を行う事前学習時と、音声信号の帯域拡張を行う帯域拡張時とに分けて説明する。 [Operation of Voice Bandwidth Expansion Device]
Next, with reference to FIG. 3 and FIG. 4, the operation of the voice band extending apparatus 1 according to the embodiment of the present invention will be described. Here, the operation of the voice band expansion device 1 will be described separately for the time of prior learning for performing prior learning of a statistical model and the time of band expansion for performing band expansion of a voice signal.

（事前学習）
まず、図３を参照（適宜図１参照）して、音声帯域拡張装置１における事前学習の動作について説明する。なお、切替手段１０は、事前学習を行うモードとして、入力した広帯域音声信号を、標本化周波数変換手段１１および広帯域音響特徴量抽出手段１２に出力するように出力経路が切り替えられているものとする。 (Learning in advance)
First, referring to FIG. 3 (refer to FIG. 1 as appropriate), the pre-learning operation in the voice band extending apparatus 1 will be described. Note that the switching path of the switching unit 10 is switched so that the input broadband audio signal is output to the sampling frequency conversion unit 11 and the broadband acoustic feature amount extraction unit 12 as a mode for performing pre-learning. .

ステップＳ１において、標本化周波数変換手段１１は、学習データとして入力される広帯域音声信号に対して、広帯域（例えば、４８ｋＨｚ）から狭帯域（例えば、１６ｋＨｚ）への標本化周波数変換を行うことで、狭帯域音声信号を生成する。 In step S1, the sampling frequency conversion means 11 performs sampling frequency conversion from a wide band (for example, 48 kHz) to a narrow band (for example, 16 kHz) with respect to the wideband audio signal input as learning data. Narrowband audio signal is generated.

ステップＳ２において、狭帯域音響特徴量抽出手段１３は、ステップＳ１で生成した狭帯域音声信号から、予め定めた次元数のパラメータ、例えば、フレームごとに、６０次元のメルケプストラム係数、１次元の対数ピッチ周波数および１次元の帯域非周期成分の計６２次元の静特定と、静特性の時間方向の１次差分および２次差分の１２４次元の動特性とからなる１８６次元の狭帯域音響特徴量を抽出する。 In step S2, the narrowband acoustic feature amount extraction unit 13 determines a parameter of a predetermined number of dimensions, for example, a 60-dimensional mel cepstrum coefficient and a one-dimensional logarithm for each frame from the narrowband speech signal generated in step S1. A 186-dimensional narrow-band acoustic feature quantity consisting of a pitch dimension and a 62-dimensional static characteristic of a one-dimensional band non-periodic component, and a 124-dimensional dynamic characteristic of a first-order difference and a second-order difference in the time direction of the static characteristics. Extract.

ステップＳ３において、広帯域音響特徴量抽出手段１２は、学習データとして入力される広帯域音声信号から、予め定めた次元数のパラメータ、例えば、フレームごとに、６０次元のメルケプストラム係数、１次元の対数ピッチ周波数および５次元の帯域非周期成分の計６６次元の静特定を広帯域音響特徴量として抽出する。
なお、このステップＳ３は、ステップＳ１，Ｓ２の前、あるいは、ステップＳ１，Ｓ２と並行して動作させることとしてもよい。
これによって、同じ広帯域音声信号から生成された狭帯域音響特徴量と広帯域音響特徴量とが、フレームに対応して生成されることになる。 In step S3, the broadband acoustic feature amount extraction unit 12 determines a parameter of a predetermined dimension, for example, a 60-dimensional mel cepstrum coefficient, a one-dimensional logarithmic pitch for each frame, from the broadband speech signal input as learning data. A total 66-dimensional static identification of frequency and 5-dimensional band non-periodic components is extracted as a broadband acoustic feature.
This step S3 may be operated before steps S1 and S2 or in parallel with steps S1 and S2.
As a result, the narrowband acoustic feature quantity and the broadband acoustic feature quantity generated from the same broadband audio signal are generated corresponding to the frame.

そして、ステップＳ４において、統計モデル学習手段１４は、ステップＳ３で抽出された広帯域音響特徴量を教師データとして、狭帯域音響特徴量を広帯域音響特徴量に変換する統計モデルを学習する。ここでは、統計モデル学習手段１４は、図２の順伝播ニューラルネットワークにおける入力層Ｌ１に、ステップＳ２で抽出された狭帯域音響特徴量を入力する。そして、統計モデル学習手段１４は、隠れ層Ｌ２において、入力層Ｌ１に入力された狭帯域音響特徴量の各要素の値に重みを付加して伝搬させるとともに、出力層Ｌ３において、ステップＳ３で抽出された広帯域音響特徴量との誤差が“０”に近似するように統計モデルＭを学習する。
以上の動作によって、音声帯域拡張装置１は、狭帯域音声信号の音響特徴量から、当該狭帯域音声信号に対応する広帯域音声信号の音響特徴量を推定する統計モデルを生成することができる。 In step S4, the statistical model learning unit 14 learns a statistical model that converts the narrowband acoustic feature amount into the broadband acoustic feature amount by using the broadband acoustic feature amount extracted in step S3 as teacher data. Here, the statistical model learning means 14 inputs the narrowband acoustic feature amount extracted in step S2 to the input layer L1 in the forward propagation neural network of FIG. Then, the statistical model learning means 14 adds a weight to the value of each element of the narrowband acoustic feature value input to the input layer L1 in the hidden layer L2 and propagates it, and extracts it in the output layer L3 in step S3. The statistical model M is learned so that an error from the set broadband acoustic feature amount approximates to “0”.
With the above operation, the audio band extension device 1 can generate a statistical model for estimating the acoustic feature amount of the wideband speech signal corresponding to the narrowband speech signal from the acoustic feature amount of the narrowband speech signal.

（帯域拡張）
次に、図４を参照（適宜図１参照）して、音声帯域拡張装置１における帯域拡張の動作について説明する。なお、切替手段１０は、帯域拡張を行うモードとして、入力した狭帯域音声信号を、狭帯域音響特徴量抽出手段１６に出力するように出力経路が切り替えられているものとする。 (Bandwidth expansion)
Next, with reference to FIG. 4 (refer to FIG. 1 as appropriate), the operation of band extension in the voice band extension apparatus 1 will be described. Note that the switching path of the switching unit 10 is switched so that the input narrowband audio signal is output to the narrowband acoustic feature quantity extraction unit 16 as a mode for performing band expansion.

ステップＳ１０において、狭帯域音響特徴量抽出手段１６は、帯域拡張対象となる狭帯域音声信号から、予め定めた次元数のパラメータ、例えば、フレームごとに、６０次元のメルケプストラム係数、１次元の対数ピッチ周波数および１次元の帯域非周期成分の計６２次元の静特定と、静特性の時間方向の１次差分および２次差分の１２４次元の動特性とからなる１８６次元の狭帯域音響特徴量を抽出する。なお、このステップＳ１０で抽出する狭帯域音響特徴量は、図３のステップＳ２で抽出する狭帯域音響特徴量と同じ種類で、同じ次元数とする。 In step S10, the narrowband acoustic feature amount extraction unit 16 extracts a parameter of a predetermined number of dimensions, for example, a 60-dimensional mel cepstrum coefficient and a one-dimensional logarithm for each frame from the narrowband audio signal to be subjected to band expansion. A 186-dimensional narrow-band acoustic feature quantity consisting of a pitch dimension and a 62-dimensional static characteristic of a one-dimensional band non-periodic component, and a 124-dimensional dynamic characteristic of a first-order difference and a second-order difference in the time direction of the static characteristics. Extract. Note that the narrowband acoustic feature quantity extracted in step S10 is the same type and the same number of dimensions as the narrowband acoustic feature quantity extracted in step S2 of FIG.

ステップＳ１１において、広帯域音響特徴量生成手段１７は、事前学習（図３参照）で学習した統計モデルを用いて、ステップＳ１０で抽出された狭帯域音響特徴量から広帯域音響特徴量を生成する。 In step S11, the broadband acoustic feature quantity generation unit 17 generates a broadband acoustic feature quantity from the narrowband acoustic feature quantity extracted in step S10, using the statistical model learned in advance learning (see FIG. 3).

そして、ステップＳ１２において、音声合成手段１８は、ステップＳ１１で生成された広帯域音響特徴量を用いて、フレームごとの音声波形を合成することで音声合成を行い、広帯域（帯域拡張）音声信号を生成する。
以上の動作によって、音声帯域拡張装置１は、統計モデルを用いて、狭帯域音声信号から、帯域を拡張した広帯域音声信号を生成することができる。 In step S12, the speech synthesizer 18 performs speech synthesis by synthesizing the speech waveform for each frame using the broadband acoustic feature generated in step S11, and generates a broadband (band extension) speech signal. To do.
With the above operation, the voice band extension device 1 can generate a wideband voice signal whose band is extended from a narrowband voice signal using a statistical model.

〔音声帯域拡張装置の性能評価〕
次に、図５を参照して、音声帯域拡張装置１の性能評価について説明する。図５は、音声信号のスペクトログラムを示し、横軸が時間［秒］、縦軸が周波数［Ｈｚ］である。
図５（ａ）は、標本化周波数４８ｋＨｚの広帯域音声信号のスペクトログラム（サンプリング定理により観測可能な最大周波数は２４ｋＨｚ）である。この広帯域音声信号は、事前学習に用いた音声信号ではない。 [Performance evaluation of voice bandwidth expansion equipment]
Next, with reference to FIG. 5, the performance evaluation of the voice band extending device 1 will be described. FIG. 5 shows a spectrogram of an audio signal, with the horizontal axis representing time [seconds] and the vertical axis representing frequency [Hz].
FIG. 5A is a spectrogram of a wideband audio signal having a sampling frequency of 48 kHz (the maximum frequency observable by the sampling theorem is 24 kHz). This wideband audio signal is not an audio signal used for prior learning.

図５（ｂ）は、図５（ａ）の広帯域音声信号を、標本化周波数１６ｋＨｚにダウンサンプリングした狭帯域音声信号のスペクトログラムである。
図５（ｃ）は、図５（ｂ）の狭帯域音声信号を、予め事前学習を完了した音声帯域拡張装置１の入力とし、標本化周波数４８ｋＨｚに帯域拡張を行った結果の帯域拡張音声信号のスペクトログラムである。 FIG. 5B is a spectrogram of a narrowband audio signal obtained by down-sampling the wideband audio signal of FIG. 5A to a sampling frequency of 16 kHz.
FIG. 5C shows a band-expanded audio signal obtained as a result of band expansion to a sampling frequency of 48 kHz using the narrow-band audio signal of FIG. 5B as the input of the audio band expansion apparatus 1 that has completed preliminary learning in advance. This is the spectrogram.

図５（ｂ）に示すように、この狭帯域音声信号は、ダウンサンプリングに伴い８ｋＨｚまでの成分しか存在しない。
しかし、図５（ｃ）に示すように、音声帯域拡張装置１で帯域拡張した帯域拡張音声信号は、８ｋＨｚ以上の成分が復元され、図５（ａ）に示した元の広帯域音声信号が精度よく復元されていることが分かる。
以上説明したように、音声帯域拡張装置１は、容易かつ高品質に音声信号の帯域を拡張することができる。 As shown in FIG. 5B, this narrowband audio signal has only components up to 8 kHz due to downsampling.
However, as shown in FIG. 5C, the band-extended audio signal band-extended by the audio band-expanding device 1 has a component of 8 kHz or higher restored, and the original wide-band audio signal shown in FIG. You can see that it is well restored.
As described above, the audio band extending device 1 can easily and highly expand the audio signal band.

次に、表１を参照して、音声帯域拡張装置１の他の性能評価について説明する。表１は、帯域拡張を行う手法の違いによる性能を比較した表である。 Next, with reference to Table 1, another performance evaluation of the voice band expansion device 1 will be described. Table 1 is a table comparing the performance due to the difference in the technique for performing band expansion.

この表１は、広帯域音声信号をダウンサンプリングすることで生成した狭帯域音声信号を、元の広帯域音声信号と同じ標本化周波数にアップサンプリングした帯域拡張音声信号と、本発明に係る音声帯域拡張装置１で拡張した帯域拡張音声信号とについて、それぞれ元の広帯域音声信号との平均二乗誤差を示す。
具体的には、それぞれの音声信号のスペクトルパラメータを算出し、広帯域音声信号と帯域拡張音声信号（アップサンプリング）、広帯域音声信号と帯域拡張音声信号（本発明）のそれぞれについて、対応するスペクトルパラメータの平均二乗誤差を算出したものである。また、ここでは、音声帯域拡張装置１における帯域拡張に際し、狭帯域音響特徴量として、静特性と動特性とを用いている。また、いずれも事前学習に未使用の評価用データ３５０文の音声信号を用いて計測した。
表１に示すように、アップサンプリングした帯域拡張音声信号よりも、音声帯域拡張装置１（本発明）で拡張した帯域拡張音声信号の方が平均二乗誤差の値が小さくなっており、音声帯域拡張装置１が、精度よく広帯域音声信号を推定できることが分かる。 Table 1 shows a band extension audio signal obtained by up-sampling a narrowband audio signal generated by down-sampling a wideband audio signal to the same sampling frequency as that of the original wideband audio signal, and an audio band extension apparatus according to the present invention. The mean square error of each of the band expanded audio signals expanded in 1 and the original wideband audio signal is shown.
Specifically, the spectral parameters of the respective audio signals are calculated, and the corresponding spectral parameter is determined for each of the wideband audio signal and the band extended audio signal (upsampling), and the wideband audio signal and the band extended audio signal (the present invention). The mean square error is calculated. Also, here, static characteristics and dynamic characteristics are used as the narrow-band acoustic feature quantities when the voice band extension apparatus 1 performs band extension. Moreover, all measured using the audio | voice signal of the evaluation data 350 sentence unused in prior learning.
As shown in Table 1, the value of the mean square error is smaller in the band extended voice signal extended by the voice band extending apparatus 1 (the present invention) than in the upsampled band extended voice signal. It can be seen that the apparatus 1 can estimate the wideband audio signal with high accuracy.

さらに、表２を参照して、音声帯域拡張装置１の他の性能評価について説明する。表２は、狭帯域音響特徴量抽出手段１３，１６が抽出する狭帯域音響特徴量の違いによる性能を比較した表である。 Furthermore, with reference to Table 2, another performance evaluation of the voice band extending apparatus 1 will be described. Table 2 is a table comparing the performance due to the difference in the narrowband acoustic feature amount extracted by the narrowband acoustic feature amount extraction means 13 and 16.

この表２は、狭帯域音響特徴量抽出手段１３，１６において、静特性のみを用いた場合と、静特性と動特性とを用いた場合とで、正解の広帯域音声信号と、音声帯域拡張装置１が生成した帯域拡張音声信号との平均二乗誤差の最小値を示す。ここでは、表１の性能評価と同様に、事前学習に未使用の評価用データ３５０文の音声信号を用いて計測した。
表２に示すように、静特性のみに比べ、静特性と動特性とを合わせた狭帯域音響特徴量を用いた方が、精度よく広帯域音声信号を推定できることが分かる。 Table 2 shows that the narrowband acoustic feature amount extraction means 13 and 16 use the correct wideband audio signal and the audio band expansion device when only the static characteristic is used and when the static characteristic and the dynamic characteristic are used. 1 indicates the minimum value of the mean square error with the generated band extension voice signal. Here, similarly to the performance evaluation in Table 1, measurement was performed using an audio signal of 350 sentences of evaluation data unused for pre-learning.
As shown in Table 2, it can be seen that the use of a narrowband acoustic feature that combines the static characteristics and the dynamic characteristics can more accurately estimate the wideband audio signal than the static characteristics alone.

〔変形例〕
以上、本発明の実施形態に係る音声帯域拡張装置１の構成、動作および評価について説明したが、本発明は、この実施形態に限定されるものではない。
音声帯域拡張装置１は、統計モデルを学習する事前学習と、統計モデルを用いて、音声信号の帯域を拡張する帯域拡張との２つの動作を１つの装置で行うものである。しかし、これらの動作は、別々の装置で行うようにしても構わない。
具体的には、統計モデルを学習する事前学習を実現する装置は、図６に示す音声帯域拡張統計モデル学習装置２として構成することができる。 [Modification]
The configuration, operation, and evaluation of the voice band extending apparatus 1 according to the embodiment of the present invention have been described above. However, the present invention is not limited to this embodiment.
The voice band expansion device 1 performs two operations of pre-learning for learning a statistical model and band expansion for expanding a bandwidth of a voice signal by using the statistical model. However, these operations may be performed by separate devices.
Specifically, an apparatus that realizes pre-learning for learning a statistical model can be configured as the voice band extended statistical model learning apparatus 2 shown in FIG.

音声帯域拡張統計モデル学習装置２は、図６に示すように、標本化周波数変換手段１１と、広帯域音響特徴量抽出手段１２と、狭帯域音響特徴量抽出手段１３と、統計モデル学習手段１４と、統計モデル記憶手段１５と、を備える。この構成は、図１で説明した音声帯域拡張装置１の構成から、切替手段１０と、狭帯域音響特徴量抽出手段１６と、広帯域音響特徴量生成手段１７と、音声合成手段１８とを削除したものである。
この音声帯域拡張統計モデル学習装置２は、統計モデルを学習する事前学習動作のみを行う。
音声帯域拡張統計モデル学習装置２の動作は、図３で説明した動作と同じである。
なお、音声帯域拡張統計モデル学習装置２は、コンピュータを前記した各手段として機能させるためのプログラム（音声帯域拡張統計モデル学習プログラム）で動作させることができる。 As shown in FIG. 6, the speech band extended statistical model learning device 2 includes a sampling frequency conversion unit 11, a wideband acoustic feature amount extraction unit 12, a narrowband acoustic feature amount extraction unit 13, and a statistical model learning unit 14. And a statistical model storage unit 15. In this configuration, the switching unit 10, the narrowband acoustic feature quantity extraction unit 16, the wideband acoustic feature quantity generation unit 17, and the voice synthesis unit 18 are deleted from the configuration of the voice band expansion device 1 described in FIG. 1. Is.
This voice band extended statistical model learning device 2 performs only a pre-learning operation for learning a statistical model.
The operation of the voice band extended statistical model learning device 2 is the same as the operation described in FIG.
The voice band extended statistical model learning device 2 can be operated by a program (voice band extended statistical model learning program) for causing a computer to function as each of the above-described means.

また、統計モデルを用いて、音声信号の帯域を拡張する帯域拡張動作を実現する装置は、図７に示す音声帯域拡張装置１Ｂとして構成することができる。
音声帯域拡張装置１Ｂは、統計モデル記憶手段１５と、狭帯域音響特徴量抽出手段１６と、広帯域音響特徴量生成手段１７と、音声合成手段１８と、を備える。この構成は、図１で説明した音声帯域拡張装置１の構成から、切替手段１０と、標本化周波数変換手段１１と、広帯域音響特徴量抽出手段１２と、狭帯域音響特徴量抽出手段１３と、統計モデル学習手段１４とを削除したものである。また、統計モデル記憶手段１５に記憶する統計モデルは、図６の音声帯域拡張統計モデル学習装置２で学習されたものである。
この音声帯域拡張装置１Ｂは、音声信号の帯域を拡張する帯域拡張動作のみを行う。
音声帯域拡張装置１Ｂの動作は、図４で説明した動作と同じである。
なお、音声帯域拡張装置１Ｂは、コンピュータを前記した各手段として機能させるためのプログラム（音声帯域拡張プログラム）で動作させることができる。 In addition, a device that realizes a bandwidth extension operation for extending the bandwidth of a voice signal using a statistical model can be configured as a voice bandwidth extension device 1B shown in FIG.
The voice band expansion device 1B includes a statistical model storage unit 15, a narrowband acoustic feature amount extraction unit 16, a wideband acoustic feature amount generation unit 17, and a voice synthesis unit 18. This configuration is different from the configuration of the voice band extending apparatus 1 described in FIG. 1 in that the switching unit 10, the sampling frequency converting unit 11, the wideband acoustic feature amount extracting unit 12, the narrowband acoustic feature amount extracting unit 13, The statistical model learning means 14 is deleted. The statistical model stored in the statistical model storage means 15 is learned by the voice band extended statistical model learning device 2 of FIG.
This voice band extending device 1B performs only a band extending operation for extending the band of the voice signal.
The operation of the voice band expansion device 1B is the same as the operation described in FIG.
The voice band extending device 1B can be operated by a program (voice band extending program) for causing a computer to function as each of the above-described means.

このように、統計モデルを学習する学習動作と、統計モデルを用いて音声信号の帯域を拡張する帯域拡張動作とを、異なる装置（音声帯域拡張統計モデル学習装置２，音声帯域拡張装置１Ｂ）で動作させることで、１つの音声帯域拡張統計モデル学習装置２で学習した統計モデルを、複数の音声帯域拡張装置１Ｂで利用することが可能になる。 In this way, the learning operation for learning the statistical model and the bandwidth expansion operation for expanding the bandwidth of the speech signal using the statistical model are performed by different devices (speech bandwidth extension statistical model learning device 2, speech bandwidth extension device 1B). By operating, a statistical model learned by one voice band expansion statistical model learning device 2 can be used by a plurality of voice band expansion devices 1B.

１，１Ｂ音声帯域拡張装置
２音声帯域拡張統計モデル学習装置
１０切替手段
１１標本化周波数変換手段
１２広帯域音響特徴量抽出手段
１３狭帯域音響特徴量抽出手段
１４統計モデル学習手段
１５統計モデル記憶手段
１６狭帯域音響特徴量抽出手段（第２狭帯域音響特徴量抽出手段）
１７広帯域音響特徴量生成手段
１８音声合成手段 DESCRIPTION OF SYMBOLS 1,1B Voice band expansion apparatus 2 Voice band expansion statistical model learning apparatus 10 Switching means 11 Sampling frequency conversion means 12 Wideband acoustic feature amount extraction means 13 Narrowband acoustic feature amount extraction means 14 Statistical model learning means 15 Statistical model storage means 16 Narrowband acoustic feature extraction means (second narrowband acoustic feature extraction means)
17 Broadband acoustic feature generation means 18 Speech synthesis means

Claims

An audio band extending device for extending an audio signal band,
Sampling frequency conversion means for performing sampling frequency conversion of a wideband audio signal having a target expanded band and generating a narrowband audio signal having a band before expansion;
Narrowband acoustic feature quantity extraction means for extracting a first acoustic feature quantity from the narrowband audio signal in units of frames;
A broadband acoustic feature quantity extracting means for extracting a second acoustic feature quantity from the broadband voice signal in units of frames;
Statistical model learning means for inputting a first acoustic feature and learning a statistical model for outputting the second acoustic feature;
Second narrowband acoustic feature quantity extraction means for extracting a third acoustic feature quantity in units of frames from the extension target voice signal having the same band as the narrowband voice signal;
Broadband acoustic feature value generation means for generating a fourth acoustic feature value, which is a broadband acoustic feature value, in units of frames from the third acoustic feature value using the statistical model;
Voice synthesis means for generating a voice signal with an extended band by performing voice synthesis using the fourth acoustic feature amount;
A voice band extending apparatus comprising:

The voice band extending apparatus according to claim 1, wherein the statistical model learning unit learns a deep neural network as the statistical model.

The audio band extending device according to claim 1 or 2, wherein the acoustic feature amount is a static characteristic of a spectrum parameter and a sound source parameter for each frame.

4. The narrowband acoustic feature quantity extraction unit and the second narrowband acoustic feature quantity extraction unit extract a difference between frames of the static characteristic as a dynamic characteristic in addition to the static characteristic. The voice bandwidth expansion device described.

A speech bandwidth expansion statistical model learning device for learning a statistical model used for expanding a bandwidth of a speech signal,
Sampling frequency conversion means for performing sampling frequency conversion of a wideband audio signal having a target expanded band and generating a narrowband audio signal having a band before expansion;
Narrowband acoustic feature quantity extraction means for extracting a first acoustic feature quantity from the narrowband audio signal in units of frames;
A broadband acoustic feature quantity extracting means for extracting a second acoustic feature quantity from the broadband voice signal in units of frames;
Statistical model learning means for inputting a first acoustic feature and learning a statistical model for outputting the second acoustic feature;
A speech band extended statistical model learning device comprising:

A voice band extension device for extending a voice signal band using a statistical model learned by the voice band extension statistical model learning device according to claim 5,
Narrowband acoustic feature quantity extraction means for extracting a third acoustic feature quantity in units of frames from an extension target voice signal having the same band as the narrowband voice signal from which the acoustic feature quantity on the input side of the statistical model is extracted;
Broadband acoustic feature value generation means for generating a fourth acoustic feature value, which is a broadband acoustic feature value, in units of frames from the third acoustic feature value using the statistical model;
Voice synthesis means for generating a voice signal with an extended band by performing voice synthesis using the fourth acoustic feature amount;
A voice band extending apparatus comprising:

A voice band expansion program for causing a computer to function as the voice band expansion apparatus according to any one of claims 1, 2, 3, 4, and 6.

A voice band extended statistical model learning program for causing a computer to function as the voice band extended statistical model learning device according to claim 5.