JP2005099509A

JP2005099509A - Voice processing device, voice processing method, and voice processing program

Info

Publication number: JP2005099509A
Application number: JP2003334130A
Authority: JP
Inventors: Yasuo Yoshioka; 靖雄吉岡; Rosukosu Alex; ロスコスアレックス
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2003-09-25
Filing date: 2003-09-25
Publication date: 2005-04-14
Anticipated expiration: 2023-09-25
Also published as: JP3901144B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice processing device, a voice processing method, and a voice processing program, which give vibrato accompanied with natural tone variation. <P>SOLUTION: A template creation part 900 creates a template set TS concerned with a target voice which is supplied from a target input part 950 and to which vibrato is applied. A pitch, a gain, and an inclination of a spectrum (slope) of each frame of the target voice are recorded in the template set TS created by the template creation part 900. An vibrato applying part 400 changes pitches, gains, and slopes of a singing voice (input voice) of an amateur singer or the like under the control of a vibrato application control part 700 and performs inverse FFT of spectrums obtained by the changed pitches, gains, and slopes and supplies a voice obtained by inverse FFT (namely, a voice having the changed pitches, gains, and slopes) to a voice output part 500. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、入力される歌唱音声等に対してビブラートを付与することができる音声処理装置、音声処理方法及び音声処理プログラムに関する。 The present invention relates to a voice processing device, a voice processing method, and a voice processing program that can add vibrato to an input singing voice or the like.

歌唱技術の１つであるビブラートは、歌唱音声に対して周期的なピッチ、振幅のゆれを与える技術であり、長い伸ばし音にビブラートをかけることで豊かな表情（音の変化）を与えることができる。逆に、長い音を歌うときにビブラートをかけないと、音の変化は乏しくなり、歌唱が単調になりやすい。しかしながら、ビブラートは高度な歌唱技術であるために、歌うことになれていない素人歌唱者が綺麗なビブラートをかけて歌うことは難しいという問題があった。
かかる背景のもと、素人歌唱者の歌唱音声に対して自動的にビブラートを付加する装置が種々提案されている。例えば、下記特許文献１には、単に一定の大きさのビブラートを機械的に付加するのではなく、入力される歌唱音声信号のピッチ、音量、同じ音の持続時間などの状態に応じて変調信号を生成し、この変調信号により歌唱音声信号のピッチや振幅を変調してビブラートを付加する装置が開示されている。 Vibrato, one of the singing techniques, is a technique that gives periodic pitch and amplitude fluctuations to the singing voice. By applying vibrato to a long stretched sound, it can give a rich expression (change in sound). it can. Conversely, if vibrato is not applied when singing a long sound, the change in sound will be scarce and the singing will tend to be monotonous. However, because vibrato is an advanced singing technique, there is a problem that it is difficult for an amateur singer who is not supposed to sing to sing with a beautiful vibrato.
Under such a background, various devices for automatically adding vibrato to the singing voice of an amateur singer have been proposed. For example, the following Patent Document 1 does not simply mechanically add a certain amount of vibrato, but modulates the signal according to the state of the pitch, volume, duration of the same sound, etc. of the input singing voice signal. , And a device for adding vibrato by modulating the pitch and amplitude of the singing voice signal using this modulated signal.

特開平９−０４４１５８号公報Japanese Patent Laid-Open No. 9-044158

しかしながら、上記特許文献１に開示された装置においては、ＬＦＯ（Low Frequency Oscillator）にて発生された正弦波や三角波などの合成信号をベースに変調信号を生成しているため、模倣対象となる歌唱者（ターゲット歌唱者）によって歌われたビブラートの微妙なピッチや振幅のゆれは再現できず、また、音色の自然な変化を伴わせることもできない。上記正弦波などのかわりに、ターゲット歌唱者の現実のビブラート波形をサンプリングしたものを使用する従来例もあるが、１つのビブラート波形から歌唱音声の自然なピッチ、振幅、音色の変化を再現することはできない。 However, in the apparatus disclosed in Patent Document 1, a modulation signal is generated based on a composite signal such as a sine wave or a triangular wave generated by an LFO (Low Frequency Oscillator), so that a song to be imitated The subtle pitch and amplitude fluctuations of the vibrato sung by the musician (the target singer) cannot be reproduced, nor can it be accompanied by a natural change in timbre. There is a conventional example that uses a sample of the actual vibrato waveform of the target singer instead of the sine wave, but reproduces the natural pitch, amplitude, and timbre of the singing voice from one vibrato waveform. I can't.

本発明は、以上説明した事情を鑑みてなされたものであり、自然な音色変化を伴うビブラートを付与することができる音声処理装置、音声処理方法及び音声処理プログラムを提供することを目的とする。 The present invention has been made in view of the circumstances described above, and an object thereof is to provide an audio processing device, an audio processing method, and an audio processing program capable of providing vibrato accompanied by natural timbre changes.

上述した問題を解決するため、本発明に係る音声処理装置は、ビブラートがかかったターゲット音声のピッチ及びスペクトルの傾きを記憶する記憶手段と、音声を入力する入力手段と、入力された前記音声のピッチを抽出するピッチ抽出手段と、抽出された前記音声のピッチを前記ターゲット音声のピッチに応じて変更するピッチ変更手段と、入力された前記音声をスペクトル分析することにより得られる該音声のスペクトルの傾きを、前記ターゲット音声のスペクトルの傾きに応じて変更するスペクトル・スロープ変更手段と、変更後の前記音声のピッチと変更後の前記音声のスペクトルの傾きとを有する音声を出力する出力手段とを具備することを特徴とする。 In order to solve the above-described problem, a speech processing apparatus according to the present invention includes a storage unit that stores a pitch and a spectrum inclination of a target speech to which vibrato has been applied, an input unit that inputs speech, and the input speech. A pitch extracting means for extracting a pitch; a pitch changing means for changing the pitch of the extracted voice according to the pitch of the target voice; and a spectrum of the voice obtained by spectrally analyzing the inputted voice. Spectral slope changing means for changing the slope according to the slope of the spectrum of the target voice; and output means for outputting a voice having the pitch of the voice after the change and the slope of the spectrum of the voice after the change. It is characterized by comprising.

かかる構成によれば、例えば素人歌唱者の音声等が入力された場合、予め記憶手段に記憶されているビブラートがかかったターゲット音声のピッチに応じて入力された音声のピッチが変更されるとともに、該ターゲット音声のスペクトルの傾きに応じて入力された音声のスペクトルの傾きが変更される。このスペクトルの傾きを変化させることで、全体の音量を変えることなく音色のみを変化させることができ（図１０と図５、図１１と図５をそれぞれ比較参照されたい）、あたかもターゲット歌唱者が歌唱しているかのうような自然な音色変化を示すビブラートを素人歌唱者の音声等に付加することが可能となる。 According to this configuration, for example, when an amateur singer's voice or the like is input, the pitch of the input voice is changed according to the pitch of the target voice with vibrato previously stored in the storage means, The slope of the input speech spectrum is changed according to the slope of the target speech spectrum. By changing the slope of this spectrum, it is possible to change only the timbre without changing the overall volume (see FIG. 10 and FIG. 5 and FIG. 11 and FIG. 5 for comparison), as if the target singer is It becomes possible to add a vibrato showing a natural timbre change as if singing to the voice of an amateur singer.

ここで、前記記憶手段には、前記ターゲット音声のピッチ及びスペクトルの傾きに加えて該ターゲット音声のゲインが記憶され、入力された前記音声のゲインを抽出するゲイン抽出手段と、抽出された前記音声のゲインを前記ターゲット音声のゲインに応じて変更するゲイン変更手段とをさらに具備し、前記出力手段は、変更後の前記音声のピッチと変更後の前記音声のスペクトルの傾きと変更後の前記音声のゲインとを有する音声を出力する構成としても良い。 Here, the storage means stores gain of the target voice in addition to the pitch and spectrum inclination of the target voice, and extracts the gain of the inputted voice, and the extracted voice Gain changing means for changing the gain of the voice according to the gain of the target voice, and the output means includes the pitch of the voice after the change, the slope of the spectrum of the voice after the change, and the voice after the change. It is good also as a structure which outputs the audio | voice which has these gains.

また、前記記憶手段には、ビブラートがかかった前記ターゲット音声の複数種類のピッチと各ピッチに対応するスペクトルの傾きが記憶され、前記ピッチ変更手段は、前記ターゲット音声の複数種類のピッチの中から、抽出された前記音声のピッチに最も近いピッチを選択し、選択したターゲット音声のピッチに応じて該抽出された前記音声のピッチを変更し、前記スペクトル・スロープ変更手段は、前記ターゲット音声の複数種類のスペクトルの傾きの中から、前記ピッチ変更手段によって選択された前記ターゲット音声のピッチに対応するスペクトルの傾きを選択し、選択したターゲット音声のスペクトルの傾きに応じて前記音声のスペクトルの傾きを変更する構成としても良い。 Further, the storage means stores a plurality of types of pitches of the target sound subjected to vibrato and a spectrum inclination corresponding to each pitch, and the pitch changing means is configured to select from among the plurality of types of pitches of the target sounds. Selecting the pitch closest to the pitch of the extracted voice, changing the pitch of the extracted voice according to the pitch of the selected target voice, and the spectrum slope changing means A spectrum slope corresponding to the pitch of the target speech selected by the pitch changing means is selected from the types of spectrum slopes, and the slope of the speech spectrum is selected according to the slope of the spectrum of the selected target speech. It is good also as a structure to change.

さらにまた、前記記憶手段には、ビブラートがかかった前記ターゲット音声を少なくともビブラートアタック領域、ビブラートボディ領域に分割したときの、各領域におけるピッチ及びスペクトルの傾きが記憶される構成としても良い。 Furthermore, the storage means may be configured to store the pitch and the slope of the spectrum in each region when the target sound subjected to vibrato is divided into at least a vibrato attack region and a vibrato body region.

以上説明したように、本発明によれば、ターゲット歌唱者によって付加されるビブラートの微妙なピッチやゲインのゆれを再現するばかりでなく、音色の自然な変化をも忠実に再現することが可能となる。 As described above, according to the present invention, it is possible not only to reproduce the subtle pitch and gain fluctuations of the vibrato added by the target singer, but also to faithfully reproduce the natural changes in the timbre. Become.

以下、本発明に係る実施の形態について図面を参照しながら説明する。
Ａ．本実施形態
Ａ−１．全体構成
図１は、本実施形態に係る音声処理装置１００の構成を示す図である。
音声入力部（入力手段）２００は、マイクロホン等によって構成され、素人歌唱者等の音声を音声処理装置１００の内部に入力する。なお、以下の説明では、便宜上、音声入力部２００を介して入力される素人歌唱者等の音声を単に入力音声と呼ぶ。
スペクトル・ピッチ分析部（ピッチ抽出手段、ゲイン抽出手段）３００は、音声入力部２００から供給される入力音声をフレーム単位（５〜１０ｍｓ程度）でＦＦＴ（Fast Fourier Transform）をベースとするスペクトル分析等を行い、フレーム毎のピッチ、スペクトル等を抽出する。なお、フレーム毎のピッチ抽出については、スペクトル分析に限らず周知の手法を用いるようにしてもよい。
パラメータ入力部６００は、入力音声に付加するビブラートを制御するためのパラメータ（例えば、所望のビブラートレート等）を音声処理装置１００の内部に入力する。このビブラートを制御するためのパラメータ（以下、Vibパラメータ）は、素人歌唱者等が操作ボタン等（図示略）を適宜操作することによって入力される。 Embodiments according to the present invention will be described below with reference to the drawings.
A. Embodiment A-1. Overall Configuration FIG. 1 is a diagram showing a configuration of a speech processing apparatus 100 according to the present embodiment.
The voice input unit (input unit) 200 is configured by a microphone or the like, and inputs voice of an amateur singer or the like into the voice processing apparatus 100. In the following description, the voice of an amateur singer or the like input via the voice input unit 200 is simply referred to as input voice for convenience.
A spectrum / pitch analysis unit (pitch extraction means, gain extraction means) 300 is a spectrum analysis based on FFT (Fast Fourier Transform) in units of frames (about 5 to 10 ms) for input speech supplied from the speech input unit 200. To extract the pitch, spectrum, etc. for each frame. Note that the pitch extraction for each frame is not limited to spectrum analysis, and a known method may be used.
The parameter input unit 600 inputs a parameter (for example, a desired vibrato rate) for controlling the vibrato added to the input speech into the speech processing apparatus 100. Parameters for controlling the vibrato (hereinafter referred to as Vib parameters) are input by an amateur singer or the like by appropriately operating operation buttons or the like (not shown).

ターゲット入力部（ターゲット入力手段）９５０は、ターゲット歌唱者がビブラートをかけて歌ったときの音声（以下、ターゲット音声）を入力し、これをテンプレート作成部９００に供給する。
テンプレート作成部（ターゲットピッチ抽出手段、ターゲットスペクトル・スロープ算出手段）９００は、ターゲット入力部９５０から供給されるビブラートがかかったターゲット音声に係るテンプレートセットＴＳを作成し、これをビブラートデータベース８００に記憶する。なお、テンプレートセットＴＳの作成等に関する詳細は後述する。
ビブラートデータベース（記憶手段）８００には、ターゲット歌唱者が種々のピッチでビブラートをかけて歌ったときの音声に係るテンプレートセットＴＳが複数記憶される。かかる構成を採用することにより、入力音声にビブラートを付加する際には、この入力音声のピッチに最も近いテンプレートセットＴＳを選択・使用することができ、これにより、リアルなビブラートを付加することが可能となる。 The target input unit (target input means) 950 inputs a sound (hereinafter referred to as a target sound) when the target singer sings with vibrato, and supplies this to the template creation unit 900.
The template creation unit (target pitch extraction means, target spectrum / slope calculation means) 900 creates a template set TS related to the target voice with vibrato supplied from the target input unit 950 and stores it in the vibrato database 800. . Details regarding creation of the template set TS will be described later.
The vibrato database (storage means) 800 stores a plurality of template sets TS related to voices when the target singer sings with vibrato at various pitches. By adopting such a configuration, when adding vibrato to the input voice, it is possible to select and use the template set TS closest to the pitch of the input voice, thereby adding realistic vibrato. It becomes possible.

ビブラート付加制御部（ピッチ変更手段、スペクトル・スロープ変更手段、ゲイン変更手段）７００は、入力音声のピッチに最も近いテンプレートセットＴＳをビブラートデータベース８００から選択し、この選択したテンプレートセットＴＳと上記Vibパラメータとに基づいて、入力音声に付加するビブラートを制御する（詳細は後述）。
ビブラート付加部（ピッチ変更手段、スペクトル・スロープ変更手段、ゲイン変更手段）４００は、ビブラート付加制御部７００による制御のもと、入力音声のピッチ及び該入力音声のスペクトルを分析することにより得られるゲイン、スペクトルの傾き（以下、適宜、スロープという）を変更し、変更したピッチ、ゲイン、スロープから求まるスペクトルを逆ＦＦＴする（詳細は後述）。そしてビブラート付加部４００は、この逆ＦＦＴによって得られた音声（すなわち、変更後のピッチ、ゲイン、スロープを有する音声）を音声出力部５００に供給する。
音声出力部（音声出力手段）５００は、スピーカ等によって構成され、ビブラート付加部４００から供給される音声を外部に出力する。以上説明した音声処理装置１００を利用することで、ビブラートが付加されていない素人歌唱者の音声を、あたかもターゲット歌唱者が歌唱しているかのような自然な音色変化を示す音声に変換することが可能となる。 A vibrato addition control unit (pitch changing means, spectrum / slope changing means, gain changing means) 700 selects a template set TS closest to the pitch of the input voice from the vibrato database 800, and selects the selected template set TS and the Vib parameter. Based on the above, vibrato added to the input voice is controlled (details will be described later).
A vibrato adding section (pitch changing means, spectrum slope changing means, gain changing means) 400 is a gain obtained by analyzing the pitch of the input voice and the spectrum of the input voice under the control of the vibrato addition control section 700. The spectrum slope (hereinafter referred to as “slope” as appropriate) is changed, and the spectrum obtained from the changed pitch, gain, and slope is subjected to inverse FFT (details will be described later). The vibrato adding unit 400 supplies the audio obtained by the inverse FFT (that is, the audio having the changed pitch, gain, and slope) to the audio output unit 500.
The audio output unit (audio output unit) 500 includes a speaker or the like and outputs the audio supplied from the vibrato adding unit 400 to the outside. By using the voice processing device 100 described above, the voice of an amateur singer without vibrato can be converted into a voice that shows natural timbre changes as if the target singer is singing. It becomes possible.

Ａ−２．テンプレートセットＴＳの作成
テンプレート作成部９００は、ターゲット入力部９５０からビブラートがかかったターゲット音声を受け取ると、このターゲット音声をフレーム単位でＦＦＴをベースとするスペクトル分析等を行い、フレーム毎のピッチ、ゲイン、スロープを求める。なお、フレーム毎のピッチを求める際には、上記入力音声と同様、スペクトル分析に限らず周知の手法を用いるようにしてもよい。 A-2. Creation of Template Set TS When the template creation unit 900 receives the target voice subjected to vibrato from the target input unit 950, the target voice is subjected to spectrum analysis based on FFT for each frame, and the pitch and gain for each frame. Find the slope. When obtaining the pitch for each frame, a well-known method may be used in addition to the spectrum analysis as in the case of the input speech.

図２は、ビブラートがかかったターゲット音声のピッチを持つ波形であるピッチ波形を示す図である。テンプレート作成部９００は、ビブラートがかかった１つの音声波形を、図２に示すように、ビブラートアタック領域（ＶＡ領域）、ビブラートボディ領域（ＶＢ領域）、ビブラートリリース領域（ＶＲ領域）に分ける。具体的には、ビブラートのかけはじめの部分（ピッチがビブラート変化し始める箇所から周期的な変化に至る直前まで）をＶＡ領域とし、ビブラートアタックに続いてピッチが周期的に変動する部分をＶＢ領域とし、ビブラートボディに続いて消音するまでの部分をＶＲ領域とする。なお、各領域の境界については、ビブラート周期の変化の状況等を考慮し、ピッチの山の極大値の部分（時間点）を境界とする。 FIG. 2 is a diagram showing a pitch waveform which is a waveform having the pitch of the target sound to which vibrato is applied. The template creation unit 900 divides one audio waveform with vibrato into a vibrato attack area (VA area), a vibrato body area (VB area), and a vibrato release area (VR area) as shown in FIG. Specifically, the part where the vibrato is first applied (from the point where the pitch starts to change vibrato to just before the periodical change) is set as the VA area, and the part where the pitch changes periodically following the vibrato attack is set as the VB area. And the VR area is the part of the vibrato body until the sound is muted. In addition, regarding the boundary of each region, the maximum value portion (time point) of the peak of the pitch is used as the boundary in consideration of the change state of the vibrato period and the like.

また、本来的にはＶＢ領域の各データ（すなわち、ピッチ、ゲイン、スロープ）を用いるだけでもビブラートを付加することはできるが、本実施形態では、よりリアルなビブラートの付加を図るため、ＶＡ領域、ＶＢ領域、ＶＲ領域の各データを用いている。テンプレート作成部９００は、このように１つのターゲット音声波形を各領域に分割すると、これら各領域の各々についてテンプレートを作成する。なお、以下の説明では、ＶＡ領域について作成されたテンプレートをＶＡテンプレートと呼び、ＶＢ領域について作成されたテンプレートをＶＢテンプレートと呼び、ＶＲ領域について作成されたテンプレートをＶＲテンプレートと呼び、さらに、これら各領域のテンプレートを１つにまとめたものをテンプレートセットＴＳと呼ぶ。 Although it is possible to add vibrato by simply using each data (ie, pitch, gain, slope) in the VB area, in the present embodiment, in order to add more realistic vibrato, the VA area , VB area, and VR area data are used. When the template creation unit 900 divides one target speech waveform into regions as described above, a template is created for each of these regions. In the following description, a template created for the VA area is called a VA template, a template created for the VB area is called a VB template, a template created for the VR area is called a VR template, A group of area templates is called a template set TS.

Ａ−２−１．ピッチの抽出
図３は、ＶＡ領域のピッチ波形を示す図である。テンプレート作成部９００は、まず、図３に示すＶＡ領域内の各フレームのピッチ波形を順次取り出し、各ピッチ波形を分析することによりターゲット音声の各フレームのピッチを順次抽出する。このとき、テンプレート作成部９００は、各フレームのピッチとともに、以下に示すような付加情報も生成し、これらをＶＡテンプレートに記録する。かかる付加情報としては、開始ビブラートデプス（ｍＢｅｇｉｎＤｅｐｔｈ［ｃｅｎｔ］）、終了ビブラートデプス（ｍＥｎｄＤｅｐｔｈ［ｃｅｎｔ］）、開始ビブラートレート（ｍＢｅｇｉｎＲａｔｅ［Ｈｚ］）、終了ビブラートレート（ｍＥｎｄＲａｔｅ［Ｈｚ］）、テンプレート区間長（ｍＤｕｒａｔｉｏｎ［ｓ］）、開始ピッチ（ｍＰｉｔｃｈ［ｃｅｎｔ］）等がある。 A-2-1. Extraction of Pitch FIG. 3 is a diagram showing a pitch waveform in the VA region. First, the template creation unit 900 sequentially extracts the pitch waveform of each frame in the VA area shown in FIG. 3, and sequentially extracts the pitch of each frame of the target speech by analyzing each pitch waveform. At this time, the template creation unit 900 generates additional information as shown below together with the pitch of each frame, and records these in the VA template. Such additional information includes start vibrato depth (mBeginDepth [cent]), end vibrato depth (mEndDepth [cent]), start vibrato rate (mBeginRate [Hz]), end vibrato rate (mEndRate [Hz]), template section length ( mDuration [s]), start pitch (mPitch [cent]), and the like.

開始ビブラートデプス（ｍＢｅｇｉｎＤｅｐｔｈ［ｃｅｎｔ］）は、図３に示すように開始ビブラート周期のピッチの最大値と最小値の差分であり、終了ビブラートデプス（ｍＥｎｄＤｅｐｔｈ［ｃｅｎｔ］）は、終了ビブラート周期のピッチの最大値と最小値の差分である。なお、ビブラート周期とは、例えば、ピッチの極大値から次の極大値までの時間（ｓ）を指す。また、図３では、ＶＡ領域のピッチ波形が開始ビブラート周期と終了ビブラート周期とによって構成されているが（すなわち、ＶＡ領域が２周期分の長さを有している）、周波数によっては、例えばＶＡ領域が３周期分以上の長さを有するような場合もある。このように、ＶＡ領域は、その周波数に応じて何周期分の長さを有するかは変わるものの、ビブラートが周期的に変動するＶＢ領域に至るまでの領域を指す点において変わりはない。 As shown in FIG. 3, the start vibrato depth (mBeginDepth [cent]) is the difference between the maximum value and the minimum value of the pitch of the start vibrato period, and the end vibrato depth (mEndDepth [cent]) is the pitch of the end vibrato period. It is the difference between the maximum and minimum values. The vibrato period refers to, for example, the time (s) from the maximum value of the pitch to the next maximum value. In FIG. 3, the pitch waveform of the VA region is composed of a start vibrato cycle and an end vibrato cycle (that is, the VA region has a length of two cycles). In some cases, the VA region has a length of three cycles or more. As described above, although the VA area has a length corresponding to the frequency of the VA area, the VA area does not change in terms of the area up to the VB area where the vibrato periodically changes.

開始ビブラートレート（ｍＢｅｇｉｎＲａｔｅ［Ｈｚ］）は、開始ビブラート周期の逆数（＝１／開始ビブラート周期）であり、終了ビブラートレート（ｍＥｎｄＲａｔｅ［Ｈｚ］）は、終了ビブラート周期の逆数（＝１／終了ビブラート周期）である。また、テンプレート区間長（ｍＤｕｒａｔｉｏｎ［ｓ］）は、テンプレートの時間的長さであり、開始ピッチ（ｍＰｉｔｃｈ［ｃｅｎｔ］）は、ＶＡ領域の最初のフレーム（ビブラート周期）のピッチである。なお、これらの付加情報は、このテンプレートを変形して所望のビブラートレート等を得る際に利用される（後述）。 The start vibrato rate (mBeginRate [Hz]) is the reciprocal of the start vibrato cycle (= 1 / start vibrato cycle), and the end vibrato rate (mEndRate [Hz]) is the reciprocal of the end vibrato cycle (= 1 / end vibrato cycle). ). The template section length (mDuration [s]) is the temporal length of the template, and the start pitch (mPitch [cent]) is the pitch of the first frame (vibrato period) in the VA area. The additional information is used when the template is modified to obtain a desired vibrato rate or the like (described later).

同様に、図４にＶＢ領域のピッチ波形を示す。前述したようにＶＢ領域は、ビブラートアタックに続いてピッチが周期的に変動する部分である。このＶＢ領域の各データをビブラートを付加する長さに応じてループして使用することにより、テンプレート区間長以上のビブラートを実現する。なお、テンプレート作成部９００がＶＢテンプレートに記録する情報については上記ＶＡテンプレートに記録する情報と同様であるため、説明を割愛する。また、ＶＲ領域についても上記ＶＡ領域やＶＢ領域と同様に説明することができるため、説明を割愛する。 Similarly, FIG. 4 shows a pitch waveform in the VB region. As described above, the VB region is a portion where the pitch periodically varies following the vibrato attack. By using each data in this VB area in a loop according to the length of vibrato addition, vibrato longer than the template section length is realized. Note that the information recorded in the VB template by the template creation unit 900 is the same as the information recorded in the VA template, and the description thereof will be omitted. The VR region can be described in the same manner as the VA region and VB region, and will not be described.

Ａ−２−２．ゲイン、スロープの抽出
図５は、ＶＡ領域のあるフレームのスペクトルを例示した図であり、図６は、図５に示すスペクトルの傾き（スロープ）を表すカーブを説明するための図である。なお、各図では、周波数ｆ［Ｈｚ］を横軸にとり、振幅値Ｍａｇｎｉｔｕｄｅ［ｄＢ］を縦軸にとっている。また、図５では、スペクトルエンベロープを実線で示し、スロープを表すカーブ（以下、Ｅｃｕｒｖｅと称する）を破線で示している。ここでまず、Ｅｃｕｒｖｅの振幅値であるＥｃｕｒｖｅＭａｇは、下記式（１）で表すことができる。
ＥｃｕｒｖｅＭａｇ（ｆ）＝Ｇａｉｎ＋１００＊（ｅ−Ｓｌｏｐｅ＊ｆ−１）・・・（１）
Ｇａｉｎ；当該フレームのゲイン
Ｓｌｏｐｅ；当該フレームのスロープ A-2-2. Extraction of Gain and Slope FIG. 5 is a diagram illustrating a spectrum of a frame having a VA region, and FIG. 6 is a diagram for explaining a curve representing the slope (slope) of the spectrum shown in FIG. In each figure, the frequency f [Hz] is taken on the horizontal axis, and the amplitude value Magnitude [dB] is taken on the vertical axis. In FIG. 5, the spectrum envelope is indicated by a solid line, and a curve representing a slope (hereinafter referred to as “Ecurve”) is indicated by a broken line. First, EcurveMag, which is the amplitude value of Ecurve, can be expressed by the following formula (1).
EcurveMag (f) = Gain + 100 * (e−Slope * f−1) (1)
Gain; Gain Slope of the frame; Slope of the frame

さらに、この式をもう少しかみくだいて図示すれば図６のようになる。つまり、ＥＣｕｒｖｅは、周波数ｆ＝０［Ｈｚ］でＧａｉｎ［ｄＢ］からスタートし、周波数ｆが高くなるにつれ（Ｇａｉｎ−１００）［ｄＢ］の漸近線に近づいてゆく。そして、このＥｃｕｒｖｅの傾き具合は、スロープによって変わる。この式（１）より、ゲインで純粋に信号の大きさを変化させることができ、スロープでその周波数特性（音色の変化）をコントロールできることがわかる（後述）。 Furthermore, if this equation is bitten a little more, it will be as shown in FIG. That is, EC curve starts from Gain [dB] at a frequency f = 0 [Hz], and approaches an asymptotic line of (Gain-100) [dB] as the frequency f increases. The inclination of the Ecurve changes depending on the slope. From this equation (1), it can be seen that the magnitude of the signal can be changed purely by the gain, and the frequency characteristic (change in timbre) can be controlled by the slope (described later).

テンプレート作成部９００は、以上のようにしてＶＡ領域の各フレームのスペクトルを分析することにより、各フレーム毎のゲイン、スロープを求める。このとき、テンプレート作成部９００は、各フレーム毎のゲイン、スロープとともに、開始ゲイン（ｍＧａｉｎ［ｄＢ］）、開始スロープ（ｍＳｌｏｐｅ）、開始トレモロデプス（ｍＢｅｇｉｎＴｒｅｍｏｌｏＤｅｐｔｈ［ｄＢ］）、終了トレモロデプス（ｍＥｎｄＴｒｅｍｏｌｏＤｅｐｔｈ［ｄＢ］）といった付加情報（いずれも図示略）を生成し、これらをＶＡテンプレートに記録する。ここで、開始ゲイン（ｍＧａｉｎ［ｄＢ］）は、ＶＡ領域の最初のフレームのゲインであり、開始スロープ（ｍＳｌｏｐｅ）は、ＶＡ領域の最初のフレームのスペクトルの傾きである。また、開始トレモロデプス（ｍＢｅｇｉｎＴｒｅｍｏｌｏＤｅｐｔｈ［ｄＢ］）は、上述した開始ビブラート周期のゲインの最大値と最小値の差分であり、終了トレモロデプス（ｍＥｎｄＴｒｅｍｏｌｏＤｅｐｔｈ［ｄＢ］）は、終了ビブラート周期のゲインの最大値と最小値の差分である。 The template creation unit 900 calculates the gain and slope for each frame by analyzing the spectrum of each frame in the VA region as described above. At this time, the template creation unit 900 includes a start gain (mGain [dB]), a start slope (mSlope), a start tremolo depth (mBegin Tremolo Depth [dB]), and an end tremolo depth (mEnd Tremolo Depth [dB] along with the gain and slope for each frame. ]) Are generated and recorded in the VA template. Here, the start gain (mGain [dB]) is the gain of the first frame in the VA region, and the start slope (mSlope) is the slope of the spectrum of the first frame in the VA region. The start tremolo depth (mBeginTremoloDepth [dB]) is the difference between the maximum value and the minimum value of the gain of the start vibrato period described above, and the end tremolo depth (mEndTremoloDepth [dB]) is the maximum value of the gain of the end vibrato period. And the difference between the minimum values.

なお、これらの付加情報は、上記ピッチ波形から求められた付加情報と同様、このテンプレートを変形して所望のトレモロデプス等を得る際に利用される（後述）。また、ＶＢ領域における各フレーム毎のゲイン、スロープ、及びこれらの付加情報は、上記ＶＡ領域における各フレーム毎のゲイン、スロープ及びこれらの付加情報と同様に生成され、ＶＢテンプレートに記録される。さらに、ＶＲ領域における各フレーム毎のゲイン、スロープ及びこれらの付加情報も同様に生成され、ＶＲテンプレートに記録される。 The additional information is used when a desired tremolo depth or the like is obtained by deforming the template in the same manner as the additional information obtained from the pitch waveform (described later). The gain, slope, and additional information for each frame in the VB area are generated in the same manner as the gain, slope, and additional information for each frame in the VA area, and are recorded in the VB template. Furthermore, the gain, slope, and additional information for each frame in the VR area are similarly generated and recorded in the VR template.

テンプレート作成部９００は、このようにして作成したＶＡテンプレート、ＶＢテンプレート、ＶＲテンプレートを組とするテンプレートセットＴＳをビブラートデータベース８００に記憶する。この結果、ビブラートデータベース８００には、ビブラートがかかったターゲット音声をＶＡ領域、ＶＢ領域、ＶＲ領域に分割したときの、各領域におけるピッチ、スペクトルの傾き、ゲインを示すデータが記憶される。 The template creation unit 900 stores the template set TS including the VA template, VB template, and VR template created in this way in the vibrato database 800. As a result, the vibrato database 800 stores data indicating the pitch, the slope of the spectrum, and the gain in each region when the target audio subjected to vibrato is divided into the VA region, the VB region, and the VR region.

Ａ−３．ビブラートの付加方法
テンプレートセットＴＳを利用して入力音声にビブラートを付加する方法としては、ＶＡ領域の開始ピッチ（ｍＰｉｔｃｈ［ｃｅｎｔ］）、開始ゲイン（ｍＧａｉｎ［ｄＢ］）、開始スロープ（ｍＳｌｏｐｅ）を基準にした各デルタ値（すなわちΔＰｉｔｃｈ［ｃｅｎｔ］、ΔＧａｉｎ［ｄＢ］、ΔＳｌｏｐｅ）をもとに、入力音声のフレームのスペクトルを変形することを基本とする。 A-3. Method of adding vibrato As a method of adding vibrato to the input sound using the template set TS, the start pitch (mPitch [cent]), the start gain (mGain [dB]), and the start slope (mSlope) of the VA region are used as a reference. Based on each delta value (that is, ΔPitch [cent], ΔGain [dB], ΔSlope), the spectrum of the input speech frame is basically modified.

このように、上記各デルタ値については、ＶＡ領域の各データ値を基準にして求めるため、ビブラートアタック、ビブラートボディ、ビブラートリリースの各々の接続時に不連続は生じない。そして、ビブラート開始時にはＶＡ領域の各データを一度だけ使い、続いてＶＢ領域の各データを使う。ＶＢ領域の各データについては、上記の如くループして使用することにより、ＶＢ領域のテンプレート区間長以上のビブラートを実現する。なお、ビブラート終了時には、ＶＢ領域の各データを使用した後、ＶＲ領域の各データを１度だけ使用する。ただし、ＶＲ領域の各データは、必ず使用しなければならないものではなく、ＶＲ領域の各データを使用する代わりにビブラート終了時までＶＢ領域の各データを繰り返し使用しても良い。以下、テンプレートセットＴＳを利用して入力音声にビブラートを付加したときのピッチ変化、ゲイン変化、スロープ変化について説明する。なお、以下では、ＶＢ領域の各データを利用してビブラートを付加する場合を想定する。 As described above, since each delta value is obtained based on each data value in the VA area, discontinuity does not occur when each of the vibrato attack, the vibrato body, and the vibrato release is connected. At the start of vibrato, each data in the VA area is used only once, and then each data in the VB area is used. For each data in the VB area, a vibrato longer than the template section length of the VB area is realized by using the loop as described above. At the end of the vibrato, each data in the VR area is used only once after each data in the VB area is used. However, each data in the VR area is not necessarily used, and each data in the VB area may be repeatedly used until the end of the vibrato instead of using each data in the VR area. Hereinafter, pitch change, gain change, and slope change when vibrato is added to the input sound using the template set TS will be described. In the following, it is assumed that vibrato is added using each data in the VB area.

Ａ−３−１．ピッチ変化について
まず、ビブラートの開始時を基準にしたときの現在時刻をＴｉｍｅ［ｓ］で表し、このＴｉｍｅ［ｓ］におけるＶＢ領域のピッチをＤＢＰｉｔｃｈ（Ｔｉｍｅ）［ｃｅｎｔ］と表すとすると、ピッチのデルタ値（ΔＰｉｔｃｈ［ｃｅｎｔ］）は下記式（２）によって表される。なお、下記式（２）において、ｍＰｉｔｃｈ［ｃｅｎｔ］がＶＡ領域の開始ピッチを表すことは、前述したとおりである。
ΔＰｉｔｃｈ＝ＤＢＰｉｔｃｈ（Ｔｉｍｅ）−ｍＰｉｔｃｈ・・・（２） A-3-1. Regarding the pitch change First, when the current time relative to the start time of the vibrato is represented by Time [s] and the pitch of the VB area in this Time [s] is represented by DBPitch (Time) [cent], The delta value (ΔPitch [cent]) is expressed by the following equation (2). In the following formula (2), mPitch [cent] represents the start pitch of the VA region as described above.
ΔPitch = DBPitch (Time) −mPitch (2)

この式から明らかなように、ΔＰｉｔｃｈは、現時点におけるＶＢ領域のピッチ（ＤＢＰｉｔｃｈ（Ｔｉｍｅ）［ｃｅｎｔ］）と、ＶＡ領域の開始ピッチ（ｍＰｉｔｃｈ［ｃｅｎｔ］）とのずれ量を表すものとなっている。このように、ＶＢテンプレートを利用して求めたΔＰｉｔｃｈは、最終的には、下記式（３）により入力音声のピッチ（入力Ｐｉｔｃｈ［ｃｅｎｔ］）に加算され、これにより、ビブラートが付加された後の出力音声のピッチ（出力Ｐｉｔｃｈ［ｃｅｎｔ］）が生成される。
出力Ｐｉｔｃｈ＝入力Ｐｉｔｃｈ＋ΔＰｉｔｃｈ・・・（３） As is apparent from this equation, ΔPitch represents the amount of deviation between the current pitch of the VB area (DBPitch (Time) [cent]) and the start pitch of the VA area (mPitch [cent]). . As described above, ΔPitch obtained using the VB template is finally added to the pitch of the input voice (input Pitch [cent]) by the following equation (3), thereby adding vibrato. The pitch of the output voice (output Pitch [cent]) is generated.
Output Pitch = Input Pitch + ΔPitch (3)

Ａ−３−２．ゲイン変化について
上記と同様、Ｔｉｍｅ［ｓ］におけるＶＢ領域のゲインをＤＢＧａｉｎ（Ｔｉｍｅ）と表すとすると、ゲインのデルタ値（ΔＧａｉｎ［ｄＢ］）は下記式（４）によって表される。なお、下記式（４）において、ｍＧａｉｎ［ｄＢ］がＶＡ領域の開始ゲインを表すことは、前述したとおりである。
ΔＧａｉｎ＝ＤＢＧａｉｎ（Ｔｉｍｅ）−ｍＧａｉｎ・・・（４）
このように、ＶＢテンプレートを利用して求めたΔＧａｉｎは、最終的には、下記式（５）により入力音声のゲイン（入力Ｇａｉｎ［ｄＢ］）に加算され、これにより、ビブラートが付加された後の出力音声のＧａｉｎ（出力Ｇａｉｎ［ｄＢ］）が生成される。
出力Ｇａｉｎ＝入力Ｇａｉｎ＋ΔＧａｉｎ・・・（５） A-3-2. Regarding Gain Change As described above, if the gain of the VB region in Time [s] is expressed as DBGain (Time), the gain delta value (ΔGain [dB]) is expressed by the following equation (4). Note that in the following formula (4), mGain [dB] represents the start gain of the VA region, as described above.
ΔGain = DBGain (Time) −mGain (4)
As described above, ΔGain obtained by using the VB template is finally added to the gain of the input sound (input Gain [dB]) by the following equation (5), thereby adding vibrato. Of the output voice (output Gain [dB]) is generated.
Output Gain = Input Gain + ΔGain (5)

ここで、図７及び図８は、Ｇａｉｎの大きさを変えたときのスペクトルエンベロープ及びＥＣｕｒｖｅの変化の様子を示す図であり、図７は、Ｇａｉｎを大きくしたとき（ΔＧａｉｎ＝正の値）の変化を示し、図８は、Ｇａｉｎを小さくしたとき（ΔＧａｉｎ＝負の値）の変化を示している。なお、図７及び図８に示すスペクトルエンベロープ及びＥＣｕｒｖｅは、いずれも前掲図５に示すスペクトルエンベロープ及びＥＣｕｒｖｅを基準にしている。
図７と図５、図８と図５をそれぞれ比較して明らかなように、Ｇａｉｎ（各図ではＥＣｕｒｖｅの切片）を変化させると、ＥＣｕｒｖｅ及びスペクトルエンベロープがその形状を維持したままで上下にシフトするのみであり、スペクトルエンベロープの形状そのものは変化しない。つまり、Ｇａｉｎの大きさを変えたとしても、全体の音量が変化する（Ｇａｉｎが大きくなれば全体の音量は大きくなり、Ｇａｉｎが小さくなれば全体の音量は小さくなる）のみであり、音色は変化しない。 Here, FIG. 7 and FIG. 8 are diagrams showing changes in the spectral envelope and EC curve when the magnitude of Gain is changed, and FIG. 7 is a diagram when Gain is increased (ΔGain = positive value). FIG. 8 shows a change when Gain is reduced (ΔGain = negative value). Note that the spectrum envelope and EC curve shown in FIGS. 7 and 8 are all based on the spectrum envelope and EC curve shown in FIG.
As is clear from comparison between FIGS. 7 and 5 and FIGS. 8 and 5, when Gain (in each figure, EC curve intercept) is changed, EC curve and spectrum envelope shift up and down while maintaining their shapes. The shape of the spectral envelope itself does not change. That is, even if the magnitude of Gain is changed, the overall volume changes only (the overall volume increases as Gain increases, and the overall volume decreases as Gain decreases), and the timbre changes. do not do.

Ａ−３−３．スロープ変化について
上記と同様、Ｔｉｍｅ［ｓ］におけるＶＢ領域のスロープをＤＢＳｌｏｐｅ（Ｔｉｍｅ）と表すとすると、スロープのデルタ値（ΔＳｌｏｐｅ）は下記式（６）によって表される。なお、下記式（６）において、ｍＳｌｏｐｅがＶＡ領域の開始スロープを表すことは、前述したとおりである。
ΔＳｌｏｐｅ＝ＤＢＳｌｏｐｅ（Ｔｉｍｅ）／ｍＳｌｏｐｅ・・・（６）
このようにして求めたΔＳｌｏｐｅとｍｓｌｏｐｅをさらに下記式（７）に代入し、入力音声の各周波数Ｆｒｅｑ［Ｈｚ］についてのＤｉｆｆ［ｄＢ］を求める。
Ｄｉｆｆ＝１００＊（ｅｘｐ（ｍｓｌｏｐｅ＊ΔＳｌｏｐｅ＊Ｆｒｅｑ）−１．０）−１００＊（ｅｘｐ（ｍｓｌｏｐｅ＊Ｆｒｅｑ）−１．０）・・・（７） A-3-3. About Slope Change As described above, if the slope of the VB region in Time [s] is expressed as DBSlope (Time), the slope delta value (ΔSlope) is expressed by the following equation (6). In addition, in the following formula (6), mSlope represents the start slope of the VA region as described above.
ΔSlope = DBSlope (Time) / mSlope (6)
ΔSlope and mslope thus obtained are further substituted into the following equation (7) to obtain Diff [dB] for each frequency Freq [Hz] of the input voice.
Diff = 100 * (exp (mslope * ΔSlope * Freq) −1.0) −100 * (exp (mslope * Freq) −1.0) (7)

そして、このようにして求めたＤｉｆｆを下記式（８）に代入することで、出力ｂｉｎＭａｇｎｉｔｕｄｅ［ｄＢ］を求める。なお、下記式（８）において入力ｂｉｎＭａｇｎｉｔｕｄｅ［ｄＢ］は、入力音声のスペクトルにおける周波数ｆｒｅｑ［Ｈｚ］の成分の大きさを表す（図９参照）。
出力ｂｉｎＭａｇｎｉｔｕｄｅ＝入力ｂｉｎＭａｇｎｉｔｕｄｅ＋Ｄｉｆｆ・・・（８）
このような計算を全てのｂｉｎ（周波数）について行い、各々のｂｉｎにおける出力ｂｉｎＭａｇｎｉｔｕｄｅ［ｄＢ］を求める。この操作により、入力音声のスペクトル分析によって得られるスペクトルの傾き（スロープ）が、時間と共に変化していくことになる。 Then, the output bin Magnitude [dB] is obtained by substituting Diff thus obtained into the following equation (8). Note that in the following equation (8), input bin Magnitude [dB] represents the magnitude of the component of frequency freq [Hz] in the spectrum of the input speech (see FIG. 9).
Output bin Magnitude = Input bin Magnitude + Diff (8)
Such a calculation is performed for all bins (frequency), and an output bin Magnitude [dB] in each bin is obtained. By this operation, the slope (slope) of the spectrum obtained by the spectrum analysis of the input speech changes with time.

図１０及び図１１は、上記操作によってＥｃｕｒｖｅのＳｌｏｐｅを変えたときのスペクトルエンベロープの変化の様子を示す図であり、図１０は、Ｓｌｏｐｅを大きくしたときのスペクトルエンベロープの変化を示し、図１１は、Ｓｌｏｐｅを小さくしたときのスペクトルエンベロープの変化を示している。なお、図１０及び図１１に示すスペクトルエンベロープ及びＥＣｕｒｖｅは、いずれも前掲図５に示すスペクトルエンベロープ及びＥＣｕｒｖｅを基準にしている。
図１０と図５、図１１と図５をそれぞれ比較して明らかなように、Ｓｌｏｐｅを変化させると、全体の音量を表すＧａｉｎ（各図ではＥＣｕｒｖｅの切片）は変わらないが、ＥＣｕｒｖｅのＳｌｏｐｅの変化に伴ってスペクトルエンベロープの形状が変化し、これにより音色が変化する。より具体的には、図１０に示すようにＳｌｏｐｅを大きくすると、高域側のスペクトルが出なくなるため、こもった音色になる。一方、図１１に示すようにＳｌｏｐｅを小さくすると、低域から高域まで均等にスペクトルが出るため、明るい音色になる。 FIG. 10 and FIG. 11 are diagrams showing how the spectral envelope changes when the Evolve Slope is changed by the above operation. FIG. 10 shows the change in the spectral envelope when the Slope is increased. , The change in the spectral envelope when the Slope is reduced. Note that the spectrum envelope and EC curve shown in FIGS. 10 and 11 are all based on the spectrum envelope and EC curve shown in FIG.
As is obvious from comparison between FIGS. 10 and 5 and FIGS. 11 and 5, when the slope is changed, the gain (the EC curve intercept in each figure) representing the overall volume does not change, but the EC curve slope is changed. With the change, the shape of the spectrum envelope changes, thereby changing the timbre. More specifically, as shown in FIG. 10, when the Slope is increased, a high-frequency spectrum is not generated, so that the tone becomes muffled. On the other hand, when the Slope is made small as shown in FIG. 11, a spectrum appears uniformly from the low range to the high range, so that a bright tone is obtained.

次に、このＶＢテンプレートを使用して、所望のビブラートレート、所望のピッチデプス（ピッチの波の深さ）、所望のトレモロデプス（ゲインの波の深さ）とする方法について説明する。 Next, a method of using this VB template to obtain a desired vibrato rate, a desired pitch depth (pitch wave depth), and a desired tremolo depth (gain wave depth) will be described.

Ａ−４．所望のビブラートレートを得る場合
所望のビブラートレートを得るには、まず、下記式（９）及び式（１０）により、テンプレートの読み取り時刻（速度）を変更する。なお、下記式（９）及び式（１０）では、所望のビブラートレートをＶｉｂＲａｔｅ［Ｈｚ］で表し、テンプレートの開始ビブラートレート及びテンプレートの終了ビブラートレートをそれぞれｍＢｅｇｉｎＲａｔｅ［Ｈｚ］及びｍＥｎｄＲａｔｅ［Ｈｚ］で表す。また、テンプレートの読み取り開始時刻を０としたときの時刻をＴｉｍｅ［ｓ］で表し、式（１０）によって得られるＶＢテンプレートの読み取り時刻をＮｅｗＴｉｍｅ［ｓ］で表す。
ＶｉｂＲａｔｅＦａｃｔｏｒ＝ＶｉｂＲａｔｅ／｛（ｍＢｅｇｉｎＲａｔｅ＋ｍＥｎｄＲａｔｅ）／２｝・・・（９）
ＮｅｗＴｉｍｅ＝Ｔｉｍｅ＊ＶｉｂＲａｔｅＦａｃｔｏｒ・・・（１０）
このようにしてテンプレートの読み取り時刻（速度）を変更することにより、所望のピッチ変化を実現する。 A-4. When obtaining a desired vibrato rate To obtain a desired vibrato rate, first, the reading time (speed) of the template is changed by the following formulas (9) and (10). In the following formulas (9) and (10), the desired vibrato rate is represented by VibRate [Hz], and the template start vibrato rate and the template end vibrato rate are represented by mBeginRate [Hz] and mEndRate [Hz], respectively. . Also, the time when the template reading start time is set to 0 is represented by Time [s], and the reading time of the VB template obtained by Expression (10) is represented by NewTime [s].
VibRateFactor = VibRate / {(mBeginRate + mEndRate) / 2} (9)
NewTime = Time * VibRateFactor (10)
In this way, a desired pitch change is realized by changing the reading time (speed) of the template.

Ａ−５．所望のピッチデプスを得る場合
次に、ピッチデプスであるが、下記式（１１）により新たなΔＰｉｔｃｈを求め、これにより所望のピッチデプスを実現する。なお、下記式（１１）では、所望のピッチデプスをＰｉｔｃｈＤｅｐｔｈ［ｃｅｎｔ］で表し、ＶＢテンプレートの開始ビブラート（ピッチ）デプス及びＶＢテンプレートの終了ビブラート（ピッチ）デプスをそれぞれｍＢｅｇｉｎＤｅｐｔｈ［ｃｅｎｔ］、ｍＥｎｄＤｅｐｔｈ［ｃｅｎｔ］で表す。また、テンプレートの読み取り開始時刻を０とした時間（テンプレートの読み取り時刻）をＴｉｍｅ［ｓ］で表し、Ｔｉｍｅ［ｓ］におけるピッチのデルタ値をΔＰｉｔｃｈ（Ｔｉｍｅ）［ｃｅｎｔ］で表す。
ΔＰｉｔｃｈ＝ΔＰｉｔｃｈ（Ｔｉｍｅ）＊ＰｉｔｃｈＤｅｐｔｈ／｛ｍＢｅｇｉｎＤｅｐｔｈ＋ｍＥｎｄＤｅｐｔｈ）／２｝・・・（１１） A-5. Case of Obtaining Desired Pitch Depth Next, regarding pitch depth, a new ΔPitch is obtained by the following equation (11), thereby realizing a desired pitch depth. In the following formula (11), the desired pitch depth is represented by Pitch Depth [cent], and the start vibrato (pitch) depth of the VB template and the end vibrato (pitch) depth of the VB template are respectively mBeginDepth [cent] and mEndDepth [cent]. Represented by Also, the time when the template reading start time is 0 (template reading time) is represented by Time [s], and the pitch delta value at Time [s] is represented by ΔPitch (Time) [cent].
ΔPitch = ΔPitch (Time) * PitchDepth / {mBeginDepth + mEndDepth) / 2} (11)

ただし、上記の如くテンプレートの読み取り時刻がＮｅｗＴｉｍｅ（ｓ）に変更されている場合には、下記式（１１）’により新たなΔＰｉｔｃｈを求めれば良い。
ΔＰｉｔｃｈ＝ΔＰｉｔｃｈ（ＮｅｗＴｉｍｅ）＊ＰｉｔｃｈＤｅｐｔｈ／｛ｍＢｅｇｉｎＤｅｐｔｈ＋ｍＥｎｄＤｅｐｔｈ）／２｝・・・（１１）’ However, when the template reading time is changed to NewTime (s) as described above, a new ΔPitch may be obtained by the following equation (11) ′.
ΔPitch = ΔPitch (NewTime) * PitchDepth / {mBeginDepth + mEndDepth) / 2} (11) ′

Ａ−６．所望のトレモロデプスを得る場合
次に、トレモロデプスであるが、下記式（１２）により新たなΔＧａｉｎ［ｄＢ］を求め、これにより所望のトレモロデプスを実現する。なお、下記式（１２）では、所望のトレモロデプスをＴｒｅｍｏｌｏＤｅｐｔｈ［ｄＢ］、ＶＢテンプレートの開始トレモロデプス及びＶＢテンプレートの終了トレモロデプスをそれぞれｍＢｅｇｉｎＴｒｅｍｏｌｏＤｅｐｔｈ［ｄＢ］、ｍＥｎｄＴｒｅｍｏｌｏＤｅｐｔｈ［ｄＢ］で表す。また、テンプレートの読み取り開始時刻を０とした時間（テンプレートの読み取り時刻）をＴｉｍｅ［ｓ］で表し、Ｔｉｍｅ［ｓ］におけるゲインのデルタ値をΔＧａｉｎ（Ｔｉｍｅ）［ｄＢ］で表す。
ΔＧａｉｎ＝ΔＧａｉｎ（Ｔｉｍｅ）＊ＴｒｅｍｏｌｏＤｅｐｔｈ／｛（ｍＢｅｇｉｎＴｒｅｍｏｌｏＤｅｐｔｈ＋ｍＥｎｄＴｒｅｍｏｌｏＤｅｐｔｈ）／２｝・・・（１２） A-6. Case of Obtaining Desired Tremolo Depth Next, for tremolo depth, a new ΔGain [dB] is obtained by the following equation (12), thereby realizing a desired tremolo depth. In the following formula (12), the desired tremolo depth is represented by Tremolo Depth [dB], the start tremolo depth of the VB template and the end tremolo depth of the VB template are represented by mBeginTremoDepth [dB] and mEndTremoDepth [dB], respectively. Also, the time when the template reading start time is 0 (template reading time) is represented by Time [s], and the gain delta value at Time [s] is represented by ΔGain (Time) [dB].
ΔGain = ΔGain (Time) * TremoloDepth / {(mBeginTremoloDepth + mEndTremoDepth) / 2} (12)

ただし、上記の如くテンプレートの読み取り時刻がＮｅｗＴｉｍｅ（ｓ）に変更されている場合には、下記式（１２）’により新たなΔＧａｉｎを求めれば良い。
ΔＧａｉｎ＝ΔＧａｉｎ（ＮｅｗＴｉｍｅ）＊ＴｒｅｍｏｌｏＤｅｐｔｈ／｛（ｍＢｅｇｉｎＴｒｅｍｏｌｏＤｅｐｔｈ＋ｍＥｎｄＴｒｅｍｏｌｏＤｅｐｔｈ）／２｝・・・（１２）’ However, when the template reading time is changed to NewTime (s) as described above, a new ΔGain may be obtained by the following equation (12) ′.
ΔGain = ΔGain (NewTime) * TremoloDepth / {(mBeginTremoloDepth + mEndTremoDepth) / 2} (12) ′

また、音色の変化は、主に音の大きさに連動すると考えられるため、下記式（１３）により、新たなＤｉｆｆ［ｄＢ］を求める。なお、下記式（１３）の右辺に示すＤｉｆｆ［ｄＢ］は、上記式（７）によって求まるＤｉｆｆ［ｄＢ］を指す。
Ｄｉｆｆ＝Ｄｉｆｆ＊ＴｒｅｍｏｌｏＤｅｐｔｈ／｛（ｍＢｅｇｉｎＴｒｅｍｏｌｏＤｅｐｔｈ＋ｍＥｎｄＴｒｅｍｏｌｏＤｅｐｔｈ）／２｝・・・（１３） Further, since the change in timbre is considered to be mainly linked to the volume of the sound, a new Diff [dB] is obtained by the following equation (13). In addition, Diff [dB] shown on the right side of the following formula (13) indicates Diff [dB] obtained by the above formula (7).
Diff = Diff * TremoDepth / {(mBeginTremoloDepth + mEndTremoDepth) / 2} (13)

ただし、上記の如くテンプレートの読み取り時刻がＮｅｗＴｉｍｅ［ｓ］に変更されている場合には、上記式（６）、（７）を用いてＮｅｗＴｉｍｅ［ｓ］におけるＤｉｆｆ［ｄＢ］を求め、このＤｉｆｆ［ｄＢ］を式（１３）の右辺に代入することで新たなＤｉｆｆ［ｄＢ］を求めれば良い。
以上、ＶＢテンプレートを利用してビブラートを付加する場合を例に説明したが、ＶＡテンプレート、ＶＲテンプレートを利用してビブラートを付加する場合も同様に説明することができるため、これ以上の説明は割愛する。 However, when the reading time of the template is changed to NewTime [s] as described above, Diff [dB] in NewTime [s] is obtained using the above formulas (6) and (7), and this Diff [ Substituting [dB] into the right side of equation (13), a new Diff [dB] may be obtained.
In the above, the case of adding vibrato using a VB template has been described as an example, but the case of adding vibrato using a VA template or a VR template can also be described in the same way, so further explanation is omitted. To do.

Ａ−７．実施形態の動作
以下、本実施形態に係る音声処理装置１００を利用してビブラートを付加する場合の動作について説明する。なお、以下の説明では、ＶＲ領域の各データを利用せずにビブラートを付加する場合を想定する。
図１２及び図１３は、ビブラート付加制御部７００によって実行されるビブラート付加処理を示すフローチャートである。入力音声のピッチを検出して有声音であると判定されると、その音声にビブラートを付加する以下の処理が開始される。ビブラート付加制御部７００は、ビブラートを制御するために素人歌唱者等が入力したVibパラメータをパラメータ入力部６００から受け取る（ステップＳ１）。ここで、パラメータ入力部６００に入力されるVibパラメータとしては、例えば有声音が検出されてから実際にビブラートの付与が開始されるまでの時間長を表すビブラートディレイ（ＶｉｂＤｅｌａｙ［ｓ］）、ビブラートの周期を表すビブラート周期（ＶｉｂＲａｔｅ［Ｈｚ］）、ビブラートのおけるピッチのゆらぎの深さを表すピッチデプス（ＰｉｔｃｈＤｅｐｔｈ［ｃｅｎｔ］）、ビブラートにおける音量変化のゆらぎの深さを表すトレモロデプス（ＴｒｅｍｏｌｏＤｅｐｔｈ［ＤＢ］）等がある。なお、ビブラートの付与を開始（つまり、ビブラートアタック部の適用を開始）するタイミングについては、上記の如く有声音が検出されることを契機とするほか、ピッチ変化が検出されることを契機としても良い。かかる態様によれば、例えばピッチの違う有声音が連続して入力された場合に、最初のピッチの音声についてのみビブラートアタック部を適用する、あるいは最初のピッチの音声と次のピッチの音声の両方にビブラートアタック部を適用することができる。もちろん、これらに限定する趣旨ではなく、ピッチ変化の検出結果をどのように利用してビブラートの付与を開始するかは適宜設定可能である。 A-7. Operation of Embodiment Hereinafter, an operation in the case of adding vibrato using the audio processing device 100 according to the present embodiment will be described. In the following description, it is assumed that vibrato is added without using each data in the VR area.
12 and 13 are flowcharts showing a vibrato addition process executed by the vibrato addition control unit 700. FIG. When the pitch of the input speech is detected and determined to be voiced, the following processing for adding vibrato to the speech is started. The vibrato addition control unit 700 receives a Vib parameter input by an amateur singer or the like from the parameter input unit 600 in order to control vibrato (step S1). Here, as the Vib parameter input to the parameter input unit 600, for example, a vibrato delay (VibDelay [s]) indicating a time length from when a voiced sound is detected until the actual addition of vibrato is started, Vibrato period (VibRate [Hz]) representing a period, Pitch depth (PitchDepth [cent]) representing the depth of pitch fluctuation in vibrato, Tremolo Depth (TremoloDepth [DB]) representing the depth of fluctuation in volume in vibrato Etc. Note that the timing of starting the vibrato application (that is, starting the application of the vibrato attack part) is triggered by the detection of a voiced sound as described above, or by the detection of a pitch change. good. According to this aspect, for example, when voiced sounds having different pitches are continuously input, the vibrato attack unit is applied only to the first pitch sound, or both the first pitch sound and the next pitch sound are applied. The vibrato attack part can be applied to Of course, the present invention is not limited to these, and it is possible to appropriately set how the detection result of the pitch change is used to start the addition of vibrato.

次に、ビブラート付加制御部７００は、ビブラート付加のためのアルゴリズムの初期化を行う。ここでは、例えばフラグＶｉｂＡｔｔａｃｋＦｌａｇ及びフラグＶｉｂＢｏｄｙＦｌａｇを“１”にセットする（ステップＳ２）。これら各フラグは、それぞれＶＡテンプレート若しくはＶＢテンプレートを使用するか否かを判別するためのフラグであり、フラグＶｉｂＡｔｔａｃｋＦｌａｇが“１”にセットされているときはＶＡテンプレートを使用し、フラグＶｉｂＡｔｔａｃｋＦｌａｇが“０”でフラグＶｉｂＢｏｄｙＦｌａｇが“１”にセットされているときはＶＢテンプレートを使用し、フラグＶｉｂＡｔｔａｃｋＦｌａｇ及びフラグＶｉｂＢｏｄｙＦｌａｇが共に“０”にセットされているときはビブラート付加の終了を意味する。なお、ビブラート付加については、上述したように、入力音声が有声音として検出されてからＶｉｂパラメータのビブラートディレイ（ＶｉｂＤｅｌａｙ［ｓ］）に示される時間が経過した後に開始される。 Next, the vibrato addition control unit 700 initializes an algorithm for adding vibrato. Here, for example, the flag VibAttackFlag and the flag VibBodyFlag are set to “1” (step S2). Each of these flags is a flag for determining whether or not to use a VA template or a VB template. When the flag VibAttackFlag is set to “1”, the VA template is used, and the flag VibAttackFlag is “0”. When the flag VibBodyFlag is set to “1”, the VB template is used, and when both the flag VibAttackFlag and the flag VibBodyFlag are set to “0”, it means that the addition of the vibrato is completed. In addition, as described above, the addition of vibrato is started after the time indicated by the vibrato delay (VibDelay [s]) of the Vib parameter has elapsed since the input sound was detected as a voiced sound.

ビブラート付加制御部７００は、ステップＳ３に進むと、ビブラートデータベース８００に格納されている複数種類のテンプレートセットＴＳの中から、音声入力部２００を介して入力される入力音声のステップＳ１で検出したピッチに最も近いテンプレートセットＴＳ（ただし、本動作例では、ＶＡテンプレートとＶＢテンプレートによって構成されたテンプレートセットＴＳ）を選択する（ステップＳ３）。なお、テンプレートセットＴＳの作成については、上記Ａ−２においてその詳細を明らかにしたため、ここでは説明を割愛する。 When the vibrato addition control unit 700 proceeds to step S3, the pitch detected in step S1 of the input voice input via the voice input unit 200 from the plurality of types of template sets TS stored in the vibrato database 800. (In this operation example, the template set TS composed of the VA template and the VB template) is selected (step S3). Since the details of the creation of the template set TS are clarified in A-2 above, description thereof is omitted here.

そして、ビブラート付加制御部７００は、選択したＶＡテンプレートの時間長（ＶｉｂＡｔｔａｃｋＤｕｒａｔｉｏｎ［ｓ］）及びＶＢテンプレートの時間長（ＶｉｂＢｏｄｙＤｕｒａｔｉｏｎ［ｓ］）を取得し（ステップＳ４）、ステップＳ５に進む。 Then, the vibrato addition control unit 700 acquires the time length of the selected VA template (VibAttackDuration [s]) and the time length of the VB template (VibBodyDuration [s]) (Step S4), and proceeds to Step S5.

ビブラート付加制御部７００は、ステップＳ５において、フラグＶｉｂＡｔｔａｃｋＦｌａｇの値をチェックする。ここで、フラグＶｉｂＡｔｔａｃｋＦｌａｇの値が“１”であれば、ステップＳ６に進む一方、フラグＶｉｂＡｔｔａｃｋＦｌａｇの値が“０”であれば、ステップＳ１０に進む。 In step S5, the vibrato addition control unit 700 checks the value of the flag VibAttackFlag. If the value of the flag VibAttackFlag is “1”, the process proceeds to step S6. If the value of the flag VibAttackFlag is “0”, the process proceeds to Step S10.

ステップＳ６に進むと、ビブラート付加制御部７００は、ステップＳ３において選択されたテンプレートセットＴＳからＶＡテンプレートを読み込み、これをＤＢｄａｔａ（ＤＢＰｉｔｃｈ、ＤＢＧａｉｎ、ＤＢＳｌｏｐｅ）とする。また、上記式（９）によりＶｉｂＲａｔｅＦａｃｔｏｒを計算する。 In step S6, the vibrato addition control unit 700 reads the VA template from the template set TS selected in step S3, and sets this as DBdata (DBPitch, DBGain, DBSlope). Also, VibRateFactor is calculated by the above equation (9).

そして、ビブラート付加制御部７００は、ステップＳ７において、計算したＶｉｂＲａｔｅＦａｃｔｏｒから、上記式（１０）によりテンプレートの読み取り時刻を計算してＮｅｗＴｉｍｅ［ｓ］を得る。 In step S7, the vibrato addition control unit 700 calculates the template reading time from the calculated VibRateFactor according to the above equation (10) to obtain NewTime [s].

ビブラート付加制御部７００は、ステップＳ８に進むと、上記の如く求めたテンプレートの読み取り時刻ＮｅｗＴｉｍｅ［ｓ］が、ＶＡテンプレートの時間長ＶｉｂＡｔｔａｃｋＤｕｒａｔｉｏｎ［ｓ］（ステップＳ４参照）を越えているか否かを判断する。ビブラート付加制御部７００は、テンプレートの読み取り時刻ＮｅｗＴｉｍｅ［ｓ］がＶＡテンプレートの時間長ＶｉｂＡｔｔａｃｋＤｕｒａｔｉｏｎ［ｓ］を越えている（すなわち、ＶＡテンプレートを最後まで使い終わった）と判断すると（ステップＳ８；ＹＥＳ）、ステップＳ９に進む。一方、ビブラート付加制御部７００は、テンプレートの読み取り時刻ＮｅｗＴｉｍｅ［ｓ］がＶｉｂＡｔｔａｃｋＤｕｒａｔｉｏｎを越えていないと判断すると（ステップＳ８；ＮＯ）、ＶＡテンプレートを利用してビブラートを付加するべく、ステップＳ１５に進む。 In step S8, the vibrato addition control unit 700 determines whether the template reading time NewTime [s] obtained as described above exceeds the time length VibattDuration [s] of the VA template (see step S4). To do. If the vibrato addition control unit 700 determines that the template reading time NewTime [s] exceeds the time length VibattDuration [s] of the VA template (that is, the VA template has been used to the end) (step S8; YES). The process proceeds to step S9. On the other hand, if the vibrato addition control unit 700 determines that the template reading time NewTime [s] does not exceed VibAttackDuration (step S8; NO), the process proceeds to step S15 to add vibrato using the VA template.

Ａ−７−１．ステップＳ１５に進んだ場合（ステップＳ８；ＮＯ）
ビブラート付加制御部７００は、ステップＳ１５に進むと、ＶＡテンプレートのＤＢｄａｔａから、時刻ＮｅｗＴｉｍｅ［ｓ］における各パラメータ（Ｐｉｔｃｈ、Ｇａｉｎ、Ｓｌｏｐｅ）の値を求める。このとき、時刻ＮｅｗＴｉｍｅ［ｓ］がＶＡテンプレート内の実データのあるフレーム時間の中間に位置する場合には、ＮｅｗＴｉｍｅ［ｓ］前後のフレームにおける各パラメータの値を補間（例えば直線補間）することによって上記各パラメータの値を求める。 A-7-1. When proceeding to step S15 (step S8; NO)
When the process proceeds to step S15, the vibrato addition control unit 700 obtains the value of each parameter (Pitch, Gain, Slope) at time NewTime [s] from the DB data of the VA template. At this time, when the time NewTime [s] is located in the middle of a certain frame time of the actual data in the VA template, the values of the parameters in the frames before and after NewTime [s] are interpolated (for example, linear interpolation). The values of the above parameters are obtained.

そして、ビブラート付加制御部７００は、各パラメータの各デルタ値（ΔＰｉｔｃｈ、ΔＧａｉｎ、ΔＳｌｏｐｅ）を求める（ステップＳ１６）。ただし、各デルタ値を求める際には、上記Ａ−５、Ａ−６で述べたように、ＰｉｔｃｈＤｅｐｔｈ［ｃｅｎｔ］、ＴｒｅｍｏｌｏＤｅｐｔｈ［ｄＢ］の値が反映される（上記式（１１）〜（１３）参照）。 And the vibrato addition control part 700 calculates | requires each delta value ((DELTA) Pitch, (DELTA) Gain, (DELTA) Slope) of each parameter (step S16). However, when calculating each delta value, as described in the above A-5 and A-6, the values of Pitch Depth [cent] and Tremolo Depth [dB] are reflected (the above formulas (11) to (13)). reference).

ビブラート付加制御部７００は、ステップＳ１７に進むと、図１４に示すような係数Ｍｕｌｄｅｌｔａを求める。この係数Ｍｕｌｄｅｌｔａは、ビブラートの終了条件（例えば、入力音声のピッチが検出されなくなった場合等；詳しくは後述）を満たした時刻から、各パラメータのΔ値を徐々に小さくしてビブラートを収束させるための係数である。もし、このような処理をしないでビブラートをかけたい時間長に達すると同時に、ビブラートをかけるのをやめたとすれば、その時点で急激にピッチ等が変化してしまうといった問題が生じてしまう。本実施例では、このような問題を未然に回避するために、上記ビブラートを収束させるための係数Ｍｕｌｄｅｌｔａを求めるのである。なお、図１４に示すグラフからも明らかなように、ビブラートの終了条件を満たした時刻に達していない場合には、係数Ｍｕｌｄｅｌｔａは「１」に設定され、ビブラートを収束させる処理は行われない。 When the process proceeds to step S17, the vibrato addition control unit 700 obtains a coefficient Muldelta as shown in FIG. The coefficient Muldelta is used to converge the vibrato by gradually decreasing the Δ value of each parameter from the time when the end condition of the vibrato (for example, when the pitch of the input speech is not detected; details will be described later) is satisfied. Is the coefficient. If the vibrato is stopped at the same time as the time when the vibrato is to be applied without performing such processing, a problem arises in that the pitch or the like changes suddenly at that time. In this embodiment, in order to avoid such a problem, a coefficient Muldelta for converging the vibrato is obtained. As is clear from the graph shown in FIG. 14, when the time when the end condition of the vibrato is not reached, the coefficient Muldelta is set to “1”, and the process of converging the vibrato is not performed.

ビブラート付加制御部７００は、ステップＳ１８に進むと、ステップＳ１６で求めた各パラメータの各デルタ値（ΔＰｉｔｃｈ、ΔＧａｉｎ、ΔＳｌｏｐｅ）に、ステップＳ１７で求めた係数Ｍｕｌｄｅｌｔａを乗算する。そして、ビブラート付加制御部７００は、乗算した結果をビブラート付加部４００に供給し、ステップＳ７に戻る。ビブラート付加部４００は、ビブラート付加制御部７００から該乗算結果を受け取ると、先に求めた入力音声のピッチ、ゲイン、スロープについて、該乗算結果に示される各デルタ値（ΔＰｉｔｃｈ、ΔＧａｉｎ、ΔＳｌｏｐｅ）に応じたスペクトル変換を行う。すなわち、ビブラート付加部４００は、上述した式（３）を利用して出力Ｐｉｔｃｈ［ｃｅｎｔ］を算出し、上述した式（５）を利用して出力Ｇａｉｎ［ｄＢ］を算出し、上述した式（７）、（８）を利用して出力ｂｉｎＭａｇｎｉｔｕｄｅ［ｄＢ］を算出し、これらに基づきピッチ、ゲイン、スロープが変更されたスペクトルを得る。そして、ビブラート付加部４００は、このようにして求めたスペクトルを逆ＦＦＴし、逆ＦＦＴによって得た音声を音声出力部５００に供給する。これにより、あたかもターゲット歌唱者が歌唱しているかのうような自然な音色変化を示す音声が、音声出力部５００から出力される。 When the process proceeds to step S18, the vibrato addition control unit 700 multiplies each delta value (ΔPitch, ΔGain, ΔSlope) of each parameter obtained in step S16 by the coefficient Muldelta obtained in step S17. Then, the vibrato addition controller 700 supplies the multiplied result to the vibrato adder 400, and the process returns to step S7. When the vibrato adding unit 400 receives the multiplication result from the vibrato adding control unit 700, the pitch, gain, and slope of the input speech obtained previously are set to each delta value (ΔPitch, ΔGain, ΔSlope) indicated in the multiplication result. Perform the corresponding spectral conversion. That is, the vibrato adding unit 400 calculates the output Pitch [cent] using the above-described equation (3), calculates the output Gain [dB] using the above-described equation (5), and the above-described equation ( 7) and (8) are used to calculate the output bin Magnitude [dB], and a spectrum in which the pitch, gain, and slope are changed based on these is obtained. Then, the vibrato adding unit 400 performs inverse FFT on the spectrum thus obtained and supplies the audio obtained by the inverse FFT to the audio output unit 500. As a result, the voice output unit 500 outputs a voice indicating a natural timbre change as if the target singer is singing.

Ａ−７−２．ステップＳ９に進んだ場合（ステップＳ８；ＹＥＳ）
一方、ビブラート付加制御部７００は、ステップＳ８において、テンプレートの読み取り時刻ＮｅｗＴｉｍｅ［ｓ］がＶＡテンプレートの時間長ＶｉｂＡｔｔａｃｋＤｕｒａｔｉｏｎ［ｓ］を越えていると判断すると、ステップＳ９に進み、フラグＶｉｂＡｔｔａｃｋＦｌａｇを“０”にセットしてビブラートアタックを終了し、そのときの時刻Ｔｉｍｅ［ｓ］をＶｉｂＡｔｔａｃｋＥｎｄＴｉｍｅ［ｓ］としてセットする。
そして、ビブラート付加制御部７００は、フラグＶｉｂＢｏｄｙＦｌａｇをチェックし、ＶＢテンプレートを使用するか否かを判断する（ステップＳ１０）。ビブラート付加制御部７００は、フラグＶｉｂＢｏｄｙＦｌａｇの値が“０”の場合（ステップＳ１０；ＮＯ）、ビブラート付加処理を終了する一方、フラグＶｉｂＢｏｄｙＦｌａｇの値が“１”の場合には（ステップＳ１０；ＹＥＳ）、ステップＳ１１に進む。 A-7-2. When proceeding to step S9 (step S8; YES)
On the other hand, if the vibrato addition control unit 700 determines in step S8 that the template reading time NewTime [s] exceeds the time length VibattDuration [s] of the VA template, the process proceeds to step S9 and sets the flag VibAttackFlag to “0”. To end the vibrato attack, and set the time Time [s] at that time as VibAttackEndTime [s].
Then, the vibrato addition control unit 700 checks the flag VibBodyFlag and determines whether or not to use the VB template (step S10). When the value of the flag VibBodyFlag is “0” (step S10; NO), the vibrato addition control unit 700 ends the vibrato addition process, while when the value of the flag VibBodyFlag is “1” (step S10; YES). The process proceeds to step S11.

ビブラート付加制御部７００は、フラグＶｉｂＡｔｔａｃｋＦｌａｇの値が“０”にセットされ、フラグＶｉｂＢｏｄｙＦｌａｇの値が“１”にセットされていることを確認すると、ＶＢテンプレートを読み込み、これをＤＢｄａｔａ（ＤＢＰｉｔｃｈ、ＤＢＧａｉｎ、ＤＢＳｌｏｐｅ）とする。 When it is confirmed that the value of the flag VibAttackFlag is set to “0” and the value of the flag VibBodyFlag is set to “1”, the vibrato addition control unit 700 reads the VB template and reads the DBdata (DBPitch, DBGain, DBSlope).

そして、ビブラート付加制御部７００は、ステップＳ１２において、上記式（９）によりＶｉｂＲａｔｅＦａｃｔｏｒを計算すると共に、上記式（１０）によりテンプレートの読み取り時刻を計算してＮｅｗＴｉｍｅ［ｓ］を得る。ただし、ＶＢテンプレートをループしたときのＮｅｗＴｉｍｅ［ｓ］の計算には注意を要する。ビブラート付加制御部７００は、このようにしてＮｅｗＴｉｍｅ［ｓ］を求めると、ビブラートを終了すべきか否か判断する（ステップＳ１３）。具体的には、ビブラート付加制御部７００は、入力音声のピッチが検出されなくなった場合（有声音が検出されなくなった場合等）、図示せぬ操作部を介してビブラートの終了指示が与えられた場合、入力音声の音量が閾値以下になった場合などに、ビブラートを終了すべきと判断する。 In step S12, the vibrato addition control unit 700 calculates VibRateFactor according to the above equation (9) and calculates the template reading time according to the above equation (10) to obtain NewTime [s]. However, care must be taken when calculating NewTime [s] when the VB template is looped. When the vibrato addition control unit 700 obtains NewTime [s] in this way, it determines whether or not to end the vibrato (step S13). Specifically, the vibrato addition control unit 700 is instructed to end the vibrato via an operation unit (not shown) when the pitch of the input voice is no longer detected (such as when no voiced sound is detected). In such a case, it is determined that the vibrato should be terminated when the volume of the input sound is equal to or lower than the threshold value.

ビブラート付加制御部７００は、ビブラートを終了すべきでないと判断すると（ステップＳ１３；ＮＯ）、ステップＳ１５’に進む。一方、ビブラート付加制御部７００は、ビブラートを終了すべきであると判断すると（ステップＳ１３；ＹＥＳ）、ステップＳ１４に進み、ビブラートを収束させるべきか否かを判断するためのフラグＶｉｂＣｏｎＦｌａｇを“１”にセットする。そして、ビブラート付加制御部７００は、ステップＳ１５’に進む。なお、このフラグＶｉｂＣｏｎＦｌａｇの値が“０”のときにはビブラートを収束させず、このフラグＶｉｂＣｏｎＦｌａｇの値が“１”のときにはビブラートを収束させる（ビブラートの収束に関してはステップＳ１７参照）。 If the vibrato addition control unit 700 determines that the vibrato should not be terminated (step S13; NO), the process proceeds to step S15 '. On the other hand, when the vibrato addition control unit 700 determines that the vibrato should be terminated (step S13; YES), the process proceeds to step S14, and a flag VibConFlag for determining whether or not the vibrato should be converged is “1”. Set to. Then, the vibrato addition control unit 700 proceeds to step S15 '. When the value of this flag VibConFlag is “0”, the vibrato is not converged. When the value of this flag VibConFlag is “1”, the vibrato is converged (refer to step S17 for the convergence of the vibrato).

ビブラート付加制御部７００は、ステップＳ１５’に進むと、ＶＢテンプレートのＤＢｄａｔａから、時刻ＮｅｗＴｉｍｅ［ｓ］における各パラメータ（Ｐｉｔｃｈ、Ｇａｉｎ、Ｓｌｏｐｅ）の値を求める。このとき、時刻ＮｅｗＴｉｍｅ［ｓ］がＶＢテンプレート内の実データのあるフレーム時間の中間に位置する場合には、ＮｅｗＴｉｍｅ［ｓ］前後のフレームにおける各パラメータの値を補間（例えば直線補間）することによって上記各パラメータの値を求める。そして、
ビブラート付加制御部７００は、各パラメータの各デルタ値（ΔＰｉｔｃｈ、ΔＧａｉｎ、ΔＳｌｏｐｅ）を求める（ステップＳ１６’）。ただし、各デルタ値を求める際には、上記Ａ−５、Ａ−６で述べたように、ＰｉｔｃｈＤｅｐｔｈ［ｃｅｎｔ］、ＴｒｅｍｏｌｏＤｅｐｔｈ［ｄＢ］の値が反映される（上記式（１１）〜（１３）参照）。 When the process proceeds to step S15 ′, the vibrato addition control unit 700 obtains the value of each parameter (Pitch, Gain, Slope) at time NewTime [s] from the DB data of the VB template. At this time, when the time NewTime [s] is located in the middle of a certain frame time of the actual data in the VB template, the value of each parameter in the frame before and after NewTime [s] is interpolated (for example, linear interpolation). The values of the above parameters are obtained. And
The vibrato addition control unit 700 obtains each delta value (ΔPitch, ΔGain, ΔSlope) of each parameter (step S16 ′). However, when calculating each delta value, as described in the above A-5 and A-6, the values of Pitch Depth [cent] and Tremolo Depth [dB] are reflected (the above formulas (11) to (13)). reference).

ビブラート付加制御部７００は、ステップＳ１７’に進むと、フラグＶｉｂＣｏｎＦｌａｇの値を参照し、ビブラートの終了条件を満たした時刻に到達したか否かを判断する。ビブラート付加制御部７００は、フラグＶｉｂＣｏｎＦｌａｇの値が“０”であり、未だビブラートの終了条件を満たした時刻に達していないと判断すると、ステップＳ１８’に進み、係数Ｍｕｌｄｅｌｔａ＝「１」（すなわち、ビブラートを収束させない値）を取得する。そして、ビブラート付加制御部７００は、ステップＳ１９’に進み、ステップＳ１６’で求めた各パラメータの各デルタ値（ΔＰｉｔｃｈ、ΔＧａｉｎ、ΔＳｌｏｐｅ）に、ステップＳ１８’で求めた係数Ｍｕｌｄｅｌｔａ＝「１」を乗算する。そして、ビブラート付加制御部７００は、乗算した結果をビブラート付加部４００に供給し、ステップＳ１２に戻る。 When the process proceeds to step S17 ', the vibrato addition control unit 700 refers to the value of the flag VibConFlag and determines whether or not the time when the vibrato end condition is satisfied has been reached. If the vibrato addition control unit 700 determines that the value of the flag VibConFlag is “0” and has not yet reached the time when the end condition of the vibrato has been satisfied, the process proceeds to step S18 ′, where the coefficient Muldelta = “1” (ie, Get the value that does not converge the vibrato). Then, the vibrato addition control unit 700 proceeds to step S19 ′, and multiplies each delta value (ΔPitch, ΔGain, ΔSlope) obtained in step S16 ′ by the coefficient Muldelta = “1” obtained in step S18 ′. To do. Then, the vibrato addition controller 700 supplies the multiplied result to the vibrato adder 400, and the process returns to step S12.

一方、ビブラート付加制御部７００は、ステップＳ１７’において、フラグＶｉｂＣｏｎＦｌａｇの値が“１”であり、ビブラートの終了条件を満たした時刻に到達したと判断すると、ステップＳ１８’’に進み、各パラメータの各Δ値を徐々に小さくし、ビブラートを収束させるような係数Ｍｕｌｄｅｌｔａを取得する（図１４に示すＶｉｂ終了時刻参照）。そして、ビブラート付加制御部７００は、ステップＳ１９’’に進み、ステップＳ１６’で求めた各パラメータの各デルタ値（ΔＰｉｔｃｈ、ΔＧａｉｎ、ΔＳｌｏｐｅ）に、ステップＳ１８’’で求めた係数Ｍｕｌｄｅｌｔａの値を乗算する。そして、ビブラート付加制御部７００は、乗算した結果をビブラート付加部４００に供給する。ビブラート付加制御部７００は、ステップＳ２０’’に進むと、上記ステップＳ１８’’において取得した係数Ｍｕｌｄｅｌｔａが「０」に収束したか否かを判断する。ビブラート付加制御部７００は、取得した係数Ｍｕｌｄｅｌｔａが未だ「０」に収束していないと判断すると（ステップＳ２０’’；ＮＯ）、ステップＳ１２に戻り、上述した処理を繰り返し実行する。かかる処理を繰り返し実行している間に、ビブラート付加制御部７００は、ステップＳ１８’’において取得した係数Ｍｕｌｄｅｌｔａが「０」に収束したと判断すると（ステップＳ２０’’；ＹＥＳ）、フラグＶｉｂＢｏｄｙＦｌａｇの値を“０”にセットし（ステップＳ２１’’）、以上説明した処理を終了する。なお、ビブラート付加制御部７００からビブラート付加部４００へ乗算結果が供給された後の動作については、上記ＶＡテンプレートを利用してビブラートを付加する場合と同様に説明することができるため、説明を割愛する。 On the other hand, when the vibrato addition control unit 700 determines in step S17 ′ that the value of the flag VibConFlag is “1” and the time when the end condition of the vibrato has been satisfied, the process proceeds to step S18 ″. Each Δ value is gradually reduced to obtain a coefficient Muldelta that converges the vibrato (see Vib end time shown in FIG. 14). Then, the vibrato addition control unit 700 proceeds to step S19 ″, and multiplies each delta value (ΔPitch, ΔGain, ΔSlope) of each parameter obtained in step S16 ′ by the value of the coefficient Muldelta obtained in step S18 ″. To do. Then, the vibrato addition controller 700 supplies the multiplication result to the vibrato adder 400. When the process proceeds to step S20 '', the vibrato addition control unit 700 determines whether or not the coefficient Muldelta acquired in step S18 '' has converged to "0". When the vibrato addition control unit 700 determines that the obtained coefficient Muldelta has not yet converged to “0” (step S20 ″; NO), the vibrato addition control unit 700 returns to step S12 and repeatedly executes the above-described processing. If the vibrato control unit 700 determines that the coefficient Muldelta acquired in step S18 ″ has converged to “0” (step S20 ″; YES) while repeatedly executing such processing, the value of the flag VibBodyFlag Is set to “0” (step S21 ″), and the above-described processing is terminated. The operation after the multiplication result is supplied from the vibrato addition control unit 700 to the vibrato addition unit 400 can be described in the same manner as the case where the vibrato is added using the VA template. To do.

以上、ＶＲ領域の各データを利用せずにビブラートを付加する場合について説明したが、ＶＲ領域の各データを利用して（すなわち、ＶＲテンプレートを利用して）ビブラートを付加するようにしても良いのはもちろんである。かかる場合には、上述したビブラートの終了条件を満たしたときに、ＶＲテンプレートを利用してビブラートを付加するように制御すれば良い。なお、このＶＲテンプレートを利用してビブラートを付加する場合には、計算によってビブラートを収束させるための係数Ｍｕｌｄｅｌｔａを求める必要がなく、音声処理装置１００の制御部にかかる処理の負担を軽減することができる。 As described above, the case where the vibrato is added without using each data in the VR area has been described. However, the vibrato may be added using each data in the VR area (that is, using the VR template). Of course. In such a case, when the above-described vibrato termination condition is satisfied, control may be performed so as to add vibrato using the VR template. In addition, when adding vibrato using this VR template, it is not necessary to obtain the coefficient Muldelta for converging the vibrato by calculation, and the processing burden on the control unit of the speech processing apparatus 100 can be reduced. it can.

以上説明したように、本実施形態によれば、ターゲット歌唱者によって付加されるビブラートの微妙なピッチやゲインのゆれを再現するばかりでなく、音色の自然な変化をも忠実に再現することができる。これにより、従来のカラオケ装置等ではなしえなかったよりリアルなビブラートを素人歌唱者の歌唱音声に付加することが可能となる。
また、本実施形態によれば、ビブラートがかかったターゲット歌唱者の音声を分析した各データを、ビブラートアタック領域、ビブラートボディ領域、ビブラートリリース領域の各々に分割して保持しているため、ビブラートボディ領域の各データのみを保持しているカラオケ装置等と比較してリアルなビブラートを付加することが可能となる。 As described above, according to the present embodiment, not only the subtle pitch and gain fluctuations of the vibrato added by the target singer can be reproduced, but also the natural change in tone can be reproduced faithfully. . This makes it possible to add a more realistic vibrato that cannot be achieved with a conventional karaoke apparatus or the like to the singing voice of an amateur singer.
In addition, according to the present embodiment, each data obtained by analyzing the voice of the target singer subjected to vibrato is divided and held in the vibrato attack area, the vibrato body area, and the vibrato release area. Real vibrato can be added as compared with a karaoke apparatus or the like that holds only each data of the area.

また、本実施形態によれば、ビブラートを付加する際にビブラートリリース領域のテンプレートを用いない場合であっても、係数Ｍｕｌｄｅｌｔａを利用することによりビブラートをゆるやかに減衰させていく。これにより、自然なビブラートを終了させることが可能となる。
また、本実施形態によれば、ターゲット歌唱者が種々のピッチでビブラートをかけて歌ったときの音声を分析して得た複数種類のテンプレートセットＴＳがビブラートデータベースに記憶される。かかる構成を採用することにより、入力音声にビブラートを付加する際には、この入力音声のピッチに最も近いテンプレートセットＴＳを選択・使用することができ、これにより、リアルなビブラートを付加することが可能となる。 Further, according to the present embodiment, even when the vibrato release region template is not used when adding the vibrato, the vibrato is gradually attenuated by using the coefficient Muldelta. This makes it possible to end natural vibrato.
Further, according to the present embodiment, a plurality of types of template sets TS obtained by analyzing voices when the target singer sings with vibrato at various pitches are stored in the vibrato database. By adopting such a configuration, when adding vibrato to the input voice, it is possible to select and use the template set TS closest to the pitch of the input voice, thereby adding realistic vibrato. It becomes possible.

なお、本実施形態では、歌唱音声を例に説明したが、例えば通常の会話の音声や楽器音などにも適用することができる。また、以上説明した音声処理装置１００の各部の機能は、ＲＯＭ等に格納されているプログラムによって実現されるため、かかるプログラムについてＣＤ−ＲＯＭ等の記録媒体に記録して頒布したり、インターネット等の通信ネットワークを介して頒布しても良い。もちろん、音声処理装置１００の各部の機能をハードウェアによって実現しても良い。 In the present embodiment, the singing voice has been described as an example, but the present invention can be applied to, for example, a normal conversation voice or a musical instrument sound. Further, since the functions of the respective units of the voice processing apparatus 100 described above are realized by a program stored in a ROM or the like, the program is recorded on a recording medium such as a CD-ROM and distributed, or the Internet or the like. You may distribute via a communication network. Of course, the function of each unit of the speech processing apparatus 100 may be realized by hardware.

本実施形態に係る音声処理装置の構成を示す図である。It is a figure which shows the structure of the audio processing apparatus which concerns on this embodiment. 同実施形態に係るビブラートがかかったターゲット音声のピッチ波形を示す図である。It is a figure which shows the pitch waveform of the target audio | voice which applied the vibrato which concerns on the same embodiment. 同実施形態に係るＶＡ領域のピッチ波形を示す図である。It is a figure which shows the pitch waveform of the VA area | region which concerns on the same embodiment. 同実施形態に係るＶＢ領域のピッチ波形を示す図である。It is a figure which shows the pitch waveform of the VB area | region which concerns on the same embodiment. 同実施形態に係るＶＡ領域のあるフレームのスペクトルを例示した図である。It is the figure which illustrated the spectrum of a certain frame with the VA field concerning the embodiment. 同実施形態に係る図５に示すスペクトルの傾きを表すカーブを説明するための図である。It is a figure for demonstrating the curve showing the inclination of the spectrum shown in FIG. 5 which concerns on the same embodiment. 同実施形態に係るＧａｉｎの大きさを変えたときのスペクトルエンベロープ及びＥＣｕｒｖｅの変化の様子を示す図である。It is a figure which shows the mode of the change of a spectrum envelope and ECurve when the magnitude | size of Gain which concerns on the embodiment is changed. 同実施形態に係るＧａｉｎの大きさを変えたときのスペクトルエンベロープ及びＥＣｕｒｖｅの変化の様子を示す図である。It is a figure which shows the mode of the change of a spectrum envelope and ECurve when the magnitude | size of Gain which concerns on the embodiment is changed. 同実施形態に係る入力ｂｉｎＭａｇｎｉｔｕｄｅを例示した図である。It is the figure which illustrated input bin Magnitude concerning the embodiment. 同実施形態に係るＥＣｕｒｖｅのＳｌｏｐｅを変えたときのスペクトルエンベロープの変化の様子を示す図である。It is a figure which shows the mode of a change of the spectrum envelope when the slope of ECurve concerning the embodiment is changed. 同実施形態に係るＥＣｕｒｖｅのＳｌｏｐｅを変えたときのスペクトルエンベロープの変化の様子を示す図である。It is a figure which shows the mode of a change of the spectrum envelope when the slope of ECurve concerning the embodiment is changed. 同実施形態に係るビブラート付加処理を示すフローチャートである。It is a flowchart which shows the vibrato addition process which concerns on the same embodiment. 同実施形態に係るビブラート付加処理を示すフローチャートである。It is a flowchart which shows the vibrato addition process which concerns on the same embodiment. 同実施形態に係る係数Ｍｕｌｄｅｌｔａとビブラート時刻との関係を示す図である。It is a figure which shows the relationship between the coefficient Muldelta which concerns on the embodiment, and vibrato time.

Explanation of symbols

１００・・・音声処理装置、２００・・・音声入力部、３００・・・スペクトル・ピッチ分析部、４００・・・ビブラート付加部、５００・・・音声出力部、６００・・・パラメータ入力部、７００・・・ビブラート付加制御部、８００・・・ビブラートデータベース、９００・・・テンプレート作成部、９５０・・・ターゲット入力部、ＴＳ・・・テンプレートセット。 DESCRIPTION OF SYMBOLS 100 ... Voice processing apparatus, 200 ... Voice input part, 300 ... Spectrum / pitch analysis part, 400 ... Vibrato addition part, 500 ... Voice output part, 600 ... Parameter input part, 700 ... Vibrato addition control unit, 800 ... Vibrato database, 900 ... Template creation unit, 950 ... Target input unit, TS ... Template set.

Claims

Storage means for storing the pitch of the target voice subjected to vibrato and the inclination of the spectrum;
An input means for inputting voice;
Pitch extraction means for extracting the pitch of the input voice;
Pitch changing means for changing the pitch of the extracted voice according to the pitch of the target voice;
Spectrum slope changing means for changing the slope of the spectrum of the voice obtained by performing spectrum analysis on the inputted voice according to the slope of the spectrum of the target voice;
An audio processing apparatus comprising: output means for outputting a voice having the pitch of the voice after change and the slope of the spectrum of the voice after change.

The storage means stores a gain of the target sound in addition to the pitch and spectrum inclination of the target sound,
Gain extraction means for extracting the gain of the input voice, and gain change means for changing the gain of the extracted voice according to the gain of the target voice;
2. The audio processing according to claim 1, wherein the output unit outputs a sound having a pitch of the sound after change, a slope of a spectrum of the sound after change, and a gain of the sound after change. apparatus.

The storage means stores a plurality of types of pitches of the target sound subjected to vibrato and a spectrum inclination corresponding to each pitch,
The pitch changing means selects a pitch closest to the pitch of the extracted voice from a plurality of types of pitches of the target voice, and the pitch of the extracted voice according to the pitch of the selected target voice Change
The spectrum slope changing means selects a slope of the spectrum corresponding to the pitch of the target sound selected by the pitch changing means from a plurality of kinds of spectrum inclinations of the target sound, and selects the selected target sound. The speech processing apparatus according to claim 1, wherein a slope of a spectrum of the speech is changed according to a slope of the spectrum.

The pitch and spectrum inclination in each area when the target sound subjected to vibrato is divided into at least a vibrato attack area and a vibrato body area are stored in the storage means. Voice processing device.

Target input means for inputting the target audio with vibrato,
Target pitch extraction means for extracting the pitch of the input target speech and storing it in the storage means;
2. A target spectrum / slope calculation unit that calculates a slope of a spectrum of the target speech by performing spectrum analysis on the input target speech and stores the spectrum in the storage unit. The speech processing apparatus according to any one of claims 3 and 4.

A memorization process for memorizing the pitch and spectrum inclination of the target sound with vibrato applied;
Input process of inputting voice,
A pitch extraction process for extracting the pitch of the input voice;
A pitch changing process of changing the pitch of the extracted voice according to the pitch of the target voice;
A spectrum slope changing process for changing the slope of the spectrum of the voice obtained by performing spectrum analysis on the input voice according to the slope of the spectrum of the target voice;
An output process for outputting a sound having the pitch of the sound after the change and the slope of the spectrum of the sound after the change.

Computer
Storage means for storing the pitch of the target voice subjected to vibrato and the inclination of the spectrum;
An input means for inputting voice;
Pitch extraction means for extracting the pitch of the input voice;
Pitch changing means for changing the pitch of the extracted voice according to the pitch of the target voice;
Spectrum slope changing means for changing the slope of the spectrum of the voice obtained by performing spectrum analysis on the inputted voice according to the slope of the spectrum of the target voice;
A speech processing program for functioning as output means for outputting speech having the pitch of the speech after change and the slope of the spectrum of the speech after change.