JP7200483B2

JP7200483B2 - Speech processing method, speech processing device and program

Info

Publication number: JP7200483B2
Application number: JP2018043118A
Authority: JP
Inventors: 竜之介大道; 啓嘉山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2018-03-09
Filing date: 2018-03-09
Publication date: 2023-01-10
Anticipated expiration: 2038-03-09
Also published as: JP2019159014A

Description

本発明は、音声を表す音声信号を処理する技術に関する。 The present invention relates to techniques for processing audio signals representing speech.

歌唱表現等の音声表現を音声に付加する各種の技術が従来から提案されている。例えば特許文献１には、音声信号の各調波成分を周波数領域で移動させることにより、当該音声信号が表す音声を、濁声または嗄声等の特徴的な声質の音声に変換する技術が開示されている。 2. Description of the Related Art Conventionally, various techniques have been proposed for adding speech expressions such as singing expressions to speech. For example, Patent Literature 1 discloses a technique for converting the sound represented by the audio signal into a characteristic voice such as a hoarse voice or a hoarse voice by moving each harmonic component of the audio signal in the frequency domain. ing.

特開２０１４－２３３８号公報Japanese Unexamined Patent Application Publication No. 2014-2338

しかし、特許文献１の技術においては、聴感的に自然な音声を生成するという観点から更なる改善の余地がある。以上の事情を考慮して、本発明は、聴感的に自然な音声を合成することを目的とする。 However, the technique of Patent Literature 1 has room for further improvement from the viewpoint of generating an audibly natural sound. SUMMARY OF THE INVENTION In view of the above circumstances, the present invention aims at synthesizing acoustically natural speech.

以上の課題を解決するために、本発明の好適な態様に係る音声処理方法は、歌唱音声を表す第１音信号のうちの処理期間を、前記歌唱音声とは音響特性が相違する参照音声を表す前記第２音信号において前記第１音信号の変形に適用されるべき表現期間の時間長に応じて伸長し、前記処理期間の伸長後の前記第１音信号を、前記第２音信号の前記表現期間に応じて変形する。 In order to solve the above problems, a sound processing method according to a preferred aspect of the present invention sets a processing period of a first sound signal representing a singing voice to a reference voice having acoustic characteristics different from those of the singing voice. In the second sound signal represented, the representation period to be applied to the deformation of the first sound signal is extended according to the time length of the representation period, and the first sound signal after the extension of the processing period is the second sound signal It transforms according to the representation period.

以上の課題を解決するために、本発明の好適な態様に係る音声処理装置は、歌唱音声を表す第１音信号のうちの処理期間を、前記歌唱音声とは音響特性が相違する参照音声を表す前記第２音信号において前記第１音信号の変形に適用されるべき表現期間の時間長に応じて伸長し、前記処理期間の伸長後の前記第１音信号を、前記第２音信号の前記表現期間に応じて変形する合成処理部を具備する。 In order to solve the above problems, a sound processing apparatus according to a preferred aspect of the present invention sets a processing period of a first sound signal representing a singing voice to a reference voice having acoustic characteristics different from those of the singing voice. In the second sound signal represented, the representation period to be applied to the deformation of the first sound signal is extended according to the time length of the representation period, and the first sound signal after the extension of the processing period is the second sound signal A synthesizing unit that transforms according to the expression period is provided.

本発明の実施形態に係る音処理装置の構成を例示するブロック図である。1 is a block diagram illustrating the configuration of a sound processing device according to an embodiment of the present invention; FIG. 音処理装置の機能的な構成を例示するブロック図である。2 is a block diagram illustrating the functional configuration of a sound processing device; FIG. 第１音信号における定常期間の説明図である。FIG. 4 is an explanatory diagram of a stationary period in the first sound signal; 信号解析処理の具体的な手順を例示するフローチャートである。4 is a flowchart illustrating a specific procedure of signal analysis processing; 歌唱音声の発音が開始された直後における基本周波数の時間変化である。It is the time change of the fundamental frequency immediately after the pronunciation of the singing voice is started. 歌唱音声の発音が終了する直前における基本周波数の時間変化である。It is the time change of the fundamental frequency immediately before the pronunciation of the singing voice ends. リリース処理の具体的な手順を例示するフローチャートである。FIG. 10 is a flowchart illustrating a specific procedure of release processing; FIG. リリース処理の説明図である。FIG. 11 is an explanatory diagram of release processing; スペクトル包絡概形の説明図である。FIG. 4 is an explanatory diagram of a spectral envelope outline; アタック処理の具体的な手順を例示するフローチャートである。4 is a flowchart illustrating a specific procedure of attack processing; アタック処理の説明図である。FIG. 10 is an explanatory diagram of attack processing;

図１は、本発明の好適な形態に係る音処理装置１００の構成を例示するブロック図である。本実施形態の音処理装置１００は、利用者が楽曲を歌唱した音声（以下「歌唱音声」という）に対して各種の音表現を付加する信号処理装置である。音表現は、歌唱音声（第１音の例示）に対して付加される音響特性である。楽曲の歌唱に着目すると、音表現は、音声の発音（すなわち歌唱）に関する音楽的な表現または表情である。具体的には、ボーカルフライ、唸り声、または嗄れ声のような歌唱表現が、音表現の好適例である。なお、音表現は、声質とも換言される。 FIG. 1 is a block diagram illustrating the configuration of a sound processing device 100 according to a preferred embodiment of the invention. The sound processing device 100 of the present embodiment is a signal processing device that adds various sound expressions to the voice of a song sung by a user (hereinafter referred to as "singing voice"). A sound expression is an acoustic characteristic added to a singing voice (an example of the first sound). Focusing on singing of a song, the sound expression is a musical expression or facial expression related to the pronunciation of voice (that is, singing). Specifically, singing expressions such as vocal fly, growl, or hoarseness are suitable examples of sound expressions. Note that the sound expression can also be called voice quality.

音表現は、発音の開始の直後に音量が増加していく部分（以下「アタック部」という）と、発音の終了の直前に音量が減少していく部分（以下「リリース部」という）とにおいて特に顕著となる。以上の傾向を考慮して、本実施形態では、歌唱音声のうち特にアタック部およびリリース部に対して音表現を付加する。
The sound expression consists of a part where the volume increases immediately after the beginning of the pronunciation (hereinafter referred to as the "attack part") and a part where the volume decreases immediately before the end of the pronunciation (hereinafter referred to as the "release part"). It is particularly noticeable in In consideration of the above tendency, in the present embodiment, sound expressions are added to the attack part and the release part of the singing voice.

図１に例示される通り、音処理装置１００は、制御装置１１と記憶装置１２と操作装置１３と放音装置１４とを具備するコンピュータシステムで実現される。例えば携帯電話機もしくはスマートフォン等の可搬型の情報端末、またはパーソナルコンピュータ等の可搬型または据置型の情報端末が、音処理装置１００として好適に利用される。操作装置１３は、利用者からの指示を受付ける入力機器である。例えば、利用者が操作する複数の操作子、または利用者による接触を検知するタッチパネルが、操作装置１３として好適に利用される。 As illustrated in FIG. 1, the sound processing device 100 is realized by a computer system including a control device 11, a storage device 12, an operating device 13, and a sound emitting device . For example, a portable information terminal such as a mobile phone or a smart phone, or a portable or stationary information terminal such as a personal computer is preferably used as the sound processing device 100 . The operation device 13 is an input device that receives instructions from a user. For example, a plurality of manipulators operated by the user or a touch panel for detecting contact by the user is preferably used as the operating device 13 .

制御装置１１は、例えばＣＰＵ（Central Processing Unit）等の処理回路であり、各種の演算処理および制御処理を実行する。本実施形態の制御装置１１は、歌唱音声に音表現を付与した音声（以下「変形音」という）を表す第３音信号Ｙを生成する。放音装置１４は、例えばスピーカまたはヘッドホンであり、制御装置１１が生成した第３音信号Ｙが表す変形音を放音する。なお、制御装置１１が生成した第３音信号Ｙをデジタルからアナログに変換するＤ/Ａ変換器の図示は便宜的に省略した。なお、音処理装置１００が放音装置１４を具備する構成を図１では例示したが、音処理装置１００とは別体の放音装置１４を音処理装置１００に有線または無線で接続してもよい。 The control device 11 is a processing circuit such as a CPU (Central Processing Unit), for example, and executes various kinds of arithmetic processing and control processing. The control device 11 of the present embodiment generates a third sound signal Y representing a sound obtained by adding sound expression to the singing voice (hereinafter referred to as "deformed sound"). The sound emitting device 14 is, for example, a speaker or headphones, and emits a modified sound represented by the third sound signal Y generated by the control device 11 . A D/A converter for converting the third sound signal Y generated by the control device 11 from digital to analog is omitted for convenience. Although the configuration in which the sound processing device 100 includes the sound emitting device 14 is illustrated in FIG. good.

記憶装置１２は、例えば磁気記録媒体または半導体記録媒体等の公知の記録媒体で構成されたメモリであり、制御装置１１が実行するプログラムと制御装置１１が使用する各種のデータとを記憶する。なお、複数種の記録媒体の組合せにより記憶装置１２を構成してもよい。また、音処理装置１００とは別体の記憶装置１２（例えばクラウドストレージ）を用意し、制御装置１１が通信網を介して記憶装置１２に対する書込および読出を実行してもよい。すなわち、記憶装置１２を音処理装置１００から省略してもよい。 The storage device 12 is a memory composed of a known recording medium such as a magnetic recording medium or a semiconductor recording medium, and stores programs executed by the control device 11 and various data used by the control device 11 . Note that the storage device 12 may be configured by combining multiple types of recording media. Alternatively, a storage device 12 (for example, a cloud storage) may be provided separately from the sound processing device 100, and the control device 11 may perform writing and reading to and from the storage device 12 via a communication network. That is, the storage device 12 may be omitted from the sound processing device 100 .

本実施形態の記憶装置１２は、第１音信号Ｘ1と第２音信号Ｘ2とを記憶する。第１音信号Ｘ1は、音処理装置１００の利用者が楽曲を歌唱した歌唱音声を表す音響信号である。第２音信号Ｘ2は、利用者以外の歌唱者（例えば歌手）が音表現を付加して歌唱した音声（以下「参照音声」という）を表す音響信号である。第１音信号Ｘ1と第２音信号Ｘ2とでは音響特性（例えば声質）が相違する。本実施形態の音処理装置１００は、第２音信号Ｘ2が表す参照音声（第２音の例示）の音表現を、第１音信号Ｘ1が表す歌唱音声に付加することで、変形音の第３音信号Ｙを生成する。なお、歌唱音声と参照音声との間で楽曲の異同は不問である。なお、以上の説明では歌唱音声の発声者と参照音声の発声者とが別人である場合を想定したが、歌唱音声の発声者と参照音声の発声者とは同一人でもよい。例えば、歌唱音声は、音表現を付加せずに利用者が歌唱した音声であり、参照音声は、当該利用者が歌唱表現を付加した音声である。 The storage device 12 of this embodiment stores the first sound signal X1 and the second sound signal X2. The first sound signal X1 is an acoustic signal representing the singing voice of a song sung by the user of the sound processing device 100 . The second sound signal X2 is an acoustic signal representing voice (hereinafter referred to as "reference voice") sung by a singer other than the user (for example, a singer) with added sound expression. The first sound signal X1 and the second sound signal X2 have different acoustic characteristics (for example, voice quality). The sound processing device 100 of the present embodiment adds the sound representation of the reference voice (an example of the second sound) represented by the second sound signal X2 to the singing voice represented by the first sound signal X1, so that the modified sound A tritone signal Y is generated. It does not matter whether the songs are different or similar between the singing voice and the reference voice. In the above description, it is assumed that the vocalist of the singing voice and the vocalist of the reference voice are different people, but the vocalist of the singing voice and the vocalist of the reference voice may be the same person. For example, the singing voice is the voice sung by the user without adding the sound expression, and the reference voice is the voice to which the user has added the singing expression.

図２は、制御装置１１の機能的な構成を例示するブロック図である。図２に例示される通り、制御装置１１は、記憶装置１２に記憶されたプログラムを実行することで、第１音信号Ｘ1と第２音信号Ｘ2とから第３音信号Ｙを生成するための複数の機能（信号解析部２１および合成処理部２２）を実現する。なお、相互に別体で構成された複数の装置で制御装置１１の機能を実現してもよいし、制御装置１１の機能の一部または全部を専用の電子回路で実現してもよい。 FIG. 2 is a block diagram illustrating the functional configuration of the control device 11. As shown in FIG. As illustrated in FIG. 2, the control device 11 executes a program stored in the storage device 12 to generate the third sound signal Y from the first sound signal X1 and the second sound signal X2. A plurality of functions (signal analysis unit 21 and synthesis processing unit 22) are realized. The functions of the control device 11 may be realized by a plurality of devices configured separately from each other, or some or all of the functions of the control device 11 may be realized by a dedicated electronic circuit.

信号解析部２１は、第１音信号Ｘ1の解析により解析データＤ1を生成し、第２音信号Ｘ2の解析により解析データＤ2を生成する。信号解析部２１が生成した解析データＤ1および解析データＤ2は記憶装置１２に格納される。 The signal analysis unit 21 generates analysis data D1 by analyzing the first sound signal X1, and generates analysis data D2 by analyzing the second sound signal X2. The analysis data D 1 and the analysis data D 2 generated by the signal analysis unit 21 are stored in the storage device 12 .

解析データＤ1は、第１音信号Ｘ1における複数の定常期間Ｑ1を表すデータである。図３に例示される通り、解析データＤ1が示す各定常期間Ｑ1は、第１音信号Ｘ1のうち基本周波数ｆ1とスペクトル形状とが時間的に安定している可変長の期間である。解析データＤ1は、各定常期間Ｑ1の始点の時刻（以下「始点時刻」という）Ｔ1_Sと終点の時刻（以下「終点時刻」という）Ｔ1_Eとを指定する。なお、楽曲内で相前後する２個の音符の間では、基本周波数ｆ1またはスペクトル形状（すなわち音韻）が変化する場合が多い。したがって、各定常期間Ｑ1は、楽曲内の１個の音符に相当する期間である可能性が高い。 The analysis data D1 is data representing a plurality of stationary periods Q1 in the first sound signal X1. As illustrated in FIG. 3, each stationary period Q1 indicated by the analysis data D1 is a variable-length period in which the fundamental frequency f1 and spectrum shape of the first sound signal X1 are temporally stable. The analysis data D1 designates a start point time (hereinafter referred to as "start point time") T1_S and an end point time (hereinafter referred to as "end point time") T1_E of each steady period Q1. It should be noted that the fundamental frequency f1 or spectrum shape (that is, phoneme) often changes between two consecutive notes in a piece of music. Therefore, each stationary period Q1 is likely to be a period corresponding to one note in the music.

同様に、解析データＤ2は、第２音信号Ｘ2における複数の定常期間Ｑ2を表すデータである。各定常期間Ｑ2は、第２音信号Ｘ2のうち基本周波数ｆ2とスペクトル形状とが時間的に安定している可変長の期間である。解析データＤ2は、各定常期間Ｑ2の始点時刻Ｔ2_Sと終点時刻Ｔ2_Eとを指定する。定常期間Ｑ1と同様に、各定常期間Ｑ2は、楽曲内の１個の音符に相当する期間である可能性が高い。 Similarly, the analysis data D2 is data representing a plurality of stationary periods Q2 in the second sound signal X2. Each stationary period Q2 is a variable-length period in which the fundamental frequency f2 and the spectral shape of the second sound signal X2 are temporally stable. The analysis data D2 specifies the starting point time T2_S and the ending point time T2_E of each steady period Q2. As with the stationary period Q1, each stationary period Q2 is likely to be a period corresponding to one note in the piece of music.

図４は、信号解析部２１が第１音信号Ｘ1を解析する処理（以下「信号解析処理」という）Ｓ0のフローチャートである。例えば操作装置１３に対する利用者からの指示を契機として図４の信号解析処理Ｓ0が開始される。図４に例示される通り、信号解析部２１は、時間軸上の複数の単位期間（フレーム）の各々について第１音信号Ｘ1の基本周波数ｆ1を算定する（Ｓ01）。基本周波数ｆ1の算定には公知の技術が任意に採用される。各単位期間は、定常期間Ｑ1に想定される時間長と比較して充分に短い期間である。 FIG. 4 is a flow chart of processing (hereinafter referred to as "signal analysis processing") S0 for the signal analysis unit 21 to analyze the first sound signal X1. For example, the signal analysis processing S0 of FIG. 4 is started with an instruction from the user to the operation device 13 as a trigger. As illustrated in FIG. 4, the signal analysis unit 21 calculates the fundamental frequency f1 of the first sound signal X1 for each of a plurality of unit periods (frames) on the time axis (S01). A known technique is arbitrarily adopted for calculating the fundamental frequency f1. Each unit period is a sufficiently short period compared to the time length assumed for the steady period Q1.

信号解析部２１は、第１音信号Ｘ1のスペクトル形状を表すメルケプストラムＭ1を単位期間毎に算定する（Ｓ02）。メルケプストラムＭ1は、第１音信号Ｘ1の周波数スペクトルの包絡線を表す複数の係数で表現される。メルケプストラムＭ1は、歌唱音声の音韻を表す特徴量とも表現される。メルケプストラムＭ1の算定には公知の技術が任意に採用される。なお、第１音信号Ｘ1のスペクトル形状を表す特徴量として、メルケプストラムＭ1の代わりにＭＦＣＣ（Mel-Frequency Cepstrum Coefficients）を算定してもよい。 The signal analysis unit 21 calculates a mel-cepstrum M1 representing the spectral shape of the first sound signal X1 for each unit period (S02). The mel-cepstrum M1 is represented by a plurality of coefficients representing the envelope of the frequency spectrum of the first sound signal X1. The mel-cepstrum M1 is also expressed as a feature quantity representing the phoneme of the singing voice. A known technique is arbitrarily adopted for calculating the mel-cepstrum M1. Note that MFCC (Mel-Frequency Cepstrum Coefficients) may be calculated instead of the mel-cepstrum M1 as the feature quantity representing the spectral shape of the first sound signal X1.

信号解析部２１は、第１音信号Ｘ1が表す歌唱音声の有声性を単位期間毎に推定する（Ｓ03）。すなわち、歌唱音声が有声音および無声音の何れに該当するかが判定される。有声性（有声／無声）の推定には公知の技術が任意に採用される。なお、基本周波数ｆ1の算定（Ｓ01）とメルケプストラムＭ1の算定（Ｓ02）と有声性の推定（Ｓ03）とについて順序は任意であり、以上に例示した順序には限定されない。 The signal analysis unit 21 estimates the voicing of the singing voice represented by the first sound signal X1 for each unit period (S03). That is, it is determined whether the singing voice corresponds to voiced sound or unvoiced sound. A known technique is arbitrarily adopted for estimating voicedness (voiced/unvoiced). The order of calculation of the fundamental frequency f1 (S01), calculation of the mel-cepstrum M1 (S02) and estimation of voicing (S03) is arbitrary and not limited to the order illustrated above.

信号解析部２１は、基本周波数ｆ1の時間的な変化の度合を示す第１指標δ1を単位期間毎に算定する（Ｓ04）。例えば相前後する２個の単位期間の間における基本周波数ｆ1の差分が第１指標δ1として算定される。基本周波数ｆ1の時間的な変化が顕著であるほど第１指標δ1は大きい数値となる。 The signal analysis unit 21 calculates, for each unit period, a first index δ1 that indicates the degree of temporal change in the fundamental frequency f1 (S04). For example, the difference in the fundamental frequency f1 between two consecutive unit periods is calculated as the first index .delta.1. The greater the temporal change in the fundamental frequency f1, the larger the value of the first index .delta.1.

信号解析部２１は、メルケプストラムＭ1の時間的な変化の度合を示す第２指標δ2を単位期間毎に算定する（Ｓ05）。例えば、相前後する２個の単位期間の間においてメルケプストラムＭ1の係数毎の差分を複数の係数について合成（例えば加算または平均）した数値が、第２指標δ2として好適である。歌唱音声のスペクトル形状の時間的な変化が顕著であるほど第２指標δ2は大きい数値となる。例えば歌唱音声の音韻が変化する時点の付近では、第２指標δ2は大きい数値となる。 The signal analysis unit 21 calculates a second index δ2 indicating the degree of temporal change of the mel-cepstrum M1 for each unit period (S05). For example, a numerical value obtained by synthesizing (for example, adding or averaging) differences for each coefficient of the mel-cepstrum M1 between two consecutive unit periods is suitable as the second index δ2. The second index δ2 becomes a larger numerical value as the temporal change in the spectral shape of the singing voice becomes more conspicuous. For example, the second index .delta.2 becomes a large numerical value near the time when the phoneme of the singing voice changes.

信号解析部２１は、第１指標δ1および第２指標δ2に応じた変動指標Δを単位期間毎に算定する（Ｓ06）。例えば、第１指標δ1と第２指標δ2との加重和が変動指標Δとして単位期間毎に算定される。第１指標δ1および第２指標δ2の各々の加重値は、所定の固定値、または操作装置１３に対する利用者からの指示に応じた可変値に設定される。以上の説明から理解される通り、第１音信号Ｘ1の基本周波数ｆ1またはメルケプストラムＭ1（すなわちスペクトル形状）の時間的な変動が大きいほど、変動指標Δは大きい数値になるという傾向がある。 The signal analysis unit 21 calculates a fluctuation index Δ corresponding to the first index δ1 and the second index δ2 for each unit period (S06). For example, the weighted sum of the first index δ1 and the second index δ2 is calculated for each unit period as the fluctuation index Δ. A weighted value of each of the first index δ1 and the second index δ2 is set to a predetermined fixed value or a variable value according to an instruction from the user to the operating device 13 . As can be understood from the above description, there is a tendency that the greater the temporal variation of the fundamental frequency f1 or the mel-cepstrum M1 (that is, the spectral shape) of the first sound signal X1, the larger the value of the variation index Δ.

信号解析部２１は、第１音信号Ｘ1における複数の定常期間Ｑ1を特定する（Ｓ07）。本実施形態の信号解析部２１は、歌唱音声の有声性の推定の結果（Ｓ03）と変動指標Δとに応じて定常期間Ｑ1を特定する。具体的には、信号解析部２１は、歌唱音声が有声音であると推定され、かつ、変動指標Δが所定の閾値を下回る一連の単位期間の集合を定常期間Ｑ1として画定する。歌唱音声が無声音であると推定された単位期間、または、変動指標Δが閾値を上回る単位期間は、定常期間Ｑ1から除外される。以上の手順により第１音信号Ｘ1の各定常期間Ｑ1を画定すると、信号解析部２１は、各定常期間Ｑ1の始点時刻Ｔ1_Sと終点時刻Ｔ1_Eとを指定する解析データＤ1を記憶装置１２に格納する（Ｓ08）。 The signal analysis unit 21 identifies a plurality of stationary periods Q1 in the first sound signal X1 (S07). The signal analysis unit 21 of this embodiment specifies the steady period Q1 according to the result of estimating the voicedness of the singing voice (S03) and the fluctuation index Δ. Specifically, the signal analysis unit 21 defines a set of a series of unit periods in which the singing voice is estimated to be voiced and the fluctuation index Δ is below a predetermined threshold as the steady period Q1. A unit period in which the singing voice is estimated to be unvoiced or a unit period in which the fluctuation index Δ exceeds the threshold is excluded from the stationary period Q1. After each steady period Q1 of the first sound signal X1 is defined by the above procedure, the signal analysis unit 21 stores the analysis data D1 specifying the start point time T1_S and the end point time T1_E of each steady period Q1 in the storage device 12. (S08).

信号解析部２１は、以上に説明した信号解析処理Ｓ0を、参照音声を表す第２音信号Ｘ2についても実行することで解析データＤ2を生成する。具体的には、信号解析部２１は、第２音信号Ｘ2の単位期間毎に、基本周波数ｆ2の算定（Ｓ01）とメルケプストラムＭ2の算定（Ｓ02）と有声性（有声／無声）の推定（Ｓ03）とを実行する。信号解析部２１は、基本周波数ｆ2の時間的な変化の度合を示す第１指標δ1と、メルケプストラムＭ2の時間的な変化の度合を示す第２指標δ2とに応じた変動指標Δを算定する（Ｓ04－Ｓ06）。そして、信号解析部２１は、参照音声の有声性の推定の結果（Ｓ03）と変動指標Δとに応じて第２音信号Ｘ2の各定常期間Ｑ2を特定する（Ｓ07）。信号解析部２１は、各定常期間Ｑ2の始点時刻Ｔ2_Sと終点時刻Ｔ2_Eとを指定する解析データＤ2を記憶装置１２に格納する（Ｓ08）。なお、解析データＤ1および解析データＤ2を、操作装置１３に対する利用者からの指示に応じて編集してもよい。 The signal analysis unit 21 generates analysis data D2 by executing the above-described signal analysis processing S0 also for the second sound signal X2 representing the reference speech. Specifically, for each unit period of the second sound signal X2, the signal analysis unit 21 calculates the fundamental frequency f2 (S01), calculates the mel-cepstrum M2 (S02), and estimates the voicing (voiced/unvoiced) ( S03) is executed. The signal analysis unit 21 calculates a fluctuation index Δ according to a first index δ1 indicating the degree of temporal change in the fundamental frequency f2 and a second index δ2 indicating the degree of temporal change in the mel-cepstrum M2. (S04-S06). Then, the signal analysis unit 21 specifies each stationary period Q2 of the second sound signal X2 according to the estimation result of the voicedness of the reference speech (S03) and the variation index Δ (S07). The signal analysis unit 21 stores the analysis data D2 specifying the start point time T2_S and the end point time T2_E of each steady period Q2 in the storage device 12 (S08). Note that the analysis data D1 and the analysis data D2 may be edited according to instructions from the user to the operation device 13. FIG.

図２の合成処理部２２は、第２音信号Ｘ2の解析データＤ2を利用して第１音信号Ｘ1の解析データＤ1を変形する。本実施形態の合成処理部２２は、アタック処理部３１とリリース処理部３２と音声合成部３３とを含んで構成される。アタック処理部３１は、第２音信号Ｘ2におけるアタック部の音表現を第１音信号Ｘ1に付加するアタック処理Ｓ1を実行する。リリース処理部３２は、第２音信号Ｘ2におけるリリース部の音表現を第１音信号Ｘ1に付加するリリース処理Ｓ2を実行する。音声合成部３３は、アタック処理部３１およびリリース処理部３２による処理後の解析データから変形音の第３音信号Ｙを合成する。 The synthesis processing unit 22 of FIG. 2 transforms the analysis data D1 of the first sound signal X1 using the analysis data D2 of the second sound signal X2. The synthesis processing section 22 of this embodiment includes an attack processing section 31 , a release processing section 32 and a speech synthesis section 33 . The attack processing unit 31 executes attack processing S1 for adding the sound representation of the attack portion in the second sound signal X2 to the first sound signal X1. The release processing unit 32 performs release processing S2 for adding the sound representation of the release portion in the second sound signal X2 to the first sound signal X1. The voice synthesizing unit 33 synthesizes the third sound signal Y of the modified sound from the analysis data processed by the attack processing unit 31 and the release processing unit 32 .

図５には、歌唱音声の発音が開始された直後における基本周波数ｆ1の時間変化が図示されている。図５に例示される通り、定常期間Ｑ1の直前には有声期間Ｖaが存在する。有声期間Ｖaは、定常期間Ｑ1に先行する有声音の期間である。有声期間Ｖaは、歌唱音声の音響特性（例えば基本周波数ｆ1またはスペクトル形状）が定常期間Ｑ1の直前に不安定に変動する期間である。例えば、歌唱音声の発音が開始した直後の定常期間Ｑ1に着目すると、歌唱音声の発音が開始される時刻τ1_Aから当該定常期間Ｑ1の始点時刻Ｔ1_Sまでのアタック部が有声期間Ｖaに相当する。なお、以上の説明では歌唱音声に着目したが、参照音声についても同様に、定常期間Ｑ2の直前に有声期間Ｖaが存在する。合成処理部２２（具体的にはアタック処理部３１）は、アタック処理Ｓ1において、第１音信号Ｘ1のうち有声期間Ｖaと直後の定常期間Ｑ1とに対して第２音信号Ｘ2におけるアタック部の音表現を付加する。 FIG. 5 shows the change over time of the fundamental frequency f1 immediately after the start of vocalization of the singing voice. As illustrated in FIG. 5, a voiced period Va exists immediately before the stationary period Q1. The voiced period Va is the period of voiced speech that precedes the stationary period Q1. The voiced period Va is a period in which the acoustic characteristics of the singing voice (for example, the fundamental frequency f1 or the spectral shape) fluctuate unstably immediately before the steady period Q1. For example, focusing on the steady period Q1 immediately after the vocalization of the singing voice starts, the attack part from the time τ1_A at which the vocalization of the singing voice starts to the start time T1_S of the steady period Q1 corresponds to the voiced period Va. In the above description, the focus is on the singing voice, but the reference voice also has a voiced period Va immediately before the steady period Q2. In the attack processing S1, the synthesizing unit 22 (specifically, the attack processing unit 31) performs the attack processing of the second sound signal X2 for the voiced period Va and the stationary period Q1 immediately after the first sound signal X1. Add sound expression.

図６には、歌唱音声の発音が終了する直前における基本周波数ｆ1の時間変化が図示されている。図６に例示される通り、定常期間Ｑ1の直後には有声期間Ｖrが存在する。有声期間Ｖrは、定常期間Ｑ1に後続する有声音の期間である。有声期間Ｖrは、歌唱音声の音響特性（例えば基本周波数ｆ2またはスペクトル形状）が定常期間Ｑ1の直後に不安定に変動する期間である。例えば、歌唱音声の発音が終了する直前の定常期間Ｑ1に着目すると、当該定常期間Ｑ1の終点時刻Ｔ1_Eから歌唱音声が消音する時刻τ1_Rまでのリリース部が有声期間Ｖrに相当する。なお、以上の説明では歌唱音声に着目したが、参照音声についても同様に、定常期間Ｑ2の直後に音声期間Ｖrが存在する。合成処理部２２（具体的にはリリース処理部３２）は、リリース処理Ｓ2において、第１音信号Ｘ1のうち有声期間Ｖrと直前の定常期間Ｑ1とに対して第２音信号Ｘ2のリリース部の音表現を付加する。 FIG. 6 shows the change over time of the fundamental frequency f1 just before the end of vocalization of the singing voice. As illustrated in FIG. 6, a voiced period Vr exists immediately after the steady period Q1. A voiced period Vr is a period of voiced speech that follows the stationary period Q1. The voiced period Vr is a period in which the acoustic characteristics of the singing voice (for example, the fundamental frequency f2 or the spectral shape) unstably fluctuate immediately after the steady period Q1. For example, focusing on the steady period Q1 immediately before the end of the vocalization of the singing voice, the release part from the end point time T1_E of the steady period Q1 to the time τ1_R at which the singing voice is muted corresponds to the voiced period Vr. Although the above description focused on the singing voice, the reference voice also has a voice period Vr immediately after the stationary period Q2. In the release processing S2, the synthesizing unit 22 (specifically, the release processing unit 32) performs the release part of the second sound signal X2 for the voiced period Vr and the immediately preceding steady period Q1 of the first sound signal X1. Add sound expression.

＜リリース処理Ｓ2＞
図７は、リリース処理部３２が実行するリリース処理Ｓ2の具体的な内容を例示するフローチャートである。第１音信号Ｘ1の定常期間Ｑ1毎に図７のリリース処理Ｓ2が実行される。 <Release process S2>
FIG. 7 is a flowchart illustrating specific contents of the release processing S2 executed by the release processing unit 32. As shown in FIG. The release process S2 of FIG. 7 is executed every steady period Q1 of the first sound signal X1.

リリース処理Ｓ2を開始すると、リリース処理部３２は、第１音信号Ｘ1のうち処理対象の定常期間Ｑ1に第２音信号Ｘ2のリリース部の音表現を付加するか否かを判定する（Ｓ21）。具体的には、リリース処理部３２は、以下に例示する条件Ｃr1から条件Ｃr3の何れかに該当する定常期間Ｑ1についてはリリース部の音表現を付加しないと判定する。ただし、第１音信号Ｘ1の定常期間Ｑ1に音表現を付加するか否かを判定する条件は以下の例示に限定されない。
［条件Ｃr1］定常期間Ｑ1の時間長が所定値を下回る。
［条件Ｃr2］定常期間Ｑ1の直後の無声期間の時間長が所定値を下回る。
［条件Ｃr3］定常期間Ｑ1に後続する有声期間Ｖrの時間長が所定値を上回る。 When the release processing S2 is started, the release processing unit 32 determines whether or not to add the sound expression of the release portion of the second sound signal X2 to the stationary period Q1 to be processed in the first sound signal X1 (S21). . Specifically, the release processing unit 32 determines not to add the sound expression of the release part for the stationary period Q1 corresponding to any one of conditions Cr1 to Cr3 exemplified below. However, the conditions for determining whether or not to add sound representation to the stationary period Q1 of the first sound signal X1 are not limited to the following examples.
[Condition Cr1] The time length of the steady period Q1 is below a predetermined value.
[Condition Cr2] The duration of the silent period immediately after the steady period Q1 is below a predetermined value.
[Condition Cr3] The time length of the voiced period Vr following the steady period Q1 exceeds a predetermined value.

時間長が充分に短い定常期間Ｑ1には自然な声質で音表現を付加することが困難である。そこで、定常期間Ｑ1の時間長が所定値を下回る場合（条件Ｃr1）、リリース処理部３２は、当該定常期間Ｑ1を音表現の付加対象から除外する。また、定常期間Ｑ1の直後に充分に短い無声期間が存在する場合、当該無声期間は、歌唱音声の途中における無声子音の期間である可能性がある。そして、無声子音の期間に音表現を付加すると、聴感的な違和感が知覚されるという傾向がある。以上の傾向を考慮して、定常期間Ｑ1の直後の無声期間の時間長が所定値を下回る場合（条件Ｃr2）、リリース処理部３２は、当該定常期間Ｑ1を音表現の付加対象から除外する。また、定常期間Ｑ1の直後の有声期間Ｖrの時間長が充分に長い場合には、歌唱音声に既に充分な音表現が付加されている可能性が高い。そこで、定常期間Ｑ1に後続する有声期間Ｖrの時間長が充分に長い場合（条件Ｃr3）、リリース処理部３２は、当該定常期間Ｑ1を音表現の付加対象から除外する。第１音信号Ｘ1の定常期間Ｑ1に音表現を付加しないと判定した場合（Ｓ21：NO）、リリース処理部３２は、以下に詳述する処理（Ｓ22－Ｓ26）を実行することなくリリース処理Ｓ2を終了する。 It is difficult to add sound expression with natural voice quality to the stationary period Q1 whose time length is sufficiently short. Therefore, when the length of time of the steady period Q1 is less than a predetermined value (condition Cr1), the release processing unit 32 excludes the steady period Q1 from addition of sound representation. Also, if a sufficiently short silent period exists immediately after the stationary period Q1, the silent period may be a period of unvoiced consonants in the middle of the singing voice. If sound representation is added to the period of unvoiced consonants, there is a tendency that an auditory sense of incongruity is perceived. Considering the above tendency, when the length of time of the silent period immediately after the steady period Q1 is less than a predetermined value (condition Cr2), the release processing unit 32 excludes the steady period Q1 from addition of sound representation. Also, if the time length of the voiced period Vr immediately after the steady period Q1 is sufficiently long, there is a high possibility that sufficient sound expression has already been added to the singing voice. Therefore, when the time length of the voiced period Vr following the stationary period Q1 is sufficiently long (condition Cr3), the release processing unit 32 excludes the stationary period Q1 from addition of sound representation. When it is determined that no sound expression is added to the stationary period Q1 of the first sound signal X1 (S21: NO), the release processing unit 32 performs the release processing S2 without executing the processing (S22-S26) described in detail below. exit.

第１音信号Ｘ1の定常期間Ｑ1に第２音信号Ｘ2のリリース部の音表現を付加すると判定した場合（Ｓ21：YES）、リリース処理部３２は、第２音信号Ｘ2の複数の定常期間Ｑ2のうち、第１音信号Ｘ1の定常期間Ｑ1に付加されるべき音表現に対応する定常期間Ｑ2を選択する（Ｓ22）。具体的には、リリース処理部３２は、処理対象の定常期間Ｑ1に楽曲内の状況が近似する定常期間Ｑ2を選択する。例えば、１個の定常期間（以下「着目定常期間」という）について考慮される状況（context）としては、着目定常期間の時間長、着目定常期間の直後の定常期間の時間長、着目定常期間と直後の定常期間との間の音高差、着目定常期間の音高、および着目定常期間の直前の無音期間の時間長が例示される。リリース処理部３２は、以上に例示した状況について定常期間Ｑ1との差異が最小となる定常期間Ｑ2を選択する。 When it is determined that the sound representation of the release portion of the second sound signal X2 is added to the steady period Q1 of the first sound signal X1 (S21: YES), the release processing section 32 adds a plurality of steady periods Q2 of the second sound signal X2. Among them, the stationary period Q2 corresponding to the sound expression to be added to the stationary period Q1 of the first sound signal X1 is selected (S22). Specifically, the release processing unit 32 selects the steady period Q2 in which the situation in the music is similar to the steady period Q1 to be processed. For example, the situation (context) to be considered for one steady period (hereinafter referred to as the "steady period of interest") includes the time length of the steady period of interest, the time length of the steady period immediately after the steady period of interest, and the steady period of interest. The pitch difference from the stationary period immediately after, the pitch of the stationary period of interest, and the time length of the silent period immediately before the stationary period of interest are exemplified. The release processing unit 32 selects the steady period Q2 that minimizes the difference from the steady period Q1 for the situations illustrated above.

リリース処理部３２は、以上の手順で選択した定常期間Ｑ2に対応する音表現を第１音信号Ｘ1（解析データＤ1）に付加するための処理（Ｓ23－Ｓ26）を実行する。図８は、リリース処理部３２が第１音信号Ｘ1にリリース部の音表現を付加する処理の説明図である。 The release processing unit 32 executes processing (S23-S26) for adding the sound expression corresponding to the stationary period Q2 selected in the above procedure to the first sound signal X1 (analysis data D1). FIG. 8 is an explanatory diagram of the processing in which the release processing unit 32 adds the sound representation of the release portion to the first sound signal X1.

図８には、第１音信号Ｘ1と第２音信号Ｘ2と変形後の第３音信号Ｙとの各々について、時間軸上の波形と基本周波数の時間変化とが併記されている。図８において、歌唱音声の定常期間Ｑ1の始点時刻Ｔ1_Sおよび終点時刻Ｔ1_Eと、当該定常期間Ｑ1の直後の有声期間Ｖrの終点時刻τ1_Rと、当該定常期間Ｑ1の直後の音符に対応する有声期間Ｖaの始点時刻τ1_Aと、参照音声の定常期間Ｑ2の始点時刻Ｔ2_Sおよび終点時刻Ｔ2_Eと、当該定常期間Ｑ2の直後の有声期間Ｖrの終点時刻τ2_Rとが、既知の情報である。 FIG. 8 also shows waveforms on the time axis and temporal changes in the fundamental frequency for each of the first sound signal X1, the second sound signal X2, and the third sound signal Y after deformation. In FIG. 8, the start time T1_S and the end time T1_E of the steady period Q1 of the singing voice, the end time τ1_R of the voiced period Vr immediately after the steady period Q1, and the voiced period Va corresponding to the note immediately after the steady period Q1 , the start time T2_S and end time T2_E of the steady period Q2 of the reference speech, and the end time τ2_R of the voiced period Vr immediately after the steady period Q2 are known information.

リリース処理部３２は、処理対象の定常期間Ｑ1とステップＳ22で選択した定常期間Ｑ2との間で時間軸上の位置関係を調整する（Ｓ23）。具体的には、リリース処理部３２は、定常期間Ｑ2の時間軸上の位置を、定常期間Ｑ1の端点（Ｔ1_S，Ｔ1_E）を基準とした位置に調整する。本実施形態のリリース処理部３２は、図８に例示される通り、定常期間Ｑ1の終点時刻Ｔ1_Eに定常期間Ｑ2の終点時刻Ｔ2_Eが時間軸上で一致するように、第２音信号Ｘ2（定常期間Ｑ2）を第１音信号Ｘ1の時間軸上に配置する。 The release processing unit 32 adjusts the positional relationship on the time axis between the steady period Q1 to be processed and the steady period Q2 selected in step S22 (S23). Specifically, the release processing unit 32 adjusts the position of the steady period Q2 on the time axis to a position based on the end points (T1_S, T1_E) of the steady period Q1. As exemplified in FIG. 8, the release processing unit 32 of the present embodiment generates the second sound signal X2 (stationary The period Q2) is arranged on the time axis of the first sound signal X1.

＜処理期間Ｚ1_Rの伸長（Ｓ24）＞
リリース処理部３２は、第１音信号Ｘ1のうち第２音信号Ｘ2の音表現が付加される期間（以下「処理期間」という）Ｚ1_Rを時間軸上で伸縮する（Ｓ24）。図８に例示される通り、処理期間Ｚ1_Rは、音表現の付加が開始される時刻（以下「合成開始時刻」という）Ｔm_Rから定常期間Ｑ1の直後の有声期間Ｖrの終点時刻τ1_Rまでの期間である。合成開始時刻Ｔm_Rは、歌唱音声の定常期間Ｑ1の始点時刻Ｔ1_Sと参照音声の定常期間Ｑ2の始点時刻Ｔ2_Sとのうち後方の時刻である。図８の例示の通り、定常期間Ｑ2の始点時刻Ｔ2_Sが定常期間Ｑ1の始点時刻Ｔ1_Sの後方に位置する場合には、定常期間Ｑ2の始点時刻Ｔ2_Sが合成開始時刻Ｔm_Rとして設定される。ただし、合成開始時刻Ｔm_Rは始点時刻Ｔ2_Sに限定されない。 <Extension of processing period Z1_R (S24)>
The release processing unit 32 expands or contracts a period Z1_R during which the sound expression of the second sound signal X2 is added to the first sound signal X1 (hereinafter referred to as "processing period") on the time axis (S24). As exemplified in FIG. 8, the processing period Z1_R is the period from the time Tm_R at which addition of the sound expression is started (hereinafter referred to as "synthesis start time") to the end point time τ1_R of the voiced period Vr immediately after the steady period Q1. be. The synthesis start time Tm_R is the later time between the start time T1_S of the steady period Q1 of the singing voice and the start time T2_S of the steady period Q2 of the reference voice. As illustrated in FIG. 8, when the start time T2_S of the steady period Q2 is located after the start time T1_S of the steady period Q1, the start time T2_S of the steady period Q2 is set as the synthesis start time Tm_R. However, synthesis start time Tm_R is not limited to start point time T2_S.

図８に例示される通り、本実施形態のリリース処理部３２は、第１音信号Ｘ1の処理期間Ｚ1_Rを、第２音信号Ｘ2のうち表現期間Ｚ2_Rの時間長に応じて伸長する。表現期間Ｚ2_Rは、第２音信号Ｘ2のうちリリース部の音表現を表す期間であり、第１音信号Ｘ1に対する当該音表現の付加に利用される。図８に例示される通り、表現期間Ｚ2_Rは、合成開始時刻Ｔm_Rから定常期間Ｑ2の直後の有声期間Ｖrの終点時刻τ2_Rまでの期間である。 As illustrated in FIG. 8, the release processing unit 32 of the present embodiment extends the processing period Z1_R of the first sound signal X1 according to the time length of the expression period Z2_R of the second sound signal X2. The expression period Z2_R is a period representing the sound expression of the release portion of the second sound signal X2, and is used to add the sound expression to the first sound signal X1. As illustrated in FIG. 8, the representation period Z2_R is a period from the synthesis start time Tm_R to the end point time τ2_R of the voiced period Vr immediately after the steady period Q2.

歌手等の熟練した歌唱者が歌唱した参照音声には相応の時間長にわたる充分な音表現が付加されるのに対し、歌唱に不慣れな利用者が歌唱した歌唱音声では音表現が時間的に不足する傾向がある。以上の傾向のもとでは、図８に例示される通り、参照音声の表現期間Ｚ2_Rが歌唱音声の処理期間Ｚ1_Rと比較して長い期間となる。したがって、本実施形態のリリース処理部３２は、第１音信号Ｘ1の処理期間Ｚ1_Rを、第２音信号Ｘ2の表現期間Ｚ2_Rの時間長まで伸長する。 A reference voice sung by a skilled singer such as a singer is added with sufficient sound expression over a suitable length of time, whereas a singing voice sung by a user who is unfamiliar with singing lacks sound expression in terms of time. tend to Under the above tendency, as illustrated in FIG. 8, the expression period Z2_R of the reference voice is longer than the processing period Z1_R of the singing voice. Therefore, the release processing unit 32 of the present embodiment extends the processing period Z1_R of the first sound signal X1 to the time length of the expression period Z2_R of the second sound signal X2.

処理期間Ｚ1_Rの伸長は、第１音信号Ｘ1（歌唱音声）の任意の時刻ｔ1と変形後の第３音信号Ｙ（変形音）の任意の時刻ｔとを相互に対応付ける処理（マッピング）で実現される。図８には、歌唱音声の時刻ｔ1（縦軸）と変形音の時刻ｔ（横軸）との対応関係が図示されている。 The expansion of the processing period Z1_R is realized by a process (mapping) that associates an arbitrary time t1 of the first sound signal X1 (singing voice) with an arbitrary time t of the third sound signal Y after deformation (deformed sound). be done. FIG. 8 shows the correspondence relationship between time t1 (vertical axis) of the singing voice and time t (horizontal axis) of the modified sound.

図８の対応関係における時刻ｔ1は、変形音の時刻ｔに対応する第１音信号Ｘ1の時刻である。図８に鎖線で併記された基準線Ｌは、第１音信号Ｘ1が伸縮されない状態（ｔ1＝ｔ）を意味する。また、変形音の時刻ｔに対する歌唱音声の時刻ｔ1の勾配が基準線Ｌと比較して小さい区間は、第１音信号Ｘ1が伸長される区間を意味する。時刻ｔに対する時刻ｔ1の勾配が基準線Ｌと比較して大きい区間は、歌唱音声が収縮される区間を意味する。 The time t1 in the correspondence relationship of FIG. 8 is the time of the first sound signal X1 corresponding to the time t of the modified sound. A reference line L indicated by a chain line in FIG. 8 means a state (t1=t) in which the first sound signal X1 is not expanded or contracted. Also, a section in which the gradient of the time t1 of the singing voice with respect to the time t of the modified sound is smaller than that of the reference line L means a section in which the first sound signal X1 is expanded. A section in which the gradient of time t1 with respect to time t is greater than that of the reference line L means a section in which the singing voice is contracted.

時刻ｔ1と時刻ｔとの対応関係は、以下に例示する数式(1a)から数式(1c)の非線形関数で表現される。

The correspondence between the time t1 and the time t is represented by the nonlinear functions of formulas (1a) to (1c) exemplified below.

時刻Ｔ_Rは、図８に例示される通り、合成開始時刻Ｔm_Rと処理期間Ｚ1_Rの終点時刻τ1_Rとの間に位置する所定の時刻である。例えば、定常期間Ｑ1の始点時刻Ｔ1_Sと終点時刻Ｔ1_Eとの中点（(Ｔ1_S＋Ｔ1_E)/２）と合成開始時刻Ｔm_Rとのうちの後方の時刻が時刻Ｔ_Rとして設定される。数式(1a)から理解される通り、処理期間Ｚ1_Rのうち時刻Ｔ_Rの前方の期間は伸縮されない。すなわち、時刻Ｔ_Rから処理期間Ｚ1_Rの伸長が開始される。 The time T_R is a predetermined time located between the synthesis start time Tm_R and the end point time τ1_R of the processing period Z1_R, as illustrated in FIG. For example, the later time between the middle point ((T1_S+T1_E)/2) between the start time T1_S and the end time T1_E of the steady period Q1 and the synthesis start time Tm_R is set as the time T_R. As understood from the formula (1a), the period before the time T_R in the processing period Z1_R is not expanded or contracted. That is, extension of the processing period Z1_R is started from the time T_R.

数式(1b)から理解される通り、処理期間Ｚ1_Rのうち時刻Ｔ_Rの後方の期間は、当該時刻Ｔ_Rに近い位置において伸長の度合が大きく、終点時刻τ1_Rに近付くほど伸長の度合が小さくなるように時間軸上で伸長される。数式(1b)の関数η(t)は、時間軸上の前方ほど処理期間Ｚ1_Rを伸長し、時間軸上の後方ほど処理期間Ｚ1_Rの伸長の度合を低減するための非線形関数である。具体的には、例えば時刻ｔの２次関数（η(t)＝ｔ^２）が関数η(t)として好適に利用される。以上に説明した通り、本実施形態では、処理期間Ｚ1_Rの終点時刻τ1_Rに近い位置ほど伸長の度合が小さくなるように処理期間Ｚ1_Rが時間軸上で伸長される。したがって、歌唱音声の終点時刻τ1_Rの近傍の音響特性を変形音においても充分に維持することが可能である。なお、時刻Ｔ_Rに近い位置では、終点時刻τ1_Rの近傍と比較して、伸長に起因した聴感上の違和感が知覚され難い傾向がある。したがって、前述の例示のように時刻Ｔ_Rに近い位置において伸長の度合を増大させても、変形音の聴感上の自然性は殆ど低下しない。なお、第１音信号Ｘ1のうち表現期間Ｚ2_Rの終点時刻τ2_Rから次の有声期間Ｖrの始点時刻τ1_Aまでの期間は数式(1c)から理解される通り時間軸上で短縮される。なお、終点時刻τ2_Rから始点時刻τ1_Aまでの期間には音声が存在しないから、第１音信号Ｘ1を部分的な削除により削除してもよい。 As understood from the formula (1b), in the period after the time T_R in the processing period Z1_R, the degree of expansion is large at a position close to the time T_R, and the degree of expansion decreases as the end point time τ1_R approaches. Stretched on the time axis. The function η(t) of Expression (1b) is a nonlinear function for extending the processing period Z1_R toward the front on the time axis and reducing the degree of extension of the processing period Z1_R toward the rear on the time axis. Specifically, for example, a quadratic function of time t (η(t)=t ² ) is preferably used as the function η(t). As described above, in the present embodiment, the processing period Z1_R is extended on the time axis so that the closer the position is to the end point time τ1_R of the processing period Z1_R, the smaller the degree of extension. Therefore, it is possible to sufficiently maintain the acoustic characteristics in the vicinity of the ending point time τ1_R of the singing voice even in the deformed sound. At a position close to the time T_R, there is a tendency that an auditory sense of incongruity caused by the expansion is less likely to be perceived than at a position near the end point time τ1_R. Therefore, even if the degree of extension is increased at a position close to time T_R as in the above example, the audible naturalness of the deformed sound hardly deteriorates. Note that the period from the end point time τ2_R of the expression period Z2_R to the start point time τ1_A of the next voiced period Vr in the first sound signal X1 is shortened on the time axis as can be understood from Equation (1c). Note that since there is no sound during the period from the end point time τ2_R to the start point time τ1_A, the first sound signal X1 may be deleted by partial deletion.

以上の例示の通り、歌唱音声の処理期間Ｚ1_Rは参照音声の表現期間Ｚ2_Rの時間長に伸長される。他方、参照音声の表現期間Ｚ2_Rは時間軸上で伸縮されない。すなわち、変形音の時刻ｔに対応する配置後の第２音信号Ｘ2の時刻ｔ2は当該時刻ｔに一致する（ｔ2＝ｔ）。以上の例示の通り、本実施形態においては、歌唱音声の処理期間Ｚ1_Rが表現期間Ｚ2_Rの時間長に応じて伸長されるから、第２音信号Ｘ2の伸長は不要である。したがって、第２音信号Ｘ2が表すリリース部の音表現を正確に第１音信号Ｘ1に付加することが可能である。 As illustrated above, the singing voice processing period Z1_R is extended to the time length of the reference voice representation period Z2_R. On the other hand, the representation period Z2_R of the reference speech is not expanded or contracted on the time axis. That is, the time t2 of the arranged second sound signal X2 corresponding to the time t of the modified sound coincides with the time t (t2=t). As illustrated above, in the present embodiment, the singing voice processing period Z1_R is expanded according to the time length of the expression period Z2_R, so expansion of the second sound signal X2 is unnecessary. Therefore, it is possible to accurately add the sound representation of the release portion represented by the second sound signal X2 to the first sound signal X1.

以上に例示した手順で処理期間Ｚ1_Rを伸長すると、リリース処理部３２は、第１音信号Ｘ1の伸長後の処理期間Ｚ1_Rを第２音信号Ｘ2の表現期間Ｚ2_Rに応じて変形する（Ｓ25－Ｓ26）。具体的には、歌唱音声の伸長後の処理期間Ｚ1_Rと参照音声の表現期間Ｚ2_Rとの間で、基本周波数の合成（Ｓ25）とスペクトル包絡概形の合成（Ｓ26）とが実行される。 When the processing period Z1_R is extended by the procedure illustrated above, the release processing unit 32 transforms the processing period Z1_R after the extension of the first sound signal X1 according to the expression period Z2_R of the second sound signal X2 (S25-S26 ). Specifically, fundamental frequency synthesis (S25) and spectral envelope outline synthesis (S26) are performed between the processing period Z1_R after decompressing the singing voice and the representation period Z2_R of the reference voice.

＜基本周波数の合成（Ｓ25）＞
リリース処理部３２は、以下の数式(2)の演算により第３音信号Ｙの各時刻ｔにおける基本周波数Ｆ(t)を算定する。

<Fundamental frequency synthesis (S25)>
The release processing unit 32 calculates the fundamental frequency F(t) of the third sound signal Y at each time t using the following formula (2).

数式(2)における平滑基本周波数Ｆ1(t1)は、第１音信号Ｘ1の基本周波数ｆ1(t1)の時系列を時間軸上で平滑化した周波数である。同様に、数式(2)の平滑基本周波数Ｆ2(t2)は、第２音信号Ｘ2の基本周波数ｆ2(t2)の時系列を時間軸上で平滑化した周波数である。数式(2)の係数λ1および係数λ2は１以下の非負値に設定される（０≦λ1≦１，０≦λ2≦１）。 The smoothed fundamental frequency F1(t1) in Equation (2) is a frequency obtained by smoothing the time series of the fundamental frequency f1(t1) of the first sound signal X1 on the time axis. Similarly, the smoothed fundamental frequency F2(t2) in Equation (2) is a frequency obtained by smoothing the time series of the fundamental frequency f2(t2) of the second sound signal X2 on the time axis. The coefficient λ1 and the coefficient λ2 in equation (2) are set to non-negative values of 1 or less (0≤λ1≤1, 0≤λ2≤1).

数式(2)から理解される通り、数式(2)の第２項は、歌唱音声の基本周波数ｆ1(t1)と平滑基本周波数Ｆ1(t1)との差分を、係数λ1に応じた度合で、第１音信号Ｘ1の基本周波数ｆ1(t1)から低減する処理である。また、数式(2)の第３項は、参照音声の基本周波数ｆ2(t2)と平滑基本周波数Ｆ2(t2)との差分を、係数λ2に応じた度合で、第１音信号Ｘ1の基本周波数ｆ1(t1)に付加する処理である。以上の説明から理解される通り、リリース処理部３２は、歌唱音声の基本周波数ｆ1(t1)と平滑基本周波数Ｆ1(t1)との差分を、参照音声の基本周波数ｆ2(t2)と平滑基本周波数Ｆ2(t2)との差分に置換する要素として機能する。すなわち、第１音信号Ｘ1における伸長後の処理期間Ｚ1_R内の基本周波数ｆ1(t1)の時間変化が、第２音信号Ｘ2における表現期間Ｚ2_R内の基本周波数ｆ2(t2)の時間変化に近付く。 As can be seen from the formula (2), the second term of the formula (2) expresses the difference between the fundamental frequency f1(t1) of the singing voice and the smoothed fundamental frequency F1(t1) according to the coefficient λ1, This is a process for reducing the fundamental frequency f1(t1) of the first sound signal X1. Also, the third term of the equation (2) expresses the difference between the fundamental frequency f2(t2) of the reference sound and the smoothed fundamental frequency F2(t2) by the degree according to the coefficient λ2, the fundamental frequency of the first sound signal X1 This is a process to add to f1(t1). As can be understood from the above description, the release processing unit 32 calculates the difference between the fundamental frequency f1(t1) of the singing voice and the smoothed fundamental frequency F1(t1) as the fundamental frequency f2(t2) of the reference voice and the smoothed fundamental frequency It functions as an element that replaces the difference with F2(t2). That is, the time change of the fundamental frequency f1(t1) within the processing period Z1_R after expansion of the first sound signal X1 approaches the time change of the fundamental frequency f2(t2) within the representation period Z2_R of the second sound signal X2.

＜スペクトル包絡概形の合成（Ｓ26）＞
リリース処理部３２は、歌唱音声の伸長後の処理期間Ｚ1_Rと参照音声の表現期間Ｚ2_Rとの間でスペクトル包絡概形を合成する。第１音信号Ｘ1のスペクトル包絡概形Ｇ1は、図９に例示される通り、第１音信号Ｘ1の周波数スペクトルｇ1の概形であるスペクトル包絡ｇ2を周波数領域で更に平滑化した強度分布を意味する。具体的には、音韻性（音韻に依存した差異）および個人性（発声者に依存した差異）が知覚できなくなる程度にスペクトル包絡ｇ2を平滑化した強度分布がスペクトル包絡概形Ｇ1である。例えばスペクトル包絡ｇ2を表すメルケプストラムの複数の係数のうち低次側に位置する所定個の係数によりスペクトル包絡概形Ｇ1が表現される。以上の説明では第１音信号Ｘ1のスペクトル包絡概形Ｇ1に着目したが、第２音信号Ｘ2のスペクトル包絡概形Ｇ2も同様である。 <Synthesis of Spectrum Envelope (S26)>
The release processing unit 32 synthesizes a spectral envelope outline between the processing period Z1_R after decompression of the singing voice and the expression period Z2_R of the reference voice. As illustrated in FIG. 9, the spectral envelope outline G1 of the first sound signal X1 means the intensity distribution obtained by further smoothing the spectrum envelope g2, which is the outline of the frequency spectrum g1 of the first sound signal X1, in the frequency domain. do. Specifically, the spectral envelope outline G1 is the intensity distribution obtained by smoothing the spectral envelope g2 to such an extent that phonology (differences depending on phonemes) and individuality (differences depending on the speaker) cannot be perceived. For example, the spectrum envelope outline G1 is represented by a predetermined number of coefficients positioned on the lower order side among the plurality of coefficients of the mel-cepstrum representing the spectrum envelope g2. In the above description, attention is focused on the spectral envelope outline G1 of the first sound signal X1, but the same applies to the spectrum envelope outline G2 of the second sound signal X2.

リリース処理部３２は、以下の数式(3)の演算により第３音信号Ｙの各時刻ｔにおけるスペクトル包絡概形（以下「合成スペクトル包絡概形」という）Ｇ(t)を算定する。

The release processing unit 32 calculates a spectral envelope outline (hereinafter referred to as a "synthetic spectral envelope outline") G(t) of the third sound signal Y at each time t using the following formula (3).

数式(3)の記号Ｇ1_refは、基準スペクトル包絡概形である。第１音信号Ｘ1の複数のスペクトル包絡概形Ｇ1のうち、特定の時点における１個のスペクトル包絡概形Ｇ1が、基準スペクトル包絡概形Ｇ1_ref（第１基準スペクトル包絡概形の例示）として利用される。具体的には、基準スペクトル包絡概形Ｇ1_refは、第１音信号Ｘ1のうち合成開始時刻Ｔm_R（第１時点の例示）におけるスペクトル包絡概形Ｇ1(Tm_R)である。すなわち、基準スペクトル包絡概形Ｇ1_refが抽出される時点は、定常期間Ｑ1の始点時刻Ｔ1_Sおよび定常期間Ｑ2の始点時刻Ｔ2_Sのうち後方の時刻に位置する。なお、基準スペクトル包絡概形Ｇ1_refが抽出される時点は合成開始時刻Ｔm_Rに限定されない。例えば、定常期間Ｑ1内の任意の時点のスペクトル包絡概形Ｇ1が基準スペクトル包絡概形Ｇ1_refとして利用される。 The symbol G1_ref in Equation (3) is the reference spectral envelope outline. Of the plurality of spectral envelope outlines G1 of the first sound signal X1, one spectral envelope outline G1 at a specific point in time is used as a reference spectral envelope outline G1_ref (an example of the first reference spectrum envelope outline). be. Specifically, the reference spectral envelope outline G1_ref is the spectral envelope outline G1(Tm_R) of the first sound signal X1 at the synthesis start time Tm_R (example of the first time point). That is, the time point at which the reference spectral envelope outline G1_ref is extracted is positioned later than the starting point time T1_S of the steady period Q1 and the starting point time T2_S of the steady period Q2. Note that the time at which the reference spectral envelope outline G1_ref is extracted is not limited to the synthesis start time Tm_R. For example, the spectral envelope outline G1 at any time point within the stationary period Q1 is used as the reference spectral envelope outline G1_ref.

同様に、数式(3)の基準スペクトル包絡概形Ｇ2_refは、第２音信号Ｘ2の複数のスペクトル包絡概形Ｇ2のうち、特定の時点における１個のスペクトル包絡概形Ｇ2である。具体的には、基準スペクトル包絡概形Ｇ2_refは、第２音信号Ｘ2のうち合成開始時刻Ｔm_R（第２時点の例示）におけるスペクトル包絡概形Ｇ2(Tm_R)である。すなわち、基準スペクトル包絡概形Ｇ2_refが抽出される時点は、定常期間Ｑ1の始点時刻Ｔ1_Sおよび定常期間Ｑ2の始点時刻Ｔ2_Sのうち後方の時刻に位置する。なお、基準スペクトル包絡概形Ｇ2_refが抽出される時点は合成開始時刻Ｔm_Rに限定されない。例えば、定常期間Ｑ1内の任意の時点のスペクトル包絡概形Ｇ2が基準スペクトル包絡概形Ｇ2_refとして利用される。 Similarly, the reference spectral envelope outline G2_ref in Equation (3) is one spectral envelope outline G2 at a specific point in time among the plurality of spectral envelope outlines G2 of the second sound signal X2. Specifically, the reference spectral envelope outline G2_ref is the spectral envelope outline G2(Tm_R) of the second sound signal X2 at the synthesis start time Tm_R (example of the second time point). In other words, the point of time at which the reference spectral envelope outline G2_ref is extracted is positioned later than the starting point time T1_S of the steady period Q1 and the starting point time T2_S of the steady period Q2. Note that the point at which the reference spectral envelope outline G2_ref is extracted is not limited to the synthesis start time Tm_R. For example, the spectral envelope outline G2 at an arbitrary time point within the stationary period Q1 is used as the reference spectral envelope outline G2_ref.

数式(3)の係数μ1および係数μ2は、１以下の非負値に設定される（０≦μ1≦１，０≦μ2≦１）。数式(3)の第２項は、歌唱音声のスペクトル包絡概形Ｇ1(t1)と基準スペクトル包絡概形Ｇ1_refとの差分を、係数μ1（第１係数の例示）に応じた度合で、第１音信号Ｘ1のスペクトル包絡概形Ｇ1(t1)から低減する処理である。また、数式(3)の第３項は、参照音声のスペクトル包絡概形Ｇ2(t2)と基準スペクトル包絡概形Ｇ2_refとの差分を、係数μ2（第２係数の例示）に応じた度合で、第２音信号Ｘ2のスペクトル包絡概形Ｇ2(b)から低減する処理である。以上の説明から理解される通り、リリース処理部３２は、歌唱音声のスペクトル包絡概形Ｇ1(t1)と基準スペクトル包絡概形Ｇ1_refとの差分（第１差分の例示）を、参照音声のスペクトル包絡概形Ｇ2(t2)と基準スペクトル包絡概形Ｇ2_refとの差分（第２差分の例示）に置換する要素として機能する。 The coefficient μ1 and the coefficient μ2 in equation (3) are set to non-negative values of 1 or less (0≤μ1≤1, 0≤μ2≤1). The second term of Equation (3) expresses the difference between the spectral envelope outline G1(t1) of the singing voice and the reference spectral envelope outline G1_ref by the degree corresponding to the coefficient μ1 (example of the first coefficient), the first This is a process of reducing from the spectral envelope outline G1(t1) of the sound signal X1. In addition, the third term of Equation (3) expresses the difference between the spectral envelope outline G2(t2) of the reference speech and the reference spectral envelope outline G2_ref in accordance with the coefficient μ2 (an example of the second coefficient), This is a process of reducing from the spectral envelope outline G2(b) of the second sound signal X2. As can be understood from the above description, the release processing unit 32 converts the difference (example of the first difference) between the spectral envelope outline G1(t1) of the singing voice and the reference spectral envelope outline G1_ref into the spectral envelope of the reference voice. It functions as an element that replaces the difference (an example of the second difference) between the outline G2(t2) and the reference spectrum envelope outline G2_ref.

＜アタック処理Ｓ1＞
図１０は、アタック処理部３１が実行するアタック処理Ｓ1の具体的な内容を例示するフローチャートである。第１音信号Ｘ1の定常期間Ｑ1毎に図１０のアタック処理Ｓ1が実行される。なお、アタック処理Ｓ1の具体的な手順はリリース処理Ｓ2と同様である。 <Attack processing S1>
FIG. 10 is a flowchart illustrating specific contents of the attack processing S1 executed by the attack processing section 31. As shown in FIG. The attack process S1 of FIG. 10 is executed for each stationary period Q1 of the first sound signal X1. The specific procedure of the attack processing S1 is the same as that of the release processing S2.

アタック処理Ｓ1を開始すると、アタック処理部３１は、第１音信号Ｘ1のうち処理対象の定常期間Ｑ1に第２音信号Ｘ2のアタック部の音表現を付加するか否かを判定する（Ｓ11）。具体的には、アタック処理部３１は、以下に例示する条件Ｃa1から条件Ｃa5の何れかに該当する定常期間Ｑ1についてはアタック部の音表現を付加しないと判定する。ただし、第１音信号Ｘ1の定常期間Ｑ1に音表現を付加するか否かを判定する条件は以下の例示に限定されない。
［条件Ｃa1］定常期間Ｑ1の時間長が所定値を下回る。
［条件Ｃa2］定常期間Ｑ1内で平滑化した基本周波数ｆ1の変動幅が所定値を上回る。
［条件Ｃa3］定常期間Ｑ1のうち始点を含む所定長の期間内で平滑化した基本周波数ｆ1の変動幅が所定値を上回る。
［条件Ｃa4］定常期間Ｑ1の直前の有声期間Ｖaの時間長が所定値を上回る。
［条件Ｃa5］定常期間Ｑ1の直前の有声期間Ｖaにおける基本周波数ｆ1の変動幅が所定値を上回る。 When the attack processing S1 is started, the attack processing section 31 determines whether or not to add the sound representation of the attack portion of the second sound signal X2 to the stationary period Q1 to be processed in the first sound signal X1 (S11). . Specifically, the attack processing unit 31 determines not to add the sound representation of the attack portion for the stationary period Q1 corresponding to any one of conditions Ca1 to Ca5 illustrated below. However, the conditions for determining whether or not to add sound representation to the stationary period Q1 of the first sound signal X1 are not limited to the following examples.
[Condition Ca1] The time length of the steady period Q1 is below a predetermined value.
[Condition Ca2] The fluctuation width of the smoothed fundamental frequency f1 within the steady period Q1 exceeds a predetermined value.
[Condition Ca3] The fluctuation range of the smoothed fundamental frequency f1 within a predetermined length of period including the start point of the steady period Q1 exceeds a predetermined value.
[Condition Ca4] The time length of the voiced period Va immediately preceding the steady period Q1 exceeds a predetermined value.
[Condition Ca5] The fluctuation width of the fundamental frequency f1 in the voiced period Va immediately before the steady period Q1 exceeds a predetermined value.

条件Ｃa1は、前述の条件Ｃr1と同様に、時間長が充分に短い定常期間Ｑ1には自然な声質で音表現を付加することが困難であるという事情を考慮した条件である。また、定常期間Ｑ1内で基本周波数ｆ1が大きく変動する場合には、歌唱音声に充分な音表現が付加されている可能性が高い。そこで、平滑後の基本周波数ｆ1の変動幅が所定値を上回る定常期間Ｑ1は、音表現の付加対象から除外される（条件Ｃa2）。条件Ｃa3は、条件Ｃa2と同様の内容であるが、定常期間Ｑ1のうち特にアタック部に近い期間に着目した条件である。また、定常期間Ｑ1の直前の有声期間Ｖaの時間長が充分に長い場合、または有声期間Ｖa内で基本周波数ｆ1が大きく変動する場合には、歌唱音声に既に充分な音表現が付加されている可能性が高い。そこで、直前の有声期間Ｖaの時間長が所定値を上回る定常期間Ｑ1（条件Ｃa4）と、有声期間Ｖa内での基本周波数ｆ1の変動幅が所定値を上回る定常期間Ｑ1（条件Ｃa5）とは、音表現の付加対象から除外される。定常期間Ｑ1に音表現を付加しないと判定した場合（Ｓ11：YES）、アタック処理部３１は、以下に詳述する処理（Ｓ12－Ｓ16）を実行することなくアタック処理Ｓ1を終了する。 Condition Ca1, like condition Cr1, is a condition that takes into consideration the fact that it is difficult to add sound expression with natural voice quality during a sufficiently short stationary period Q1. Further, when the fundamental frequency f1 fluctuates greatly within the stationary period Q1, there is a high possibility that sufficient sound expression is added to the singing voice. Therefore, the stationary period Q1 in which the fluctuation range of the smoothed fundamental frequency f1 exceeds a predetermined value is excluded from the objects to which sound expression is added (condition Ca2). The condition Ca3 has the same content as the condition Ca2, but is a condition focused on a period particularly close to the attack part in the steady period Q1. Further, when the time length of the voiced period Va immediately before the stationary period Q1 is sufficiently long, or when the fundamental frequency f1 fluctuates greatly within the voiced period Va, sufficient sound expression is already added to the singing voice. Probability is high. Therefore, the steady period Q1 (condition Ca4) in which the time length of the immediately preceding voiced period Va exceeds a predetermined value and the steady period Q1 (condition Ca5) in which the fluctuation range of the fundamental frequency f1 within the voiced period Va exceeds a predetermined value are defined. , are excluded from addition of sound expressions. If it is determined that the sound expression is not added to the stationary period Q1 (S11: YES), the attack processing section 31 ends the attack processing S1 without executing the processing (S12-S16) described in detail below.

第１音信号Ｘ1の定常期間Ｑ1に第２音信号Ｘ2のアタック部の音表現を付加すると判定した場合（Ｓ11：YES）、アタック処理部３１は、第２音信号Ｘ2の複数の定常期間Ｑ2のうち、定常期間Ｑ1に付加されるべき音表現に対応する定常期間Ｑ2を選択する（Ｓ12）。アタック処理部３１が定常期間Ｑ2を選択する方法は、リリース処理部３２が定常期間Ｑ2を選択する方法と同様である。 When it is determined that the sound representation of the attack portion of the second sound signal X2 is added to the steady period Q1 of the first sound signal X1 (S11: YES), the attack processing section 31 adds a plurality of steady periods Q2 of the second sound signal X2. Among them, the stationary period Q2 corresponding to the sound expression to be added to the stationary period Q1 is selected (S12). The method by which the attack processing unit 31 selects the steady period Q2 is the same as the method by which the release processing unit 32 selects the steady period Q2.

アタック処理部３１は、以上の手順で選択した定常期間Ｑ2に対応する音表現を第１音信号Ｘ1に付加するための処理（Ｓ13－Ｓ16）を実行する。図１１は、アタック処理部３１が第１音信号Ｘ1にアタック部の音表現を付加する処理の説明図である。 The attack processing unit 31 executes processing (S13-S16) for adding the sound expression corresponding to the stationary period Q2 selected in the above procedure to the first sound signal X1. FIG. 11 is an explanatory diagram of processing in which the attack processing unit 31 adds the sound expression of the attack portion to the first sound signal X1.

アタック処理部３１は、処理対象の定常期間Ｑ1とステップＳ12で選択した定常期間Ｑ2との間で時間軸上の位置関係を調整する（Ｓ13）。具体的には、アタック処理部３１は、図１１に例示される通り、定常期間Ｑ1の始点時刻Ｔ1_Sに定常期間Ｑ2の始点時刻Ｔ2_Sが時間軸上で一致するように、第２音信号Ｘ2（定常期間Ｑ2）を第１音信号Ｘ1の時間軸上に配置する。 The attack processing unit 31 adjusts the positional relationship on the time axis between the steady period Q1 to be processed and the steady period Q2 selected in step S12 (S13). Specifically, as illustrated in FIG. 11, the attack processing unit 31 generates the second sound signal X2 ( The stationary period Q2) is arranged on the time axis of the first sound signal X1.

＜処理期間Ｚ1_Aの伸長＞
アタック処理部３１は、第１音信号Ｘ1のうち第２音信号Ｘ2の音表現が付加される処理期間Ｚ1_Aを時間軸上で伸長する（Ｓ14）。処理期間Ｚ1_Aは、定常期間Ｑ1の直前の有声期間Ｖaの始点時刻τ1_Aから音表現の付加が終了される時刻（以下「合成終了時刻」という）Ｔm_Aまでの期間である。合成終了時刻Ｔm_Aは、例えば定常期間Ｑ1の始点時刻Ｔ1_S（定常期間Ｑ2の始点時刻Ｔ2_S）である。すなわち、アタック処理Ｓ1においては、定常期間Ｑ1の前方の有声期間Ｖaが処理期間Ｚ1_Aとして伸長される。前述の通り、定常期間Ｑ1は楽曲の音符に相当する期間である。有声期間Ｖaを伸長し、定常期間Ｑ1は伸長しない構成によれば、定常期間Ｑ1の始点時刻Ｔ1_Sの変化が抑制される。すなわち、歌唱音声における音符の先頭が前後に移動する可能性を低減できる。 <Extension of processing period Z1_A>
The attack processing unit 31 extends the processing period Z1_A in which the sound representation of the second sound signal X2 is added to the first sound signal X1 on the time axis (S14). The processing period Z1_A is a period from the starting point time τ1_A of the voiced period Va immediately before the steady period Q1 to the time Tm_A at which the addition of the sound expression is finished (hereinafter referred to as "synthesis end time"). The synthesis end time Tm_A is, for example, the start time T1_S of the steady period Q1 (the start time T2_S of the steady period Q2). That is, in the attack processing S1, the voiced period Va preceding the steady period Q1 is extended as the processing period Z1_A. As described above, the stationary period Q1 is a period corresponding to musical notes of music. According to the configuration in which the voiced period Va is extended but the steady period Q1 is not extended, the change in the start point time T1_S of the steady period Q1 is suppressed. That is, it is possible to reduce the possibility that the head of the note in the singing voice moves back and forth.

図１１に例示される通り、本実施形態のアタック処理部３１は、第１音信号Ｘ1の処理期間Ｚ1_Aを、第２音信号Ｘ2のうち表現期間Ｚ2_Aの時間長に応じて伸長する。表現期間Ｚ2_Aは、第２音信号Ｘ2のうちアタック部の音表現を表す期間であり、第１音信号Ｘ1に対する当該音表現の付加に利用される。図１１に例示される通り、表現期間Ｚ2_Aは、定常期間Ｑ2の直前の有声期間Ｖaである。 As illustrated in FIG. 11, the attack processing unit 31 of this embodiment extends the processing period Z1_A of the first sound signal X1 according to the time length of the expression period Z2_A of the second sound signal X2. The representation period Z2_A is a period representing the sound representation of the attack portion of the second sound signal X2, and is used to add the sound representation to the first sound signal X1. As illustrated in FIG. 11, the expression period Z2_A is the voiced period Va immediately preceding the stationary period Q2.

具体的には、アタック処理部３１は、第１音信号Ｘ1の処理期間Ｚ1_Aを、第２音信号Ｘ2の表現期間Ｚ2_Aの時間長まで伸長する。図１１には、歌唱音声の時刻ｔ1（縦軸）と変形音の時刻ｔ（横軸）との対応関係が図示されている。 Specifically, the attack processing unit 31 extends the processing period Z1_A of the first sound signal X1 to the time length of the expression period Z2_A of the second sound signal X2. FIG. 11 shows the correspondence relationship between time t1 (vertical axis) of the singing voice and time t (horizontal axis) of the modified sound.

図１１に例示される通り、本実施形態では、処理期間Ｚ1_Aの始点時刻τ1_Aに近い位置ほど伸長の度合が小さくなるように処理期間Ｚ1_Aが時間軸上で伸長される。したがって、歌唱音声の始点時刻τ1_Aの近傍の音響特性を変形音においても充分に維持することが可能である。他方、参照音声の表現期間Ｚ2_Aは時間軸上で伸縮されない。したがって、第２音信号Ｘ2が表すアタック部の音表現を正確に第１音信号Ｘ1に付加することが可能である。 As exemplified in FIG. 11, in the present embodiment, the processing period Z1_A is extended on the time axis so that the closer the position is to the starting point time τ1_A of the processing period Z1_A, the smaller the degree of extension. Therefore, it is possible to sufficiently maintain the acoustic characteristics in the vicinity of the starting point time τ1_A of the singing voice even in the deformed sound. On the other hand, the representation period Z2_A of the reference speech is not expanded or contracted on the time axis. Therefore, it is possible to accurately add the sound representation of the attack portion represented by the second sound signal X2 to the first sound signal X1.

以上に例示した手順で処理期間Ｚ1_Aを伸長すると、アタック処理部３１は、第１音信号Ｘ1の伸長後の処理期間Ｚ1_Aを第２音信号Ｘ2の表現期間Ｚ2_Aに応じて変形する（Ｓ15－Ｓ16）。具体的には、歌唱音声の伸長後の処理期間Ｚ1_Aと参照音声の表現期間Ｚ2_Aとの間で、基本周波数の合成（Ｓ25）とスペクトル包絡概形の合成（Ｓ26）とが実行される。 When the processing period Z1_A is extended by the procedure illustrated above, the attack processing unit 31 transforms the processing period Z1_A after extension of the first sound signal X1 according to the expression period Z2_A of the second sound signal X2 (S15-S16 ). Specifically, fundamental frequency synthesis (S25) and spectrum envelope synthesis (S26) are performed between the processing period Z1_A after decompression of the singing voice and the representation period Z2_A of the reference voice.

具体的には、アタック処理部３１は、前述の数式(2)と同様の演算により、第１音信号Ｘ1の基本周波数ｆ1(t1)と第２音信号Ｘ2の基本周波数ｆ2(t2)とから第３音信号Ｙの基本周波数Ｆ(t)を算定する。すなわち、アタック処理部３１は、基本周波数ｆ1(t1)と平滑後の基本周波数Ｆ1(t1)との差分を係数λ1に応じた度合で第１音信号Ｘ1の基本周波数ｆ1(t1)から低減し、基本周波数ｆ2(t2)と平滑後の基本周波数Ｆ2(t2)との差分を係数λ2に応じた度合で第１音信号Ｘ1の基本周波数ｆ1(t1)に付加することで、第３音信号Ｙの基本周波数Ｆ(t)を算定する。したがって、第１音信号Ｘ1における伸長後の処理期間Ｚ1_A内の基本周波数ｆ1(t1)の時間変化が、第２音信号Ｘ2における表現期間Ｚ2_A内の基本周波数ｆ2(t2)の時間変化に近付く。 Specifically, the attack processing unit 31 calculates the following from the fundamental frequency f1(t1) of the first sound signal X1 and the fundamental frequency f2(t2) of the second sound signal X2 by the same calculation as the formula (2). The fundamental frequency F(t) of the third sound signal Y is calculated. That is, the attack processing unit 31 reduces the difference between the fundamental frequency f1(t1) and the smoothed fundamental frequency F1(t1) from the fundamental frequency f1(t1) of the first sound signal X1 at a degree corresponding to the coefficient λ1. , the difference between the fundamental frequency f2(t2) and the smoothed fundamental frequency F2(t2) is added to the fundamental frequency f1(t1) of the first sound signal X1 at a degree corresponding to the coefficient λ2 to obtain the third sound signal Calculate the fundamental frequency of Y, F(t). Therefore, the time change of the fundamental frequency f1(t1) within the processing period Z1_A after expansion of the first sound signal X1 approaches the time change of the fundamental frequency f2(t2) within the representation period Z2_A of the second sound signal X2.

また、アタック処理部３１は、歌唱音声の伸長後の処理期間Ｚ1_Aと参照音声の表現期間Ｚ2_Aとの間でスペクトル包絡概形を合成する。具体的には、アタック処理部３１は、前述の数式(3)と同様の演算により、第１音信号Ｘ1のスペクトル包絡概形Ｇ1(t1)と第２音信号Ｘ2のスペクトル包絡概形Ｇ2(t2)とから第３音信号Ｙの合成スペクトル包絡概形Ｇ(t)を算定する。アタック処理Ｓ1において数式(3)に適用される基準スペクトル包絡概形Ｇ1_refは、第１音信号Ｘ1のうち合成終了時刻Ｔm_A（第１時点の例示）におけるスペクトル包絡概形Ｇ1(Tm_A)である。すなわち、基準スペクトル包絡概形Ｇ1_refが抽出される時点は、定常期間Ｑ1の始点時刻Ｔ1_Sに位置する。 Also, the attack processing unit 31 synthesizes a spectral envelope outline between the processing period Z1_A after decompression of the singing voice and the expression period Z2_A of the reference voice. Specifically, the attack processing unit 31 calculates the spectral envelope outline G1(t1) of the first sound signal X1 and the spectral envelope outline G2(t1) of the second sound signal X2 by a calculation similar to the above-described formula (3). t2) and the synthesized spectral envelope outline G(t) of the third sound signal Y is calculated. The reference spectral envelope outline G1_ref applied to Equation (3) in the attack process S1 is the spectral envelope outline G1(Tm_A) at the synthesis end time Tm_A (example of the first point in time) of the first sound signal X1. That is, the time point at which the reference spectral envelope outline G1_ref is extracted is located at the start point time T1_S of the stationary period Q1.

同様に、アタック処理Ｓ1において数式(3)に適用される基準スペクトル包絡概形Ｇ2_refは、第２音信号Ｘ2のうち合成終了時刻Ｔm_A（第２時点の例示）におけるスペクトル包絡概形Ｇ2(Tm_A)である。すなわち、基準スペクトル包絡概形Ｇ2_refが抽出される時点は、定常期間Ｑ1の始点時刻Ｔ1_Sに位置する。 Similarly, the reference spectral envelope outline G2_ref applied to Equation (3) in the attack processing S1 is the spectral envelope outline G2(Tm_A) at the synthesis end time Tm_A (exemplification of the second point in time) of the second sound signal X2. is. That is, the time point at which the reference spectral envelope outline G2_ref is extracted is located at the start point time T1_S of the stationary period Q1.

以上の説明から理解される通り、本実施形態のアタック処理部３１およびリリース処理部３２の各々は、定常期間Ｑ1の端点（始点時刻Ｔ1_Sまたは終点時刻Ｔ1_E）を基準とした時間軸上の位置において第２音信号Ｘ2（解析データＤ2）を利用して第１音信号Ｘ1（解析データＤ1）を変形する。以上に例示したアタック処理Ｓ1およびリリース処理Ｓ2により、変形音を表す第３音信号Ｙの基本周波数Ｆ(t)の時系列と合成スペクトル包絡概形Ｇ(t)の時系列とが生成される。図２の音声合成部３３は、第３音信号Ｙの基本周波数Ｆ(t)の時系列と合成スペクトル包絡概形Ｇ(t)の時系列とから第３音信号Ｙを生成する。 As can be understood from the above description, each of the attack processing unit 31 and the release processing unit 32 of the present embodiment performs The first sound signal X1 (analysis data D1) is transformed using the second sound signal X2 (analysis data D2). By the attack processing S1 and the release processing S2 illustrated above, the time series of the fundamental frequency F(t) of the third sound signal Y representing the modified sound and the time series of the synthetic spectral envelope outline G(t) are generated. . The speech synthesizer 33 of FIG. 2 generates the third sound signal Y from the time series of the fundamental frequency F(t) of the third sound signal Y and the time series of the synthesized spectrum envelope outline G(t).

図２の音声合成部３３は、アタック処理Ｓ1およびリリース処理Ｓ2の結果（すなわち変形後の解析データ）を利用して変形音の第３音信号Ｙを合成する。具体的には、音声合成部３３は、第１音信号Ｘ1から算定される各周波数スペクトルｇ1を合成スペクトル包絡概形Ｇ(t)に沿うように調整し、かつ、第１音信号Ｘ1の基本周波数ｆ1を基本周波数Ｆ(t)に調整する。周波数スペクトルｇ1および基本周波数ｆ1の調整は例えば周波数領域で実行される。音声合成部３３は、以上に例示した調整後の周波数スペクトルを時間領域に変換することで第３音信号Ｙを合成する。 The speech synthesizing unit 33 in FIG. 2 synthesizes the third sound signal Y of the deformed sound using the results of the attack processing S1 and the release processing S2 (that is, the analysis data after deformation). Specifically, the speech synthesizer 33 adjusts each frequency spectrum g1 calculated from the first sound signal X1 so as to conform to the synthesized spectrum envelope outline G(t), and Adjust the frequency f1 to the fundamental frequency F(t). The adjustment of the frequency spectrum g1 and the fundamental frequency f1 is performed, for example, in the frequency domain. The speech synthesizing unit 33 synthesizes the third sound signal Y by transforming the adjusted frequency spectrum illustrated above into the time domain.

以上に説明した通り、本実施形態では、第１音信号Ｘ1のスペクトル包絡概形Ｇ1(t1)と基準スペクトル包絡概形Ｇ1_refとの差分（Ｇ1(t1)－Ｇ1_ref）と、第２音信号Ｘ2のスペクトル包絡概形Ｇ2(t2)と基準スペクトル包絡概形Ｇ2_refとの差分（Ｇ2(t2)－Ｇ2_ref）とが、第１音信号Ｘ1のスペクトル包絡概形Ｇ1(t1)に合成される。したがって、第１音信号Ｘ1のうち、第２音信号Ｘ2を利用して変形される期間（処理期間Ｚ1_A，Ｚ1_R）と当該期間の前後の期間との境界において音響特性が連続する聴感的に自然な変形音を生成できる。 As described above, in this embodiment, the difference (G1(t1)-G1_ref) between the spectral envelope outline G1(t1) of the first sound signal X1 and the reference spectral envelope outline G1_ref, and the second sound signal X2 and the difference (G2(t2)-G2_ref) between the spectral envelope outline G2(t2) and the reference spectral envelope outline G2_ref are combined into the spectral envelope outline G1(t1) of the first sound signal X1. Therefore, it is perceptually natural that the acoustic characteristics are continuous at the boundaries between the periods (processing periods Z1_A and Z1_R) in which the first sound signal X1 is modified using the second sound signal X2 and the periods before and after the relevant periods. It can generate various deformation sounds.

また、本実施形態では、第１音信号Ｘ1のうち基本周波数ｆ1およびスペクトル形状が時間的に安定している定常期間Ｑ1が特定され、定常期間Ｑ1の端点（始点時刻Ｔ1_Sまたは終点時刻Ｔ1_E）を基準として配置された第２音信号Ｘ2を利用して第１音信号Ｘ1が変形される。したがって、第１音信号Ｘ1の適切な期間が第２音信号Ｘ2に応じて変形され、聴感的に自然な変形音を生成できる。 Further, in the present embodiment, the steady period Q1 in which the fundamental frequency f1 and the spectral shape of the first sound signal X1 are temporally stable is specified, and the end point (start point time T1_S or end point time T1_E) of the steady period Q1 is specified. The first sound signal X1 is transformed using the second sound signal X2 placed as a reference. Therefore, an appropriate period of the first sound signal X1 is deformed according to the second sound signal X2, and an acoustically natural deformed sound can be generated.

本実施形態では、第１音信号Ｘ1の処理期間（Ｚ1_A，Ｚ1_R）が第２音信号Ｘ2の表現期間（Ｚ2_A，Ｚ2_R）の時間長に応じて伸長されるから、第２音信号Ｘ2の伸長は不要である。したがって、参照音声の音響特性（例えば音表現）が正確に第１音信号Ｘ1に付加され、聴感的に自然な変形音を生成できる。 In the present embodiment, the processing period (Z1_A, Z1_R) of the first sound signal X1 is expanded according to the time length of the expression period (Z2_A, Z2_R) of the second sound signal X2. is unnecessary. Therefore, the acoustic characteristics (for example, sound expression) of the reference voice are accurately added to the first sound signal X1, and an acoustically natural deformed sound can be generated.

＜変形例＞
以上に例示した各態様に付加される具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２個以上の態様を、相互に矛盾しない範囲で適宜に併合してもよい。 <Modification>
Specific modified aspects added to the above-exemplified aspects will be exemplified below. Two or more aspects arbitrarily selected from the following examples may be combined as appropriate within a mutually consistent range.

（１）前述の形態では、第１指標δ1と第２指標δ2とから算定される変動指標Δを利用して第１音信号Ｘ1の定常期間Ｑ1を特定したが、第１指標δ1と第２指標δ2とに応じて定常期間Ｑ1を特定する方法は以上の例示に限定されない。例えば、信号解析部２１は、第１指標δ1に応じた第１暫定期間と第２指標δ2に応じた第２暫定期間とを特定する。第１暫定期間は、例えば第１指標δ1が閾値を下回る有声音の期間である。すなわち、基本周波数ｆ1が時間的に安定している期間が第１暫定期間として特定される。第２暫定期間は、例えば第２指標δ2が閾値を下回る有声音の期間である。すなわち、スペクトル形状が時間的に安定している期間が第２暫定期間として特定される。信号解析部２１は、第１暫定期間と第２暫定期間とが相互に重複する期間を定常期間Ｑ1として特定する。すなわち、第１音信号Ｘ1のうち基本周波数ｆ1とスペクトル形状との双方が時間的に安定している期間が定常期間Ｑ1として特定される。以上の説明から理解される通り、定常期間Ｑ1の特定において変動指標Δの算定を省略してもよい。なお、以上の説明では定常期間Ｑ1の特定に着目したが、第２音信号Ｘ2における定常期間Ｑ2の特定についても同様である。 (1) In the above embodiment, the steady period Q1 of the first sound signal X1 is specified using the fluctuation index Δ calculated from the first index δ1 and the second index δ2. The method of identifying the steady period Q1 according to the index .delta.2 is not limited to the above examples. For example, the signal analysis unit 21 identifies the first provisional period corresponding to the first index δ1 and the second provisional period corresponding to the second index δ2. The first provisional period is, for example, a period of voiced speech in which the first index δ1 is below the threshold. That is, the period during which the fundamental frequency f1 is temporally stable is specified as the first provisional period. The second provisional period is, for example, a period of voiced speech in which the second index δ2 is below the threshold. That is, a period in which the spectral shape is temporally stable is specified as the second provisional period. The signal analysis unit 21 identifies a period in which the first provisional period and the second provisional period overlap each other as the steady period Q1. That is, a period in which both the fundamental frequency f1 and the spectral shape of the first sound signal X1 are temporally stable is specified as the stationary period Q1. As can be understood from the above explanation, the calculation of the fluctuation index Δ may be omitted in identifying the steady period Q1. In the above description, attention is paid to specifying the steady period Q1, but the same applies to specifying the steady period Q2 in the second sound signal X2.

（２）前述の形態では、第１音信号Ｘ1のうち基本周波数ｆ1およびスペクトル形状の双方が時間的に安定する期間を定常期間Ｑ1として特定したが、第１音信号Ｘ1のうち基本周波数ｆ1およびスペクトル形状の一方が時間的に安定する期間を定常期間Ｑ1として特定してもよい。同様に、第２音信号Ｘ2のうち基本周波数ｆ2およびスペクトル形状の一方が時間的に安定する期間を定常期間Ｑ2として特定してもよい。 (2) In the above embodiment, the period in which both the fundamental frequency f1 and the spectral shape of the first sound signal X1 are temporally stable is specified as the stationary period Q1. A period in which one of the spectral shapes is temporally stable may be specified as the stationary period Q1. Similarly, a period during which one of the fundamental frequency f2 and the spectral shape of the second sound signal X2 is temporally stable may be specified as the stationary period Q2.

（３）前述の形態では、第１音信号Ｘ1のうち合成開始時刻Ｔm_Rまたは合成終了時刻Ｔm_Aにおけるスペクトル包絡概形Ｇ1を基準スペクトル包絡概形Ｇ1_refとして利用したが、基準スペクトル包絡概形Ｇ1_refが抽出される時点（第１時点）は以上の例示に限定されない。例えば、定常期間Ｑ1の端点（始点時刻Ｔ1_Sまたは終点時刻Ｔ1_E）におけるスペクトル包絡概形Ｇ1を基準スペクトル包絡概形Ｇ1_refとしてもよい。ただし、基準スペクトル包絡概形Ｇ1_refが抽出される第１時点は、第１音信号Ｘ1のうちスペクトル形状が安定している定常期間Ｑ1内の時点であることが望ましい。 (3) In the above embodiment, the spectral envelope outline G1 at the synthesis start time Tm_R or the synthesis end time Tm_A of the first sound signal X1 is used as the reference spectral envelope outline G1_ref, but the reference spectrum envelope outline G1_ref is extracted. The point in time (first point in time) is not limited to the above example. For example, the spectral envelope outline G1 at the end point (start point time T1_S or end point time T1_E) of the stationary period Q1 may be used as the reference spectral envelope outline G1_ref. However, the first point in time at which the reference spectral envelope outline G1_ref is extracted is preferably a point in the stationary period Q1 in which the spectral shape of the first sound signal X1 is stable.

基準スペクトル包絡概形Ｇ2_refについても同様である。すなわち、前述の形態では、第２音信号Ｘ2のうち合成開始時刻Ｔm_Rまたは合成終了時刻Ｔm_Aにおけるスペクトル包絡概形Ｇ2を基準スペクトル包絡概形Ｇ2_refとして利用したが、基準スペクトル包絡概形Ｇ2_refが抽出される時点（第２時点）は以上の例示に限定されない。例えば、定常期間Ｑ2の端点（始点時刻Ｔ2_Sまたは終点時刻Ｔ2_E）におけるスペクトル包絡概形Ｇ2を基準スペクトル包絡概形Ｇ2_refとしてもよい。ただし、基準スペクトル包絡概形Ｇ2_refが抽出される第２時点は、第２音信号Ｘ2のうちスペクトル形状が安定している定常期間Ｑ2内の時点であることが望ましい。 The same is true for the reference spectral envelope outline G2_ref. That is, in the above embodiment, the spectral envelope outline G2 at the synthesis start time Tm_R or the synthesis end time Tm_A of the second sound signal X2 is used as the reference spectral envelope outline G2_ref, but the reference spectral envelope outline G2_ref is extracted. The point in time (second point in time) is not limited to the above example. For example, the spectral envelope outline G2 at the end point (start point time T2_S or end point time T2_E) of the stationary period Q2 may be used as the reference spectral envelope outline G2_ref. However, it is desirable that the second point in time at which the reference spectral envelope outline G2_ref is extracted be a point in the stationary period Q2 in which the spectral shape of the second sound signal X2 is stable.

また、第１音信号Ｘ1のうち基準スペクトル包絡概形Ｇ1_refが抽出される第１時点と、第２音信号Ｘ2のうち基準スペクトル包絡概形Ｇ2_refが抽出される第２時点とは、時間軸上の相異なる時点でもよい。 Further, the first time point at which the reference spectral envelope outline G1_ref is extracted from the first sound signal X1 and the second time point at which the reference spectrum envelope outline G2_ref is extracted from the second sound signal X2 are separated from each other on the time axis. different points in time.

（４）前述の形態では、音処理装置１００の利用者が歌唱した歌唱音声を表す第１音信号Ｘ1を処理したが、第１音信号Ｘ1が表す音声は、利用者による歌唱音声に限定されない。例えば、素片接続型または統計モデル型の公知の音声合成技術により合成された第１音信号Ｘ1を処理してもよい。また、光ディスク等の記録媒体から読出された第１音信号Ｘ1を処理してもよい。第２音信号Ｘ2についても同様に、任意の方法で取得される。 (4) In the above embodiment, the first sound signal X1 representing the singing voice sung by the user of the sound processing device 100 is processed, but the voice represented by the first sound signal X1 is not limited to the singing voice of the user. . For example, the first sound signal X1 synthesized by a known segment connection type or statistical model type speech synthesis technique may be processed. Alternatively, the first sound signal X1 read from a recording medium such as an optical disk may be processed. Similarly, the second sound signal X2 is obtained by any method.

また、第１音信号Ｘ1および第２音信号Ｘ2が表す音響は、狭義の音声（すなわち人間が発声する言語音）に限定されない。例えば、楽器の演奏音を表す第１音信号Ｘ1に各種の音表現（例えば演奏表現）を付加する場合にも本発明は適用される。例えば、演奏表現が付加されていない単調な演奏音を表す第１音信号Ｘ1に対し、第２音信号Ｘ2を利用してビブラート等の演奏表現が付加される。 Moreover, the sounds represented by the first sound signal X1 and the second sound signal X2 are not limited to sounds in a narrow sense (that is, speech sounds uttered by humans). For example, the present invention can be applied to adding various sound expressions (for example, musical performance expressions) to the first sound signal X1 representing the performance sound of a musical instrument. For example, a performance expression such as vibrato is added to a first sound signal X1 representing a monotonous performance sound to which no performance expression is added, using the second sound signal X2.

＜付記＞
以上に例示した形態から、例えば以下の構成が把握される。 <Appendix>
For example, the following configuration can be grasped from the form illustrated above.

本発明の好適な態様（第１態様）に係る音声処理方法は、歌唱音声を表す第１音信号のうちの処理期間を、前記歌唱音声とは音響特性が相違する参照音声を表す前記第２音信号において前記第１音信号の変形に適用されるべき表現期間の時間長に応じて伸長し、前記処理期間の伸長後の前記第１音信号を、前記第２音信号の前記表現期間に応じて変形合成する。以上の態様では、第１音信号のうちの処理期間が第２音信号の表現期間の時間長に応じて伸長されるから、参照音声を表す第２音信号の伸縮は不要である。したがって、参照音声の音響特性を正確に第１音信号に付加することが可能である。 In the sound processing method according to a preferred aspect (first aspect) of the present invention, the processing period of the first sound signal representing the singing voice is set to the second sound signal representing the reference voice having acoustic characteristics different from those of the singing voice. A sound signal is extended according to a time length of an expression period to be applied to transform the first sound signal, and the first sound signal after the extension of the processing period is applied to the expression period of the second sound signal. Transform and synthesize accordingly. In the above aspect, since the processing period of the first sound signal is extended according to the time length of the representation period of the second sound signal, it is unnecessary to extend or shorten the second sound signal representing the reference sound. Therefore, it is possible to accurately add the acoustic characteristics of the reference speech to the first sound signal.

第１態様の好適例（第２態様）において、前記処理期間は、前記歌唱音声のリリース部を含む期間であり、前記処理期間の伸長においては、当該処理期間の終点に近い位置ほど伸長の度合が小さくなるように前記処理期間を伸長する。以上の態様によれば、処理期間の終点に近い位置ほど伸長の度合が小さくなるから、歌唱音声の終点の近傍の音響特性を維持しながら参照音声の音響特性を付加することが可能である。 In a preferred example of the first aspect (second aspect), the processing period is a period including the release portion of the singing voice, and in extending the processing period, the closer the position to the end point of the processing period, the greater the degree of extension. The processing period is extended so that . According to the above aspect, since the degree of expansion decreases closer to the end point of the processing period, it is possible to add the acoustic characteristics of the reference voice while maintaining the acoustic characteristics near the end point of the singing voice.

第１態様の好適例（第３態様）において、前記処理期間は、前記歌唱音声のアタック部を含む期間であり、前記処理期間の伸長においては、前記処理期間の始点に近い位置ほど伸長の度合が小さくなるように前記処理期間を伸長する。以上の態様によれば、処理期間の始点に近い位置ほど伸長の度合が小さくなるから、歌唱音声の始点の近傍の音響特性を維持しながら参照音声の音響特性を付加することが可能である。 In a preferred example of the first aspect (third aspect), the processing period is a period including an attack portion of the singing voice, and in extending the processing period, the closer the position to the starting point of the processing period, the greater the degree of extension. The processing period is extended so that . According to the above aspect, since the degree of expansion is smaller at a position closer to the starting point of the processing period, it is possible to add the acoustic characteristics of the reference voice while maintaining the acoustic characteristics near the starting point of the singing voice.

本発明の好適な態様（第４態様）に係る音声処理装置は、歌唱音声を表す第１音信号のうちの処理期間を、前記歌唱音声とは音響特性が相違する参照音声を表す前記第２音信号において前記第１音信号の変形に適用されるべき表現期間の時間長に応じて伸長し、前記処理期間の伸長後の前記第１音信号を、前記第２音信号の前記表現期間に応じて変形する合成処理部を具備する。 A sound processing apparatus according to a preferred aspect (fourth aspect) of the present invention is configured such that the processing period of the first sound signal representing the singing voice is set to the second sound signal representing the reference voice having acoustic characteristics different from those of the singing voice. A sound signal is extended according to a time length of an expression period to be applied to transform the first sound signal, and the first sound signal after the extension of the processing period is applied to the expression period of the second sound signal. A synthesizing unit that deforms accordingly is provided.

第４態様の好適例（第５態様）において、前記処理期間は、前記歌唱音声のリリース部を含む期間であり、前記合成処理部は、当該処理期間の終点に近い位置ほど伸長の度合が小さくなるように前記処理期間を伸長する。 In a preferred example of the fourth aspect (fifth aspect), the processing period is a period including a release portion of the singing voice, and the synthesis processing unit expands the position closer to the end point of the processing period. The processing period is extended so that

第４態様の好適例（第６態様）において、前記処理期間は、前記歌唱音声のアタック部を含む期間であり、前記合成処理部は、前記処理期間の始点に近い位置ほど伸長の度合が小さくなるように前記処理期間を伸長する。 In a preferred example of the fourth aspect (sixth aspect), the processing period is a period including an attack portion of the singing voice, and the synthesizing section expands the position closer to the starting point of the processing period. The processing period is extended so that

１００…音声処理装置、１１…制御装置、１２…記憶装置、１３…操作装置、１４…放音装置、２１…信号解析部、２２…合成処理部、３１…アタック処理部、３２…リリース処理部、３３…音声合成部。 DESCRIPTION OF SYMBOLS 100... Sound processing apparatus, 11... Control apparatus, 12... Storage device, 13... Operation device, 14... Sound emitting device, 21... Signal analysis part, 22... Synthesis process part, 31... Attack process part, 32... Release process part , 33 . . . speech synthesizing unit.

Claims

At the end point of the first steady period in which the fundamental frequency and spectrum shape are temporally stable in the first sound signal representing the singing voice, and at the second sound signal representing the reference voice having different acoustic characteristics from the singing voice The positions on the time axis of the first steady period and the second steady period were adjusted so that the end point of the second steady period in which the frequency and spectral shape were temporally stable coincided on the time axis. state, the processing period from a point in the first sound signal that is ahead of the end point of the first steady period by a specific time to a point in time when the singing voice is muted is the second sound signal. Extending to the time length of the expression period from the time point ahead of the end point of the two stationary periods by the specific time to the time point when the reference voice is silenced,
adding acoustic characteristics related to voice quality in the representation period of the second sound signal to the processing period in the decompressed first sound signal;
A speech processing method implemented by a computer.

2. The audio processing method according to claim 1, wherein in extending the processing period, the processing period is extended so that the degree of extension decreases as the position nearer to the end point of the processing period is extended.

The processing period from the time when the singing voice of the first sound signal representing the singing voice starts to the starting point of the first steady period in which the fundamental frequency and the spectrum shape are temporally stable in the first sound signal, Of the second sound signal representing the reference voice having acoustic characteristics different from the singing voice, the second sound signal whose fundamental frequency and spectrum shape are temporally stable in the second sound signal from the time the reference voice starts. extended to the length of time of the representation period to the beginning of the stationary period,
adding acoustic characteristics related to voice quality in the representation period of the second sound signal to the processing period in the decompressed first sound signal;
A speech processing method implemented by a computer.

4. The audio processing method according to claim 3, wherein, in extending the processing period, the processing period is extended so that the closer the position is to the starting point of the processing period, the smaller the degree of extension.

At the end point of the first steady period in which the fundamental frequency and spectrum shape are temporally stable in the first sound signal representing the singing voice, and at the second sound signal representing the reference voice having different acoustic characteristics from the singing voice The positions on the time axis of the first steady period and the second steady period were adjusted so that the end point of the second steady period in which the frequency and spectral shape were temporally stable coincided on the time axis. state, the processing period from a point in the first sound signal that is ahead of the end point of the first steady period by a specific time to a point in time when the singing voice is muted is the second sound signal. Extending to the time length of the expression period from the point in time preceding the end point of the two stationary periods by the specific time to the point in time when the reference voice is muted, the acoustic characteristics related to the voice quality in the expression period of the second sound signal are obtained. , a synthesis processing unit added to the processing period in the decompressed first sound signal.

6. The audio processing device according to claim 5, wherein the synthesizing unit extends the processing period so that the degree of extension decreases as the position nearer to the end point of the processing period.

The processing period from the time when the singing voice of the first sound signal representing the singing voice starts to the starting point of the first steady period in which the fundamental frequency and the spectrum shape are temporally stable in the first sound signal, Of the second sound signal representing the reference voice having acoustic characteristics different from the singing voice, the second sound signal whose fundamental frequency and spectrum shape are temporally stable in the second sound signal from the time the reference voice starts. A synthesis processing unit that extends the time length of the expression period up to the start point of the stationary period, and adds acoustic characteristics related to voice quality in the expression period of the second sound signal to the processing period of the first sound signal after the expansion. A speech processing device comprising:

8. The audio processing device according to claim 7, wherein the synthesizing unit extends the processing period so that the closer the position is to the starting point of the processing period, the smaller the degree of extension.

At the end point of the first steady period in which the fundamental frequency and spectrum shape are temporally stable in the first sound signal representing the singing voice, and at the second sound signal representing the reference voice having different acoustic characteristics from the singing voice The positions on the time axis of the first steady period and the second steady period were adjusted so that the end point of the second steady period in which the frequency and spectral shape were temporally stable coincided on the time axis. state, the processing period from a point in the first sound signal that is ahead of the end point of the first steady period by a specific time to a point in time when the singing voice is muted is the second sound signal. Extending to the time length of the expression period from the point in time preceding the end point of the two stationary periods by the specific time to the point in time when the reference voice is muted, the acoustic characteristics related to the voice quality in the expression period of the second sound signal are obtained. , a synthesis processing unit added to the processing period in the decompressed first sound signal;
A program that makes a computer function as a

The processing period from the time when the singing voice of the first sound signal representing the singing voice starts to the starting point of the first steady period in which the fundamental frequency and the spectrum shape are temporally stable in the first sound signal, Of the second sound signal representing the reference voice having acoustic characteristics different from the singing voice, the second sound signal whose fundamental frequency and spectrum shape are temporally stable in the second sound signal from the time the reference voice starts. A synthesis processing unit that extends the time length of the expression period up to the start point of the stationary period, and adds acoustic characteristics related to voice quality in the expression period of the second sound signal to the processing period of the first sound signal after the expansion. ,
A program that makes a computer function as a