JP2022065566A

JP2022065566A - Method for synthesizing voice and program

Info

Publication number: JP2022065566A
Application number: JP2020174248A
Authority: JP
Inventors: 竜之介大道; Ryunosuke Daido; 慶二郎才野; Keijiro Saino
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2020-10-15
Filing date: 2020-10-15
Publication date: 2022-04-27

Abstract

To provide a method for synthesizing voice that can easily add acoustic data with the same sound quality as that of acoustic data with a specific sound quality to the acoustic data with the specific sound quality.SOLUTION: The method for synthesizing voice includes: preparing a musical score encoder 111 for generating an intermediate feature amount MF1 from a musical score feature amount of musical score data D1, an acoustic encoder 121 for generating an intermediate feature amount MF2 from an acoustic feature amount of acoustic data D2, and an acoustic decoder 133 for generating an acoustic feature amount AFS based on the MF1 and the MF2; receiving acoustic data D2_T for auxiliary learning and training the acoustic decoder 133 to generate an acoustic feature amount AFS close to the acoustic feature amount of D2_T by using the MF2 generated from the acoustic feature amount of the D2_T by the acoustic encoder and by using the acoustic feature amount of the D2_T; and generating an acoustic feature amount AFS by processing the MF1 generated from the musical score data D1 arranged on the time axis of the D2_T by using the musical score encoder, by the trained acoustic decoder 133.SELECTED DRAWING: Figure 2

Description

本発明は、音声合成方法およびプログラムに関する。本明細書の「音声」は、一般の「音（サウンド）」を意味しており、「人の声（ボイス）」には限定されない。 The present invention relates to speech synthesis methods and programs. The "voice" in the present specification means a general "sound" and is not limited to a "human voice".

特定の歌手の歌声や特定の楽器の演奏音を合成する音声合成器が知られている。機械学習を利用した音声合成器は、特定の歌手や楽器の楽譜データ付きの音響データを教師データとして学習する。特定の歌手や楽器の音響データを学習した音声合成器は、ユーザによって楽譜データが与えられることにより、特定の歌手の歌声や特定の楽器の演奏音を合成して出力する。下記特許文献１において、機械学習を利用した歌声の合成技術が開示されている。また、歌声の合成技術を利用することで、歌声の声質を変換する技術が知られている。 A voice synthesizer that synthesizes the singing voice of a specific singer or the playing sound of a specific musical instrument is known. A speech synthesizer using machine learning learns acoustic data with musical score data of a specific singer or musical instrument as teacher data. A voice synthesizer that has learned the acoustic data of a specific singer or musical instrument synthesizes and outputs the singing voice of a specific singer or the playing sound of a specific musical instrument by being given musical score data by the user. The following Patent Document 1 discloses a technique for synthesizing a singing voice using machine learning. Further, a technique for converting the voice quality of a singing voice by using a technique for synthesizing a singing voice is known.

特開２０１９－１０１０９４号公報Japanese Unexamined Patent Publication No. 2019-101094

ある歌手の歌声が録音されたトラックにその歌手と同じ音質の歌声やセリフを少し追加したり、そのトラックの歌声を少し修正したい場合がある。また、楽器の演奏音が録音されたトラックに、その楽器音と同じ音質の演奏音を少し追加したり、そのトラックの演奏音を少し修正したい場合がある。それらの場合、そのトラックのその箇所について、その歌手の歌唱やその楽器の演奏音を録音し直す必要があった。 You may want to add a few singing voices or lines with the same sound quality as that singer to a track on which a singer's singing voice is recorded, or you may want to modify the singing voice of that track a little. In addition, there are cases where it is desired to add a little performance sound having the same sound quality as the instrument sound to a track on which the performance sound of the instrument is recorded, or to modify the performance sound of the track a little. In those cases, it was necessary to re-record the singer's singing and the instrument's playing sound for that part of the track.

機械学習を利用した音声合成器は、所望の歌手の歌声や楽器の演奏音を学習させて合成することができる。しかし、その学習のためには、特定の歌手の歌唱や楽器の演奏音の音響データに加えて、ラベリング作業を行い、その音響データに対応する楽譜データを準備する必要がある。 A voice synthesizer using machine learning can learn and synthesize the singing voice of a desired singer and the playing sound of a musical instrument. However, for the learning, it is necessary to perform labeling work in addition to the acoustic data of the singing of a specific singer or the performance sound of a musical instrument, and prepare the musical score data corresponding to the acoustic data.

本発明の目的は、録音された特定の音質の音響データに対して、同じ音質の音響データを追加することや、その音質を保ったまま音響データを部分的に修正することが容易に行える音声合成方法を提供することである。 An object of the present invention is to easily add acoustic data having the same sound quality to recorded acoustic data having a specific sound quality, or to partially modify the acoustic data while maintaining the sound quality. It is to provide a synthesis method.

本発明の一局面に従う音声合成方法は、コンピュータにより実現される音声合成方法であって、楽譜データの楽譜特徴量から第１中間特徴量を生成する楽譜エンコーダ、音響データの音響特徴量から第２中間特徴量を生成する音響エンコーダ、および、第１中間特徴量または第２中間特徴量に基づいて音響特徴量を生成する音響デコーダを準備し、補助学習用音響データを受け取り、音響エンコーダを用いて補助学習用音響データの音響特徴量から生成される第２中間特徴量と、補助学習用音響データの音響特徴量とを用いて、補助学習用音響データの音響特徴量に近い音響特徴量を生成するよう、音響デコーダを補助訓練し、ユーザインタフェースを介して、補助学習用音響データの時間軸上に配置された楽譜データを受け取り、楽譜エンコーダを用いて配置された楽譜データから生成される第１中間特徴量を、補助訓練済みの音響デコーダで処理することにより、音響特徴量を生成する。 The voice synthesis method according to one aspect of the present invention is a voice synthesis method realized by a computer, the score encoder that generates the first intermediate feature from the score feature of the score data, and the second from the acoustic feature of the acoustic data. Prepare an acoustic encoder that generates intermediate features and an acoustic decoder that generates acoustic features based on the first intermediate features or the second intermediate features, receive auxiliary learning acoustic data, and use the acoustic encoder. Using the second intermediate feature amount generated from the acoustic feature amount of the auxiliary learning acoustic data and the acoustic feature amount of the auxiliary learning acoustic data, an acoustic feature amount close to the acoustic feature amount of the auxiliary learning acoustic data is generated. First, the acoustic decoder is assisted and trained, the score data arranged on the time axis of the auxiliary learning acoustic data is received via the user interface, and the score data is generated from the score data arranged by using the score encoder. An acoustic feature is generated by processing the intermediate feature with an auxiliary trained acoustic decoder.

本発明の他の局面に従う音声合成プログラムは、コンピュータに音声合成方法を実行させるプログラムであって、当該プログラムに基づきコンピュータは、楽譜データの楽譜特徴量から第１中間特徴量を生成する楽譜エンコーダ、音響データの楽譜特徴量から第２中間特徴量を生成する音響エンコーダ、および、第１中間特徴量または第２中間特徴量に基づいて音響特徴量を生成する音響デコーダを準備し、補助学習用音響データを受け取り、音響エンコーダを用いて補助学習用音響データの音響特徴量から生成される第２中間特徴量と、補助学習用音響データの音響特徴量とを用いて、補助学習用音響データの音響特徴量に近い音響特徴量を生成するよう、音響デコーダを補助訓練し、ユーザインタフェースを介して、補助学習用音響データの時間軸上に配置された楽譜データを受け取り、楽譜エンコーダを用いて配置された楽譜データから生成される第１中間特徴量を、補助訓練済みの音響デコーダで処理することにより、音響特徴量を生成する。 A voice synthesis program according to another aspect of the present invention is a program that causes a computer to execute a voice synthesis method, and based on the program, the computer generates a first intermediate feature amount from the score feature amount of the score data. An acoustic encoder that generates a second intermediate feature amount from the score feature amount of acoustic data, and an acoustic decoder that generates an acoustic feature amount based on the first intermediate feature amount or the second intermediate feature amount are prepared, and the sound for auxiliary learning is prepared. The sound of the auxiliary learning acoustic data is received using the second intermediate feature amount generated from the acoustic feature amount of the auxiliary learning acoustic data using the acoustic encoder and the acoustic feature amount of the auxiliary learning acoustic data. The acoustic decoder is assisted and trained to generate acoustic features close to the features, and the score data arranged on the time axis of the auxiliary learning acoustic data is received via the user interface and arranged using the score encoder. The first intermediate feature amount generated from the score data is processed by the auxiliary trained acoustic decoder to generate the acoustic feature amount.

本発明は、録音された特定の音質の音響データに対して、同じ音質の音響データを追加することや、その音質を保ったまま音響データを部分的に修正することが容易に行える音声合成方法を提供する。 INDUSTRIAL APPLICABILITY The present invention is a voice synthesis method capable of easily adding acoustic data of the same sound quality to recorded acoustic data of a specific sound quality and partially modifying the acoustic data while maintaining the sound quality. I will provide a.

実施の形態に係る音声合成器の構成図である。It is a block diagram of the speech synthesizer which concerns on embodiment. 実施の形態に係る音声合成器の機能ブロック図である。It is a functional block diagram of the speech synthesizer which concerns on embodiment. 音声合成器が利用するデータを示す図である。It is a figure which shows the data used by a speech synthesizer. 実施の形態に係る基本訓練方法を示すフローチャートである。It is a flowchart which shows the basic training method which concerns on embodiment. 実施の形態に係る音声合成方法を示すフローチャートである。It is a flowchart which shows the voice synthesis method which concerns on embodiment. 音声合成器のユーザインタフェースを示す図である。It is a figure which shows the user interface of a speech synthesizer. 音声合成器のユーザインタフェースを示す図である。It is a figure which shows the user interface of a speech synthesizer. 実施の形態に係る音響デコーダ訓練方法を示すフローチャートである。It is a flowchart which shows the acoustic decoder training method which concerns on embodiment. 音声合成器のユーザインタフェースを示す図である。It is a figure which shows the user interface of a speech synthesizer. 音声合成器のユーザインタフェースを示す図である。It is a figure which shows the user interface of a speech synthesizer. 音声合成器のユーザインタフェースを示す図である。It is a figure which shows the user interface of a speech synthesizer.

（１）音声合成器の構成
以下、本発明の実施の形態に係る音声合成器について図面を用いて詳細に説明する。図１は、実施の形態に係る音声合成器１を示す構成図である。図１に示すように、音声合成器１は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１２、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１３、操作部１４、表示部１５、記憶装置１６、サウンドシステム１７、デバイスインタフェース１８および通信インタフェース１９を備える。音声合成器１は、例えば、パーソナルコンピュータ、タブレット端末またはスマートフォンなどが利用される。 (1) Configuration of Speech Synthesizer Hereinafter, the speech synthesizer according to the embodiment of the present invention will be described in detail with reference to the drawings. FIG. 1 is a configuration diagram showing a speech synthesizer 1 according to an embodiment. As shown in FIG. 1, the voice synthesizer 1 includes a CPU (Central Processing Unit) 11, a RAM (Random Access Memory) 12, a ROM (Read Only Memory) 13, an operation unit 14, a display unit 15, a storage device 16, and a sound. It includes a system 17, a device interface 18, and a communication interface 19. As the voice synthesizer 1, for example, a personal computer, a tablet terminal, a smartphone, or the like is used.

ＣＰＵ１１は、１又は複数のプロセッサにより構成されており、音声合成器１の全体制御を行う。ＲＡＭ１２は、ＣＰＵ１１がプログラムを実行するときに作業エリアとして利用される。ＲＯＭ１３は、制御プログラムなどが記憶される。操作部１４は、音声合成器１に対するユーザの操作を入力する。操作部１４は、例えば、マウスやキーボードなどである。表示部１５は、音声合成器１のユーザインタフェースを表示する。操作部１４および表示部１５が、タッチパネル式ディスプレイとして構成されていてもよい。サウンドシステム１７は、音源、音声信号をＤ／Ａ変換および増幅する機能、アナログ変換された音声信号を出力するスピーカなどを含む。デバイスインタフェース１８は、ＣＰＵ１１がＣＤ－ＲＯＭ、半導体メモリなどの記憶媒体ＲＭにアクセスするためのインタフェースである。通信インタフェース１９は、ＣＰＵ１１が、インターネットなどのネットワークに接続するためのインタフェースである。 The CPU 11 is composed of one or a plurality of processors, and controls the entire speech synthesizer 1. The RAM 12 is used as a work area when the CPU 11 executes a program. The ROM 13 stores a control program and the like. The operation unit 14 inputs a user's operation on the voice synthesizer 1. The operation unit 14 is, for example, a mouse or a keyboard. The display unit 15 displays the user interface of the speech synthesizer 1. The operation unit 14 and the display unit 15 may be configured as a touch panel type display. The sound system 17 includes a sound source, a function of D / A conversion and amplification of an audio signal, a speaker that outputs an analog-converted audio signal, and the like. The device interface 18 is an interface for the CPU 11 to access a storage medium RM such as a CD-ROM or a semiconductor memory. The communication interface 19 is an interface for the CPU 11 to connect to a network such as the Internet.

記憶装置１６には、音声合成プログラムＰ１、訓練プログラムＰ２、楽譜データＤ１および音響データＤ２が記憶されている。音声合成プログラムＰ１は、音声合成された音響データまたは音質変換された音響データを生成するためのプログラムである。訓練プログラムＰ２は、音声合成または音質変換に利用されるエンコーダおよび音響デコーダを訓練するためのプログラムである。 The storage device 16 stores a speech synthesis program P1, a training program P2, musical score data D1, and acoustic data D2. The voice synthesis program P1 is a program for generating voice-synthesized acoustic data or sound quality-converted acoustic data. The training program P2 is a program for training an encoder and an acoustic decoder used for speech synthesis or sound quality conversion.

楽譜データＤ１は、楽曲を規定するデータである。楽譜データＤ１は、各音符の音高や強度に関する情報、各音符内での音韻に関する情報（歌唱の場合のみ）、各音符の発音期間に関する情報、演奏記号に関する情報などを含んでいる。音響データＤ２は、音声の波形データである。音響データＤ２は、例えば、歌唱の波形データや、楽器音の波形データなどである。音声合成器１では、楽譜データＤ１と音響データＤ２を用いて、１曲のコンテンツが生成される。 The musical score data D1 is data that defines a musical piece. The musical score data D1 includes information on the pitch and intensity of each note, information on the tone within each note (only in the case of singing), information on the pronunciation period of each note, information on performance symbols, and the like. The acoustic data D2 is voice waveform data. The acoustic data D2 is, for example, singing waveform data, musical instrument sound waveform data, or the like. In the voice synthesizer 1, the content of one song is generated by using the score data D1 and the acoustic data D2.

（２）音声合成器の機能構成
図２は、音声合成器１の機能ブロック図である。図２に示すように、音声合成器１は、制御部１００を備える。制御部１００は、変換部１１０、楽譜エンコーダ１１１、ピッチモデル１１２、分析部１２０、音響エンコーダ１２１、切換部１３１、切換部１３２、音響デコーダ１３３およびボコーダ１３４を備える。図２において、制御部１００は、音声合成プログラムＰ１を、ＲＡＭ１２を作業領域として利用しつつ、ＣＰＵ１１が実行することにより実現される機能部である。つまり、変換部１１０、楽譜エンコーダ１１１、ピッチモデル１１２、分析部１２０、音響エンコーダ１２１、切換部１３１、切換部１３２、音響デコーダ１３３およびボコーダ１３４は、音声合成プログラムＰ１がＣＰＵ１１により実行されることにより実現される機能部である。また、楽譜エンコーダ１１１、音響エンコーダ１２１および音響デコーダ１３３は、訓練プログラムＰ２が、ＲＡＭ１２を作業領域として利用しつつ、ＣＰＵ１１により実行されることによりその機能を学習する。 (2) Functional Configuration of Speech Synthesizer FIG. 2 is a functional block diagram of the speech synthesizer 1. As shown in FIG. 2, the speech synthesizer 1 includes a control unit 100. The control unit 100 includes a conversion unit 110, a score encoder 111, a pitch model 112, an analysis unit 120, an acoustic encoder 121, a switching unit 131, a switching unit 132, an acoustic decoder 133, and a vocoder 134. In FIG. 2, the control unit 100 is a functional unit realized by executing the speech synthesis program P1 by the CPU 11 while using the RAM 12 as a work area. That is, in the conversion unit 110, the score encoder 111, the pitch model 112, the analysis unit 120, the acoustic encoder 121, the switching unit 131, the switching unit 132, the acoustic decoder 133, and the vocoder 134, the voice synthesis program P1 is executed by the CPU 11. It is a functional part to be realized. Further, the score encoder 111, the acoustic encoder 121, and the acoustic decoder 133 learn their functions by being executed by the CPU 11 while the training program P2 uses the RAM 12 as a work area.

変換部１１０は、楽譜データＤ１を読み込み、楽譜データＤ１から種々の楽譜特徴データＳＦを生成する。変換部１１０は、その楽譜特徴データＳＦを楽譜エンコーダ１１１およびピッチモデル１１２に出力する。楽譜エンコーダ１１１が変換部１１０から取得する楽譜特徴データＳＦは各時点の音質を制御する因子であり、例えば、音高や強度や音素ラベルなどのコンテキストである。ピッチモデル１１２が変換部１１０から取得する楽譜特徴データＳＦは各時点の音高を制御する因子であり、例えば、音高および発音期間で特定される音符のコンテキストである。コンテキストは、各時点のデータに加えて、その前と後の少なくとも一方のデータを含む。 The conversion unit 110 reads the score data D1 and generates various score feature data SFs from the score data D1. The conversion unit 110 outputs the score feature data SF to the score encoder 111 and the pitch model 112. The musical score feature data SF acquired from the conversion unit 110 by the musical score encoder 111 is a factor that controls the sound quality at each time point, and is, for example, a context such as pitch, intensity, and phoneme label. The musical score feature data SF acquired by the pitch model 112 from the conversion unit 110 is a factor that controls the pitch at each time point, and is, for example, the context of the note specified by the pitch and the pronunciation period. The context includes, in addition to the data at each point in time, at least one of the data before and after.

楽譜エンコーダ１１１は、楽譜特徴データＳＦから中間特徴データＭＦ１を生成する。訓練済みの楽譜エンコーダ１１１は、楽譜特徴データＳＦから中間特徴データＭＦ１を生成する統計的モデルであり、記憶装置１６に記憶された複数の変数１１１＿Ｐにより規定される。楽譜エンコーダ１１１は、本実施の形態においては、楽譜特徴データＳＦに応じた中間特徴データＭＦ１を出力する生成モデルが利用される。楽譜エンコーダ１１１を構成する生成モデルとしては、例えば、畳み込みニューラルネットワーク（ＣＮＮ）、リカレントニューラルネットワーク（ＲＮＮ）、それらの組合せなどが利用される。自己回帰モデルや、アテンション付きモデルでもよい。 The score encoder 111 generates intermediate feature data MF1 from the score feature data SF. The trained score encoder 111 is a statistical model that generates intermediate feature data MF1 from the score feature data SF, and is defined by a plurality of variables 111_P stored in the storage device 16. In the present embodiment, the score encoder 111 uses a generation model that outputs intermediate feature data MF1 according to the score feature data SF. As a generative model constituting the score encoder 111, for example, a convolutional neural network (CNN), a recurrent neural network (RNN), a combination thereof, or the like is used. It may be an autoregressive model or a model with attention.

ピッチモデル１１２は、楽譜特徴データＳＦを読み込み、楽譜特徴データＳＦから楽曲中の音の基本周波数Ｆ０を生成する。ピッチモデル１１２は、取得した基本周波数Ｆ０を切換部１３２に出力する。訓練済みのピッチモデル１１２は、楽譜特徴データＳＦから楽曲中の音の基本周波数Ｆ０を生成する統計的モデルであり、記憶装置１６に記憶された複数の変数１１２＿Ｐにより規定される。ピッチモデル１１２は、本実施の形態においては、楽譜特徴データＳＦに応じた基本周波数Ｆ０を出力する生成モデルが利用される。ピッチモデル１１２を構成する生成モデルとしては、例えば、ＣＮＮ、ＲＮＮ、それらの組合せなどが利用される。自己回帰モデルや、アテンション付きモデルでもよい。逆に、もっとシンプルな隠れマルコフや、ランダムフォレストのモデルを用いてもよい。 The pitch model 112 reads the score feature data SF and generates the fundamental frequency F0 of the sound in the music from the score feature data SF. The pitch model 112 outputs the acquired fundamental frequency F0 to the switching unit 132. The trained pitch model 112 is a statistical model that generates the fundamental frequency F0 of the sound in the music from the musical score feature data SF, and is defined by a plurality of variables 112_P stored in the storage device 16. As the pitch model 112, in the present embodiment, a generation model that outputs a fundamental frequency F0 corresponding to the musical score feature data SF is used. As the generative model constituting the pitch model 112, for example, CNN, RNN, a combination thereof, or the like is used. It may be an autoregressive model or a model with attention. Conversely, a simpler hidden Markov or random forest model may be used.

分析部１２０は、音響データＤ２を読み込み、音響データＤ２に対して周波数分析を行う。分析部１２０は、音響データＤ２に対して周波数分析を行うことにより、音響データＤ２の示す音の基本周波数Ｆ０および音響特徴データＡＦを生成する。音響特徴データＡＦは音響データＤ２の示す音の周波数スペクトルを示し、例えば、メル周波数対数スペクトル（ＭＳＬＳ：Ｍｅｌ－ＳｃａｌｅＬｏｇ－Ｓｐｅｃｔｒｕｍ）である。分析部１２０は、その基本周波数Ｆ０を切換部１３２に出力する。分析部１２０は、その音響特徴データＡＦを音響エンコーダ１２１に出力する。 The analysis unit 120 reads the acoustic data D2 and performs frequency analysis on the acoustic data D2. The analysis unit 120 generates the fundamental frequency F0 and the acoustic feature data AF of the sound indicated by the acoustic data D2 by performing frequency analysis on the acoustic data D2. The acoustic feature data AF shows the frequency spectrum of the sound indicated by the acoustic data D2, and is, for example, a Mel-Scale Log-Spectrum (MSLS). The analysis unit 120 outputs the fundamental frequency F0 to the switching unit 132. The analysis unit 120 outputs the acoustic feature data AF to the acoustic encoder 121.

音響エンコーダ１２１は、音響特徴データＡＦから中間特徴データＭＦ２を生成する。訓練済みの音響エンコーダ１２１は、音響特徴データＡＦから中間特徴データＭＦ２を生成する統計的モデルであり、記憶装置１６に記憶された複数の変数１２１＿Ｐにより規定される。音響エンコーダ１２１は、本実施の形態においては、音響特徴データＡＦに応じた中間特徴データＭＦ２を出力する生成モデルが利用される。音響エンコーダ１２１を構成する生成モデルとしては、例えば、ＣＮＮ、ＲＮＮ、それらの組合せなどが利用される。 The acoustic encoder 121 generates intermediate feature data MF2 from the acoustic feature data AF. The trained acoustic encoder 121 is a statistical model that generates intermediate feature data MF2 from acoustic feature data AF, and is defined by a plurality of variables 121_P stored in the storage device 16. In the present embodiment, the acoustic encoder 121 uses a generation model that outputs intermediate feature data MF2 corresponding to the acoustic feature data AF. As the generative model constituting the acoustic encoder 121, for example, CNN, RNN, a combination thereof, or the like is used.

切換部１３１は、楽譜エンコーダ１１１から中間特徴データＭＦ１を受け取る。切換部１３１は、音響エンコーダ１２１から中間特徴データＭＦ２を受け取る。切換部１３１は、楽譜エンコーダ１１１からの中間特徴データＭＦ１、または、音響エンコーダ１２１からの中間特徴データＭＦ２のいずれかを選択的に音響デコーダ１３３に出力する。 The switching unit 131 receives the intermediate feature data MF1 from the score encoder 111. The switching unit 131 receives the intermediate feature data MF2 from the acoustic encoder 121. The switching unit 131 selectively outputs either the intermediate feature data MF1 from the score encoder 111 or the intermediate feature data MF2 from the acoustic encoder 121 to the acoustic decoder 133.

切換部１３２は、ピッチモデル１１２から基本周波数Ｆ０を受け取る。切換部１３２は、分析部１２０から基本周波数Ｆ０を受け取る。切換部１３２は、ピッチモデル１１２からの基本周波数Ｆ０、または、分析部１２０からの基本周波数Ｆ０のいずれかを選択的に音響デコーダ１３３に出力する。 The switching unit 132 receives the fundamental frequency F0 from the pitch model 112. The switching unit 132 receives the fundamental frequency F0 from the analysis unit 120. The switching unit 132 selectively outputs either the fundamental frequency F0 from the pitch model 112 or the fundamental frequency F0 from the analysis unit 120 to the acoustic decoder 133.

音響デコーダ１３３は、中間特徴データＭＦ１または中間特徴データＭＦ２に基づいて、音響特徴データＡＦＳを生成する。音響特徴データＡＦＳは周波数振幅スペクトルであり、例えば、メル周波数対数スペクトルである。音響デコーダ１３３は、音響特徴データＡＦＳを生成する統計的モデルであり、記憶装置１６に記憶された複数の変数１３３＿Ｐにより規定される。音響デコーダ１３３は、本実施の形態においては、中間特徴データＭＦ１または中間特徴データＭＦ２に応じた音響特徴データＡＦＳを出力するモデルが利用される。音響デコーダ１３３を構成するモデルとしては、例えば、ＣＮＮ、ＲＮＮ、それらの組合せなどが利用される。自己回帰モデルや、アテンション付きモデルでもよい。 The acoustic decoder 133 generates acoustic feature data AFS based on the intermediate feature data MF1 or the intermediate feature data MF2. The acoustic feature data AFS is a frequency amplitude spectrum, for example, a mel frequency logarithmic spectrum. The acoustic decoder 133 is a statistical model that generates acoustic feature data AFS, and is defined by a plurality of variables 133_P stored in the storage device 16. In the present embodiment, the acoustic decoder 133 uses a model that outputs acoustic feature data AFS corresponding to the intermediate feature data MF1 or the intermediate feature data MF2. As a model constituting the acoustic decoder 133, for example, CNN, RNN, a combination thereof, or the like is used. It may be an autoregressive model or a model with attention.

ボコーダ１３４は、音響エンコーダ１２１から入力した音響特徴データＡＦＳに基づいて合成音響データＤ３を生成する。音響特徴データＡＦＳがメル周波数対数スペクトルである場合であれば、ボコーダ１３４は、音響エンコーダ１２１から入力したメル周波数対数スペクトルを時間領域の音響信号に変換し、合成音響データＤ３を生成する。 The vocoder 134 generates synthetic acoustic data D3 based on the acoustic feature data AFS input from the acoustic encoder 121. If the acoustic feature data AFS is a mel frequency logarithmic spectrum, the vocoder 134 converts the mel frequency logarithmic spectrum input from the acoustic encoder 121 into an acoustic signal in the time domain to generate synthetic acoustic data D3.

（３）音声合成器が使用する情報
図３は、音声合成器１が使用するデータを示す。音声合成器１は、音声合成に関わるデータとして、楽譜データＤ１および音響データＤ２を使用する。楽譜データＤ１は、上述したように、楽曲を規定するデータである。楽譜データＤ１は、各音符の音高などに関する情報、各音符内の音韻に関する情報（歌唱の場合のみ）、各音符の発音期間に関する情報、演奏記号に関する情報などを含んでいる。音響データＤ２は、上述したように、音声の波形データである。音響データＤ２は、例えば、歌唱の波形データや、楽器音の波形データなどである。各歌唱の波形データには、その歌唱を行った歌唱者を示す音源ＩＤが付与されており、各楽器音の波形データには、その楽器を示す音源ＩＤが付与されている。音源ＩＤは、その波形データ示す音の生成源を示す。 (3) Information used by the speech synthesizer FIG. 3 shows data used by the speech synthesizer 1. The voice synthesizer 1 uses the score data D1 and the acoustic data D2 as the data related to the voice synthesis. As described above, the musical score data D1 is data that defines a musical piece. The musical score data D1 includes information on the pitch of each note, information on the melody in each note (only in the case of singing), information on the pronunciation period of each note, information on performance symbols, and the like. As described above, the acoustic data D2 is audio waveform data. The acoustic data D2 is, for example, singing waveform data, musical instrument sound waveform data, or the like. The waveform data of each singing is given a sound source ID indicating the singer who sang the song, and the waveform data of each musical instrument sound is given a sound source ID indicating the musical instrument. The sound source ID indicates the source of the sound indicated by the waveform data.

音声合成器１が使用する楽譜データＤ１には、基本学習用楽譜データＤ１＿Ｒおよび合成用楽譜データＤ１＿Ｓがある。音声合成器１が使用する音響データＤ２には、それらに対応する基本学習用音響データＤ２＿Ｒ、合成用音響データＤ２＿Ｓおよび補助学習用音響データＤ２＿Ｔがある。基本学習用音響データＤ２＿Ｒに対応する基本学習用楽譜データＤ１＿Ｒは、基本学習用音響データＤ２＿Ｒにおける演奏に対応する楽譜（音符列等）を示す。合成用音響データＤ２＿Ｓに対応する合成用楽譜データＤ１＿Ｓは、合成用音響データＤ２＿Ｓにおける演奏に対応する楽譜（音符列等）を示す。図１および図２の記憶装置１６には、楽譜データＤ１および音響データＤ２を図示しているが、実際には、楽譜データＤ１としては、基本学習用楽譜データＤ１＿Ｒおよび合成用楽譜データＤ１＿Ｓが記憶され、音響データＤ２としては、基本学習用音響データＤ２＿Ｒ、合成用音響データＤ２＿Ｓおよび補助学習用音響データＤ２＿Ｔが記憶される。 The score data D1 used by the voice synthesizer 1 includes the score data D1_R for basic learning and the score data D1_S for synthesis. The acoustic data D2 used by the speech synthesizer 1 includes basic learning acoustic data D2_R, synthetic acoustic data D2_S, and auxiliary learning acoustic data D2_T corresponding thereto. The basic learning musical score data D1_R corresponding to the basic learning acoustic data D2_R indicates a musical score (note string or the like) corresponding to the performance in the basic learning acoustic data D2_R. The synthetic musical score data D1_S corresponding to the synthetic acoustic data D2_S indicates a musical score (note string or the like) corresponding to the performance in the synthetic acoustic data D2_S. The storage device 16 of FIGS. 1 and 2 shows the score data D1 and the acoustic data D2, but in reality, the score data D1 stores the basic learning score data D1_R and the composite score data D1_S. As the acoustic data D2, the basic learning acoustic data D2_R, the synthetic acoustic data D2_S, and the auxiliary learning acoustic data D2_T are stored.

基本学習用楽譜データＤ１＿Ｒは、楽譜エンコーダ１１１、音響エンコーダ１２１および音響デコーダ１３３の訓練に用いられるデータである。基本学習用音響データＤ２＿Ｒは、楽譜エンコーダ１１１、音響エンコーダ１２１および音響デコーダ１３３の訓練に用いられるデータである。基本学習用楽譜データＤ１＿Ｒおよび基本学習用音響データＤ２＿Ｒを用いて、楽譜エンコーダ１１１、音響エンコーダ１２１および音響デコーダ１３３が学習することにより、音声合成器１は、音源ＩＤで特定される音質の音声を合成可能な状態に設定される。 The musical score data D1_R for basic learning is data used for training the musical score encoder 111, the acoustic encoder 121, and the acoustic decoder 133. The basic learning acoustic data D2_R is data used for training the musical score encoder 111, the acoustic encoder 121, and the acoustic decoder 133. By learning the score encoder 111, the acoustic encoder 121, and the acoustic decoder 133 using the basic learning score data D1_R and the basic learning acoustic data D2_R, the speech synthesizer 1 produces a voice with a sound quality specified by the sound source ID. It is set to a state where it can be synthesized.

合成用楽譜データＤ１＿Ｓは、特定の音質の音声を合成可能な状態となった音声合成器１に与えられるデータである。音声合成器１は、合成用楽譜データＤ１＿Ｓに基づいて音源ＩＤで特定される音質の音声の合成音響データＤ３を生成する。例えば、歌唱合成の場合、音声合成器１は、歌詞（音韻）およびメロディー（音符列）が与えられることにより、音源ＩＤで特定される歌手の歌声を合成出力できる。楽器音合成の場合、メロディ（音符列）を与えることにより、音源ＩＤで特定される楽器の演奏音を合成出力できる。 The score data D1_S for synthesis is data given to the voice synthesizer 1 in a state in which voice of a specific sound quality can be synthesized. The voice synthesizer 1 generates synthetic acoustic data D3 of voice having a sound quality specified by a sound source ID based on the musical score data D1_S for synthesis. For example, in the case of singing synthesis, the voice synthesizer 1 can synthesize and output the singing voice of the singer specified by the sound source ID by being given lyrics (sounds) and melody (note strings). In the case of musical instrument sound synthesis, by giving a melody (note string), the performance sound of the musical instrument specified by the sound source ID can be synthesized and output.

合成用音響データＤ２＿Ｓは、特定の音質の音声を合成可能な状態となった音声合成器１に与えられるデータである。音声合成器１は、合成用音響データＤ２＿Ｓに基づいて音源ＩＤで特定される音質の音声の合成音響データＤ３を生成する。例えば、音声合成器１は、任意の音源ＩＤの歌手または楽器音の合成用音響データＤ２＿Ｓが与えられることにより、それとは異なる音源ＩＤで特定される歌手の歌声や楽器の演奏音を合成出力する。この機能を利用することにより、音声合成器１は、ある種の音質変換器として機能する。 The synthetic acoustic data D2_S is data given to the voice synthesizer 1 in a state in which voice of a specific sound quality can be synthesized. The voice synthesizer 1 generates the synthetic sound data D3 of the sound quality specified by the sound source ID based on the sound data D2_S for synthesis. For example, the voice synthesizer 1 synthesizes and outputs the singing voice of a singer or the performance sound of a musical instrument specified by a different sound source ID by being given the acoustic data D2_S for synthesizing the singer or musical instrument sound of an arbitrary sound source ID. .. By utilizing this function, the speech synthesizer 1 functions as a kind of sound quality converter.

補助学習用音響データＤ２＿Ｔは、音響デコーダ１３３の訓練に用いられるデータである。補助学習用音響データＤ２＿Ｔは、音響デコーダ１３３により合成される音質を変更するための学習データである。補助学習用音響データＤ２＿Ｔを用いて、音響デコーダ１３３が学習することにより、音声合成器１は、新たな別の歌手の歌声を合成可能な状態に設定される。 The auxiliary learning acoustic data D2_T is data used for training the acoustic decoder 133. The auxiliary learning acoustic data D2_T is learning data for changing the sound quality synthesized by the acoustic decoder 133. By learning by the acoustic decoder 133 using the auxiliary learning acoustic data D2_T, the speech synthesizer 1 is set to a state in which the singing voice of another new singer can be synthesized.

（４）基本訓練方法
次に、本実施の形態に係る音声合成器１の基本訓練方法について説明する。図４は、本実施の形態に係る音声合成器１の基本訓練方法を示すフローチャートである。基本訓練では、音声合成器１が備える楽譜エンコーダ１１１、音響エンコーダ１２１および音響デコーダ１３３が訓練される。図４で示される基本訓練方法は、機械学習の処理ステップ毎に、訓練プログラムＰ２がＣＰＵ１１により実行されることにより実現される。１回の処理ステップでは、周波数分析の複数フレーム分に相当する音響データが処理される。 (4) Basic training method Next, the basic training method of the speech synthesizer 1 according to the present embodiment will be described. FIG. 4 is a flowchart showing a basic training method of the speech synthesizer 1 according to the present embodiment. In the basic training, the score encoder 111, the acoustic encoder 121, and the acoustic decoder 133 included in the speech synthesizer 1 are trained. The basic training method shown in FIG. 4 is realized by executing the training program P2 by the CPU 11 for each processing step of machine learning. In one processing step, acoustic data corresponding to a plurality of frames of frequency analysis is processed.

図４の基本訓練方法を実行する前に、教師データとして、音源ＩＤ毎に、基本学習用楽譜データＤ１＿Ｒおよび対応する基本学習用音響データＤ２＿Ｒが複数セット準備され、記憶装置１６に記憶される。教師データとして準備される基本学習用楽譜データＤ１＿Ｒおよび基本学習用音響データＤ２＿Ｒは、各音源ＩＤで特定される音質の楽曲を基本訓練するために準備されたデータである。ここでは、基本学習用楽譜データＤ１＿Ｒおよび基本学習用音響データＤ２＿Ｒが、複数の音源ＩＤで特定される複数の歌手の歌声を基本訓練するために準備されたデータである場合を例に説明する。 Before executing the basic training method of FIG. 4, a plurality of sets of basic learning musical score data D1_R and corresponding basic learning acoustic data D2_R are prepared as teacher data for each sound source ID and stored in the storage device 16. The basic learning score data D1_R and the basic learning acoustic data D2_R prepared as teacher data are data prepared for basic training of a musical piece having a sound quality specified by each sound source ID. Here, a case where the score data D1_R for basic learning and the acoustic data D2_R for basic learning are data prepared for basic training of the singing voices of a plurality of singers specified by a plurality of sound source IDs will be described as an example.

ステップＳ１０１において、変換部１１０としてのＣＰＵ１１が、基本学習用楽譜データＤ１＿Ｒに基づいて楽譜特徴データＳＦを生成する。本実施の形態においては、音響特徴の生成のための楽譜の特徴を示す楽譜特徴データＳＦとして、例えば、音素ラベルを示すデータが用いられる。次に、ステップＳ１０２において、分析部１２０としてのＣＰＵ１１が、音源ＩＤで音質が特定される基本学習用音響データＤ２＿Ｒに基づいて周波数スペクトルを示す音響特徴データＡＦを生成する。本実施の形態においては、音響特徴データＡＦとして、例えば、メル周波数対数スペクトルが用いられる。なお、ステップＳ１０２の処理をステップＳ１０１の処理の前に実行してもよい。 In step S101, the CPU 11 as the conversion unit 110 generates the score feature data SF based on the score data D1_R for basic learning. In the present embodiment, for example, data indicating a phoneme label is used as the musical score feature data SF indicating the characteristics of the musical score for generating the acoustic features. Next, in step S102, the CPU 11 as the analysis unit 120 generates acoustic feature data AF showing a frequency spectrum based on the basic learning acoustic data D2_R whose sound quality is specified by the sound source ID. In this embodiment, for example, a mel frequency logarithmic spectrum is used as the acoustic feature data AF. The process of step S102 may be executed before the process of step S101.

次に、ステップＳ１０３において、ＣＰＵ１１が、楽譜エンコーダ１１１を用いて、楽譜特徴データＳＦを処理して、中間特徴データＭＦ１を生成する。次に、ステップＳ１０４において、ＣＰＵ１１が、音響エンコーダ１２１を用いて、音響特徴データＡＦを処理して、中間特徴データＭＦ２を生成する。なお、ステップＳ１０４の処理をステップＳ１０３の処理の前に実行してもよい。 Next, in step S103, the CPU 11 processes the score feature data SF using the score encoder 111 to generate the intermediate feature data MF1. Next, in step S104, the CPU 11 processes the acoustic feature data AF using the acoustic encoder 121 to generate the intermediate feature data MF2. The process of step S104 may be executed before the process of step S103.

次に、ステップＳ１０５において、ＣＰＵ１１が、音響デコーダ１３３を用いて、基本学習用音響データＤ２＿Ｒの音源ＩＤと基本周波数Ｆ０と中間特徴データＭＦ１とを処理して音響特徴データＡＦＳ１を生成し、また、その音源ＩＤと基本周波数Ｆ０と中間特徴データＭＦ２とを処理して音響特徴データＡＦＳ２を生成する。本実施の形態においては、周波数スペクトルを示す音響特徴データＡＦＳとして、例えば、メル周波数対数スペクトルが用いられる。なお、音響デコーダ１３３は、音響デコードを実行するときに、切換部１３２から基本周波数Ｆ０を入力する。基本周波数Ｆ０は、入力データが基本学習用楽譜データＤ１＿Ｒである場合には、ピッチモデル１１２により生成され、入力データが基本学習用音響データＤ２＿Ｒである場合には、分析部１２０により生成される。また、音響デコーダ１３３は、音響デコードを実行するときに、歌手を特定する識別しとして音源ＩＤを入力する。これら基本周波数Ｆ０および音源ＩＤは、中間特徴データＭＦ１，ＭＦ２とともに、音響デコーダ１３３を構成する生成モデルへの入力値となる。 Next, in step S105, the CPU 11 processes the sound source ID of the fundamental learning acoustic data D2_R, the fundamental frequency F0, and the intermediate feature data MF1 using the acoustic decoder 133 to generate the acoustic feature data AFS1. The sound source ID, the fundamental frequency F0, and the intermediate feature data MF2 are processed to generate the acoustic feature data AFS2. In the present embodiment, for example, a mel frequency logarithmic spectrum is used as the acoustic feature data AFS showing the frequency spectrum. The acoustic decoder 133 inputs the fundamental frequency F0 from the switching unit 132 when executing the acoustic decoding. The fundamental frequency F0 is generated by the pitch model 112 when the input data is the basic learning musical score data D1_R, and is generated by the analysis unit 120 when the input data is the basic learning acoustic data D2_R. Further, the acoustic decoder 133 inputs a sound source ID as an identification for identifying the singer when executing the acoustic decoding. These fundamental frequency F0 and sound source ID are input values to the generation model constituting the acoustic decoder 133 together with the intermediate feature data MF1 and MF2.

次に、ステップＳ１０６において、ＣＰＵ１１が、中間特徴データＭＦ１および中間特徴データＭＦ２が相互に近づくように、かつ、音響特徴データＡＦＳが正解である音響特徴データＡＦに近づくように、楽譜エンコーダ１１１、音響エンコーダ１２１および音響デコーダ１３３を訓練する。つまり、中間特徴データＭＦ１は楽譜特徴データＳＦ（例えば、音素ラベルを示す）から生成され、中間特徴データＭＦ２は周波数スペクトル（例えば、メル周波数対数スペクトル）から生成されるが、これら２つの中間特徴データＭＦ１，ＭＦ２の距離が相互に近づくように、楽譜エンコーダ１１１の生成モデルおよび音響エンコーダ１２１の生成モデルが訓練される。 Next, in step S106, the score encoder 111, acoustically, the CPU 11 so that the intermediate feature data MF1 and the intermediate feature data MF2 approach each other and the acoustic feature data AFS approaches the acoustic feature data AF which is the correct answer. Train the encoder 121 and the acoustic decoder 133. That is, the intermediate feature data MF1 is generated from the score feature data SF (for example, indicating a phonetic label), and the intermediate feature data MF2 is generated from a frequency spectrum (for example, a mel frequency logarithmic spectrum). The generative model of the score encoder 111 and the generative model of the acoustic encoder 121 are trained so that the distances of MF1 and MF2 are close to each other.

具体的には、中間特徴データＭＦ１と中間特徴データＭＦ２の間の差を減らすように、その差のバックプロバケーションが実行され、楽譜エンコーダ１１１の変数１１１＿Ｐおよび音響エンコーダ１２１の変数１２１＿Ｐが更新される。中間特徴データＭＦ１および中間特徴データＭＦ２の差としては、例えば、これら２つのデータを表すベクトルのユークリッド距離が用いられる。並行して、音響デコーダ１３３から生成された音響特徴データＡＦＳが教師データである基本学習用音響データＤ２＿Ｒから生成された音響特徴データＡＦに近づくように、誤差のバックプロバケーションが実行され、楽譜エンコーダ１１１の変数１１１＿Ｐ、音響エンコーダ１２１の変数１２１＿Ｐおよび音響デコーダ１３３の変数１３３＿Ｐが更新される。 Specifically, the back provacation of the difference is executed so as to reduce the difference between the intermediate feature data MF1 and the intermediate feature data MF2, and the variable 111_P of the score encoder 111 and the variable 121_P of the acoustic encoder 121 are updated. .. As the difference between the intermediate feature data MF1 and the intermediate feature data MF2, for example, the Euclidean distance of the vector representing these two data is used. In parallel, an error back-progress is executed so that the acoustic feature data AFS generated from the acoustic decoder 133 approaches the acoustic feature data AF generated from the basic learning acoustic data D2_R, which is the teacher data, and the score encoder. The variable 111_P of 111, the variable 121_P of the acoustic encoder 121, and the variable 133_P of the acoustic decoder 133 are updated.

１つの処理ステップ（ステップＳ１０１～Ｓ１０６）の学習処理を、複数の教師データである基本学習用楽譜データＤ１＿Ｒおよび基本学習用音響データＤ２＿Ｒについて、繰り返し実行することにより、楽譜エンコーダ１１１、音響エンコーダ１２１および音響デコーダ１３３が、各音源ＩＤで特定される特定の音質であり、楽譜特徴量に応じて音質が変化する音響データ（歌手の歌声や、楽器の演奏音に対応）を合成可能な状態に訓練される。具体的には、訓練済みの音声合成器１は、楽譜データＤ１に基づいて、楽譜エンコーダ１１１および音響デコーダ１３３を用いて、訓練済みの特定の音質の音声（歌声や楽器音）を合成可能である。また、訓練済みの音声合成器１は、音響データＤ２に基づいて、音響エンコーダ１２１および音響デコーダ１３３を用いて、訓練済みの特定の音質の音声（歌声や楽器音）を合成可能である。 By repeatedly executing the learning process of one processing step (steps S101 to S106) for the plurality of teacher data, the basic learning score data D1_R and the basic learning acoustic data D2_R, the score encoder 111, the acoustic encoder 121, and the sound encoder 121 The acoustic decoder 133 is trained to be able to synthesize acoustic data (corresponding to the singing voice of a singer and the playing sound of an instrument), which has a specific sound quality specified by each sound source ID and whose sound quality changes according to the amount of musical score features. Will be done. Specifically, the trained speech synthesizer 1 can synthesize a trained voice (singing voice or musical instrument sound) having a specific sound quality trained by using the score encoder 111 and the acoustic decoder 133 based on the score data D1. be. Further, the trained voice synthesizer 1 can synthesize a trained voice (singing voice or musical instrument sound) having a specific sound quality by using the sound encoder 121 and the sound decoder 133 based on the sound data D2.

上述したように、音響デコーダ１３３の訓練では、各基本学習用音響データＤ２＿Ｒに付与された音源ＩＤを入力値として利用する。したがって、音響デコーダ１３３は、複数の音源ＩＤの基本学習用音響データＤ２＿Ｒを利用することにより、複数の歌手の歌声や複数の楽器の演奏音を相互に区別して学習可能である。 As described above, in the training of the acoustic decoder 133, the sound source ID assigned to each basic learning acoustic data D2_R is used as an input value. Therefore, the acoustic decoder 133 can learn the singing voices of a plurality of singers and the playing sounds of a plurality of musical instruments by using the basic learning acoustic data D2_R of a plurality of sound source IDs.

（５）音声合成方法
次に、本実施の形態に係る音声合成器１による、指定された音源ＩＤの音質の音声を合成する方法について説明する。図５は、本実施の形態に係る音声合成器１による音声合成方法を示すフローチャートである。図５で示される音声合成方法は、周波数分析のフレームに相当する時間ごとに、音声合成プログラムＰ１がＣＰＵ１１により実行されることにより実現される。説明の簡略化のため、ここでは合成用楽譜データＤ１＿Ｓからの基本周波数Ｆ０の生成と、合成用音響データＤ２＿Ｓからの基本周波数Ｆ０の生成とが、予め完了しているものとする。なお、それら基本周波数Ｆ０の生成を、図５の処理とパラレルに実行してもよい。 (5) Speech Synthesis Method Next, a method of synthesizing the sound quality of the designated sound source ID by the speech synthesizer 1 according to the present embodiment will be described. FIG. 5 is a flowchart showing a speech synthesis method by the speech synthesizer 1 according to the present embodiment. The speech synthesis method shown in FIG. 5 is realized by executing the speech synthesis program P1 by the CPU 11 every time corresponding to a frame of frequency analysis. For the sake of simplification of the description, it is assumed here that the generation of the fundamental frequency F0 from the musical score data D1_S for synthesis and the generation of the fundamental frequency F0 from the acoustic data D2_S for synthesis have been completed in advance. The generation of the fundamental frequency F0 may be executed in parallel with the process of FIG.

ステップＳ２０１において、変換部１１０としてのＣＰＵ１１が、ユーザインタフェースの時間軸上の当該フレームの時刻の前後に配置された合成用楽譜データＤ１＿Ｓを取得する。または、分析部１２０が、ユーザインタフェースの時間軸上の当該フレームの時刻の前後に配置された合成用音響データＤ２＿Ｓを取得する。図６は、音声合成プログラムＰ１が表示部１５に表示するユーザインタフェース２００を示す図である。本実施の形態においては、ユーザインタフェース２００として、例えば、時間軸と音高軸とを有するピアノロールが用いられる。図６に示すように、ユーザは、操作部１４を操作して、ピアノロールにおいて、所望の時刻および音高に対応する位置に、合成用楽譜データＤ１＿Ｓ（音符またはテキスト）および合成用音響データＤ２＿Ｓ（波形データ）を配置する。図の期間Ｔ１，Ｔ２およびＴ４においては、ユーザによって、合成用楽譜データＤ１＿Ｓが、ピアノロールに配置されている。期間Ｔ１において、ユーザは、音高を伴わないテキスト（曲中の語り）のみを配置している（ＴＴＳ機能）。期間Ｔ２およびＴ４において、ユーザは、音符（音高および発音期間）の時系列と、各音符で歌われる歌詞とを配置している（歌声合成機能）。図において、ブロック２０１は、音符の音高および発音期間を表している。また、ブロック２０１の下に、その音符で歌われる歌詞（音韻）が表示される。また、期間Ｔ３およびＴ５において、ユーザは、合成用音響データＤ２＿Ｓを、ピアノロールの所望の時刻位置に配置している（音質変換機能）。図において、波形２０２は、合成用音響データＤ２＿Ｓ（波形データ）の示す波形であり、音高軸方向の位置は任意である。或いは、波形２０２を、合成用音響データＤ２＿Ｓの基本周波数Ｆ０に対応する位置に自動配置してもよい。また、図では歌唱合成のために音符に加えて歌詞が配置されているが、楽器音合成では、歌詞やテキストの配置は必要ない。 In step S201, the CPU 11 as the conversion unit 110 acquires the score data D1_S for composition arranged before and after the time of the frame on the time axis of the user interface. Alternatively, the analysis unit 120 acquires the synthetic acoustic data D2_S arranged before and after the time of the frame on the time axis of the user interface. FIG. 6 is a diagram showing a user interface 200 displayed on the display unit 15 by the voice synthesis program P1. In this embodiment, as the user interface 200, for example, a piano roll having a time axis and a pitch axis is used. As shown in FIG. 6, the user operates the operation unit 14 to perform the composition score data D1_S (note or text) and the composition sound data D2_S at the positions corresponding to the desired time and pitch on the piano roll. Place (waveform data). In the periods T1, T2 and T4 in the figure, the score data D1_S for composition is arranged on the piano roll by the user. In the period T1, the user arranges only the text (narrative in the song) without pitch (TTS function). In the periods T2 and T4, the user arranges a time series of notes (pitch and pronunciation period) and lyrics sung by each note (song voice synthesis function). In the figure, block 201 represents the pitch and duration of the note. Further, below the block 201, the lyrics (phonology) sung by the note are displayed. Further, in the periods T3 and T5, the user arranges the synthetic acoustic data D2_S at a desired time position of the piano roll (sound quality conversion function). In the figure, the waveform 202 is a waveform indicated by the synthetic acoustic data D2_S (waveform data), and the position in the pitch axis direction is arbitrary. Alternatively, the waveform 202 may be automatically arranged at a position corresponding to the fundamental frequency F0 of the synthetic acoustic data D2_S. Also, in the figure, lyrics are arranged in addition to the notes for singing synthesis, but in instrumental sound synthesis, it is not necessary to arrange lyrics and text.

次に、ステップＳ２０２において、制御部１００であるＣＰＵ１１は、現時刻に取得したデータが合成用楽譜データＤ１＿Ｓであるか否かを判定する。取得したデータが合成用楽譜データＤ１＿Ｓ（音符）である場合、処理はステップＳ２０３に進む。ステップＳ２０３において、ＣＰＵ１１は、その合成用楽譜データＤ１＿Ｓから楽譜特徴データＳＦを生成し、楽譜エンコーダ１１１を用いて、その楽譜特徴データＳＦを処理して中間特徴データＭＦ１を生成する。楽譜特徴データＳＦは、歌唱合成なら音韻の特徴を示し、生成される歌唱の音質がその音韻に応じて制御される。また、楽器音合成なら、楽譜特徴データＳＦは音符の音高や強度を示し、生成される楽器音の音質がその音高や強度に応じて制御される。 Next, in step S202, the CPU 11 which is the control unit 100 determines whether or not the data acquired at the current time is the musical score data D1_S for synthesis. If the acquired data is the score data D1_S (musical notes) for composition, the process proceeds to step S203. In step S203, the CPU 11 generates the score feature data SF from the score data D1_S for synthesis, and processes the score feature data SF using the score encoder 111 to generate the intermediate feature data MF1. The musical score feature data SF shows the characteristics of the phonology in the case of singing synthesis, and the sound quality of the generated singing is controlled according to the phonology. Further, in the case of musical instrument sound synthesis, the musical score feature data SF indicates the pitch and intensity of the note, and the sound quality of the generated musical instrument sound is controlled according to the pitch and intensity.

次に、ステップＳ２０４において、制御部１００としてのＣＰＵ１１は、現時刻に取得したデータが合成用音響データＤ２＿Ｓであるか否かを判定する。取得したデータが合成用音響データＤ２＿Ｓ（波形データ）である場合、処理はステップＳ２０５に進む。ステップＳ２０５において、ＣＰＵ１１は、その合成用音響データＤ２＿Ｓから音響特徴量ＡＦ（周波数スペクトル）を生成し、音響エンコーダ１２１を用いて、その音響特徴量ＡＦを処理して中間特徴データＭＦ２を生成する。 Next, in step S204, the CPU 11 as the control unit 100 determines whether or not the data acquired at the current time is the synthetic acoustic data D2_S. If the acquired data is the synthetic acoustic data D2_S (waveform data), the process proceeds to step S205. In step S205, the CPU 11 generates an acoustic feature amount AF (frequency spectrum) from the synthetic acoustic data D2_S, processes the acoustic feature amount AF using the acoustic encoder 121, and generates intermediate feature data MF2.

ステップＳ２０３またはステップＳ２０５を実行した後、処理はステップＳ２０６に進む。ステップＳ２０６において、ＣＰＵ１１は、音響デコーダ１３３を用いて、その時点で指定されている音源ＩＤと、その時点の基本周波数Ｆ０と、その時点で生成された中間特徴データＭＦ１または中間特徴データＭＦ２とを処理して音響特徴データＡＦＳを生成する。基本訓練で生成される２つの中間特徴データが相互に近づくよう訓練されるので、音響特徴データＡＦから生成される中間特徴データＭＦ２は、楽譜特徴データから生成される中間特徴データＭＦ１と同様に、対応する楽譜の特徴を反映する。本実施の形態においては、音響デコーダ１３３は、順次生成される中間特徴データＭＦ１および中間特徴データＭＦ２を時間軸上で結合した上でデコード処理を実行し、音響特徴データＡＦＳを生成する。 After executing step S203 or step S205, the process proceeds to step S206. In step S206, the CPU 11 uses the acoustic decoder 133 to obtain the sound source ID specified at that time, the fundamental frequency F0 at that time, and the intermediate feature data MF1 or the intermediate feature data MF2 generated at that time. It is processed to generate acoustic feature data AFS. Since the two intermediate feature data generated by the basic training are trained to approach each other, the intermediate feature data MF2 generated from the acoustic feature data AF is similar to the intermediate feature data MF1 generated from the musical score feature data. Reflects the characteristics of the corresponding score. In the present embodiment, the acoustic decoder 133 combines the sequentially generated intermediate feature data MF1 and intermediate feature data MF2 on the time axis, executes decoding processing, and generates acoustic feature data AFS.

次に、ステップＳ２０７において、ボコーダ１３４としてのＣＰＵ１１が、周波数スペクトルを示す音響特徴データＡＦＳに基づいて、基本的に音源ＩＤが示す音質で、さらに、その音質が音韻や音高に応じて変化する波形データである合成音響データＤ３を生成する。中間特徴データＭＦ１および中間特徴データＭＦ２が時間軸上で結合された上で音響特徴データＡＦＳが生成されているため、曲中のつなぎが自然な合成音響データＤ３のコンテンツが生成される。図７は、音声合成処理結果を表示するユーザインタフェース２００を示す図である。図７において、期間Ｔ１～Ｔ５の全体において、生成された基本周波数（Ｆ０）２１１が表示されている。期間Ｔ１においては、合成音響データＤ３の波形２１２が基本周波数に重ねて表示されている。期間Ｔ３，Ｔ５においては、合成音響データＤ３の波形２１３が基本周波数に重ねて表示されている。 Next, in step S207, the CPU 11 as the vocoder 134 basically has the sound quality indicated by the sound source ID based on the acoustic feature data AFS indicating the frequency spectrum, and the sound quality further changes according to the tone and pitch. Synthetic acoustic data D3, which is waveform data, is generated. Since the acoustic feature data AFS is generated after the intermediate feature data MF1 and the intermediate feature data MF2 are combined on the time axis, the content of the synthetic acoustic data D3 in which the connection in the song is natural is generated. FIG. 7 is a diagram showing a user interface 200 that displays a speech synthesis processing result. In FIG. 7, the generated fundamental frequency (F0) 211 is displayed throughout the periods T1 to T5. In the period T1, the waveform 212 of the synthetic acoustic data D3 is displayed superimposed on the fundamental frequency. In the periods T3 and T5, the waveform 213 of the synthetic acoustic data D3 is displayed superimposed on the fundamental frequency.

（６）音響デコーダ訓練方法
図８は、本実施の形態に係る音声合成器１の補助訓練方法を示すフローチャートである。補助訓練では、音声合成器１が備える音響デコーダ１３３が訓練される。図８で示される補助訓練方法は、訓練プログラムＰ２が実行されることにより実現される。図８の補助訓練方法を実行する前に、教師データとして、所定の音源ＩＤで特定される新たな音質の補助学習用音響データＤ２＿Ｔが準備され、記憶装置１６に記憶される。教師データとして準備される補助学習用音響データＤ２＿Ｔは、基本訓練された音響デコーダ１３３の音質を変更するために準備されたデータである。補助学習用音響データＤ２＿Ｔは、通常、基本訓練に用いた基本学習用音響データＤ２＿Ｒとは異なる音響データであるが、その音響データの音源ＩＤは基本学習用音響データＤ２＿Ｒと同じ、つまり同じ歌手や同じ楽器の音響データであってもよい。つまり、音響デコーダ１３３に、新たな歌手や楽器の音質を学習させることも、既に学習済の歌手や楽器の音質を改善させることもできる。 (6) Acoustic Decoder Training Method FIG. 8 is a flowchart showing an auxiliary training method for the speech synthesizer 1 according to the present embodiment. In the auxiliary training, the acoustic decoder 133 included in the speech synthesizer 1 is trained. The auxiliary training method shown in FIG. 8 is realized by executing the training program P2. Before executing the auxiliary training method of FIG. 8, auxiliary learning acoustic data D2_T having a new sound quality specified by a predetermined sound source ID is prepared as teacher data and stored in the storage device 16. The auxiliary learning acoustic data D2_T prepared as teacher data is data prepared for changing the sound quality of the basic trained acoustic decoder 133. The auxiliary learning acoustic data D2_T is usually acoustic data different from the basic learning acoustic data D2_R used for basic training, but the sound source ID of the acoustic data is the same as the basic learning acoustic data D2_R, that is, the same singer. It may be acoustic data of the same instrument. That is, the sound decoder 133 can be made to learn the sound quality of a new singer or musical instrument, or the sound quality of a singer or musical instrument that has already been learned can be improved.

まず、ステップＳ３０１において、分析部１２０であるＣＰＵ１１が、補助学習用音響データＤ２＿Ｔに基づいて基本周波数Ｆ０と音響特徴データＡＦとを生成する。本実施の形態においては、補助学習用音響データＤ２＿Ｔの周波数スペクトルを示す音響特徴データＡＦとして、例えば、メル周波数対数スペクトルが用いられる。この音響デコーダ訓練では、補助学習用音響データＤ２＿Ｔだけを用いて、別の音質（例えば、新たな歌手の歌声）を生成モデルに学習させる。したがって、音響デコーダ訓練において楽譜データＤ１は不要である。つまり、ＣＰＵ１１は、音素ラベルのない補助学習用音響データＤ２＿Ｔを用いて音響デコーダ１３３を訓練する。 First, in step S301, the CPU 11 which is the analysis unit 120 generates the fundamental frequency F0 and the acoustic feature data AF based on the auxiliary learning acoustic data D2_T. In the present embodiment, for example, a mel frequency logarithmic spectrum is used as the acoustic feature data AF showing the frequency spectrum of the auxiliary learning acoustic data D2_T. In this acoustic decoder training, another sound quality (for example, the singing voice of a new singer) is trained by the generative model using only the auxiliary learning acoustic data D2_T. Therefore, the score data D1 is unnecessary in the acoustic decoder training. That is, the CPU 11 trains the acoustic decoder 133 using the auxiliary learning acoustic data D2_T without a phoneme label.

次に、ステップＳ３０２において、ＣＰＵ１１は、音響エンコーダ１２１を用いて、音響特徴データＡＦを処理して、中間特徴データＭＦ２を生成する。続いて、ステップＳ３０３において、ＣＰＵ１１が、音響デコーダ１３３を用いて、補助学習用音響データＤ２＿Ｔの音源ＩＤと基本周波数Ｆ０と中間特徴データＭＦ２とを処理して、音響特徴データＡＦＳを生成する。続いて、ステップＳ３０４において、ＣＰＵ１１が、音響特徴データＡＦＳが補助学習用音響データＤ２＿Ｔから生成された音響特徴データＡＦに近づくように、音響デコーダ１３３を訓練する。つまり、楽譜エンコーダ１１１および音響エンコーダ１２１は訓練せず、音響デコーダ１３３のみを訓練する。このように、本実施の形態の補助訓練方法によれば、訓練に音素ラベルのない補助学習用音響データＤ２＿Ｔを使えるので、教師データを準備する手間とコストをかけずに音響デコーダ１３３を訓練できる。 Next, in step S302, the CPU 11 processes the acoustic feature data AF using the acoustic encoder 121 to generate the intermediate feature data MF2. Subsequently, in step S303, the CPU 11 processes the sound source ID of the auxiliary learning acoustic data D2_T, the fundamental frequency F0, and the intermediate feature data MF2 using the acoustic decoder 133 to generate the acoustic feature data AFS. Subsequently, in step S304, the CPU 11 trains the acoustic decoder 133 so that the acoustic feature data AFS approaches the acoustic feature data AF generated from the auxiliary learning acoustic data D2_T. That is, the score encoder 111 and the acoustic encoder 121 are not trained, but only the acoustic decoder 133 is trained. As described above, according to the auxiliary training method of the present embodiment, since the auxiliary learning acoustic data D2_T without a phoneme label can be used for training, the acoustic decoder 133 can be trained without the trouble and cost of preparing teacher data. ..

図９は、音響デコーダの訓練方法に係るユーザインタフェース２００を示す図である。ユーザの録音指示に応じて、ＣＰＵ１１は、例えば１曲分の歌手の歌声や楽器の演奏音を新たに録音し音源ＩＤを付与する。その音源が学習済であれば、それと同じ音源ＩＤを付与し、未学習であれば新たな音源ＩＤを付与する。録音された１トラック分の波形データが補助学習用音響データＤ２＿Ｔである。この録音は、伴奏トラックを再生しながら行われても良い。図９において波形２２１は、補助学習用音響データＤ２＿Ｔの示す波形である。音響デコーダの補助訓練後であれば、ユーザが歌唱した音声や演奏した楽器音を、音声合成器１に接続されたマイクを介して直接取り込んでリアルタイムに音質変換処理してもよい。ＣＰＵ１１が、その補助学習用音響データＤ２＿Ｔを用いて図８の補助訓練処理を行うことで、音響デコーダ１３３は、新たな歌声や楽器音の性質を例えば１曲分学習し、その声質の歌声や楽器音を合成可能となる。図９は、さらに、ユーザの音符配置指示に応じて、ＣＰＵ１１が、録音された波形データの時間軸上の期間Ｔ１２に３つの音符（合成用楽譜データＤ１＿Ｓ）を配置した様子を示す。図では歌唱合成のために各音符の歌詞が入力されているが、楽器音合成であれば、歌詞は不要である。ＣＰＵ１１は、期間Ｔ１２について、補助訓練された音声合成器１を用いて、その合成用楽譜データＤ１＿Ｓを処理し、補助学習用音響データＤ２＿Ｔの音源ＩＤの示す音質の音声合成を行う。ＣＰＵ１１は、期間Ｔ１２は、音源ＩＤの示す音質で音声合成された合成音響データＤ３であり、区間Ｔ１１は、補助学習用音響データＤ２＿Ｔであるコンテンツを生成する。或いは、期間Ｔ１２は、音源ＩＤの示す音質で音声合成された合成音響データＤ３であり、区間Ｔ１１は、補助学習用音響データＤ２＿Ｔを入力として音声合成器１により合成されたその音源ＩＤの音質の合成音響データＤ３であるコンテンツを生成してもよい。 FIG. 9 is a diagram showing a user interface 200 according to a training method for an acoustic decoder. In response to the user's recording instruction, the CPU 11 newly records, for example, the singing voice of a singer for one song or the playing sound of a musical instrument, and assigns a sound source ID. If the sound source has been learned, the same sound source ID is assigned, and if it has not been learned, a new sound source ID is assigned. The recorded waveform data for one track is the acoustic data D2_T for auxiliary learning. This recording may be performed while playing the accompaniment track. In FIG. 9, the waveform 221 is the waveform indicated by the auxiliary learning acoustic data D2_T. After the auxiliary training of the acoustic decoder, the voice sung by the user or the sound of the musical instrument played may be directly captured through the microphone connected to the voice synthesizer 1 and the sound quality conversion processing may be performed in real time. When the CPU 11 performs the auxiliary training process of FIG. 8 using the auxiliary learning acoustic data D2_T, the acoustic decoder 133 learns the properties of a new singing voice or musical instrument sound for, for example, one song, and the singing voice of the voice quality or the like. Musical instrument sounds can be synthesized. FIG. 9 further shows how the CPU 11 arranges three notes (composite score data D1_S) in the period T12 on the time axis of the recorded waveform data in response to the note arrangement instruction of the user. In the figure, the lyrics of each note are input for singing synthesis, but if it is musical instrument sound synthesis, the lyrics are unnecessary. For the period T12, the CPU 11 processes the score data D1_S for synthesis using the auxiliary trained speech synthesizer 1, and performs speech synthesis of the sound quality indicated by the sound source ID of the acoustic data D2_T for auxiliary learning. In the CPU 11, the period T12 is the synthetic acoustic data D3 voice-synthesized with the sound quality indicated by the sound source ID, and the section T11 is the content that is the auxiliary learning acoustic data D2_T. Alternatively, the period T12 is the synthetic acoustic data D3 voice-synthesized with the sound quality indicated by the sound source ID, and the section T11 is the sound quality of the sound source ID synthesized by the voice synthesizer 1 with the auxiliary learning acoustic data D2_T as an input. Content that is synthetic audio data D3 may be generated.

（７）他の実施の形態
上述した本実施の形態の音声合成器１を用いることで、合成用楽譜データＤ１＿Ｓに基づいて音声合成された曲中に、ユーザの歌声や演奏した楽器音を挿入することも可能である。図１０は、音声合成器１において音声合成された曲を再生するユーザインタフェース２００を示している。期間Ｔ２１およびＴ２３は、ユーザにより合成用楽譜データＤ１＿Ｓが配置されており、ＣＰＵ１１によって、ユーザの指定した音源ＩＤの示す音質で歌唱合成が実行される。図１０に示すユーザインタフェース２００を表示させた状態で、ユーザがオーバーダビングの開始を指示すると、ＣＰＵ１１は、音声合成プログラムＰ１を実行して、その音源ＩＤの示す音質の合成音響データＤ３の再生を行う。このとき、ユーザインタフェース２００において現在時刻位置がタイムバー２１４によって示される。ユーザは、タイムバー２１４の位置を見ながら歌唱を行う。ユーザが歌唱した音声は、音声合成器１に接続されたマイクを介して収音され、合成用音響データＤ２＿Ｓとして記録される。図において、波形２０２は、合成用音響データＤ２＿Ｓの波形を示す。ＣＰＵ１１は、音響エンコーダ１２１および音響デコーダ１３３を用いて、合成用音響データＤ２＿Ｓを処理し、その音源ＩＤが示す音質の合成音響データＤ３を生成する。図１１は、合成音響データＤ３の波形２１５が結合されたユーザインタフェース２００を示す。ＣＰＵ１１は、期間Ｔ２１およびＴ２３は、合成用楽譜データＤ１＿Ｓから歌唱合成された音源ＩＤの示す音質の合成音響データＤ３であり、期間Ｔ２２は、ユーザ歌唱から歌唱合成されたその音源ＩＤの示す音質の合成音響データＤ３であるコンテンツを生成する。 (7) Other Embodiments By using the voice synthesizer 1 of the present embodiment described above, the user's singing voice and the played musical instrument sound are inserted into the voice-synthesized song based on the score data D1_S for synthesis. It is also possible to do. FIG. 10 shows a user interface 200 that reproduces a voice-synthesized song in the voice synthesizer 1. In the periods T21 and T23, the score data D1_S for synthesis is arranged by the user, and the CPU 11 executes singing synthesis with the sound quality indicated by the sound source ID specified by the user. When the user instructs the start of overdubbing while the user interface 200 shown in FIG. 10 is displayed, the CPU 11 executes the voice synthesis program P1 to reproduce the synthetic sound data D3 having the sound quality indicated by the sound source ID. conduct. At this time, the current time position is indicated by the time bar 214 in the user interface 200. The user sings while looking at the position of the time bar 214. The voice sung by the user is picked up through a microphone connected to the voice synthesizer 1 and recorded as synthetic acoustic data D2_S. In the figure, the waveform 202 shows the waveform of the acoustic data D2_S for synthesis. The CPU 11 processes the synthetic acoustic data D2_S using the acoustic encoder 121 and the acoustic decoder 133, and generates the synthetic acoustic data D3 of the sound quality indicated by the sound source ID. FIG. 11 shows a user interface 200 to which the waveform 215 of the synthetic acoustic data D3 is combined. In the CPU 11, the periods T21 and T23 are synthetic acoustic data D3 of the sound quality indicated by the sound source ID sung and synthesized from the score data D1_S for synthesis, and the period T22 is the sound quality indicated by the sound source ID sung and synthesized from the user singing. Generate content that is synthetic acoustic data D3.

上述した実施の形態においては、音声合成器１が音源ＩＤで特定される歌手の歌声を合成する場合を例に説明した。本実施の形態の音声合成器１は、特定の歌手の歌声を合成する以外にも、様々な音質の音声を合成する用途に利用可能である。例えば、音声合成器１は、音源ＩＤで特定される楽器の演奏音を合成する用途に利用可能である。 In the above-described embodiment, the case where the voice synthesizer 1 synthesizes the singing voice of the singer specified by the sound source ID has been described as an example. The voice synthesizer 1 of the present embodiment can be used for synthesizing voices having various sound qualities in addition to synthesizing the singing voice of a specific singer. For example, the voice synthesizer 1 can be used for synthesizing the performance sound of a musical instrument specified by a sound source ID.

上述した実施の形態においては、合成用楽譜データＤ１＿Ｓに基づいて生成された中間特徴データＭＦ１と、合成用音響データＤ２＿Ｓに基づいて生成された中間特徴データＭＦ２とを時間軸上で結合した上で、音響特徴データＡＦＳを生成した。別の実施の形態として、中間特徴データＭＦ１に基づいて生成される音響特徴データＡＦＳと、中間特徴データＭＦ２に基づいて生成される音響特徴データＡＦＳとを結合した上で、合成音響データＤ３を生成してもよい。あるいは、別の実施の形態として、中間特徴データＭＦ１に基づいて生成される音響特徴データＡＦＳから合成音響データＤ３を生成し、中間特徴データＭＦ２に基づいて生成される音響特徴データＡＦＳから合成音響データＤ３を生成し、これら２つの合成音響データＤ３を結合してもよい。 In the above-described embodiment, the intermediate feature data MF1 generated based on the synthetic score data D1_S and the intermediate feature data MF2 generated based on the synthetic acoustic data D2_S are combined on the time axis. , Acoustic feature data AFS was generated. As another embodiment, the acoustic feature data AFS generated based on the intermediate feature data MF1 and the acoustic feature data AFS generated based on the intermediate feature data MF2 are combined to generate the synthetic acoustic data D3. You may. Alternatively, as another embodiment, the synthetic acoustic data D3 is generated from the acoustic feature data AFS generated based on the intermediate feature data MF1, and the synthetic acoustic data is generated from the acoustic feature data AFS generated based on the intermediate feature data MF2. D3 may be generated and these two synthetic acoustic data D3 may be combined.

本実施の形態の音声合成器１は、音素ラベルなしの合成用音響データＤ２＿Ｓを利用してある音源ＩＤで特定される歌手の歌声を合成することができる。これにより、音声合成器１を、クロス言語合成器として利用することが可能である。つまり、音響デコーダ１３３が、当該音源ＩＤについて日本語の音響データでのみ訓練されている場合であっても、別の音源ＩＤで英語の音響データで訓練されていれば、英語の歌詞の合成用音響データＤ２＿Ｓを与えることによって、当該音源ＩＤの音質での英語言語による歌唱を生成することが可能である。 The voice synthesizer 1 of the present embodiment can synthesize the singing voice of a singer specified by a sound source ID by using the synthetic acoustic data D2_S without a phoneme label. This makes it possible to use the speech synthesizer 1 as a cross-language synthesizer. That is, even if the sound decoder 133 is trained only with Japanese sound data for the sound source ID, if it is trained with English sound data with another sound source ID, it is for synthesizing English lyrics. By giving the acoustic data D2_S, it is possible to generate a singing in an English language with the sound quality of the sound source ID.

上記の実施の形態においては音声合成プログラムＰ１および訓練プログラムＰ２は、記憶装置１６に記憶されている場合を例に説明した。音声合成プログラムＰ１および訓練プログラムＰ２は、コンピュータが読み取り可能な記録媒体ＲＭに格納された形態で提供され、記憶装置１６またはＲＯＭ１３にインストールされてもよい。また、音声合成器１が通信インタフェース１９を介してネットワークに接続されている場合、ネットワークに接続されたサーバから配信された音声合成プログラムＰ１または訓練プログラムＰ２が記憶装置１６またはＲＯＭ１３にインストールされてもよい。あるいは、ＣＰＵ１１が記憶媒体ＲＭにデバイスインタフェース１８を介してアクセスし、記憶媒体ＲＭＦに記憶されている音声合成プログラムＰ１または訓練プログラムＰ２を実行してもよい。 In the above embodiment, the speech synthesis program P1 and the training program P2 have been described by taking the case where they are stored in the storage device 16 as an example. The speech synthesis program P1 and the training program P2 are provided in a form stored in a computer-readable recording medium RM, and may be installed in the storage device 16 or the ROM 13. Further, when the voice synthesizer 1 is connected to the network via the communication interface 19, even if the voice synthesis program P1 or the training program P2 distributed from the server connected to the network is installed in the storage device 16 or the ROM 13. good. Alternatively, the CPU 11 may access the storage medium RM via the device interface 18 and execute the speech synthesis program P1 or the training program P2 stored in the storage medium RM.

（８）実施の形態の効果
以上説明したように、本実施の形態に係る音声合成方法は、コンピュータにより実現される音声合成方法であって、楽譜データＤ１の楽譜特徴量から第１中間特徴量（中間特徴データＭＦ１）を生成する楽譜エンコーダ１１１、音響データＤ２の音響特徴量から第２中間特徴量（中間特徴データＭＦ２）を生成する音響エンコーダ１２１、および、第１中間特徴量（中間特徴データＭＦ１）または第２中間特徴量（中間特徴データＭＦ２）に基づいて音響特徴量（音響特徴データＡＦＳ）を生成する音響デコーダ１３３を準備し、補助学習用音響データＤ２＿Ｔを受け取り、音響エンコーダ１２１を用いて補助学習用音響データＤ２＿Ｔの音響特徴量から生成される第２中間特徴量（中間特徴データＭＦ２）と、補助学習用音響データＤ２＿Ｔの音響特徴量とを用いて、前記補助学習用音響データＤ２＿Ｔの音響特徴量に近い音響特徴量（音響特徴データＡＦＳ）を生成するよう、音響デコーダ１３３を補助訓練し、ユーザインタフェース２００を介して、補助学習用音響データＤ２＿Ｔの時間軸上に配置された楽譜データＤ１を受け取り、楽譜エンコーダ１１１を用いて配置された楽譜データＤ１から生成される第１中間特徴量（中間特徴データＭＦ１）を、補助訓練済みの音響デコーダ１３３で処理することにより、音響特徴量（音響特徴データＡＦＳ）を生成する。これにより、録音された特定の音質の音響データに対して、同じ音質の音響データを追加することや、その音質を保ったまま音響データを部分的に修正することが容易に行える。 (8) Effect of the Embodiment As described above, the voice synthesis method according to the present embodiment is a voice synthesis method realized by a computer, and is a first intermediate feature amount from the score feature amount of the score data D1. The score encoder 111 that generates (intermediate feature data MF1), the acoustic encoder 121 that generates the second intermediate feature amount (intermediate feature data MF2) from the acoustic feature amount of the acoustic data D2, and the first intermediate feature amount (intermediate feature data). Prepare an acoustic decoder 133 that generates an acoustic feature amount (acoustic feature data AFS) based on the MF1) or the second intermediate feature amount (intermediate feature data MF2), receive the auxiliary learning acoustic data D2_T, and use the acoustic encoder 121. Using the second intermediate feature amount (intermediate feature data MF2) generated from the acoustic feature amount of the auxiliary learning acoustic data D2_T and the acoustic feature amount of the auxiliary learning acoustic data D2_T, the auxiliary learning acoustic data D2_T The acoustic decoder 133 is assisted and trained so as to generate an acoustic feature amount (acoustic feature data AFS) close to the acoustic feature amount of the above, and the score is arranged on the time axis of the auxiliary learning acoustic data D2_T via the user interface 200. By receiving the data D1 and processing the first intermediate feature amount (intermediate feature data MF1) generated from the score data D1 arranged using the score encoder 111 by the auxiliary trained acoustic decoder 133, the acoustic feature amount (Acoustic feature data AFS) is generated. This makes it easy to add acoustic data of the same sound quality to the recorded acoustic data of a specific sound quality, and to partially modify the acoustic data while maintaining the sound quality.

準備することは、楽譜エンコーダ１１１が基本学習用楽譜データＤ１＿Ｒに基づいて生成する第１中間特徴量（中間特徴データＭＦ１）および音響エンコーダ１２１が基本学習用音響データＤ２＿Ｒに基づいて生成する第２中間特徴量（中間特徴データＭＦ２）が近づくように、かつ、音響デコーダ１３３により生成される音響特徴量（音響特徴データＡＦＳ）が、基本学習用音響データＤ２＿Ｒから取得される音響特徴量に近づくように、楽譜エンコーダ１１１、音響エンコーダ１２１および音響デコーダ１３３を訓練することを含んでもよい。音響デコーダ１３３は、楽譜データＤ１に基づいて生成された中間特徴データＭＦ１、または、音響データＤ２に基づいて生成された中間特徴データＭＦ２のいずれに対しても音響特徴データＡＦＳを生成可能である。 To prepare, the first intermediate feature amount (intermediate feature data MF1) generated by the score encoder 111 based on the basic learning score data D1_R and the second intermediate generated by the acoustic encoder 121 based on the basic learning acoustic data D2_R. The feature amount (intermediate feature data MF2) approaches, and the acoustic feature amount (acoustic feature data AFS) generated by the acoustic decoder 133 approaches the acoustic feature amount acquired from the basic learning acoustic data D2_R. , Score encoder 111, acoustic encoder 121 and acoustic decoder 133 may be included. The acoustic decoder 133 can generate acoustic feature data AFS for either the intermediate feature data MF1 generated based on the score data D1 or the intermediate feature data MF2 generated based on the acoustic data D2.

音色を指定する第１識別子に基づいて音響デコーダ１３３が訓練されてもよい。識別子に応じた音質の合成音声を生成することが可能である。 The acoustic decoder 133 may be trained based on a first identifier that specifies a timbre. It is possible to generate synthetic speech with sound quality according to the identifier.

音色を指定する第２識別子に基づいて、音響デコーダ１３３が第１識別子で指定される音色とは異なる音色で訓練されてもよい。識別子に応じて異なる音質の合成音声を生成することが可能である。 Based on the second identifier that specifies the timbre, the acoustic decoder 133 may be trained with a timbre different from the timbre specified by the first identifier. It is possible to generate synthetic speech with different sound quality depending on the identifier.

本実施の形態に係る音響デコーダの音声合成プログラムは、コンピュータに音声合成方法を実行させるプログラムであって、当該プログラムに基づきコンピュータは、楽譜データＤ１の楽譜特徴量から第１中間特徴量（中間特徴データＭＦ１）を生成する楽譜エンコーダ１１１、音響データＤ２の楽譜特徴量から第２中間特徴量（中間特徴データＭＦ２）を生成する音響エンコーダ１２１、および、第１中間特徴量（中間特徴データＭＦ１）または第２中間特徴量（中間特徴データＭＦ２）に基づいて音響特徴量（音響特徴データＡＦＳ）を生成する音響デコーダ１３３を準備し、補助学習用音響データＤ２＿Ｔを受け取り、音響エンコーダ１２１を用いて補助学習用音響データＤ２＿Ｔの音響特徴量から生成される第２中間特徴量（中間特徴データＭＦ２）と、補助学習用音響データＤ２＿Ｔの音響特徴量とを用いて、補助学習用音響データＤ２＿Ｔの音響特徴量に近い音響特徴量（音響特徴データＡＦＳ）を生成するよう、音響デコーダ１３３を補助訓練し、ユーザインタフェース２００を介して、補助学習用音響データＤ２＿Ｔの時間軸上に配置された楽譜データＤ１を受け取り、楽譜エンコーダ１１１を用いて配置された楽譜データＤ１から生成される第１中間特徴量（中間特徴データＭＦ１）を、補助訓練済みの音響デコーダ１３３で処理することにより、音響特徴量（音響特徴データＡＦＳ）を生成する。これにより、録音された特定の音質の音響データに対して、同じ音質の音響データを追加することや、その音質を保ったまま音響データを部分的に修正することが容易に行える。 The voice synthesis program of the acoustic decoder according to the present embodiment is a program that causes a computer to execute a voice synthesis method, and based on the program, the computer determines the first intermediate feature amount (intermediate feature) from the score feature amount of the score data D1. A score encoder 111 that generates data MF1), an acoustic encoder 121 that generates a second intermediate feature amount (intermediate feature data MF2) from the score feature amount of acoustic data D2, and a first intermediate feature amount (intermediate feature data MF1) or An acoustic decoder 133 that generates an acoustic feature amount (acoustic feature data AFS) based on the second intermediate feature amount (intermediate feature data MF2) is prepared, an auxiliary learning acoustic data D2_T is received, and auxiliary learning is performed using the acoustic encoder 121. Using the second intermediate feature amount (intermediate feature data MF2) generated from the acoustic feature amount of the auxiliary learning acoustic data D2_T and the acoustic feature amount of the auxiliary learning acoustic data D2_T, the acoustic feature amount of the auxiliary learning acoustic data D2_T. The acoustic decoder 133 is assisted and trained so as to generate an acoustic feature amount (acoustic feature data AFS) close to, and the score data D1 arranged on the time axis of the auxiliary learning acoustic data D2_T is received via the user interface 200. , The first intermediate feature amount (intermediate feature data MF1) generated from the score data D1 arranged using the score encoder 111 is processed by the auxiliary trained acoustic decoder 133 to obtain an acoustic feature amount (acoustic feature data). AFS) is generated. This makes it easy to add acoustic data of the same sound quality to the recorded acoustic data of a specific sound quality, and to partially modify the acoustic data while maintaining the sound quality.

１００…制御部、１１０…変換部、１１１…楽譜エンコーダ、１２０…分析部、１２１…音響エンコーダ、１３１…切換部、１３３…音響デコーダ、１３４…ボコーダ、Ｄ１…楽譜データ、Ｄ２…音響データ、Ｄ３…合成音響データ、ＳＦ…楽譜特徴データ、ＡＦ…音響特徴データ、ＭＦ１，ＭＦ２…中間特徴データ、ＡＦＳ…音響特徴データ 100 ... Control unit, 110 ... Conversion unit, 111 ... Score encoder, 120 ... Analysis unit, 121 ... Acoustic encoder, 131 ... Switching unit 133 ... Acoustic decoder, 134 ... Bocoder, D1 ... Score data, D2 ... Acoustic data, D3 ... Synthetic acoustic data, SF ... Score feature data, AF ... Acoustic feature data, MF1, MF2 ... Intermediate feature data, AFS ... Acoustic feature data

Claims

It is a speech synthesis method realized by a computer.
A musical score encoder that generates a first intermediate feature amount from a musical score feature amount of musical score data, an acoustic encoder that generates a second intermediate feature amount from an acoustic feature amount of acoustic data, and the first intermediate feature amount or the second intermediate feature amount. Prepare an acoustic decoder that generates acoustic features based on the quantity,
Receives acoustic data for auxiliary learning
Using the second intermediate feature amount generated from the acoustic feature amount of the auxiliary learning acoustic data using the acoustic encoder and the acoustic feature amount of the auxiliary learning acoustic data, the auxiliary learning acoustic data can be obtained. The acoustic decoder is assisted and trained to generate acoustic features that are close to the acoustic features.
The musical score data arranged on the time axis of the auxiliary learning acoustic data is received via the user interface, and the score data is received.
A speech synthesis method for generating acoustic features by processing the first intermediate features generated from the arranged musical score data using the score encoder with the auxiliary trained acoustic decoder.

The above preparation is
The first intermediate feature amount generated by the score encoder based on the basic learning score data and the second intermediate feature amount generated by the acoustic encoder based on the basic learning acoustic data approach each other and the acoustic. Training the score encoder, the acoustic encoder, and the acoustic decoder so that the acoustic features generated by the decoder approach the acoustic features acquired from the basic learning acoustic data.
The voice synthesis method according to claim 1.

The speech synthesis method according to claim 1 or 2, wherein the acoustic decoder is trained based on a first identifier that specifies a timbre.

The speech synthesis method according to claim 3, wherein the acoustic decoder is trained with a tone different from the tone specified by the first identifier based on the second identifier that specifies the tone.

A program that causes a computer to execute a speech synthesis method, and the computer is based on the program.
A score encoder that generates a first intermediate feature from a score feature of score data, an acoustic encoder that generates a second intermediate feature from a score feature of acoustic data, and the first intermediate feature or the second intermediate feature. Prepare an acoustic decoder that generates acoustic features based on the quantity,
Receives acoustic data for auxiliary learning
Using the second intermediate feature amount generated from the acoustic feature amount of the auxiliary learning acoustic data using the acoustic encoder and the acoustic feature amount of the auxiliary learning acoustic data, the auxiliary learning acoustic data can be obtained. The acoustic decoder is assisted and trained to generate acoustic features that are close to the acoustic features.
The musical score data arranged on the time axis of the auxiliary learning acoustic data is received via the user interface, and the score data is received.
A speech synthesis program that generates acoustic features by processing the first intermediate features generated from the arranged musical score data using the score encoder with the auxiliary trained acoustic decoder.