JP2020204755A

JP2020204755A - Speech processing device and speech processing method

Info

Publication number: JP2020204755A
Application number: JP2019141982A
Authority: JP
Inventors: 恵一徳田; Keiichi Tokuda; 圭一郎大浦; Keiichiro Oura; 和寛中村; Kazuhiro Nakamura; 佳橋本; Kei Hashimoto; 吉彦南角; Yoshihiko Nankaku
Original assignee: Nagoya Institute of Technology NUC; Techno Speech Inc
Current assignee: Nagoya Institute of Technology NUC; Techno Speech Inc
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2020-12-24

Abstract

To provide techniques allowing production of a speech waveform with a desired pitch and high quality.SOLUTION: A speech processing device comprises: an acquisition unit to acquire an acoustic feature quantity for generating a speech waveform; and a generation unit to generate the speech waveform by inputting, into a neural network, an acoustic feature quantity and a periodic waveform signal that corresponds to a fundamental frequency of the speech waveform, and performing transform processing using information output by the neural network. The neural network outputs first information for generating an aperiodic component, and second information indicating a periodic component. The transform processing is processing to add second information to information obtained by arithmetic processing using the first information and an aperiodic waveform signal.SELECTED DRAWING: Figure 1

Description

本発明は、音声処理装置、および音声処理方法に関する。 The present invention relates to a voice processing device and a voice processing method.

近年の音声処理装置として、ニューラルネットワークを用いて音声波形を生成するものが知られている。非特許文献１や非特許文献２に記載された技術では、畳み込みを用いたニューラルネットワークによって音声波形を生成している。また、非特許文献３に記載された技術では、非特許文献１や非特許文献２に記載された技術を応用して音響特徴量から音声波形を生成している。 As a recent voice processing device, a device that generates a voice waveform using a neural network is known. In the techniques described in Non-Patent Document 1 and Non-Patent Document 2, voice waveforms are generated by a neural network using convolution. Further, in the technique described in Non-Patent Document 3, the audio waveform is generated from the acoustic feature amount by applying the technique described in Non-Patent Document 1 and Non-Patent Document 2.

Ａ．ｖａｎｄｅｎＯｏｒｄ他， ”Ｗａｖｅｎｅｔ：ＡＧｅｎｅｒａｔｉｖｅＭｏｄｅｌｆｏｒＲａｗＡｕｄｉｏ”，ａｒＸｉｖｐｒｅｐｒｉｎｔａｒＸｉｖ：１６０９．０３４９９，２０１６A. van den Oord et al., "Wavenet: A Generative Model for Raw Audio", arXiv preprint arXiv: 1609.03499, 2016 Ａ．ｖａｎｄｅｎＯｏｒｄ他， ”ＰａｒａｌｌｅｌＷａｖｅＮｅｔ：ＦａｓｔＨｉｇｈ−ＦｉｄｅｌｉｔｙＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓ” ，ａｒＸｉｖｐｒｅｐｒｉｎｔａｒＸｉｖ：１７１１．１０４３３，２０１７A. van den Oord et al., "Parallel WaveNet: Fast High-Fidelity Speech Synthesis", arXiv preprint arXiv: 1711.10433, 2017 ＡｋｉｒａＴａｍａｍｏｒｉ他， ”Ｓｐｅａｋｅｒ−ｄｅｐｅｎｄｅｎｔＷａｖｅｎｅｔｖｏｃｏｄｅｒ”，Ｉｎ：ＩＮＴＥＲＳＰＥＥＣＨ，ｐｐ．１１１８−１１２２，Ａｕｇ．２０１７Akira Tamamori et al., "Speaker-dependent Wavenet vocoder", In: INTERSPEECH, pp. 1118-1122, Aug. 2017 ＴａｋｕｈｉｒｏＫａｎｅｋｏ他， ”ＣｙｃｌｅＧＡＮ−ＶＣ：Ｎｏｎ−ｐａｒａｌｌｅｌＶｏｉｃｅＣｏｎｖｅｒｓｉｏｎＵｓｉｎｇＣｙｃｌｅ−ＣｏｎｓｉｓｔｅｎｔＡｄｖｅｒｓａｒｉａｌＮｅｔｗｏｒｋｓ”，５ｔｈＥＵＲＡＳＩＰＣｏｎｆｅｒｅｎｃｅｏｎ，２０１６，ｐｐ．２１１４−２１１８Takuhiro Kaneko et al., "CycleGAN-VC: Non-parallell Voice Conversion Usage Cycle-Consistent Adversarial Networks", 5th EURASIP Conference. 2114-2118

非特許文献３に記載された技術では、音響特徴量であるスペクトル情報や基本周波数情報などを補助情報として用いて音声波形を生成している。このようなニューラルネットワークを用いた音声処理技術の分野では高品位な音声波形を生成可能な技術や、所望の音高の音声波形を生成可能な技術が望まれている。 In the technique described in Non-Patent Document 3, a voice waveform is generated by using spectrum information, fundamental frequency information, etc., which are acoustic features, as auxiliary information. In the field of voice processing technology using such a neural network, a technology capable of generating a high-quality voice waveform and a technology capable of generating a voice waveform having a desired pitch are desired.

本発明は、上述の課題を解決するためになされたものであり、以下の形態として実現することが可能である。 The present invention has been made to solve the above-mentioned problems, and can be realized as the following forms.

（１）本発明の一形態によれば、音声処理装置が提供される。この音声処理装置は、音声波形を生成するための音響特徴量を取得する取得部と、ニューラルネットワークに前記音声波形の基本周波数に応じた周期波形信号を入力すると共に、前記音響特徴量を入力して、前記ニューラルネットワークが出力した情報を用いて変換処理を行うことで前記音声波形を生成する生成部を備える。前記ニューラルネットワークは、非周期成分を生成するための第１情報と、周期成分を示す第２情報と、を出力し、前記変換処理は、前記第１情報と非周期波形信号とを用いて演算処理を行った情報と、前記第２情報とを足し合わせる処理である。この形態の音声処理装置によれば、非周期成分を生成するための第１情報と非周期波形信号とを用いて演算処理を行った情報と、周期成分を示す第２情報とを足し合わせて音声波形を生成するため、高品位で、所望の音高の音声波形を生成できる。
（２）上記形態の音声処理において、前記第１情報は、予め定められた周波数帯域毎の非周期成分の強さを示す情報であり、前記変換処理は、前記周波数帯域毎の非周期波形信号に、それぞれ対応する前記第１情報を掛け合わせた情報と、前記第２情報とを足し合わせる処理でもよい。この形態の音声処理装置によれば、予め定められた周波数帯域毎の非周期波形信号に、対応する周波数帯域毎の非周期成分の強さを示す第１情報を掛け合わせた情報と、周期成分を示す第２情報とを足し合わせて音声波形を生成するため、高品位で、所望の音高の音声波形を生成できる。
（３）上記形態の音声処理装置において、前記生成部は、前記ニューラルネットワークに、更に、生成しようとする音声波形に応じた周期の有無の程度を示す信号を入力してもよい。この形態の音声処理装置によれば、例えば、生成しようとする音声波形の無音部分や、無声子音の部分といった励振源に関する情報に応じて、高品位な音声波形を生成できる。
（４）上記形態の音声処理装置において、前記生成部は、位相が異なる複数の前記周期波形信号を前記ニューラルネットワークに入力してもよい。この形態の音声処理装置によれば、より効果的に、所望の基本周波数を有する音声波形を生成できる。
（５）上記形態の音声処理装置において、更に、前記音響特徴量と前記周期波形信号と前記第１情報と前記第２情報との関係を機械学習によって学習して前記ニューラルネットワークに反映させる学習部を備えてもよい。この形態の音声処理装置によれば、音響特徴量と音声波形との関係を学習でき、生成部に学習結果を反映できる。 (1) According to one embodiment of the present invention, a voice processing device is provided. This voice processing device inputs the periodic waveform signal corresponding to the fundamental frequency of the voice waveform to the neural network and the acquisition unit that acquires the acoustic feature amount for generating the voice waveform, and inputs the acoustic feature amount. Therefore, it is provided with a generation unit that generates the voice waveform by performing a conversion process using the information output by the neural network. The neural network outputs first information for generating aperiodic components and second information indicating periodic components, and the conversion process is calculated using the first information and the aperiodic waveform signal. This is a process of adding the processed information and the second information. According to the audio processing device of this form, the information obtained by performing arithmetic processing using the first information for generating the aperiodic component and the aperiodic waveform signal, and the second information indicating the periodic component are added together. Since the voice waveform is generated, it is possible to generate a voice waveform having a desired pitch with high quality.
(2) In the audio processing of the above form, the first information is information indicating the strength of the aperiodic component for each predetermined frequency band, and the conversion processing is the aperiodic waveform signal for each frequency band. , And the information obtained by multiplying the corresponding first information and the second information may be added. According to the audio processing device of this form, the information obtained by multiplying the aperiodic waveform signal for each predetermined frequency band by the first information indicating the strength of the aperiodic component for each corresponding frequency band and the periodic component. Since the voice waveform is generated by adding the second information indicating the above, it is possible to generate a voice waveform having a desired pitch with high quality.
(3) In the voice processing apparatus of the above-described embodiment, the generation unit may further input a signal indicating the degree of presence / absence of a period according to the voice waveform to be generated to the neural network. According to the voice processing device of this form, it is possible to generate a high-quality voice waveform according to information about an excitation source such as a silent part of the voice waveform to be generated or a voiceless consonant part.
(4) In the voice processing apparatus of the above-described embodiment, the generation unit may input a plurality of periodic waveform signals having different phases to the neural network. According to this form of voice processing device, it is possible to more effectively generate a voice waveform having a desired fundamental frequency.
(5) In the voice processing device of the above embodiment, a learning unit that further learns the relationship between the acoustic feature amount, the periodic waveform signal, the first information, and the second information by machine learning and reflects them in the neural network. May be provided. According to the voice processing device of this form, the relationship between the acoustic feature amount and the voice waveform can be learned, and the learning result can be reflected in the generation unit.

なお、本発明は、種々の態様で実現することが可能である。例えば、この形態の音声処理装置を利用した音声処理システム、音声処理装置や音声処理システムの機能を実現するために情報処理装置において実行される方法、コンピュータプログラム、そのコンピュータプログラムを配布するためのサーバ装置、そのコンピュータプログラムを記憶した一時的でない記憶媒体等の形態で実現することができる。 The present invention can be realized in various aspects. For example, a voice processing system using this type of voice processing device, a method executed in the information processing device to realize the functions of the voice processing device and the voice processing system, a computer program, and a server for distributing the computer program. It can be realized in the form of a device, a non-temporary storage medium that stores the computer program, or the like.

音声処理装置の概要を示す説明図である。It is explanatory drawing which shows the outline of the voice processing apparatus. 音響特徴量における各種のパラメータの一例を示す図である。It is a figure which shows an example of various parameters in an acoustic feature quantity. 第１実施形態におけるニューラルネットワークについて説明するための説明図である。It is explanatory drawing for demonstrating the neural network in 1st Embodiment. 音声波形の生成における変換処理について説明するための説明図である。It is explanatory drawing for demonstrating the conversion process in the generation of a voice waveform. 音声波形生成処理を表すフローチャートである。It is a flowchart which shows the voice waveform generation processing. ニューラルネットワークの他の態様を示す説明図である。It is explanatory drawing which shows the other aspect of the neural network. 生成した音声波形の一例を示す図である。It is a figure which shows an example of the generated voice waveform. 主観評価実験の実験結果を示した図である。It is a figure which showed the experimental result of the subjective evaluation experiment. 周期補助信号の一例を示す図である。It is a figure which shows an example of a periodic auxiliary signal. 位相が異なる複数の周期波形信号の一例の図である。It is a figure of an example of a plurality of periodic waveform signals having different phases.

Ａ．第１実施形態：
図１は、本発明の一実施形態における音声処理装置１００の概要を示す説明図である。音声処理装置１００は、取得部１０と、生成部２０と、学習部３０と、を備える。取得部１０と、生成部２０と、学習部３０とは、１以上のＣＰＵやＧＰＵがメモリに記憶されたプログラムを実行することにより、ソフトウェア的に実現される。なおこれらの一部または全部は、回路によってハードウェア的に実現されてもよい。 A. First Embodiment:
FIG. 1 is an explanatory diagram showing an outline of a voice processing device 100 according to an embodiment of the present invention. The voice processing device 100 includes an acquisition unit 10, a generation unit 20, and a learning unit 30. The acquisition unit 10, the generation unit 20, and the learning unit 30 are realized by software when one or more CPUs or GPUs execute a program stored in the memory. Note that some or all of these may be realized in hardware by a circuit.

取得部１０は、音声波形を生成するための音響特徴量を取得する。音響特徴量の詳細については後述する。取得部１０は、例えば、予め録音された音声の音声波形から周知の音声分析技術を用いて音響特徴量を抽出してもよく、発語対象のテキストや楽譜に応じて予め生成された音響特徴量を取得してもよい。 The acquisition unit 10 acquires an acoustic feature amount for generating a voice waveform. The details of the acoustic features will be described later. For example, the acquisition unit 10 may extract an acoustic feature amount from a voice waveform of a pre-recorded voice by using a well-known voice analysis technique, and the acoustic feature generated in advance according to a text or a musical score to be spoken. You may get the quantity.

生成部２０は、ノイズ発生源２１と、バンドパスフィルタ部２２とを有する。ノイズ発生源２１は、非周期波形信号を生成する。非周期波形信号とは、ノイズを表す信号であり、例えば、ガウス雑音である。バンドパスフィルタ部２２は、ノイズ発生源２１が生成した非周期波形信号に対して、予め定められた周波数帯域毎にフィルタ処理を行い、周波数帯域が異なる複数の非周期波形信号を生成する。 The generation unit 20 includes a noise generation source 21 and a bandpass filter unit 22. The noise source 21 generates an aperiodic waveform signal. The aperiodic waveform signal is a signal representing noise, for example, Gaussian noise. The bandpass filter unit 22 filters the aperiodic waveform signal generated by the noise source 21 for each predetermined frequency band, and generates a plurality of aperiodic waveform signals having different frequency bands.

生成部２０は、複数の出力チャネルを有するニューラルネットワーク（ＮｅｕｒａｌＮｅｔｗｏｒｋ）が出力した情報を用いて変換処理を行うことで音声波形を生成する。生成部２０は、ニューラルネットワークの入力層に、生成する音声波形の基本周波数に応じた周期波形信号を入力すると共に、取得部１０が取得した音響特徴量を補助情報としてニューラルネットワークに入力して、第１情報および第２情報を出力させる。 The generation unit 20 generates a voice waveform by performing a conversion process using information output by a neural network having a plurality of output channels (Neural Network). The generation unit 20 inputs a periodic waveform signal corresponding to the fundamental frequency of the generated voice waveform to the input layer of the neural network, and inputs the acoustic feature amount acquired by the acquisition unit 10 to the neural network as auxiliary information. The first information and the second information are output.

周期波形信号とは、生成を行う音声波形の基本周波数に応じた周期波形信号である。周期波形信号は、発話スタイルや歌唱スタイル等を含んでいてもよい。例えば、ビブラートが付与された音声波形を生成する場合は、ビブラートが付与された状態の基本周波数に応じた周期波形信号でもよい。周期波形信号は、例えば、生成を行う音声波形の基本周波数と同じ周波数のサイン波形の信号や、生成を行う音声波形の基本周波数より１オクターブ高い周波数のコサイン波形の信号である。また、周期波形信号は、非正弦波である三角波、のこぎり波、短径波やパルス波の信号でもよい。生成を行う音声波形の基本周波数は、例えば、予め録音された音声の音声波形から周知の音声分析技術を用いて基本周波数を求めてもよく、発語対象のテキストや楽譜に応じて予め生成された基本周波数を用いてもよい。 The periodic waveform signal is a periodic waveform signal corresponding to the fundamental frequency of the voice waveform to be generated. The periodic waveform signal may include a speech style, a singing style, and the like. For example, when generating a voice waveform to which vibrato is added, a periodic waveform signal corresponding to the fundamental frequency in the state where vibrato is added may be used. The periodic waveform signal is, for example, a sine waveform signal having the same frequency as the fundamental frequency of the generated voice waveform, or a cosine waveform signal having a frequency one octave higher than the fundamental frequency of the generated voice waveform. Further, the periodic waveform signal may be a non-sinusoidal triangular wave, a sawtooth wave, a short diameter wave or a pulse wave signal. The fundamental frequency of the voice waveform to be generated may be obtained from the voice waveform of the pre-recorded voice by using a well-known voice analysis technique, and is generated in advance according to the text or the score to be spoken. You may use the fundamental frequency.

生成部２０は、ニューラルネットワークが出力した第１情報と第２情報と、ノイズ発生源２１が生成した非周期波形信号と、を用いて変換処理を行い、音声波形を生成する。本実施形態では、生成部２０は、第１情報と、第２情報と、バンドパスフィルタ部２２が生成した非周期波形信号と、を用いて変換処理を行う。第１情報とは、非周期成分を生成するための情報であり、本実施形態では予め定められた周波数帯域毎の非周期成分の強さを示す情報である。第２情報は、周期成分を示す情報であり、より具体的には、周期成分をサンプリング周期毎にサンプリングした振幅情報である。変換処理の詳細については後述する。 The generation unit 20 performs conversion processing using the first information and the second information output by the neural network and the aperiodic waveform signal generated by the noise generation source 21, and generates a voice waveform. In the present embodiment, the generation unit 20 performs conversion processing using the first information, the second information, and the aperiodic waveform signal generated by the bandpass filter unit 22. The first information is information for generating an aperiodic component, and in the present embodiment, is information indicating the strength of the aperiodic component for each predetermined frequency band. The second information is information indicating a periodic component, and more specifically, is amplitude information obtained by sampling the periodic component for each sampling period. The details of the conversion process will be described later.

学習部３０は、音響特徴量と、周期波形信号と、第１情報と、第２情報と、の関係を教師有り機械学習、もしくは、教師無し機械学習（例えば、非特許文献４参照）によって学習し、ニューラルネットワークで用いられる各種のパラメータを最適化する。教師有り機械学習では、例えば、生成しようとする音声波形の自然音声を教師データとし、第１情報と第２情報とを用いて変換処理を行った結果と比較して学習する。学習部３０は、学習結果を生成部２０が用いるニューラルネットワークに反映させる。こうすることにより、生成部２０は、学習部３０の学習結果を反映して音声波形の生成を行うことができる。音声処理装置１００は、学習部３０を備えていなくてもよい。この場合、生成部２０は、外部の機械学習を行う学習装置等によって得られた学習結果を反映して、後述する変換処理によって第１情報と第２情報とから音声波形の生成を行うことができる。 The learning unit 30 learns the relationship between the acoustic feature amount, the periodic waveform signal, the first information, and the second information by supervised machine learning or unsupervised machine learning (see, for example, Non-Patent Document 4). And optimize the various parameters used in the neural network. In supervised machine learning, for example, natural voice of a voice waveform to be generated is used as teacher data, and learning is performed by comparing with the result of conversion processing using the first information and the second information. The learning unit 30 reflects the learning result in the neural network used by the generation unit 20. By doing so, the generation unit 20 can generate the voice waveform by reflecting the learning result of the learning unit 30. The voice processing device 100 does not have to include the learning unit 30. In this case, the generation unit 20 may generate a voice waveform from the first information and the second information by a conversion process described later, reflecting the learning result obtained by an external learning device or the like that performs machine learning. it can.

図２は、音響特徴量における各種のパラメータの一例を示す図である。本実施形態において、音響特徴量は、音声の特徴量である。スペクトルパラメータとしては、メルケプストラムや線スペクトル対（ＬｉｎｅＳｐｅｃｔｒｕｍＰａｉｒ（ＬＳＰ））などがある。これらは、スペクトル情報と呼ばれることがある。音源情報としては、基本周波数がある。基本周波数は、一般に対数基本周波数として扱われており、その関連パラメータとしては、有声／無声の区別や、非周期性指標が考えられる。なお、無声部分は対数基本周波数の値を持たないため、有声／無声の区別を音源情報に含める代わりに、無声部分に所定の定数を入れる等の方法によって有声／無声の区別を行ってもよい。なお、音源情報における基本周波数は、上述した周期波形信号に含まれる情報であるため、省略してもよい。また、有声／無声の区別に関する情報も、後述する周期補助信号に含まれる情報であるため、省略してもよい。また、スペクトル情報や音源情報は、発話スタイルや歌唱スタイル等を含んでいてもよい。例えば、スペクトル情報として、音の大きさのビブラートが付与された状態のスペクトル情報を用いることができる。 FIG. 2 is a diagram showing an example of various parameters in the acoustic features. In the present embodiment, the acoustic feature amount is a voice feature amount. Spectral parameters include mer cepstrum and line spectrum pair (Line Spectrum Pair (LSP)). These are sometimes referred to as spectral information. The sound source information includes the fundamental frequency. The fundamental frequency is generally treated as a logarithmic fundamental frequency, and as related parameters, a distinction between voiced / unvoiced and an aperiodic index can be considered. Since the unvoiced part does not have a logarithmic fundamental frequency value, the voiced / unvoiced part may be distinguished by a method such as inserting a predetermined constant in the unvoiced part instead of including the voiced / unvoiced distinction in the sound source information. .. Since the fundamental frequency in the sound source information is the information included in the above-mentioned periodic waveform signal, it may be omitted. Further, the information regarding the distinction between voiced and unvoiced is also information included in the periodic auxiliary signal described later, and may be omitted. Further, the spectrum information and the sound source information may include a speech style, a singing style, and the like. For example, as the spectrum information, it is possible to use the spectrum information in a state where the vibrato of the loudness is added.

図３は、生成部２０によって用いられるニューラルネットワークについて説明するための説明図である。ニューラルネットワーク２００は、複数のｄｉｌａｔｉｏｎ層Ｌ１〜Ｌ４を備える。ｄｉｌａｔｉｏｎ層の数は任意に定める事ができる。なお「ｄｉｌａｔｉｏｎ層」のことを「拡張層」や「中間層」ともいう。 FIG. 3 is an explanatory diagram for explaining the neural network used by the generation unit 20. The neural network 200 includes a plurality of dilation layers L1 to L4. The number of dilation layers can be arbitrarily determined. The "dilation layer" is also referred to as an "extension layer" or an "intermediate layer".

ｄｉｌａｔｉｏｎ層Ｌ１は、情報が入力される層である。以下、「入力層」ともいう。ｄｉｌａｔｉｏｎ層Ｌ１は、入力された信号に基づいて初期演算処理と情報畳み込みを行い、ｄｉｌａｔｉｏｎ層Ｌ２〜Ｌ４は、下層から伝達される情報に基づいて情報の畳み込みを行う。各層には、複数のノードが含まれる。 The dilation layer L1 is a layer into which information is input. Hereinafter, it is also referred to as an “input layer”. The dilation layer L1 performs initial arithmetic processing and information convolution based on the input signal, and the dilation layers L2 to L4 perform information convolution based on the information transmitted from the lower layer. Each layer contains multiple nodes.

ニューラルネットワーク２００による第１情報および第２情報の生成について説明する。図３には、第１情報を「ａ１」、「ａ２」…と示しており、第２情報を「ｂ」と示している。以下ではこれらの情報のことを「データ」ともいう。本実施形態では、ニューラルネットワーク２００によって、２４個の第１情報が生成される。周期波形信号のサンプルＳ１〜Ｓ８は、ｄｉｌａｔｉｏｎ層Ｌ１で初期演算処理が行われた後、各ノードＮ１〜Ｎ８に時系列順に入力される。ｄｉｌａｔｉｏｎ層Ｌ１の各ノードＮ１〜Ｎ８は、それらの情報に畳み込みを行った情報を上層であるｄｉｌａｔｉｏｎ層Ｌ２に伝達する。図示の便宜上、図３に示すｄｉｌａｔｉｏｎ層Ｌ１には、８個の周期波形信号のサンプルＳ１〜Ｓ８が入力されているが、入力されるサンプルの数は任意に定める事ができ、例えば３０００個である。 The generation of the first information and the second information by the neural network 200 will be described. In FIG. 3, the first information is shown as “a1”, “a2” ..., And the second information is shown as “b”. Hereinafter, this information is also referred to as "data". In this embodiment, the neural network 200 generates 24 first pieces of information. The periodic waveform signal samples S1 to S8 are input to the nodes N1 to N8 in chronological order after the initial arithmetic processing is performed on the dilation layer L1. Each node N1 to N8 of the dilation layer L1 transmits the information obtained by convolving the information to the dilation layer L2 which is the upper layer. For convenience of illustration, eight periodic waveform signal samples S1 to S8 are input to the dilation layer L1 shown in FIG. 3, but the number of input samples can be arbitrarily determined, for example, 3000. is there.

ｄｉｌａｔｉｏｎ層Ｌ２〜Ｌ４では、入力層Ｌ１から伝達された情報に対して種々の演算が各層において段階的に行われる。入力層Ｌ１の各ノードＮ１〜Ｎ８やｄｉｌａｔｉｏｎ層Ｌ２〜Ｌ４の各ノードには、補助情報ＡＩとして各サンプルに対応する音響特徴量が入力される。なお、ｄｉｌａｔｉｏｎ層Ｌ２〜Ｌ４にも、下層から伝達された情報に加えて、周期波形信号のサンプルが入力されてもよい。ｄｉｌａｔｉｏｎ層Ｌ４において、最終的に演算されたデータと、各層の最右のノードのデータ、つまり時系列において最も先のデータが入力されるノードのデータとを足しあわせて演算処理を行うことで、データＤＡが出力される。本実施形態において、データＤＡは、時系列において、入力されたサンプルＳ８の時点の２４個に区分された周波数帯域毎の非周期成分の強さを示す第１情報ａ１〜ａ２４および、入力されたサンプルＳ８の時点の音声サンプルの周期成分として予測された振幅情報である第２情報ｂである。本実施形態におけるニューラルネットワーク２００は、時系列において近いサンプルであるほど、出力されるデータＤＡに強い影響を与えやすい構造となっている。具体的には、サンプルＳ８の方が、サンプルＳ１よりも、データＤＡの予測に影響を与えやすい。 In the dilation layers L2 to L4, various operations are performed stepwise in each layer with respect to the information transmitted from the input layer L1. Acoustic features corresponding to each sample are input as auxiliary information AI to each node N1 to N8 of the input layer L1 and each node of the dilation layers L2 to L4. In addition to the information transmitted from the lower layer, a sample of the periodic waveform signal may be input to the dilation layers L2 to L4. In the dilation layer L4, the data finally calculated and the data of the rightmost node of each layer, that is, the data of the node to which the earliest data in the time series is input are added and the calculation process is performed. Data DA is output. In the present embodiment, the data DA is input with the first information a1 to a24 indicating the strength of the aperiodic component for each frequency band divided into 24 at the time of the input sample S8 in the time series. It is the second information b which is the amplitude information predicted as the periodic component of the audio sample at the time of sample S8. The neural network 200 in the present embodiment has a structure in which the closer the sample is in the time series, the stronger the influence on the output data DA is likely to be. Specifically, sample S8 is more likely to affect the prediction of data DA than sample S1.

図４は、音声波形の生成における変換処理について説明するための説明図である。生成部２０は、バンドパスフィルタ部２２が生成した周波数帯域毎の非周期波形信号ｎｚ１〜ｎｚ２４に、対応する第１情報ａ１〜ａ２４をそれぞれ掛け合わせた情報と、第２情報ｂとを足し合わせることで音声波形を生成する。第１情報ａ１〜ａ２４をそれぞれ非周期波形信号ｎｚ１〜ｎｚ２４に掛け合わせた情報と第２情報ｂとは、全てが合算されればよく、第１情報ａ１〜ａ２４をそれぞれ非周期波形信号ｎｚ１〜ｎｚ２４に掛け合わせた情報を足し合わせてから第２情報ｂを足し合わせてもよいし、第１情報ａ１〜ａ２４をそれぞれ非周期波形信号ｎｚ１〜ｎｚ２４に掛け合わせた情報と第２情報ｂとを同時に足し合わせてもよい。第１情報ａ１〜ａ２４および非周期波形信号ｎｚ１〜ｎｚ２４における周波数帯域は、例えば、１０００Ｈｚ毎に区切られた帯域である。非周期波形信号ｎｚ１〜ｎｚ２４は、例えば、バンドパスフィルタ部２２によって生成された周波数帯域が異なるガウスノイズである。なお、本実施形態において、周波数帯域は２４個に区分されているが、区分数はこれに限らない。 FIG. 4 is an explanatory diagram for explaining the conversion process in the generation of the voice waveform. The generation unit 20 adds the information obtained by multiplying the aperiodic waveform signals nz1 to nz24 for each frequency band generated by the bandpass filter unit 22 by the corresponding first information a1 to a24, and the second information b. By doing so, a voice waveform is generated. The information obtained by multiplying the first information a1 to a24 by the aperiodic waveform signals nz1 to nz24 and the second information b need to be summed together, and the first information a1 to a24 are the aperiodic waveform signals nz1 to nz1 to each. The second information b may be added after adding the information multiplied by nz24, or the information obtained by multiplying the first information a1 to a24 by the aperiodic waveform signals nz1 to nz24 and the second information b, respectively. You may add them at the same time. The frequency bands in the first information a1 to a24 and the aperiodic waveform signals nz1 to nz24 are, for example, bands separated by 1000 Hz. The aperiodic waveform signals nz1 to nz24 are, for example, Gaussian noises generated by the bandpass filter unit 22 having different frequency bands. In the present embodiment, the frequency band is divided into 24, but the number of divisions is not limited to this.

図５は、本実施形態における音声処理装置１００を用いた音声波形生成処理を表すフローチャートである。まず、取得部１０が、ステップＳ１００で音響特徴量を取得する。次に、生成部２０が、ステップＳ１１０において、ステップＳ１００で取得した音響特徴量と予め定められた期間分の周期波形信号をニューラルネットワークに入力して、予め定められた周波数帯域毎の非周期成分の強さを示す第１情報と、周期成分を示す第２情報とを出力させる。最後に、生成部２０が、ステップＳ１２０において、ステップＳ１１０でニューラルネットワークが出力した情報を用いて変換処理を行い、音声波形を生成する。 FIG. 5 is a flowchart showing a voice waveform generation process using the voice processing device 100 in the present embodiment. First, the acquisition unit 10 acquires the acoustic feature amount in step S100. Next, in step S110, the generation unit 20 inputs the acoustic feature amount acquired in step S100 and the periodic waveform signal for a predetermined period into the neural network, and aperiodic components for each predetermined frequency band. The first information indicating the strength of is output and the second information indicating the periodic component is output. Finally, in step S120, the generation unit 20 performs conversion processing using the information output by the neural network in step S110 to generate a voice waveform.

以上で説明した本実施形態の音声処理装置１００によれば、生成部２０は、非周期成分を生成するための第１情報と非周期波形信号とを用いて演算処理を行った情報と、周期成分を示す第２情報とを足し合わせて音声波形を生成している。より具体的には、バンドパスフィルタ部２２が生成した予め定められた周波数帯域毎の非周期波形信号に、対応する周波数帯域毎の非周期成分の強さを示す第１情報を掛け合わせた情報と、周期成分を示す第２情報とを足し合わせて音声波形を生成するため、高品位で所望の音高の音声波形を生成できる。また、ニューラルネットワーク自身が出力したデータをニューラルネットワークに入力して次のデータを予測する自己回帰構造のニューラルネットワークよりも高速に音声波形を生成できる。また、学習部３０によって音響特徴量と周期波形信号と第１情報と第２情報との関係を学習でき、生成部２０に学習結果を反映できる。また、学習部３０の学習範囲から大きく外れた基本周波数の音声波形であっても、生成部２０は、生成を行おうとする音声波形の基本周波数に応じた周期波形信号を、ニューラルネットワークの入力層に入力して音声波形を生成するため、所望の音高を有する音声波形を生成できる。 According to the voice processing device 100 of the present embodiment described above, the generation unit 20 includes information that has been subjected to arithmetic processing using the first information for generating the aperiodic component and the aperiodic waveform signal, and the period. A voice waveform is generated by adding the second information indicating the components. More specifically, information obtained by multiplying a predetermined aperiodic waveform signal for each frequency band generated by the bandpass filter unit 22 by first information indicating the strength of the aperiodic component for each corresponding frequency band. And the second information indicating the periodic component are added to generate a sound waveform, so that a sound waveform having a desired sound pitch with high quality can be generated. In addition, it is possible to generate a voice waveform at a higher speed than a neural network having an autoregressive structure that predicts the next data by inputting the data output by the neural network itself into the neural network. Further, the learning unit 30 can learn the relationship between the acoustic feature amount, the periodic waveform signal, the first information, and the second information, and the learning result can be reflected in the generation unit 20. Further, even if the sound waveform has a fundamental frequency that greatly deviates from the learning range of the learning unit 30, the generation unit 20 outputs a periodic waveform signal corresponding to the fundamental frequency of the sound waveform to be generated to the input layer of the neural network. Since the sound waveform is generated by inputting to, a sound waveform having a desired pitch can be generated.

図６は、ニューラルネットワークの他の態様を示す説明図である。図６に示すニューラルネットワークは、図３に示したニューラルネットワークの構造が左右対称に備えられる事により構成されている。入力層Ｌ１には、第１実施形態と同様に、周期波形信号のサンプルが入力される。本実施形態のニューラルネットワークの入力層Ｌ１には、出力されるデータＤＡの時系列における過去の周期波形信号のサンプルと未来の周期波形信号のサンプルが入力される。より具体的には、ノードＮ１〜Ｎ７までには、過去の周期波形信号のサンプルＳ１〜Ｓ７に初期演算処理を行った情報が入力され、ノードＮ８には現在の周期波形信号のサンプルＳ８に初期演算処理を行った情報が入力され、ノードＮ９〜Ｎ１５には、未来の周期波形信号のサンプルＳ９〜Ｓ１５に初期演算処理を行った情報が入力される。また、各ノードでは、第１実施形態と同様に、補助情報として音響特徴量が入力される。図６に示すニューラルネットワーク２００は、時系列において近いサンプルであるほど、出力されるデータＤＡに強い影響を与えやすい構造となっている。具体的には、データＤＡの予測には、サンプルＳ８の方が、サンプルＳ１やサンプルＳ１５よりも、強い影響を与えやすい。このようなニューラルネットワークを用いれば、生成するデータの時系列における過去の周期波形信号のサンプルだけでなく、未来の周期波形信号のサンプルを入力するため、より高品位な音声波形を生成できる。 FIG. 6 is an explanatory diagram showing another aspect of the neural network. The neural network shown in FIG. 6 is configured by providing the structure of the neural network shown in FIG. 3 symmetrically. A sample of the periodic waveform signal is input to the input layer L1 as in the first embodiment. A sample of a past periodic waveform signal and a sample of a future periodic waveform signal in the time series of the output data DA are input to the input layer L1 of the neural network of the present embodiment. More specifically, the information obtained by performing the initial arithmetic processing is input to the samples S1 to S7 of the past periodic waveform signals to the nodes N1 to N7, and the initial arithmetic processing is input to the samples S8 of the current periodic waveform signal to the nodes N8. The information on which the arithmetic processing has been performed is input, and the information on which the initial arithmetic processing has been performed is input to the samples S9 to S15 of the future periodic waveform signals in the nodes N9 to N15. Further, at each node, the acoustic feature amount is input as auxiliary information as in the first embodiment. The neural network 200 shown in FIG. 6 has a structure in which the closer the sample is in the time series, the stronger the influence on the output data DA is likely to be. Specifically, the sample S8 is more likely to have a stronger influence on the prediction of the data DA than the sample S1 and the sample S15. By using such a neural network, not only a sample of the past periodic waveform signal in the time series of the generated data but also a sample of the future periodic waveform signal is input, so that a higher quality voice waveform can be generated.

図７は、実施例において生成した音声波形の一例を示す図である。上段に示す波形は、目標音声波形であり、音声処理によって生成しようとする波形である。中段に示す波形は、実施例において生成した音声波形である。下段に示す波形は、ニューラルネットワークに入力した周期波形信号であり、目標音声波形と同じ基本周波数のサイン波形である。図７に示すように、実施例において生成した音声波形は、同じ周期Ｔで変動しており、目標音声波形と同じ基本周波数となった。 FIG. 7 is a diagram showing an example of the voice waveform generated in the embodiment. The waveform shown in the upper row is a target voice waveform, and is a waveform to be generated by voice processing. The waveform shown in the middle row is the voice waveform generated in the example. The waveform shown in the lower row is a periodic waveform signal input to the neural network, and is a sine waveform having the same fundamental frequency as the target voice waveform. As shown in FIG. 7, the voice waveform generated in the embodiment fluctuates in the same period T, and has the same fundamental frequency as the target voice waveform.

実験結果：
図８は、生成した音声波形に対する主観評価実験の実験結果である平均オピニオン評点（ＭｅａｎＯｐｉｎｉｏｎＳｃｏｒｅ（ＭＯＳ））を示した図である。本実験において、４手法の合成音声の品質を、「１：非常に悪い、２：悪い、３：普通、４：良い、５：非常に良い」の５段階の主観評価実験によって評価した。図８には４手法のうちの２手法のスコアを示す。被験者は１６人であり、各被験者はテストデータである１０曲から各手法につき１０フレーズを評価した。評価対象である合成音声の音声波形は、２手法とも同じ音響特徴量を用いて生成した。 Experimental result:
FIG. 8 is a diagram showing the mean opinion score (MOS), which is the experimental result of the subjective evaluation experiment for the generated voice waveform. In this experiment, the quality of the synthetic speech of the four methods was evaluated by a five-stage subjective evaluation experiment of "1: very bad, 2: bad, 3: normal, 4: good, 5: very good". FIG. 8 shows the scores of two of the four methods. There were 16 subjects, and each subject evaluated 10 phrases for each method from 10 songs that were test data. The voice waveform of the synthetic voice to be evaluated was generated by using the same acoustic features in both methods.

実施例は、上述した実施形態１の音声処理装置１００および図６に示したニューラルネットワークを用いて音声波形を生成した。比較例は、ｗａｖｅｎｅｔ（非特許文献１記載）のニューラルネットワークを用いたボコーダ技術によって音声波形を生成した。ｗａｖｅｎｅｔのニューラルネットワークには、実施例と同一の音響特徴量を入力した。図８に示すように、実施例のスコアは、比較例のスコアよりも高かった。つまり、生成部２０が上記実施形態に従って音声波形を生成すると、より高品位に音声波形を生成できる。なお、図８に示していない残りの２手法は、（１）人間の歌唱によるオリジナル音声をそのまま出力したものと、（２）実施例と同一の手法であって、実施例におけるニューラルネットワークを音声処理装置１００の学習部３０によって教師無し機械学習（例えば、非特許文献４参照）によって最適化した学習済みのニューラルネットワークを用いた音声波形の生成手法である。 In the embodiment, a voice waveform was generated using the voice processing device 100 of the first embodiment described above and the neural network shown in FIG. In the comparative example, a voice waveform was generated by a vocoder technique using a neural network of wavenet (described in Non-Patent Document 1). The same acoustic features as in the examples were input to the wavenet neural network. As shown in FIG. 8, the score of the example was higher than the score of the comparative example. That is, when the generation unit 20 generates the voice waveform according to the above embodiment, the voice waveform can be generated with higher quality. The remaining two methods not shown in FIG. 8 are (1) the original voice produced by human singing as it is, and (2) the same method as in the embodiment, and the neural network in the embodiment is voiced. This is a method of generating a voice waveform using a trained neural network optimized by unsupervised machine learning (see, for example, Non-Patent Document 4) by the learning unit 30 of the processing device 100.

Ｂ．第２実施形態：
第２実施形態における生成部２０は、ニューラルネットワークの入力層に、更に、生成しようとする音声波形に応じた周期の有無の程度を示す信号（以下、「周期補助信号」という）を入力して音声波形を生成する点が第１実施形態と異なる。第２実施形態の音声処理装置１００の構成は、第１実施形態の音声処理装置１００の構成と同様であるため、構成の説明は省略する。 B. Second embodiment:
The generation unit 20 in the second embodiment further inputs a signal (hereinafter, referred to as “periodic auxiliary signal”) indicating the degree of presence / absence of a period according to the voice waveform to be generated to the input layer of the neural network. It differs from the first embodiment in that a voice waveform is generated. Since the configuration of the voice processing device 100 of the second embodiment is the same as the configuration of the voice processing device 100 of the first embodiment, the description of the configuration will be omitted.

本実施形態において、生成部２０は、ニューラルネットワークの入力層Ｌ１に、周期波形信号と周期補助信号とを入力する。つまり、本実施形態において、生成部２０が用いるニューラルネットワークの入力層のノードは、２つの入力チャネルを有している。例えば、第１のチャネルには、周期波形信号のサンプルが入力され、第２のチャネルには、周期補助信号のサンプルが入力される。なお、チャネルの順序は任意に定める事ができる。 In the present embodiment, the generation unit 20 inputs a periodic waveform signal and a periodic auxiliary signal to the input layer L1 of the neural network. That is, in the present embodiment, the node of the input layer of the neural network used by the generation unit 20 has two input channels. For example, a sample of the periodic waveform signal is input to the first channel, and a sample of the periodic auxiliary signal is input to the second channel. The order of the channels can be arbitrarily determined.

周期補助信号は、周期波形が始まる境界位置と終わる境界位置に応じて定める事ができ、非周期波形の部分を０、周期波形の部分を１とした、０〜１の値で表現できる。例えば、周期波形が始まる境界位置における周期補助信号は、無声から有声に切り替わる境界の２４０サンプル前の位置から２４０サンプル後の位置までを０．０から１．０にサンプル単位で線形補間した信号であり、周期波形が終わる境界位置における周期補助信号は、有声から無声に切り替わる境界の２４０サンプル前の位置から２４０サンプル後の位置までを１．０から０．０にサンプル単位で線形補間した信号である。図９は、周期補助信号の一例を示す図である。また、周期補助信号は、音素やフレーム毎の値を線形補間したデータでもよい。 The periodic auxiliary signal can be determined according to the boundary position where the periodic waveform starts and the boundary position where the periodic waveform ends, and can be expressed by a value of 0 to 1, with the aperiodic waveform portion being 0 and the periodic waveform portion being 1. For example, the periodic auxiliary signal at the boundary position where the periodic waveform starts is a signal that linearly interpolates from 0.0 to 1.0 from the position before 240 samples to the position after 240 samples of the boundary that switches from unvoiced to voice. Yes, the periodic auxiliary signal at the boundary position where the periodic waveform ends is a signal that linearly interpolates from 1.0 to 0.0 from the position before 240 samples to the position after 240 samples of the boundary that switches from voiced to unvoiced. is there. FIG. 9 is a diagram showing an example of a periodic auxiliary signal. Further, the periodic auxiliary signal may be data in which phonemes or values for each frame are linearly interpolated.

以上で説明した本実施形態の音声処理装置１００によれば、生成部２０は、周期波形信号を、直接的にニューラルネットワークの入力層に入力して音声波形を生成するため、所望の基本周波数を有する音声波形を生成できる。また、生成部２０は、ニューラルネットワークの入力層に、更に、周期補助信号を入力するため、例えば、生成しようとする音声波形の無音部分や、無声子音の部分といった励振源に関する情報に応じて、高品位な音声波形を生成できる。 According to the voice processing device 100 of the present embodiment described above, the generation unit 20 directly inputs the periodic waveform signal to the input layer of the neural network to generate the voice waveform, so that a desired fundamental frequency is set. It is possible to generate a voice waveform to have. Further, since the generation unit 20 further inputs a periodic auxiliary signal to the input layer of the neural network, the generation unit 20 responds to information on an excitation source such as a silent portion or an unvoiced consonant portion of the voice waveform to be generated. High-quality audio waveform can be generated.

Ｃ．第３実施形態：
第３実施形態における生成部２０は、位相が異なる複数の周期波形信号をニューラルネットワークの入力層に入力して音声波形を生成する点が第１実施形態と異なる。第３実施形態の音声処理装置１００の構成は、第１実施形態の音声処理装置１００の構成と同様であるため、構成の説明は省略する。 C. Third Embodiment:
The generation unit 20 in the third embodiment is different from the first embodiment in that a plurality of periodic waveform signals having different phases are input to the input layer of the neural network to generate a voice waveform. Since the configuration of the voice processing device 100 of the third embodiment is the same as the configuration of the voice processing device 100 of the first embodiment, the description of the configuration will be omitted.

図１０は、位相が異なる複数の周期波形信号の一例の図である。本実施形態において、生成部２０は、ニューラルネットワークの入力層Ｌ１に、周期波形信号Ｗｓと周期波形信号Ｗｃとを入力する。つまり、本実施形態において、生成部２０が用いるニューラルネットワークの入力層のノードは、２つの入力チャネルを有している。第１のチャネルには、周期波形信号Ｗｓのサンプルが入力され、第２のチャネルには、周期波形信号Ｗｃのサンプルが入力される。なお、チャネルの順序は任意に定める事ができる。 FIG. 10 is a diagram of an example of a plurality of periodic waveform signals having different phases. In the present embodiment, the generation unit 20 inputs the periodic waveform signal Ws and the periodic waveform signal Wc to the input layer L1 of the neural network. That is, in the present embodiment, the node of the input layer of the neural network used by the generation unit 20 has two input channels. A sample of the periodic waveform signal Ws is input to the first channel, and a sample of the periodic waveform signal Wc is input to the second channel. The order of the channels can be arbitrarily determined.

周期波形信号Ｗｓは、生成を行う音声波形と同じ基本周波数を有するサイン波形であり、周期波形信号Ｗｃは、生成を行う音声波形と同じ基本周波数を有するコサイン波形である。図１０に示すように、周期波形信号Ｗｓは、上昇時であるタイミングｔ１の場合の振幅の値と、下降時であるタイミングｔ２の場合の値とは、どちらも振幅Ａ１であるが、タイミングｔ１における周期波形信号Ｗｃは振幅Ａ２であり、タイミングｔ２における周期波形信号Ｗｃは振幅Ａ２と異なる値の振幅Ａ３である。従って、生成部２０の用いるニューラルネットワークは、周期波形信号Ｗｓが振幅Ａ１の場合、周期波形信号Ｗｃが振幅Ａ２であれば上昇時であり、周期波形信号Ｗｃが振幅Ａ３であれば下降時であることを一意に判断できる。 The periodic waveform signal Ws is a sine waveform having the same fundamental frequency as the generated voice waveform, and the periodic waveform signal Wc is a cosine waveform having the same fundamental frequency as the generated voice waveform. As shown in FIG. 10, in the periodic waveform signal Ws, the amplitude value in the case of timing t1 when rising and the value in the case of timing t2 when falling are both amplitude A1, but timing t1. The periodic waveform signal Wc in is the amplitude A2, and the periodic waveform signal Wc at the timing t2 is the amplitude A3 having a value different from the amplitude A2. Therefore, when the periodic waveform signal Ws is the amplitude A1, the neural network used by the generation unit 20 is rising when the periodic waveform signal Wc is amplitude A2, and falling when the periodic waveform signal Wc is amplitude A3. It can be uniquely determined.

以上で説明した本実施形態の音声処理装置１００によれば、生成部２０は、位相が異なる複数の周期波形信号をニューラルネットワークの入力層に入力するため、生成部２０が用いるニューラルネットワークは、周期波形信号の値が、上昇時の値なのか下降時の値なのかを一意に決める事ができる。そのため、生成部２０は、より効果的に、所望の基本周波数を有する音声波形を生成でき、より高品位な音声波形を生成できる。 According to the voice processing device 100 of the present embodiment described above, the generation unit 20 inputs a plurality of periodic waveform signals having different phases to the input layer of the neural network, so that the neural network used by the generation unit 20 has a period. It is possible to uniquely determine whether the value of the waveform signal is the value at the time of rising or the value at the time of falling. Therefore, the generation unit 20 can more effectively generate a voice waveform having a desired fundamental frequency, and can generate a higher quality voice waveform.

Ｄ．第４実施形態
第４実施形態では、生成部２０によって用いられるニューラルネットワークの構造が第１実施形態と異なる。第４実施形態の音声処理装置１００の構成は、第１実施形態の音声処理装置１００の構成と同様であるため、構成の説明は省略する。 D. Fourth Embodiment In the fourth embodiment, the structure of the neural network used by the generation unit 20 is different from that of the first embodiment. Since the configuration of the voice processing device 100 of the fourth embodiment is the same as the configuration of the voice processing device 100 of the first embodiment, the description of the configuration will be omitted.

本実施形態において、生成部２０は、図３や図６に示したニューラルネットワークを、縦に複数重ねた構造のニューラルネットワークを用いて第１情報および第２情報を出力する。例えば、ニューラルネットワークを２つ重ねた場合、生成部２０は、下段のニューラルネットワークで出力された情報を、上段のニューラルネットワークの入力層Ｌ１に入力して、第１情報および第２情報を出力する。つまり、上段のニューラルネットワークの入力層Ｌ１のノードの数分、下段のニューラルネットワークの出力を用意する。 In the present embodiment, the generation unit 20 outputs the first information and the second information by using a neural network having a structure in which a plurality of neural networks shown in FIGS. 3 and 6 are vertically stacked. For example, when two neural networks are overlapped, the generation unit 20 inputs the information output by the lower neural network to the input layer L1 of the upper neural network, and outputs the first information and the second information. .. That is, the output of the lower neural network is prepared for the number of nodes of the input layer L1 of the upper neural network.

以上で説明した本実施形態の音声処理装置１００によれば、生成部２０は、周期波形信号を、直接的にニューラルネットワークの入力層に入力して音声波形を生成するため、所望の基本周波数を有する音声波形を生成できる。また、生成部２０は、ニューラルネットワークを複数重ねた構造のニューラルネットワークを用いて第１情報および第２情報を求めて音声波形を生成するため、１段のみの構造であるニューラルネットワークに比べて、同数のサンプルを入力して音声波形を生成する場合に、各段のニューラルネットワークを小さくすることができる。そのため、全体としてパラメータを増加させることなく、多くのサンプルを入力して音声波形を生成できるため、より高品位な音声波形を生成できる。 According to the voice processing device 100 of the present embodiment described above, the generation unit 20 directly inputs the periodic waveform signal to the input layer of the neural network to generate the voice waveform, so that a desired fundamental frequency is set. It is possible to generate a voice waveform to have. Further, since the generation unit 20 obtains the first information and the second information by using the neural network having a structure in which a plurality of neural networks are stacked, and generates the voice waveform, compared with the neural network having only one stage structure, When the same number of samples are input to generate a voice waveform, the neural network of each stage can be made smaller. Therefore, since it is possible to input a large number of samples and generate a voice waveform without increasing the parameters as a whole, it is possible to generate a higher quality voice waveform.

Ｅ．その他の実施形態：
（Ｅ１）上記実施形態において、取得部１０が取得する音響特徴量は、歌唱音声の特徴量である。この代わりに、取得部１０は、音響特徴量として話し言葉の特徴量を取得してもよい。この形態によれば、歌声ではない、テキスト合成音声である音声波形を生成できる。また、声のトーンやアクセント、イントネーション、中国語における四声等をより正確に再現した音声波形を生成できる。また、取得部１０は、音響特徴量として声質を表す特徴量を取得してもよい。声質を表す特徴量は、他人の声から抽出した音響特徴量である。この形態によれば、ある話者の音響特徴量から、他の話者の音響特徴量へと変換する声質変換を行った音声波形を生成できる。声質変換を行う場合、音響特徴量は、変換する音声の音響特徴量でもよく、変換したい音声の音響特徴量でもよい。また、これらの音響特徴量の差分を音響特徴量としてもよく、両方を用いてもよい。ニューラルネットワークには、周期波形信号として、変換する音声や変換する音声の基本周波数を有する周期信号、変換する音声の残差信号である周期信号、変換したい音声の基本周波数を有する周期信号を入力してもよい。また、取得部１０は音響特徴量として、楽器音の特徴量を取得して、ニューラルネットワークに補助情報として入力してもよい。この形態によれば、歌声ではない、楽器音である音声波形を生成できる。打楽器音の生成を行う場合、取得部１０は打楽器音の特徴量を取得し、周期波形信号として、打楽器を発音させたいタイミングで立ち上がるパルス信号を用いる。より具体的には、エイトビートのハイハットの音声波形を生成したい場合、８分音符毎に１となり、他は０であるパルス信号を用いる。 E. Other embodiments:
(E1) In the above embodiment, the acoustic feature amount acquired by the acquisition unit 10 is the feature amount of the singing voice. Instead, the acquisition unit 10 may acquire the feature amount of the spoken language as the acoustic feature amount. According to this form, it is possible to generate a voice waveform that is a text-synthesized voice that is not a singing voice. In addition, it is possible to generate a voice waveform that more accurately reproduces voice tones, accents, intonation, four tones in Chinese, and the like. Further, the acquisition unit 10 may acquire a feature amount representing voice quality as an acoustic feature amount. The feature amount representing the voice quality is an acoustic feature amount extracted from the voice of another person. According to this form, it is possible to generate a voice waveform that has undergone voice quality conversion that converts the acoustic features of one speaker into the acoustic features of another speaker. When performing voice quality conversion, the acoustic feature amount may be the acoustic feature amount of the voice to be converted or the acoustic feature amount of the voice to be converted. Further, the difference between these acoustic features may be used as the acoustic features, or both may be used. As a periodic waveform signal, a periodic signal having a fundamental frequency of the voice to be converted or a voice to be converted, a periodic signal which is a residual signal of the voice to be converted, and a periodic signal having a fundamental frequency of the voice to be converted are input to the neural network. You may. Further, the acquisition unit 10 may acquire the feature amount of the musical instrument sound as the acoustic feature amount and input it to the neural network as auxiliary information. According to this form, it is possible to generate a voice waveform that is not a singing voice but a musical instrument sound. When generating a percussion instrument sound, the acquisition unit 10 acquires a feature amount of the percussion instrument sound, and uses a pulse signal that rises at a timing when the percussion instrument is desired to be sounded as a periodic waveform signal. More specifically, when it is desired to generate an eight-beat hi-hat voice waveform, a pulse signal is used, which is 1 for every eighth note and 0 for the others.

（Ｅ２）上記実施形態において、取得部１０は、生成したい音声波形の元となる楽譜特徴量や言語特徴量を周知の変換技術を用いて音声特徴量に変換することで、音響特徴量を取得してもよい。また、取得部１０は、楽譜特徴量や言語特徴量を任意のニューラルネットワークを用いて変換した情報を音響特徴量として用いてもよい。更に、学習部３０は、楽譜特徴量や言語特徴量の変換に用いるニューラルネットワークと、上記実施形態における第１情報および第２情報を出力するニューラルネットワークとを同時に学習して、各種パラメータを最適化してもよい。 (E2) In the above embodiment, the acquisition unit 10 acquires the acoustic feature amount by converting the musical score feature amount and the language feature amount, which are the sources of the voice waveform to be generated, into the voice feature amount by using a well-known conversion technique. You may. Further, the acquisition unit 10 may use the information obtained by converting the musical score feature amount and the language feature amount by using an arbitrary neural network as the acoustic feature amount. Further, the learning unit 30 simultaneously learns the neural network used for converting the musical score feature amount and the language feature amount and the neural network that outputs the first information and the second information in the above embodiment, and optimizes various parameters. You may.

（Ｅ３）上記実施形態において、取得部１０が取得する音響特徴量は、音源情報とスペクトル情報との他に、表現情報が含まれてもよい。表現情報には、例えば、歌唱の場合は音高のビブラートの周期および振幅とその有無、音の大きさのビブラートの周期および振幅とその有無等が、話し言葉の場合はアクセントやイントネーション等が、楽器音の場合はギターのチョーキングの程度やその有無等が、含まれている。なお、音高のビブラートの有無の区別を歌唱表現情報に含める代わりに、音高のビブラート無い部分に所定の定数を入れる等の方法によって音高のビブラートの有無の区別を行ってもよい。同様に、音の大きさのビブラートの有無の区別を歌唱表現情報に含める代わりに、音の大きさのビブラート無い部分に所定の定数を入れる等の方法によって音の大きさのビブラートの有無の区別を行ってもよい。 (E3) In the above embodiment, the acoustic feature amount acquired by the acquisition unit 10 may include expression information in addition to the sound source information and the spectrum information. The expression information includes, for example, the pitch and amplitude of the pitch vibrato and its presence or absence in the case of singing, the vibrato cycle and amplitude of the loudness and its presence or absence, and the accent and intonation in the case of spoken language. In the case of sound, the degree of choking of the guitar and the presence or absence of it are included. Instead of including the distinction between the presence and absence of pitch vibrato in the singing expression information, the presence or absence of pitch vibrato may be distinguished by a method such as inserting a predetermined constant in the portion without pitch vibrato. Similarly, instead of including the distinction between the presence and absence of loudness vibrato in the singing expression information, the distinction between the presence and absence of loudness vibrato is made by inserting a predetermined constant in the part without the loudness vibrato. May be done.

（Ｅ４）上記実施形態において、ニューラルネットワークの入力層Ｌ１のノードは、２つ以上の入力チャネルを有していてもよい。例えば、入力層Ｌ１に２つの入力チャネルを設け、第１のチャネルには、周期波形信号のサンプルを入力し、第２のチャネルには、時系列において第１のチャネルに入力されたサンプルの一つ前の時点の周期波形信号のサンプルを入力してもよい。また、ニューラルネットワークは、複数の入力チャネルに時系列において同じ時点の周期波形信号のサンプルを複数種類入力し、各チャネルに対して第１情報と第２情報とを出力してもよい。これにより、複数の声が重なった多重音声や和音を表す音声波形を生成できる。 (E4) In the above embodiment, the node of the input layer L1 of the neural network may have two or more input channels. For example, two input channels are provided in the input layer L1, a sample of a periodic waveform signal is input to the first channel, and one of the samples input to the first channel in time series is input to the second channel. A sample of the periodic waveform signal at the previous time point may be input. Further, the neural network may input a plurality of types of periodic waveform signal samples at the same time point in a time series to a plurality of input channels and output first information and second information to each channel. As a result, it is possible to generate a voice waveform representing a multiplex voice or a chord in which a plurality of voices are overlapped.

（Ｅ５）上記実施形態において、生成部２０は、ニューラルネットワークの入力層に生成する音声波形の基本周波数に応じた周期波形信号を入力すると共に、音響特徴量を補助情報としてニューラルネットワークに入力している。生成部２０は、更に、非周期波形信号を、ニューラルネットワークの入力層に入力してもよい。 (E5) In the above embodiment, the generation unit 20 inputs a periodic waveform signal corresponding to the fundamental frequency of the voice waveform generated in the input layer of the neural network, and inputs the acoustic feature amount to the neural network as auxiliary information. There is. The generation unit 20 may further input the aperiodic waveform signal to the input layer of the neural network.

（Ｅ６）上記実施形態において、ノイズ発生源２１は、非周期波形信号としてガウス雑音を生成しているが、これに限らず、他のノイズを表す信号を生成してもよい。ノイズ発生源２１は、例えば、白色雑音を生成する。 (E6) In the above embodiment, the noise generation source 21 generates Gaussian noise as an aperiodic waveform signal, but the present invention is not limited to this, and a signal representing other noise may be generated. The noise source 21 generates, for example, white noise.

（Ｅ７）上記実施形態において、生成部２０は、一つのニューラルネットワークを用いて、第１情報と第２情報とを出力している。この代わりに、生成部２０は、２つのニューラルネットワークを用いて、第１情報と第２情報とをそれぞれ出力してもよい。また、この形態において、生成部２０は、第１情報を出力する一方のニューラルネットワークの入力層に、生成する音声波形の基本周波数に応じた周期波形信号として他方のニューラルネットワークが出力した第２情報を入力してもよい。 (E7) In the above embodiment, the generation unit 20 outputs the first information and the second information by using one neural network. Instead, the generation unit 20 may output the first information and the second information, respectively, by using two neural networks. Further, in this embodiment, the generation unit 20 outputs the second information to the input layer of one neural network that outputs the first information as a periodic waveform signal corresponding to the fundamental frequency of the generated voice waveform by the other neural network. May be entered.

（Ｅ８）上記実施形態において、生成部２０は、ニューラルネットワークを用いて出力する第１情報として、非周期成分を生成するための情報である、メルケプストラムやＬＳＰ（線スペクトル対）等の音響特徴量を出力し、第１特徴量と第２特徴量と非周期波形信号とを用いて演算処理を行うことで音声波形を生成する変換処理を行ってもよい。例えば、生成部２０は、ニューラルネットワークを用いて、２４次元の非周期成分のメルケプストラムである第１情報と、１次元の周期成分である第２情報と、を出力する。そして、生成部２０は、第１情報をノイズ発生源２１で生成した非周期波形信号に畳み込むことで非周期成分を生成し、第２情報と足し合わせることで音声波形を生成する。 (E8) In the above embodiment, the generation unit 20 uses the neural network to output, as the first information, information for generating aperiodic components, such as acoustic features such as mer cepstrum and LSP (line spectrum pair). A conversion process for generating a voice waveform may be performed by outputting a quantity and performing arithmetic processing using the first feature quantity, the second feature quantity, and the aperiodic waveform signal. For example, the generation unit 20 uses a neural network to output first information which is a 24-dimensional aperiodic component mer cepstrum and second information which is a one-dimensional periodic component. Then, the generation unit 20 generates an aperiodic component by convolving the first information into the aperiodic waveform signal generated by the noise generation source 21, and generates a voice waveform by adding it to the second information.

（Ｅ９）上記第２実施形態において、生成部２０は、更に、位相が異なる周期波形信号をニューラルネットワークの入力層に入力して音声波形を生成してもよい。つまり、第２実施形態と第３実施形態とを組み合わせてもよい。より具体的には、生成部２０は、例えば、生成したい音声波形と同じ基本周波数であるサイン波形からなる周期波形信号Ｗｓと、生成したい音声波形と同じ基本周波数であるコサイン波形からなる周期波形信号Ｗｃと、周期補助信号とをニューラルネットワークの入力層に入力できる。 (E9) In the second embodiment, the generation unit 20 may further input periodic waveform signals having different phases into the input layer of the neural network to generate a voice waveform. That is, the second embodiment and the third embodiment may be combined. More specifically, the generation unit 20 is, for example, a periodic waveform signal Ws composed of a sine waveform having the same fundamental frequency as the voice waveform to be generated and a periodic waveform signal composed of a cosine waveform having the same fundamental frequency as the voice waveform to be generated. Wc and the periodic auxiliary signal can be input to the input layer of the neural network.

（Ｅ１０）上記第２実施形態において、周期補助信号は、例えば、生成しようとする音声波形の言語情報に応じて定めてもよい。「言語情報」とは、例えば、母音や子音の情報である。言語情報は音響特徴量に含まれていてもよい。より具体的には、周期補助信号は、無音部分や無声子音の部分が０．０であり、母音部分が０．９や１．０であり、／ｂ／、／ｄ／、／ｇ／等の周期と非周期が混在するような子音部分が０．３〜０．７の値であるデータを用いることができる。 (E10) In the second embodiment, the periodic auxiliary signal may be determined according to, for example, the linguistic information of the voice waveform to be generated. "Language information" is, for example, information on vowels and consonants. Linguistic information may be included in the acoustic features. More specifically, in the periodic auxiliary signal, the silent part and the unvoiced consonant part are 0.0, the vowel part is 0.9 and 1.0, and / b /, / d /, / g / etc. Data can be used in which the consonant portion in which the period and the non-period are mixed is a value of 0.3 to 0.7.

本発明は、上述の実施形態に限られるものではなく、その趣旨を逸脱しない範囲において種々の構成で実現することができる。例えば発明の概要の欄に記載した各形態中の技術的特徴に対応する実施形態中の技術的特徴は、上述した課題を解決するために、あるいは上述の効果の一部又は全部を達成するために、適宜、差し替えや組み合わせを行うことが可能である。また、その技術的特徴が本明細書中に必須なものとして説明されていなければ、適宜削除することが可能である。 The present invention is not limited to the above-described embodiment, and can be realized with various configurations without departing from the spirit of the present invention. For example, the technical features in the embodiments corresponding to the technical features in each form described in the column of the outline of the invention are for solving the above-mentioned problems or for achieving a part or all of the above-mentioned effects. In addition, it is possible to replace or combine them as appropriate. Further, if the technical feature is not described as essential in the present specification, it can be appropriately deleted.

１０…取得部、２０…生成部、２１…ノイズ発生源、２２…バンドパスフィルタ部、３０…学習部、１００…音声処理装置、２００…ニューラルネットワーク、ＡＩ…補助情報、ＤＡ…データ、Ｌ１〜Ｌ４…ｄｉｌａｔｉｏｎ層、Ｎ１〜Ｎ１５…ノード、Ｓ１〜Ｓ１５…サンプル、ａ１〜ａ２４…第１情報、ｂ…第２情報、ｎｚ１〜ｎｚ２４…非周期波形信号 10 ... Acquisition unit, 20 ... Generation unit, 21 ... Noise source, 22 ... Bandpass filter unit, 30 ... Learning unit, 100 ... Speech processing device, 200 ... Neural network, AI ... Auxiliary information, DA ... Data, L1- L4 ... dilation layer, N1 to N15 ... node, S1 to S15 ... sample, a1 to a24 ... first information, b ... second information, nz1 to nz24 ... aperiodic waveform signal

Claims

It is a voice processing device
An acquisition unit that acquires acoustic features for generating audio waveforms,
The voice waveform is generated by inputting a periodic waveform signal corresponding to the fundamental frequency of the voice waveform into the neural network, inputting the acoustic feature amount, and performing conversion processing using the information output by the neural network. Equipped with a generator
The neural network outputs first information for generating aperiodic components and second information indicating periodic components.
The conversion process is a voice processing device that is a process of adding the information obtained by performing arithmetic processing using the first information and the aperiodic waveform signal and the second information.

The voice processing device according to claim 1.
The first information is information indicating the strength of the aperiodic component for each predetermined frequency band.
The conversion process is a process of adding the information obtained by multiplying the aperiodic waveform signal for each frequency band by the corresponding first information and the second information.

The voice processing device according to claim 1 or 2.
The generation unit is a voice processing device that further inputs a signal indicating the degree of presence / absence of a period according to a voice waveform to be generated to the neural network.

The voice processing device according to any one of claims 1 to 3.
The generation unit is a voice processing device that inputs a plurality of periodic waveform signals having different phases to the neural network.

The voice processing device according to any one of claims 1 to 4, and further.
A voice processing device including a learning unit that learns the relationship between the acoustic feature amount, the periodic waveform signal, the first information, and the second information by machine learning and reflects it in the neural network.

It ’s a voice processing method.
The acquisition process to acquire the acoustic features for generating the audio waveform, and
The voice waveform is generated by inputting a periodic waveform signal corresponding to the fundamental frequency of the voice waveform into the neural network, inputting the acoustic feature amount, and performing conversion processing using the information output by the neural network. Equipped with a generation process
The neural network outputs first information for generating aperiodic components and second information indicating periodic components.
The conversion process is a voice processing method, which is a process of adding the information obtained by performing arithmetic processing using the first information and the aperiodic waveform signal and the second information.