JP6552146B1

JP6552146B1 - Audio processing apparatus and audio processing method

Info

Publication number: JP6552146B1
Application number: JP2019009182A
Authority: JP
Inventors: 恵一徳田; 圭一郎大浦; 和寛中村
Original assignee: Techno Speech Inc
Current assignee: Techno Speech Inc
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2019-07-31
Anticipated expiration: 2039-01-23
Also published as: JP2020118828A

Abstract

【課題】滑らかで自然な音声を合成可能な音声合成技術を提供する。【解決手段】音声処理装置は、音声に関する多次元の第１特徴量を取得する取得部と、予め定められた第１期間毎に第１特徴量を多次元の第２特徴量に変換する第１変換部と、第２特徴量を、時系列的に第２特徴量を処理することができるニューラルネットワークを用いて、第１期間よりも長い第２期間毎に音声波形を生成するための音響特徴量に変換する第２変換部と、を備える。【選択図】図６A speech synthesis technique capable of synthesizing smooth and natural speech is provided. An audio processing apparatus includes an acquisition unit configured to acquire a multidimensional first feature quantity related to audio, and a first feature quantity that converts a first feature quantity into a multidimensional second feature quantity every predetermined first period. An acoustic for generating a speech waveform for each second period longer than the first period using one conversion unit and a neural network capable of processing the second feature quantity in time series A second conversion unit for converting into a feature amount. [Selection] Figure 6

Description

本発明は、音声処理装置、および音声処理方法に関する。 The present invention relates to an audio processing device and an audio processing method.

近年の音声処理装置として、ニューラルネットワークを用いて音声を合成するものが知られている。特許文献１に記載された技術では、ニューラルネットワークによって生成した音響特徴量を用いて音声波形を合成している。 As a speech processing apparatus in recent years, a device that synthesizes speech using a neural network is known. In the technique described in Patent Document 1, a speech waveform is synthesized using an acoustic feature amount generated by a neural network.

特開２０１８−１４６８０３号公報JP 2018-146803 A

Ａ．ｖａｎｄｅｎＯｏｒｄｅｔａｌ．， ”Ｗａｖｅｎｅｔ：ＡＧｅｎｅｒａｔｉｖｅＭｏｄｅｌｆｏｒＲａｗＡｕｄｉｏ”，ａｒＸｉｖｐｒｅｐｒｉｎｔａｒＸｉｖ：１６０９．０３４９９，２０１６A. van den Oord et al. , “Wavenet: A Generative Model for Raw Audio”, arXiv preprint arXiv: 1609.03499, 2016

しかし、特許文献１に記載された技術では、音響特徴量は時間軸上で独立に、もしくは逐次的に生成されるため、音声の時間構造を十分に表現できず、機械的で不自然な音声が生成されるおそれがある。そのため、音声の時間構造を適切に処理することができ、滑らかで自然な音声を合成可能な音声合成技術が望まれていた。 However, in the technique described in Patent Document 1, since the acoustic feature amount is generated independently or sequentially on the time axis, the time structure of speech cannot be expressed sufficiently, and mechanical and unnatural speech is generated. May be generated. Therefore, a speech synthesis technology that can appropriately process the temporal structure of speech and is capable of synthesizing smooth and natural speech has been desired.

本発明は、上述の課題を解決するためになされたものであり、以下の形態として実現することが可能である。 The present invention has been made to solve the above-described problems, and can be realized as the following modes.

（１）本発明の一形態によれば、音声処理装置が提供される。この音声処理装置は、音声に関する多次元の第１特徴量を取得する取得部と、予め定められた第１期間毎に前記第１特徴量を多次元の第２特徴量に変換する第１変換部と、前記第２特徴量を、時系列的に前記第２特徴量を処理することができるニューラルネットワークを用いて、前記第１期間よりも長い第２期間毎に音声波形を生成するための音響特徴量に変換する第２変換部と、を備える。この形態の音声処理装置によれば、長い期間毎に第１特徴量が音響特徴量に変換されるため、この音響特徴量を用いて音声を合成すると滑らかで自然な音声を合成できる。
（２）上記形態の音声処理装置において、前記第２変換部は、前記ニューラルネットワークとして、畳み込みニューラルネットワークを用いて前記第２特徴量を前記音響特徴量に変換してもよい。この形態の音声処理装置によれば、既存の技術を利用して高品位に第２特徴量を音響特徴量に変換できる。
（３）上記形態の音声処理装置において、前記第２期間は可変長でもよい。この形態の音声処理装置によれば、任意の長さの音響特徴量に変換できる。
（４）上記形態の音声処理装置において、前記第２変換部は、前記第１特徴量における無音部分に応じて前記第２期間の長さを変化させてもよい。この形態の音声処理装置によれば、例えば、歌声を合成する場合に、フレーズ毎に合成ができる。
（５）上記形態の音声処理装置において、前記第１変換部は、フィードフォワードニューラルネットワークを用いて前記第１特徴量を前記第２特徴量に変換してもよい。この形態の音声処理装置によれば、高速に第１特徴量を第２特徴量に変換できる。
（６）上記形態の音声処理装置において、前記第２変換部は、前記第２特徴量に加えて前記第１特徴量に含まれる特定のパラメータを前記ニューラルネットワークに入力して、前記音響特徴量への変換を行ってもよい。この形態の音声処理装置によれば、補助情報として第１特徴量に含まれる特定のパラメータを第２特徴量に加えるため、合成音声の精度が向上する音響特徴量に変換できる。
（７）上記形態の音声処理装置において、前記パラメータは音高情報を含んでいてもよい。この形態の音声処理装置によれば、合成音声の音質が向上する音響特徴量に変換できる。
（８）上記形態の音声処理装置において、前記第１特徴量における無音部分の前記音高情報は、前後の音高情報により補間された情報でもよい。この形態の音声処理装置によれば、より合成音声の音質が向上する音響特徴量に変換できる。
（９）上記形態の音声処理装置において、前記第１特徴量は、少なくとも言語特徴量と楽譜特徴量と声質特徴量とのいずれか一つを含んでいてもよい。この形態の音声処理装置によれば、例えば、第１特徴量をテキスト音声合成や歌声合成や声質変換を行うための音響特徴量に変換できる。
（１０）上記形態の音声処理装置において、更に、前記音響特徴量を用いて音声波形を生成するボコーダ部を備えてもよい。この形態の音声処理装置によれば、音響特徴量を用いて合成音声を生成できる。
（１１）上記形態の音声処理装置において、更に、前記第１特徴量と前記音響特徴量との関係を教師有り機械学習によって学習して前記ニューラルネットワークに反映させる学習部を備えてもよい。この形態の音声処理装置によれば、第１特徴量と音響特徴量との関係を学習でき、第２変換部に学習結果を反映できる。また、第１変換部がニューラルネットワークを用いて変換を行う場合には、第１変換部にも学習結果を反映できる。
（１２）上記形態の音声処理装置において、前記第２変換部は、前記第２特徴量を、前記第２特徴量の各次元のデータを前記第２期間の長さ分並べて表される２次元データとして用いて前記音響特徴量に変換する、音声処理装置。この形態の音声処理装置によれば、時間方向の変化を効果的に扱うことができる。
（１３）音声処理装置であって、音声に関する多次元の特徴量を取得する取得部と、前記特徴量を予め定められた期間毎に畳み込みニューラルネットワークを用いて音声波形を生成するための音響特徴量に変換する変換部と、を備え、前記変換部は、前記特徴量を、前記特徴量の各次元のデータを前記期間の長さ分並べて表される２次元データとして用いて前記音響特徴量に変換する。この形態の音声処理装置によれば、時間方向の変化を効果的に扱うことができ、長い期間毎に音響特徴量に変換するため、この音響特徴量を用いて音声を合成すると滑らかで自然な音声を合成できる。 (1) According to an aspect of the present invention, an audio processing device is provided. The speech processing apparatus includes an acquisition unit for acquiring a multi-dimensional first feature quantity related to speech, and a first conversion for converting the first feature quantity into a multi-dimensional second feature quantity every predetermined first period. An audio waveform for each second period longer than the first period, using a neural network capable of processing the second feature amount in time series, and the second feature amount. A second conversion unit for converting into an acoustic feature amount. According to the voice processing apparatus of this aspect, since the first feature amount is converted into the acoustic feature amount every long period, when the voice is synthesized using this acoustic feature amount, smooth and natural voice can be synthesized.
(2) In the speech processing apparatus of the above aspect, the second conversion unit may convert the second feature into the acoustic feature using a convolutional neural network as the neural network. According to the speech processing apparatus of this aspect, the second feature value can be converted into the acoustic feature value with high quality by using the existing technology.
(3) In the speech processing device of the above aspect, the second period may be variable length. According to the speech processing apparatus of this aspect, it can be converted into an acoustic feature amount having an arbitrary length.
(4) In the speech processing device according to the above aspect, the second conversion unit may change the length of the second period according to a silent portion in the first feature amount. According to the speech processing apparatus of this aspect, for example, when a singing voice is synthesized, synthesis can be performed for each phrase.
(5) In the speech processing device according to the above aspect, the first conversion unit may convert the first feature value into the second feature value using a feedforward neural network. According to the voice processing apparatus of this aspect, it is possible to convert the first feature amount into the second feature amount at high speed.
(6) In the speech processing apparatus of the above aspect, the second conversion unit inputs a specific parameter included in the first feature amount to the neural network in addition to the second feature amount to the acoustic feature amount. Conversion to may be performed. According to the speech processing device of this aspect, since the specific parameter included in the first feature amount is added as the auxiliary information to the second feature amount, it can be converted into an acoustic feature amount that improves the accuracy of the synthesized speech.
(7) In the voice processing device of the above aspect, the parameter may include pitch information. According to the voice processing apparatus of this aspect, it is possible to convert into the acoustic feature quantity that improves the sound quality of the synthesized voice.
(8) In the speech processing apparatus of the above aspect, the pitch information of the silent portion in the first feature amount may be information interpolated by preceding and following pitch information. According to the speech processing apparatus of this aspect, it can be converted into an acoustic feature quantity that further improves the quality of the synthesized speech.
(9) In the speech processing device according to the above aspect, the first feature amount may include at least one of a language feature amount, a score feature amount, and a voice quality feature amount. According to the speech processing apparatus of this embodiment, for example, the first feature amount can be converted into an acoustic feature amount for performing text speech synthesis, singing voice synthesis, and voice quality conversion.
(10) The speech processing apparatus according to the above aspect may further include a vocoder unit that generates a speech waveform using the acoustic feature amount. According to the speech processing apparatus of this aspect, synthesized speech can be generated using acoustic feature amounts.
(11) The speech processing apparatus according to the above aspect may further include a learning unit configured to learn the relationship between the first feature amount and the acoustic feature amount by supervised machine learning and reflect the result on the neural network. According to the speech processing device of this aspect, the relationship between the first feature value and the acoustic feature value can be learned, and the learning result can be reflected in the second conversion unit. When the first conversion unit performs conversion using a neural network, the learning result can be reflected in the first conversion unit.
(12) In the speech processing apparatus according to the above aspect, the second conversion unit may be configured to display the second feature quantity in a two-dimensional manner by arranging data of each dimension of the second feature quantity by the length of the second period. A speech processing apparatus that converts the acoustic feature value into data using the data. According to this type of speech processing apparatus, it is possible to effectively handle changes in the time direction.
(13) An audio processing apparatus, which is an acquisition unit for acquiring multi-dimensional feature quantities relating to speech, and acoustic features for generating speech waveforms using a convolutional neural network for each predetermined period of the feature quantities. A conversion unit configured to convert into an amount, wherein the conversion unit uses the feature amount as two-dimensional data represented by arranging data of each dimension of the feature amount by the length of the period, the acoustic feature amount Convert to According to the speech processing apparatus of this aspect, it is possible to effectively handle changes in the time direction, and since speech is synthesized using this acoustic feature amount because it is converted into an acoustic feature amount for each long period, it is smooth and natural It can synthesize voice.

なお、本発明は、種々の態様で実現することが可能である。例えば、この形態の音声処理装置を利用した音声処理システム、音声合成装置や音声合成システムの機能を実現するために情報処理装置において実行される方法、コンピュータプログラム、そのコンピュータプログラムを配布するためのサーバ装置、そのコンピュータプログラムを記憶した一時的でない記憶媒体等の形態で実現することができる。 Note that the present invention can be realized in various modes. For example, a speech processing system using the speech processing apparatus of this embodiment, a method executed by the information processing apparatus to realize the functions of the speech synthesis apparatus and the speech synthesis system, a computer program, and a server for distributing the computer program The present invention can be realized in the form of a device, a non-temporary storage medium storing the computer program, and the like.

本発明の一実施形態における音声処理装置の概要を示す説明図である。It is explanatory drawing which shows the outline | summary of the audio | voice processing apparatus in one Embodiment of this invention. 第１特徴量における各種のパラメータの一例を示す図である。It is a figure which shows an example of the various parameters in a 1st feature value. 音響特徴量における各種のパラメータの一例を示す図である。It is a figure which shows an example of the various parameters in an acoustic feature-value. ディープニューラルネットワークによる機械学習について説明するための説明図である。It is explanatory drawing for demonstrating the machine learning by a deep neural network. 音声合成処理を表すフローチャートである。It is a flowchart showing a speech synthesis process. 音声合成処理を模式的に表した説明図である。It is explanatory drawing which represented the speech synthesis process typically. 主観評価実験の実験結果を示した図である。It is the figure which showed the experimental result of the subjective evaluation experiment. 第２実施形態におけるＣＮＮの説明図である。It is explanatory drawing of CNN in 2nd Embodiment. 音高情報を補間した場合の一例を示す説明図である。It is explanatory drawing which shows an example at the time of interpolating pitch information.

Ａ．第１実施形態：
図１は、本発明の一実施形態における音声処理装置１００の概要を示す説明図である。音声処理装置１００は、取得部１０と、第１変換部２０と、第２変換部３０と、ボコーダ部４０と、学習部５０と、音響モデル６０と、を備える。取得部１０と、第１変換部２０と、第２変換部３０と、ボコーダ部４０と、学習部５０とは、１以上のＣＰＵがメモリに記憶されたプログラムを実行することにより、ソフトウェア的に実現される。なおこれらの一部または全部は、回路によってハードウェア的に実現されてもよい。 A. First embodiment:
FIG. 1 is an explanatory diagram showing an overview of a speech processing apparatus 100 according to an embodiment of the present invention. The speech processing apparatus 100 includes an acquisition unit 10, a first conversion unit 20, a second conversion unit 30, a vocoder unit 40, a learning unit 50, and an acoustic model 60. The acquisition unit 10, the first conversion unit 20, the second conversion unit 30, the vocoder unit 40, and the learning unit 50 are software-like as one or more CPUs execute programs stored in the memory. To be realized. Note that some or all of these may be implemented in hardware by a circuit.

取得部１０は、音声に関する多次元の第１特徴量を取得する。第１特徴量の詳細については後述する。取得部１０は、例えば、予め録音された音声の音声波形から周知の音声認識技術を用いて第１特徴量を抽出してもよく、発語対象のテキストや楽譜に応じて予め生成された第１特徴量を取得してもよい。 The acquisition unit 10 acquires a multidimensional first feature quantity related to speech. Details of the first feature amount will be described later. For example, the acquisition unit 10 may extract the first feature amount from a pre-recorded speech waveform using a well-known speech recognition technique, and the first feature amount may be generated in advance according to the text or musical score to be spoken. One feature amount may be acquired.

第１変換部２０は、予め定められた第１期間毎に取得部１０によって取得された第１特徴量を多次元の第２特徴量に変換する。第２特徴量とは、第２変換部３０が音響特徴量への変換において扱いやすいデータである。本実施形態において、第１変換部２０は、フィードフォワードニューラルネットワーク（ＦｅｅｄｆｏｒｗａｒｄＮｅｕｒａｌＮｅｔｗｏｒｋ（ＦＦＮＮ））を用いて第１特徴量を多次元の第２特徴量に変換する。第１変換部２０は、ＦＦＮＮに限られず、ロングショートタームメモリーネットワーク（Ｌｏｎｇｓｈｏｒｔ−ｔｅｒｍｍｅｍｏｒｙ（ＬＳＴＭ））等の再起構造を持ったリカレントニューラルネットワーク（ＲｅｃｕｒｒｅｎｔＮｅｕｒａｌＮｅｔｗｏｒｋ（ＲＮＮ））を用いてもよく、隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ（ＨＭＭ））用いてもよい。また、これらを組み合わせて用いてもよい。 The first conversion unit 20 converts the first feature quantity acquired by the acquisition unit 10 into a multidimensional second feature quantity every predetermined first period. The second feature amount is data that can be easily handled in the conversion to the acoustic feature amount by the second conversion unit 30. In the present embodiment, the first conversion unit 20 converts the first feature amount into a multi-dimensional second feature amount using a feedforward neural network (FFNN). The first conversion unit 20 is not limited to the FFNN, and may use a recurrent neural network (Recurrent Neural Network (RNN)) having a recurrent structure such as a long short-term memory network (LSTM). Alternatively, a Hidden Markov Model (HMM) may be used. Moreover, you may use combining these.

第２変換部３０は、第１変換部２０によって変換された第２特徴量を、時系列的に第２特徴量を処理することができるニューラルネットワークを用いて、第１期間よりも長い第２期間毎に音声波形を生成するための音響特徴量に変換する。音響特徴量の詳細については後述する。第２期間は、可変長でもよい。第２期間を可変長とする場合、第１特徴量における無音部分に応じて第２期間を変化させて設定することが好ましい。これにより、音響特徴量をフレーズ毎に生成することができる。本実施形態において、第２変換部３０は、畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ（ＣＮＮ））を用いて第２特徴量を音響特徴量に変換する。第２変換部３０は、第２期間が可変等である場合は、Ｆｕｌｌｙｒｅｃｕｒｒｅｎｔｎｅｔｗｏｒｋ（ＦＲＮ）や、ＦｕｌｌｙＣｏｎｖｏｌｕｔｉｏｎａｌｎｅｔｗｏｒｋ（ＦＣＮ）を用いる。また、第２変換部３０は、ＣＮＮに限らず、ＲＮＮを用いてもよい。 The second conversion unit 30 uses the neural network capable of processing the second feature amount converted by the first conversion unit 20 in a time series manner, and the second conversion unit 30 is longer than the first period. It converts into the acoustic feature-value for producing | generating an audio | voice waveform for every period. Details of the acoustic feature amount will be described later. The second period may be variable length. In the case where the second period has a variable length, it is preferable to change and set the second period in accordance with the silent portion in the first feature amount. Thereby, an acoustic feature amount can be generated for each phrase. In the present embodiment, the second conversion unit 30 converts the second feature into an acoustic feature using a convolutional neural network (CNN). When the second period is variable or the like, the second conversion unit 30 uses a Fully recurrent network (FRN) or a Fully Convolutional network (FCN). Moreover, the 2nd conversion part 30 may use not only CNN but RNN.

ボコーダ部４０は、第２変換部３０によって変換された音響特徴量から音声波形を生成する。ボコーダ部４０として、例えば、従来のボコーダ技術を用いてもよく、ｗａｖｅｎｅｔ（非特許文献１記載）等のニューラルネットワークを用いたボコーダ技術を用いてもよい。音声処理装置１００は、ボコーダ部４０を備えていなくてもよい。その場合、音声波形の生成は外部の音声合成装置が行う。 The vocoder unit 40 generates a speech waveform from the acoustic feature amount converted by the second conversion unit 30. As the vocoder unit 40, for example, a conventional vocoder technique may be used, or a vocoder technique using a neural network such as wavenet (described in Non-Patent Document 1) may be used. The voice processing apparatus 100 may not include the vocoder unit 40. In that case, the voice waveform is generated by an external voice synthesizer.

学習部５０は、第１特徴量と音響特徴量との関係を教師有り機械学習によって学習する。学習部５０は、学習結果を第１変換部２０のニューラルネットワークもしくは隠れマルコフモデル（ＨＭＭ）といった統計モデルや第２変換部３０で用いられるニューラルネットワークに反映させる。こうすることにより、第１変換部２０や第２変換部３０は、学習部５０の学習結果を反映して変換を行うことができる。音声処理装置１００は、学習部５０を備えていなくてもよい。この場合、第１変換部２０や第２変換部３０は、外部の機械学習を行う学習装置等によって得られた学習結果を反映して変換を行うことができる。第１変換部２０がニューラルネットワークを用いる場合、第２変換部３０の用いるニューラルネットワークと連結して同時学習を行うことで、より高精度な学習を行うことができる。また、第１変換部２０が用いるニューラルネットワークと、第２変換部３０が用いるニューラルネットワークとのいずれか一方に、交互に学習結果を反映させてもよい。 The learning unit 50 learns the relationship between the first feature value and the acoustic feature value by supervised machine learning. The learning unit 50 reflects the learning result on a statistical model such as a neural network or a Hidden Markov Model (HMM) of the first conversion unit 20 or a neural network used in the second conversion unit 30. By doing this, the first conversion unit 20 and the second conversion unit 30 can perform conversion reflecting the learning result of the learning unit 50. The speech processing apparatus 100 may not include the learning unit 50. In this case, the first conversion unit 20 and the second conversion unit 30 can perform conversion by reflecting learning results obtained by a learning device or the like that performs external machine learning. When the first conversion unit 20 uses a neural network, more accurate learning can be performed by performing simultaneous learning in connection with the neural network used by the second conversion unit 30. Alternatively, the learning result may be alternately reflected on any one of the neural network used by the first conversion unit 20 and the neural network used by the second conversion unit 30.

図２は、歌声合成において、取得部１０により取得される第１特徴量に含まれる多次元のパラメータの一例を示す図である。本実施形態において、第１特徴量は楽譜特徴量である。楽譜情報には、曲情報とフレーズ情報と音符情報とが含まれている。音符情報には，例えば、音符の長さや音高、フレーズ内における音符の位置等の情報が含まれている。言語情報には、音節情報と音素情報とが含まれている。音節情報は、例えば音素数や音符内における音節の位置等の情報が含まれている。音素情報は、例えば、種類（例えば、母音や有声子音、無声子音等）や音節内における音素の位置等の情報が含まれている。継続長情報は、音素内位置情報と状態内位置情報とが含まれている。音素内位置情報は、例えば、音素の開始位置からの長さや割合等の情報が含まれている。状態内位置情報は、例えば、状態の開始位置からの長さや割合等の情報が含まれている。 FIG. 2 is a diagram illustrating an example of multi-dimensional parameters included in the first feature amount acquired by the acquisition unit 10 in singing voice synthesis. In the present embodiment, the first feature amount is a score feature amount. The score information includes song information, phrase information, and note information. The note information includes, for example, information such as the length and pitch of the note, and the position of the note in the phrase. The linguistic information includes syllable information and phoneme information. The syllable information includes information such as the number of phonemes and the position of a syllable in a note. The phoneme information includes, for example, information such as the type (for example, vowel, voiced consonant, unvoiced consonant, etc.) and the position of phoneme in a syllable. The continuation length information includes in-phoneme position information and in-state position information. The in-phoneme position information includes, for example, information such as the length and ratio from the start position of the phoneme. The in-state position information includes, for example, information such as length and ratio from the start position of the state.

図３は、第２変換部３０により出力される音響特徴量における各種のパラメータの一例を示す図である。スペクトルパラメータとしては、メルケプストラムや線スペクトル対（ＬｉｎｅＳｐｅｃｔｒｕｍＰａｉｒ（ＬＳＰ））などがある。これらは、スペクトル情報と呼ばれることがある。音源情報としては、基本周波数は、一般に対数基本周波数として扱われており、その関連パラメータとしては、有声／無声の区別や、非周期性指標が考えられる。これらは音源情報と呼ばれることがある。なお、無声部分は対数基本周波数の値を持たないため、有声／無声の区別を音源情報に含める代わりに、無声部分に所定の定数を入れる等の方法によって有声／無声の区別を行ってもよい。また、更に、こうした音源情報、スペクトル情報の他に、本実施形態では、歌唱表現情報が音響特徴量に含まれる。 FIG. 3 is a diagram illustrating an example of various parameters in the acoustic feature amount output by the second conversion unit 30. Examples of the spectrum parameter include a mel cepstrum and a line spectrum pair (Line Spectrum Pair (LSP)). These are sometimes referred to as spectral information. As the sound source information, the fundamental frequency is generally treated as a logarithmic fundamental frequency, and the related parameter may be a voiced / unvoiced distinction or an aperiodicity index. These are sometimes called sound source information. Since the unvoiced portion does not have a logarithmic fundamental frequency value, instead of including the voiced / unvoiced distinction in the sound source information, the voiced / unvoiced distinction may be performed by a method such as inserting a predetermined constant in the unvoiced portion. . Furthermore, in addition to such sound source information and spectrum information, in the present embodiment, singing expression information is included in the acoustic feature amount.

歌唱表現情報には、音高のビブラートの周期および振幅とその有無、音の大きさのビブラートの周期および振幅とその有無が、含まれている。なお、音高のビブラートの有無の区別を歌唱表現情報に含める代わりに、音高のビブラートが無い部分に所定の定数を入れる等の方法によって音高のビブラートの有無の区別を行ってもよい。同様に、音の大きさのビブラートの有無の区別を歌唱表現情報に含める代わりに、音の大きさのビブラートが無い部分に所定の定数を入れる等の方法によって音の大きさのビブラートの有無の区別を行ってもよい。 The song expression information includes the period and amplitude of the pitch vibrato and its presence and absence, and the period and amplitude of the sound magnitude vibrato and its presence and absence. Instead of including the presence / absence of the pitch vibrato in the singing expression information, the presence / absence of the pitch vibrato may be determined by a method such as inserting a predetermined constant in a portion where there is no pitch vibrato. Similarly, instead of including the distinction of the presence or absence of the loudness of the sound in the singing expression information, the presence or absence of the loudness or the presence of the loudness of the sound may be determined by inserting a predetermined constant in a portion where the loudness of the loudness is not present. A distinction may be made.

図４は、ディープニューラルネットワークによる第１特徴量の変換について説明するための説明図である。ディープニューラルネットワーク２００は、人間の脳神経系における学習機構をモデルにしたネットワークである。ディープニューラルネットワーク２００は、入力層Ｌ１と、複数の中間層Ｌ２と、出力層Ｌ３とを備える。中間層Ｌ２の数は任意に定める事ができる。 FIG. 4 is an explanatory diagram for explaining the conversion of the first feature amount by the deep neural network. The deep neural network 200 is a network that models a learning mechanism in a human brain nervous system. The deep neural network 200 includes an input layer L1, a plurality of intermediate layers L2, and an output layer L3. The number of intermediate layers L2 can be determined arbitrarily.

入力層Ｌ１は、情報が入力される層である。中間層Ｌ２は、入力層Ｌ１から伝達される情報に基づいて特徴量の算出を行う層である。出力層Ｌ３は、中間層Ｌ２から伝達される情報に基づいて結果を出力する層である。各層には、複数のノードが含まれる。 The input layer L1 is a layer to which information is input. The intermediate layer L2 is a layer that calculates feature amounts based on information transmitted from the input layer L1. The output layer L3 is a layer that outputs a result based on information transmitted from the intermediate layer L2. Each layer includes a plurality of nodes.

ディープニューラルネットワーク２００による変換について説明する。本実施形態において、第１変換部２０はディープニューラルネットワーク２００を用いて第１特徴量を第２特徴量に変換する。入力層Ｌ１は、図２に示した第１特徴量に含まれる複数のパラメータが入力されると、それらのパラメータを中間層Ｌ２に伝達する。中間層Ｌ２では、入力層Ｌ１から伝達されたパラメータに対して種々の演算が各層において段階的に行われる。出力層Ｌ３において、最終的に演算されたパラメータが、図３に示した第２特徴量として出力される。 The conversion by the deep neural network 200 will be described. In the present embodiment, the first conversion unit 20 converts the first feature value into the second feature value using the deep neural network 200. When a plurality of parameters included in the first feature shown in FIG. 2 are input, the input layer L1 transmits those parameters to the intermediate layer L2. In the intermediate layer L2, various operations are performed stepwise in each layer on the parameters transmitted from the input layer L1. In the output layer L3, the finally calculated parameter is output as the second feature amount shown in FIG.

図５は、本実施形態における音声処理装置１００を用いた音声合成処理を表すフローチャートである。まず、取得部１０が、ステップＳ１００で第１特徴量を取得する。次に、第１変換部２０が、ステップＳ１１０において、ステップＳ１００で取得した第１特徴量を第２特徴量に変換する。続いて、第２変換部３０が、ステップＳ１２０において、ステップＳ１１０で変換した第２特徴量を音響特徴量に変換する。最後に、ボコーダ部４０が、ステップＳ１３０において、ステップＳ１２０で変換した音響特徴量を用いて音声波形を生成する。 FIG. 5 is a flowchart showing a voice synthesis process using the voice processing apparatus 100 according to this embodiment. First, the acquisition unit 10 acquires a first feature amount in step S100. Next, in Step S110, the first conversion unit 20 converts the first feature amount acquired in Step S100 into a second feature amount. Subsequently, in step S120, the second conversion unit 30 converts the second feature amount converted in step S110 into an acoustic feature amount. Finally, in step S130, the vocoder unit 40 generates an audio waveform using the acoustic feature quantity converted in step S120.

図６は、図５に示した音声合成処理を模式的に表した説明図である。図６に示すように、ステップＳ１１０において第１変換部２０により、ＦＦＮＮを用いて第１特徴量が第２特徴量に変換され、ステップＳ１２０において、第２変換部３０により、ＣＮＮを用いて第２特徴量が音響特徴量に変換される。本実施形態において、第１変換部２０によって変換される第１特徴量の第１期間は、例えば、５ミリ秒である。また、第２変換部３０によって変換される第２特徴量の第２期間は、例えば、１０秒である。つまり、第２変換部３０は、第２特徴量を２０００個束ねてＣＮＮを用いて変換を行う。第２変換部３０は、ＣＮＮにおいて、第２特徴量を第２特徴量の各次元のデータを第２期間の長さ分並べて表される２次元データＤ１として用いて、音響特徴量に変換する。２次元データＤ１は、本実施形態においては、第２特徴量を時系列順に２０００個並べたデータである。つまり、［第２特徴量の各次元のデータ］×［時間］で表されたデータである。第２特徴量は２次元データＤ１に限られず、３次元以上の多次元データとして表されてもよい。ＣＮＮにおける入力データのサイズの概念は、画像処理が元となっているため、高さ、幅、チャンネル数（フィルタ数）の３次元である。本実施形態では、高さを１、幅を第２期間の長さ、チャンネル数を第２特徴量の次元数、としている。ＣＮＮの内部には、畳み込み（Ｃｏｎｖｏｌｕｔｉｏｎ）により第２特徴量を畳み込んでいく部分を有する。加えて、畳み込みにより２次元データＤ１の列の大きさを小さくする部分と、逆畳み込み（ｆｒａｃｔｉｏｎａｌｌｙ−ｓｔｒｉｄｅｄｃｏｎｖｏｌｕｔｉｏｎ）や転置畳み込み（ｔｒａｎｓｐｏｓｅｄｃｏｎｖｏｌｕｔｉｏｎ）により元の第２期間数に戻すよう大きくする部分と、を有してもよい。 FIG. 6 is an explanatory diagram schematically showing the speech synthesis process shown in FIG. As shown in FIG. 6, in step S110, the first conversion unit 20 converts the first feature amount into the second feature amount using FFNN, and in step S120, the second conversion unit 30 converts the first feature amount using CNN. Two feature quantities are converted to acoustic feature quantities. In the present embodiment, the first period of the first feature value converted by the first conversion unit 20 is, for example, 5 milliseconds. Further, the second period of the second feature value converted by the second conversion unit 30 is, for example, 10 seconds. That is, the second conversion unit 30 bundles 2000 second feature values and performs conversion using the CNN. The second conversion unit 30 converts the second feature amount into an acoustic feature amount by using the second feature amount as two-dimensional data D1 expressed by arranging data of each dimension of the second feature amount by the length of the second period. . In the present embodiment, the two-dimensional data D1 is data in which 2000 second feature values are arranged in time series. That is, the data is represented by [data of each dimension of the second feature value] × [time]. The second feature amount is not limited to the two-dimensional data D1, and may be represented as multidimensional data of three or more dimensions. The concept of the size of input data in CNN is three-dimensional in height, width, and number of channels (number of filters), since it is based on image processing. In this embodiment, the height is 1, the width is the length of the second period, and the number of channels is the number of dimensions of the second feature amount. The CNN includes a portion that convolves the second feature amount by convolution. In addition, a part that reduces the size of the column of the two-dimensional data D1 by convolution, a part that increases to return to the original second period number by a deconvolution (translated convolution) or a transposed convolution (transposed convolution), You may have.

以上で説明した本実施形態の音声処理装置１００によれば、楽譜特徴量等の第１特徴量が表される時間単位である予め定められた第１期間よりも長い第２期間毎に音響特徴量に変換するため、この音響特徴量を用いて音声を合成すると滑らかで自然な音声を合成できる。また、第２変換部３０は、第２特徴量を第２特徴量の各次元のデータを第２期間に含まれる第１期間の数分並べて表される２次元データＤ１として用いて音響特徴量に変換しているため、時間方向の変化を効果的に扱うことができる。より具体的には、例えば、第２特徴量を第２期間分の各次元のデータを並べて表される１次元データとして用いる場合と比較して、各次元のデータの時間方向での変化をより効果的に学習できる。また、第２変換部３０は、ＣＮＮを用いて変換を行うため、既存の技術を利用して高品位に第２特徴量を音響特徴量に変換できる。 According to the speech processing apparatus 100 of the present embodiment described above, the acoustic feature is generated every second period longer than the first predetermined period which is a unit of time in which the first feature amount such as the score feature amount is represented. Since the sound is synthesized using this acoustic feature amount in order to convert it into a quantity, a smooth and natural voice can be synthesized. In addition, the second conversion unit 30 uses the second feature amount as two-dimensional data D1 represented by arranging the data of each dimension of the second feature amount by the number of first periods included in the second period as an acoustic feature amount. Therefore, changes in the time direction can be effectively handled. More specifically, for example, as compared with the case where the second feature value is used as one-dimensional data represented by arranging the data of each dimension for the second period, the change in the data of each dimension in the time direction is more Can learn effectively. In addition, since the second conversion unit 30 performs conversion using CNN, it is possible to convert the second feature amount into an acoustic feature amount with high quality using the existing technology.

また、本実施形態では、第１変換部２０は、ＦＦＮＮを用いて第１特徴量を第２特徴量に変換しているため、高速に変換できる。 Further, in the present embodiment, since the first conversion unit 20 converts the first feature amount into the second feature amount using FFNN, the conversion can be performed at high speed.

実験結果：
図７は、生成した音声波形に対する主観評価実験の実験結果である平均オピニオン評点（ＭｅａｎＯｐｉｎｉｏｎＳｃｏｒｅ（ＭＯＳ））を示した図である。本実験において、４手法の合成音声の品質を、「１：非常に悪い、２：悪い、３：普通、４：良い、５：非常に良い」の５段階の主観評価実験によって評価した。被験者は１５人であり、各被験者はテストデータである５曲から各手法につき１０フレーズを評価した。評価対象である合成音声の音声波形は、４手法とも同じ第１特徴量を用いて生成した。 Experimental result:
FIG. 7 is a diagram showing an average opinion score (MOS), which is an experimental result of a subjective evaluation experiment on the generated speech waveform. In this experiment, the quality of synthetic speech of the four methods was evaluated by a five-step subjective evaluation experiment of “1: very bad, 2: bad, 3: normal, 4: good, 5: very good”. There were 15 test subjects, and each test subject evaluated 10 phrases for each method from 5 songs as test data. The speech waveform of synthetic speech to be evaluated was generated using the same first feature amount in all four methods.

実施例１および実施例２は、上述した実施形態１の音声処理装置１００によって第１特徴量を変換した音響特徴量、より具体的には、第２変換部３０がＣＮＮを用いて第２特徴量を変換した音響特徴量を用いて、音声波形を生成した。比較例１および比較例２は、第２変換部３０がＦＦＮＮを用いて第２特徴量を変換した音響特徴量を用いて音声波形を生成した。また、実施例１および比較例１は、従来のボコーダ技術であるＭＬＳＡフィルタを用いて音響特徴量から音声波形を生成し、実施例２および比較例２は、ｗａｖｅｎｅｔを用いて音響特徴量から音声波形を生成した。図７に示すように、第２変換部３０がＣＮＮを用いた実施例１、実施例２のスコアは、第２変換部３０がＦＦＮＮを用いた比較例１、比較例２のスコアよりも高かった。つまり、第２変換部３０が上記実施形態に従ってＣＮＮを用いて変換を行うと、より高品位に第２特徴量を音響特徴量に変換できる。 Example 1 and Example 2 are acoustic feature values obtained by converting the first feature value by the speech processing apparatus 100 according to the first embodiment described above. More specifically, the second conversion unit 30 uses the CNN to obtain the second feature value. A speech waveform was generated using the acoustic feature value obtained by converting the quantity. In Comparative Example 1 and Comparative Example 2, the sound waveform was generated using the acoustic feature quantity obtained by converting the second feature quantity by the second conversion unit 30 using FFNN. In addition, in Example 1 and Comparative Example 1, a speech waveform is generated from an acoustic feature using an MLSA filter that is a conventional vocoder technique, and in Example 2 and Comparative Example 2, speech is generated from an acoustic feature using wavenet. A waveform was generated. As shown in FIG. 7, the score of Example 1 and Example 2 in which the second conversion unit 30 uses CNN is higher than the score of Comparative Example 1 and Comparative Example 2 in which the second conversion unit 30 uses FFNN. It was. That is, when the second conversion unit 30 performs conversion using CNN according to the above embodiment, it is possible to convert the second feature amount into an acoustic feature amount with higher quality.

Ｂ．第２実施形態：
図８は、第２実施形態におけるＣＮＮの説明図である。第２実施形態のＣＮＮは、図８においてハッチングで示すように、第２特徴量に加えて第１特徴量に含まれる特定のパラメータを入力層に入力して用いる点が第１実施形態と異なる。第２実施形態の音声処理装置１００の構成は、第１実施形態の音声処理装置１００の構成と同様であるため、構成の説明は省略する。 B. Second embodiment:
FIG. 8 is an explanatory diagram of the CNN in the second embodiment. The CNN of the second embodiment is different from the first embodiment in that, as indicated by hatching in FIG. 8, in addition to the second feature amount, a specific parameter included in the first feature amount is input to the input layer and used. . Since the configuration of the speech processing apparatus 100 of the second embodiment is the same as the configuration of the speech processing apparatus 100 of the first embodiment, description of the configuration is omitted.

本実施形態において、第１特徴量に含まれる特定のパラメータは、音高情報である。「音高情報」とは、楽譜情報における音高の対数基本周波数の情報である。音高情報は、第１特徴量における無音部分が、時間軸における前後の第１特徴量の音高情報によって補間されていることが好ましい。パラメータの他の例として、例えば、ＭＩＤＩの音高番号や、音素情報が挙げられる。 In the present embodiment, the specific parameter included in the first feature amount is pitch information. “Pitch information” is information on the logarithmic fundamental frequency of the pitch in the score information. In the pitch information, it is preferable that silent parts in the first feature amount be interpolated by the pitch information of the first feature amount before and after on the time axis. Other examples of parameters include MIDI pitch numbers and phoneme information.

図９は、音高情報を補間した場合の一例を示す説明図である。図９に示す音高情報は、縦軸が対数基本周波数を示し、横軸が時間を示す。図９では、無音部分であるｎ番目（ｎは２以上の整数）の音符ｎにおける第１特徴量の音高情報が、音符ｎ−１における第１特徴量の音高情報Ｐ０と音符ｎ＋１における第１特徴量の音高情報Ｐ２とを用いて音高情報Ｐ１に線形補間されている。なお、音高情報の補間は、線形補間に限らず、スプライン補間やラグランジュ補間等の他の補間手法を適用してもよい。 FIG. 9 is an explanatory diagram showing an example when pitch information is interpolated. In the pitch information shown in FIG. 9, the vertical axis indicates the logarithmic fundamental frequency, and the horizontal axis indicates time. In FIG. 9, the pitch information of the first feature value in the n-th note n (n is an integer of 2 or more), which is a silent part, is the pitch information P0 of the first feature value in the note n−1 and the note n + 1. The pitch information P1 is linearly interpolated using the pitch information P2 of the first feature amount. The interpolation of the pitch information is not limited to linear interpolation, and other interpolation methods such as spline interpolation and Lagrange interpolation may be applied.

以上で説明した本実施形態の音声処理装置１００によれば、補助情報として第１特徴量に含まれるパラメータである音高情報を第２特徴量に加えるため、合成音声の音質が向上する音響特徴量に変換できる。なお、音高情報は入力層ではなく、中間層に入力してもよい。 According to the speech processing apparatus 100 of the present embodiment described above, since the pitch information which is a parameter included in the first feature amount is added as the auxiliary information to the second feature amount, an acoustic feature that improves the sound quality of the synthesized voice Can be converted into a quantity. Note that pitch information may be input to the intermediate layer instead of the input layer.

Ｃ．その他の実施形態：
上記実施形態において、取得部１０が取得する第１特徴量は、楽譜特徴量である。この代わりに、取得部１０は、第１特徴量として言語特徴量を取得してもよい。言語特徴量は、図２に示した楽譜特徴量から楽譜情報が省略され、品詞やアクセント等の情報が追加された多次元のパラメータである。この形態によれば、歌声ではない、単なるテキスト合成音声を行うための音響特徴量を生成できる。また、取得部１０は、第１特徴量として声質特徴量を取得してもよい。声質特徴量は、他人の声から抽出した音響特徴量である。この形態によれば、ある話者の音響特徴量から、他の話者の音響特徴量へと変換する声質変換を行うための音響特徴量を生成できる。 C. Other embodiments:
In the said embodiment, the 1st feature-value which the acquisition part 10 acquires is a score feature-value. Instead, the acquisition unit 10 may acquire a language feature amount as the first feature amount. The language feature quantity is a multidimensional parameter in which the score information is omitted from the score feature quantity shown in FIG. 2 and information such as part of speech and accent is added. According to this aspect, it is possible to generate an acoustic feature amount for performing simple text synthesized speech that is not a singing voice. The acquisition unit 10 may acquire a voice quality feature amount as the first feature amount. The voice quality feature amount is an acoustic feature amount extracted from another person's voice. According to this aspect, it is possible to generate an acoustic feature amount for performing voice quality conversion for converting an acoustic feature amount of a certain speaker into an acoustic feature amount of another speaker.

また、上記実施形態において、音声処理装置１００は、第１変換部２０による変換と第２変換部３０による変換とによって第１特徴量を音響特徴量に変換している。この代わりに、第２変換部３０が直接第１特徴量から音響特徴量に変換してもよい。この場合、第２変換部３０は、ＣＮＮにより、第１特徴量を、第１特徴量の各次元のデータを予め定めた期間の長さ分並べて表される２次元データとして用いて変換を行う。 Further, in the above embodiment, the audio processing device 100 converts the first feature amount into an acoustic feature amount by the conversion by the first conversion unit 20 and the conversion by the second conversion unit 30. Instead, the second conversion unit 30 may directly convert the first feature value into the acoustic feature value. In this case, the second conversion unit 30 performs conversion using the first feature value as two-dimensional data represented by arranging data of each dimension of the first feature value by the length of a predetermined period by CNN. .

また、上記実施形態において、第１変換部２０は、ＦＦＮＮを用いて第１特徴量を第２特徴量に変換している。第１変換部２０は、ＦＦＮＮにおいて、中間層Ｌ２において無作為にまたは任意に選んだノードの情報を伝達しないドロップアウトを行ってもよい。これにより、ＦＦＮＮにおけるロバスト性を向上させることができる。 In the above embodiment, the first conversion unit 20 converts the first feature amount into the second feature amount using FFNN. The first conversion unit 20 may perform dropout in the FFNN, which does not transmit information of randomly or arbitrarily selected nodes in the middle layer L2. Thereby, the robustness in FFNN can be improved.

また、上記実施形態において、第１変換部２０および第２変換部３０は、ニューラルネットワークにおいて、任意の層に入力されたパラメータを変換せずに次の層に伝える経路を追加した、スキップ構造であってもよい。これにより、任意のパラメータの情報を損なわずに、伝搬することができる。例えば、第１変換部２０のＦＦＮＮでは第１特徴量における音高情報Ｐ０をスキップし、第２変換部３０のＣＮＮにおいて、変換されてない音高情報Ｐ０を含む第２特徴量を音響特徴量に変換してもよい。また、第２変換部３０のニューラルネットワークにスキップ構造を加えることで、中間層の数を増加しても、入力した任意のパラメータの情報（例えば、楽譜の音高情報）を損なわずに、伝搬することができる。 In the above-described embodiment, the first conversion unit 20 and the second conversion unit 30 have a skip structure in which a route transmitted to the next layer without converting a parameter input to an arbitrary layer is added in the neural network. There may be. Thereby, it can propagate, without impairing the information of arbitrary parameters. For example, the pitch information P0 in the first feature value is skipped in the FFNN of the first conversion unit 20, and the second feature value including the pitch information P0 that has not been converted is converted into the acoustic feature value in the CNN of the second conversion unit 30. May be converted to Further, by adding a skip structure to the neural network of the second conversion unit 30, even if the number of intermediate layers is increased, the information can be propagated without losing information of any input parameters (for example, pitch information of a score). can do.

また、上記実施形態において、学習部５０は、第２変換部３０の生成した音響特徴量と教師データとに対して、一次微分や二次微分である時間変動を考慮するために用いられる動的特徴量を求め、これらを比較した学習結果をニューラルネットワークに反映してもよい。これにより、第１期間毎の時間変動における音響特徴量の関係がより考慮されるため、滑らかで自然な音声を合成できる。また、上記実施形態において、第２変換部３０は、動的特徴量を生成していないが、動的特徴量を生成してもよい。この場合、ボコーダ部４０は、音響特徴量に含まれる静的特徴量と動的特徴量から、これらの関係を考慮したパラメータ生成を行い、音声波形を生成できる。これにより、ボコーダ部４０は、第２変換部３０によって生成された動的特徴量を考慮して、静的特徴量を補正することができるため、より滑らかで自然な音声を合成できる。また、学習部５０は、第２変換部３０が生成する動的特徴量を含めて、音響特徴量における静的特徴量と動的特徴量との関係を教師有り機械学習によって学習することができる。 Further, in the above embodiment, the learning unit 50 is a dynamic used for considering temporal fluctuation which is a first derivative or a second derivative with respect to the acoustic feature amount and the teacher data generated by the second conversion unit 30. A feature amount may be obtained and a learning result obtained by comparing these may be reflected in the neural network. Thereby, since the relationship of the acoustic feature amount in the time variation for each first period is further considered, it is possible to synthesize a smooth and natural voice. Further, in the above embodiment, the second conversion unit 30 does not generate a dynamic feature amount, but may generate a dynamic feature amount. In this case, the vocoder unit 40 can generate a voice waveform from the static feature amount and the dynamic feature amount included in the acoustic feature amount by performing parameter generation in consideration of these relationships. As a result, the vocoder unit 40 can correct the static feature amount in consideration of the dynamic feature amount generated by the second conversion unit 30, so that smoother, natural speech can be synthesized. In addition, the learning unit 50 can learn the relationship between the static feature amount and the dynamic feature amount in the acoustic feature amount by supervised machine learning, including the dynamic feature amount generated by the second conversion unit 30. .

本発明は、上述の実施形態に限られるものではなく、その趣旨を逸脱しない範囲において種々の構成で実現することができる。例えば発明の概要の欄に記載した各形態中の技術的特徴に対応する実施形態中の技術的特徴は、上述した課題を解決するために、あるいは上述の効果の一部又は全部を達成するために、適宜、差し替えや組み合わせを行うことが可能である。また、その技術的特徴が本明細書中に必須なものとして説明されていなければ、適宜削除することが可能である。 The present invention is not limited to the above-described embodiment, and can be realized in various configurations without departing from the scope of the invention. For example, the technical features in the embodiments corresponding to the technical features in the respective forms described in the section of the summary of the invention are for solving the problems described above or for achieving some or all of the effects described above It is possible to replace or combine as appropriate. Further, if the technical feature is not described as essential in the present specification, it can be deleted as appropriate.

１０…取得部、２０…第１変換部、３０…第２変換部、４０…ボコーダ部、５０…学習部、６０…音響モデル、１００…音声処理装置、２００…ディープニューラルネットワーク、Ｄ１…２次元データ、Ｌ１…入力層、Ｌ２…中間層、Ｌ３…出力層 DESCRIPTION OF SYMBOLS 10 ... Acquisition part, 20 ... 1st conversion part, 30 ... 2nd conversion part, 40 ... Vocoder part, 50 ... Learning part, 60 ... Acoustic model, 100 ... Speech processing apparatus, 200 ... Deep neural network, D1 ... Two-dimensional Data, L1 ... input layer, L2 ... middle layer, L3 ... output layer

Claims

A voice processing device,
An acquisition unit for acquiring a first multi-dimensional feature amount related to speech;
A first conversion unit that converts the first feature value into a multidimensional second feature value for each predetermined first period;
An acoustic feature for generating a speech waveform for each second period longer than the first period using a convolutional neural network capable of processing the second feature amount in time series. And a second conversion unit for converting into a quantity.

The speech processing apparatus according to claim 1 ,
The speech processing apparatus, wherein the second period is variable length.

The speech processing apparatus according to claim 2 ,
The second processing unit is an audio processing device that changes a length of the second period according to a silent portion in the first feature amount.

The speech processing apparatus according to any one of claims 1 to 3 , wherein
The first conversion unit is a voice processing device that converts the first feature value into the second feature value using a feedforward neural network.

The speech processing apparatus according to any one of claims 1 to 4 , wherein
The voice processing device according to claim 1, wherein the second conversion unit inputs pitch information included in the first feature amount into the convolutional neural network in addition to the second feature amount and converts the pitch information into the acoustic feature amount.

The speech processing apparatus according to claim 5 ,
The speech processing apparatus, wherein the pitch information of the silent portion in the first feature amount is information interpolated by preceding and following pitch information.

The speech processing apparatus according to any one of claims 1 to 6 , wherein
The speech processing apparatus, wherein the first feature amount includes at least one of a language feature amount, a score feature amount, and a voice quality feature amount.

The speech processing apparatus according to any one of claims 1 to 7 , further comprising:
An audio processing apparatus comprising a vocoder that generates an audio waveform using the acoustic feature amount.

The speech processing apparatus according to any one of claims 1 to 8 , further comprising:
A speech processing apparatus comprising: a learning unit that learns a relationship between the first feature value and the acoustic feature value by supervised machine learning and reflects the result in the convolutional neural network.

The speech processing apparatus according to any one of claims 1 to 9 ,
The second conversion unit converts the second feature amount into the acoustic feature amount by using data of each dimension of the second feature amount as two-dimensional data represented by arranging the length of the second period. , Voice processing device.

An audio processing method,
An acquisition step of acquiring a first multi-dimensional feature quantity related to speech;
A first conversion step of converting the first feature quantity into a multidimensional second feature quantity for each predetermined first period;
An acoustic feature amount for generating an audio waveform for each second period longer than the first period using a convolutional neural network capable of processing the second feature amount in time series. A voice conversion method comprising: a second conversion step of converting to a second conversion step.

A voice processing device,
An acquisition unit for acquiring multi-dimensional feature values related to speech;
A conversion unit that converts the feature amount into an acoustic feature amount for generating a speech waveform using a convolutional neural network for each predetermined period; and
The speech processing device, wherein the conversion unit converts the feature quantity into the acoustic feature quantity using two-dimensional data represented by arranging each dimension data of the feature quantity for the length of the period.