JP2020160319A

JP2020160319A - Voice synthesizing device, method and program

Info

Publication number: JP2020160319A
Application number: JP2019060654A
Authority: JP
Inventors: 信行西澤; Nobuyuki Nishizawa
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2019-03-27
Filing date: 2019-03-27
Publication date: 2020-10-01
Anticipated expiration: 2039-03-27
Also published as: JP6993376B2

Abstract

To provide a voice synthesizing device capable of estimating voice characteristics of a specific speaker while ensuring a predetermined accuracy without necessarily using a large amount of clean voice which is restricted to be collected in a large amount.SOLUTION: A voice synthesizing device includes a first analysis unit 4 which analyzes a first voice which is determined to be clean by a specific speaker to obtain first speaker property information, a second analysis unit 5 which analyzes a second voice of the specific speaker to obtain second speaker property information, and a prediction unit 7 for predicting voice characteristics when the specific speaker utters a specific text by applying a learning model to the first speaker property information, the second speaker property information, and an intermediate expression for voice synthesis of the specified text.SELECTED DRAWING: Figure 1

Description

本発明は、大量に収集することに制約のあるクリーンな音声を必ずしも大量に用いる必要なく、所定精度を確保して特定話者の音声特徴を推定することのできる音声合成装置、方法及びプログラムに関する。 The present invention relates to a voice synthesizer, a method and a program capable of estimating the voice characteristics of a specific speaker with a predetermined accuracy without necessarily using a large amount of clean voice, which is restricted to collect a large amount. ..

音声合成技術とは音声を人工的に合成する手法である。代表的な利用方法として、テキスト音声変換（Text-To-Speech）が挙げられるが、例えば日本語では、ＴＴＳの入力となるテキストは通常、漢字仮名交じり文であり、例えば文字と合成すべき音声の特徴とを直接マッピングすることはその関係性の構造が極めて複雑であることから困難である。そこで抽象化された中間表現を用い、（１）テキストから中間表現、（２）中間表現から音声の特徴、という２段階の変換を経て、音声の特徴の情報にあう音声波形を信号処理的に生成、あるいは事前準備した波形の蓄積から適切なものを選択することで、合成音声波形を得ることができる。 Speech synthesis technology is a technique for artificially synthesizing speech. A typical usage method is text-to-speech. For example, in Japanese, the text input for TTS is usually a kanji-kana mixed sentence, for example, a voice to be synthesized with a character. It is difficult to directly map the features of the above because the structure of the relationship is extremely complicated. Therefore, using an abstracted intermediate expression, through two-step conversion of (1) text to intermediate expression and (2) intermediate expression to speech characteristics, a speech waveform that matches the information of speech features is signal-processed. A synthetic voice waveform can be obtained by selecting an appropriate waveform from the generation or the accumulation of the waveform prepared in advance.

（１）中間表現に関して
この中間表現としては、以下では音声合成記号を想定する。音声合成記号には様々な形式があり得るが、例えば、一連の音声を構成する音素の情報と、主としてポーズや声の高さとして表現される韻律的情報を同時に表記したものが考えられる。すなわち音声言語を記述する記号である。そのような音声合成用記号の例として、ＪＥＩＴＡ（電子情報技術産業協会）規格ＩＴ−４００６「日本語テキスト音声合成用記号」がある（非特許文献１参照）。以下で述べる音声合成装置とは、このような音声合成記号による入力に基づいてそれに対応する音声波形を生成する装置をいう。 (1) Intermediate representation As this intermediate representation, a speech synthesis symbol is assumed below. There may be various forms of speech synthesis symbols, but for example, information on phonemes constituting a series of speech and prosodic information mainly expressed as a pose or pitch of voice can be considered at the same time. That is, it is a symbol that describes a spoken language. As an example of such a symbol for speech synthesis, there is JEITA (Japan Electronics and Information Technology Industries Association) standard IT-4006 “symbol for Japanese text speech synthesis” (see Non-Patent Document 1). The voice synthesizer described below refers to a device that generates a voice waveform corresponding to the input by such a voice synthesis symbol.

（２）中間表現から音声特徴を得ることに関して
このような音声合成装置によりある特定の話者の音声を再現することを考える（以下、再現したい話者を目標話者という）。すなわち先述の抽象化された中間表現から、ある特定の話者の音声の特徴を有する合成音声を生成することを考える。抽象化された中間表現そのものに、話者の特徴を記述する十分な情報は含まれないので、単純には、予め、目標話者の音声の物理的特徴（波形そのものや、あるいは波形から信号処理により抽出した、ケプストラム（周波数スペクトル情報の表現方法の１つ）や基本周波数のような音声特徴量）と、その音声に対応する中間表現との対応関係を調べ、対象となる話者毎に、中間表現から音声特徴への変換規則を定め、その規則を用いて中間表現から音声特徴を予測し、予測結果に対応する音声合成波形を信号処理的に合成すれば良い。一般に各時刻において直接対応する中間表現のみの情報量は、中間表現が抽象的で単純なものであることから限定されたものとなるが、時間的に前後（なお時間的に隣接したものとは限らない）の中間表現についても考慮に加えることで、より複雑な変換規則を定義できる。このような対象時刻以外の情報は、一般的にはコンテキスト（文脈）とも呼ばれる。以下中間表現には、そのようなコンテキストも考慮されているケースも含むものとする。 (2) Obtaining voice features from intermediate representation Consider reproducing the voice of a specific speaker by such a voice synthesizer (hereinafter, the speaker who wants to reproduce is referred to as a target speaker). That is, it is considered to generate a synthetic speech having the characteristics of the speech of a specific speaker from the above-mentioned abstracted intermediate representation. Since the abstracted intermediate expression itself does not contain sufficient information to describe the characteristics of the speaker, simply, in advance, the physical characteristics of the target speaker's speech (waveform itself or signal processing from the waveform). The correspondence between the cepstrum (one of the expression methods of frequency spectrum information) and the voice feature amount such as the fundamental frequency) extracted by the above and the intermediate expression corresponding to the voice is investigated, and for each target speaker, A conversion rule from the intermediate expression to the speech feature may be defined, the speech feature may be predicted from the intermediate representation using the rule, and the speech synthesis waveform corresponding to the prediction result may be synthesized by signal processing. In general, the amount of information of only the intermediate representation that directly corresponds to each time is limited because the intermediate representation is abstract and simple, but it is before and after the time (still adjacent in time). More complex conversion rules can be defined by taking into account the intermediate representations (but not limited to). Such information other than the target time is generally also called a context. In the following, the intermediate representation shall include cases where such a context is also taken into consideration.

（３）変換規則を自動決定する学習に関して
コンテキストを考慮した場合の複雑な変換規則を人手で全て定めることは困難なことから、決定木やニューラルネットワーク等に基づき、中間表現から音声特徴への変換規則を機械学習手法により定める方法が広く用いられている。この変換規則の自動決定（学習ともいう）を高精度に行うためには、音声データと、それをコンテキストも考慮して記述した中間表現との組を、学習データとして大量に用いる必要がある。すなわち、音声合成における特定話者の音声再現には、目標話者の大量の音声データを収集し、その音声データの全てに例えば人手で中間表現を付与し、両者の関係を機械学習手法により学習すれば良い。例えば、隠れマルコフモデルに基づく音声合成手法であるＨＭＭ音声合成は、中間表現に対応する音声をモデル化したＨＭＭのパラメータを、決定木により中間表現から予測し、その予測結果のＨＭＭを用いて音声を合成する手法である（非特許文献２参照）。また、深層学習に基づくニューラルネットワーク（ＤＮＮ）を用いた音声合成手法の多くでは、中間表現から対応する音声特徴を直接予測し、音声を合成している。 (3) Learning to automatically determine conversion rules Since it is difficult to manually determine all complicated conversion rules when context is taken into consideration, conversion from intermediate representation to voice features based on decision trees and neural networks. A method of defining rules by machine learning methods is widely used. In order to automatically determine the conversion rule (also called learning) with high accuracy, it is necessary to use a large amount of sets of voice data and an intermediate representation that describes it in consideration of the context as learning data. That is, in order to reproduce the voice of a specific speaker in speech synthesis, a large amount of voice data of the target speaker is collected, intermediate representations are manually added to all of the voice data, and the relationship between the two is learned by a machine learning method. Just do it. For example, in HMM speech synthesis, which is a speech synthesis method based on the Hidden Markov Model, the parameters of the HMM that models the speech corresponding to the intermediate expression are predicted from the intermediate expression by a decision tree, and the HMM of the prediction result is used for the speech. Is a method for synthesizing (see Non-Patent Document 2). Further, in many speech synthesis methods using a neural network (DNN) based on deep learning, the corresponding speech features are directly predicted from the intermediate representation, and speech is synthesized.

「日本語テキスト音声合成用記号」ＪＥＩＴＡ規格ＩＴ−４００６、電子情報技術産業協会、２０１０年３月"Symbols for Japanese Text and Speech Synthesis" JEITA Standard IT-4006, Japan Electronics and Information Technology Industries Association, March 2010 益子貴史、徳田恵一、小林隆夫、今井聖、「動的特徴を用いたＨＭＭに基づく音声合成」、電子情報通信学会論文誌(D-II), J79-D-II, 12, pp.2184-2190, Dec. 1996.Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, Sei Imai, "Speech Synthesis Based on HMM Using Dynamic Features", IEICE Transactions (D-II), J79-D-II, 12, pp.2184- 2190, Dec. 1996. Najim Dehak, Patrick J. Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet, "Front-End Factor Analysis for Speaker Verification," IEEE Transactions on Audio Speech and Language Processing 19(4), pp.788-798, June 2011.Najim Dehak, Patrick J. Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet, "Front-End Factor Analysis for Speaker Verification," IEEE Transactions on Audio Speech and Language Processing 19 (4), pp.788-798, June 2011 ..

しかしながら、上記説明したような既存手法の学習では、音声再現させたい特定話者ごとに大量の音声データが必要であり、学習のための手間が過大なものとなってしまうという課題があった。すなわち、実際には、音声合成で再現したい特定の話者の大量の音声データを収集することは困難なことが少なくない。 However, in the learning of the existing method as described above, there is a problem that a large amount of voice data is required for each specific speaker who wants to reproduce the voice, and the time and effort for learning becomes excessive. That is, in reality, it is often difficult to collect a large amount of voice data of a specific speaker to be reproduced by voice synthesis.

この従来技術の課題に対する対処法として、以下が考えられる。 The following can be considered as a coping method for the problem of this prior art.

すなわち、話者性について少量の音声データから決定できる何らかの方法で記述し、予め多数の話者の音声に対して、話者性情報、中間表現の組と、音声特徴の間の関係を機械学習手法で学習したモデルを用意しておけば、最低限、話者性情報を予測できるだけの音声データ量で、その話者の音声を模擬した合成音声を作成することができる。（なお、話者性情報を含む形で「中間表現」ということも可能だが、ここでは、中間表現は基本的にテキスト解析のみから得られる情報で構成されている場合を想定する。） That is, the speaker property is described by some method that can be determined from a small amount of voice data, and the relationship between the speaker property information, the set of intermediate expressions, and the voice feature is machine-learned for the voice of a large number of speakers in advance. If a model learned by the method is prepared, it is possible to create a synthetic voice that simulates the voice of the speaker with at least the amount of voice data that can predict the speaker character information. (Although it is possible to call it an "intermediate representation" that includes speaker information, here it is assumed that the intermediate representation is basically composed of information obtained only from text analysis.)

もちろんこの方法では、話者性情報、中間表現、音声特徴の３つの間の複雑な関係を事前にモデル化する必要があるため、予め準備するモデルの学習はより困難なものとなる。ただし、これは事前に用意するモデルデータに合成目標話者の音声を含めることは必須ではないため、予め高精度なモデルを学習しておけば良い。 Of course, in this method, it is necessary to model the complicated relationship between the speaker information, the intermediate representation, and the speech feature in advance, which makes it more difficult to learn the model prepared in advance. However, since it is not essential to include the voice of the synthetic target speaker in the model data prepared in advance, it is sufficient to learn a highly accurate model in advance.

一方で、話者性情報を高精度に推定するためには、一般にクリーンな（背景雑音や反響等の少ない）音声が必要となる。話者性情報の抽出において、特に音声のスペクトルの分析では背景雑音や反響等の影響が大きく、大きな背景雑音や反響を含む音声から話者性情報を推定すると、その推定結果の精度が下がるためである。そしてその影響は最終的に話者再現性を低下させることになる。 On the other hand, in order to estimate speaker information with high accuracy, clean voice (with less background noise and reverberation) is generally required. In the extraction of speaker information, especially in the analysis of the spectrum of voice, the influence of background noise and echo is large, and if the speaker information is estimated from the voice including large background noise and echo, the accuracy of the estimation result is lowered. Is. And the effect will eventually reduce the speaker reproducibility.

事前学習に使う音声も同様にクリーンな音声が必要だが、こちらはシステムの構築時に用いる音声なので、防音室等で音声を収録することで比較的容易に収集できる。これに対し、目標話者の音声についてもクリーンな音声を大量に要求することは、ユーザの音声収録環境を制約することから、一般に望まれない。 The voice used for pre-learning also needs to be clean, but since this is the voice used when building a system, it can be collected relatively easily by recording the voice in a soundproof room or the like. On the other hand, it is generally not desirable to request a large amount of clean voice for the voice of the target speaker because it limits the voice recording environment of the user.

上記の従来技術の課題に鑑み、本発明は、大量に収集することに制約のあるクリーンな音声を必ずしも大量に用いる必要なく、所定精度を確保して特定話者の音声特徴を推定することのできる音声合成装置、方法及びプログラムを提供することを目的とする。 In view of the above problems of the prior art, the present invention does not necessarily need to use a large amount of clean voice, which is restricted to collect a large amount, and estimates the voice characteristics of a specific speaker while ensuring a predetermined accuracy. It is an object of the present invention to provide a voice synthesizer, a method and a program capable of providing a voice synthesizer.

上記目的を達成するため、本発明は音声合成装置であって、特定話者のクリーンであると判定される第１音声を分析して第１話者性情報を得る第１分析部と、前記特定話者の第２音声を分析して第２話者性情報を得る第２分析部と、前記第１話者性情報と、前記第２話者性情報と、指定されるテキストの音声合成用の中間表現と、に対して学習モデルを適用することで前記テキストを前記特定話者が発声する際の音声特徴を予測する予測部と、を備えることを特徴とする。また、前記音声合成装置に対応する方法及びプログラムであることを特徴とする。 In order to achieve the above object, the present invention is a voice synthesizer, the first analysis unit which analyzes the first voice determined to be clean of a specific speaker and obtains the first speaker character information, and the above. The second analysis unit that analyzes the second voice of a specific speaker to obtain the second speaker character information, the first speaker character information, the second speaker character information, and the voice synthesis of the designated text. It is characterized in that it includes an intermediate expression for use, and a prediction unit that predicts a voice feature when the specific speaker utters the text by applying a learning model to the text. Further, it is characterized in that it is a method and a program corresponding to the voice synthesizer.

本発明によれば、第１音声が必ずしも大量に利用可能でなくとも、特定話者の音声特徴を所定精度を確保して予測することができる。 According to the present invention, it is possible to predict the voice characteristics of a specific speaker with a predetermined accuracy even if the first voice is not always available in a large amount.

一実施形態に係る音声合成装置の機能ブロック図である。It is a functional block diagram of the voice synthesis apparatus which concerns on one Embodiment. 一実施形態に係る音声合成装置の動作のフローチャートである。It is a flowchart of the operation of the voice synthesizer which concerns on one Embodiment. 予測部が予測するために必要な学習モデルを学習する学習装置の機能ブロック図である。It is a functional block diagram of the learning device which learns the learning model necessary for the prediction part to make a prediction. 予測部が第１変形例で構成された音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer whose prediction part was composed of the 1st modification. 予測部が第２変形例で構成された音声合成装置の機能ブロック図である。It is a functional block diagram of the speech synthesizer whose prediction part was composed of the 2nd modification. 一般的なコンピュータの構成を示す図である。It is a figure which shows the structure of a general computer.

図１は、実施形態に係る音声合成装置10の機能ブロック図である。図示する通り、音声合成装置10は、収集部1、識別部2、指定部3、第１分析部4、第２分析部5、解析部6、予測部7及び生成部8を備える。図２は、一実施形態に係る音声合成装置10の動作のフローチャートである。以下、図２の各ステップを説明しながら、図１の音声合成装置10の各部の処理内容の概要を説明する。 FIG. 1 is a functional block diagram of the speech synthesizer 10 according to the embodiment. As shown in the figure, the speech synthesizer 10 includes a collection unit 1, an identification unit 2, a designation unit 3, a first analysis unit 4, a second analysis unit 5, an analysis unit 6, a prediction unit 7, and a generation unit 8. FIG. 2 is a flowchart of the operation of the voice synthesizer 10 according to the embodiment. Hereinafter, the outline of the processing contents of each part of the speech synthesizer 10 of FIG. 1 will be described while explaining each step of FIG.

図２のフローが開始されるとまず、ステップS1では、収集部1において人の発声データとしての大量の音声データを収集し、この収集した音声データに対して識別部2において話者の識別及び音声環境の識別処理を施したうえで、識別部2においてこの識別された音声データを保存しておき、ステップS2へと進む。 When the flow of FIG. 2 is started, first, in step S1, a large amount of voice data as human utterance data is collected by the collection unit 1, and the speaker identification and speaker identification by the identification unit 2 with respect to the collected voice data. After performing the identification processing of the voice environment, the identification unit 2 saves the identified voice data, and proceeds to step S2.

収集部1は、例えば、対話側の音声操作により情報検索やネットワーク家電（スマート家電）の操作を行うことができるスマートスピーカ（スマート家電の一種）において、操作音声などを保存しておくものとして実現してもよいし、このようなスマートスピーカにおいて録音しておいた音声をネットワーク経由で取得するものとして実現してもよい。 The collection unit 1 is realized, for example, as a smart speaker (a type of smart home appliance) that can search information and operate network home appliances (smart home appliances) by voice operation on the dialogue side, and stores operation voices and the like. Alternatively, the voice recorded by such a smart speaker may be acquired via a network.

識別部2では、収集部1で収集された大量の音声データに対して、人による発話がなされている各区間D(i)(i=1,2,…)を検出したうえで、各区間における話者x(x=a,b,c,…)と、各区間の音声環境e(e=1,2とし、e=1が「クリーン音声」を、e=2が「クリーンでない音声」を表す)と、を識別し、その識別結果x=x(i),e=e(i)を区間D(i)と紐付けて保存しておく。 The identification unit 2 detects each section D (i) (i = 1,2, ...) In which a person is speaking from a large amount of voice data collected by the collection unit 1, and then each section. In the speaker x (x = a, b, c, ...) and the voice environment e (e = 1,2) of each section, e = 1 is "clean voice" and e = 2 is "unclean voice". ), And the identification results x = x (i) and e = e (i) are associated with the interval D (i) and saved.

識別部2における発話区間D(i)の検出と、区間D(i)での話者x=x(i)の識別と、区間D(i)の音声環境e=e(i)（クリーン音声かそうでないか）の識別と、には、任意の既存手法を用いてよい。例えば、発話区間の検出には零交差や振幅などによる閾値判定を用いてよい。あるいは、クリーン音声か否かをスイッチ操作等によりユーザに入力させても良い。話者の識別には、i-vector等による既存の話者認識技術を用いてよく、識別候補としての話者x=a,b,c,…に関してはそれぞれ、予めi-vector等を求めるための音声を登録しておけばよい。なお、i-vectorに関しては前掲の非特許文献３に開示されている。この際、体調の影響等による一時的な声質変化の影響や、背景雑音の影響による見かけ上の声質変化の影響等を回避するために、一般的な話者識別目的よりも厳しい判定基準を設けて、基準を満たさなかった音声区間は処理の対象から外しても良い。この判定基準には、後述する話者性情報の推定手法と同様の方法を用いても良い。 The detection of the utterance section D (i) in the identification unit 2, the identification of the speaker x = x (i) in the section D (i), and the voice environment e = e (i) (clean voice) in the section D (i). Any existing method may be used for identification (whether or not). For example, a threshold value determination based on zero intersection, amplitude, or the like may be used to detect the utterance section. Alternatively, the user may be made to input whether or not the voice is clean by operating a switch or the like. For speaker identification, existing speaker recognition technology using i-vector or the like may be used, and for speaker x = a, b, c, ... As identification candidates, i-vector or the like is obtained in advance. All you have to do is register the voice of. The i-vector is disclosed in Non-Patent Document 3 described above. At this time, in order to avoid the influence of temporary voice quality change due to the influence of physical condition and the influence of apparent voice quality change due to the influence of background noise, a stricter judgment standard than the general speaker identification purpose is set. Therefore, the audio section that does not meet the criteria may be excluded from the processing target. As this determination criterion, a method similar to the method for estimating speaker information described later may be used.

音声環境e=e(i)の自動識別に関しては、生活環境音その他の原因による、話者の発声以外の雑音レベルが所定閾値未満になるものをe=1（クリーン音声）として識別し、この閾値判定を満たさず一定以上の雑音レベルがあると判定されるものをe=2（クリーンでない音声）として識別すればよい。雑音が閾値以上か否かの判定はスペクトル分析などの任意の既存手法を利用することができる。 Regarding the automatic identification of the voice environment e = e (i), the noise level other than the speaker's utterance due to the living environment sound or other causes is identified as e = 1 (clean voice). Those that do not satisfy the threshold value and are determined to have a noise level above a certain level may be identified as e = 2 (unclean voice). Any existing method such as spectrum analysis can be used to determine whether the noise is above the threshold.

なお、複数人物が同時に発声していると判定された区間に関しては識別部2で保存する区間D(i)に含めないようにしてもよいし、最も音量の大きな話者xを話者識別結果とし、且つ、e=2（クリーンでない音声）として環境識別結果を与えるようにしてもよい。 The section where it is determined that a plurality of persons are uttering at the same time may not be included in the section D (i) saved by the identification unit 2, and the speaker x having the loudest volume may be included in the speaker identification result. And, the environment identification result may be given as e = 2 (unclean voice).

一実施形態では、識別部2はさらに、区間D(i)での発話内容テキストt(i)も識別して、区間D(i)に紐づけて保存するようにしてもよい。テキストt(i)の識別には、既存の任意の音声認識手法を利用してよい。 In one embodiment, the identification unit 2 may further identify the utterance content text t (i) in the section D (i) and save it in association with the section D (i). Any existing speech recognition method may be used to identify the text t (i).

ステップS2では、音声合成装置10を利用するユーザより、指定部3において音声合成を望む話者xの指定と、解析部6において音声合成を望むテキストTの指定と、を受け取ってから、ステップS3へと進む。以下、説明のため、音声合成を望む話者xの指定は「x=a」（aは話者識別子）であるものとする。 In step S2, after receiving the designation of the speaker x who desires speech synthesis in the designation unit 3 and the designation of the text T which desires speech synthesis in the analysis unit 6 from the user who uses the speech synthesis device 10, step S3 Proceed to. Hereinafter, for the sake of explanation, it is assumed that the designation of the speaker x who desires speech synthesis is "x = a" (a is a speaker identifier).

ステップS3では、まず、指定部3がステップS2にて指定された話者x=aの音声データを識別部2に保存されている話者x(i)及び音声環境e(i)が識別された大量の発話区間D(i)を検索して取得し、第１分析部4にこのうちの第１音声V1(a)を出力し、第２分析部5にこのうちの第２音声V2(a)を出力する。 In step S3, first, the designated unit 3 identifies the speaker x (i) and the voice environment e (i) in which the voice data of the speaker x = a specified in step S2 is stored in the identification unit 2. A large amount of utterance sections D (i) are searched and acquired, the first voice V1 (a) of these is output to the first analysis unit 4, and the second voice V2 (of these) is output to the second analysis unit 5. a) is output.

具体的に、大量の発話区間D(i)のうち、「x(i)=a」である、すなわち、指定された話者aによる発話だと識別されており、且つ、音声環境が「e(i)=1」である、すなわち、雑音のないクリーンな音声であると識別されているもの全てを第１音声V1(a)として検索して取得して第１分析部4に出力する。また、大量の発話区間D(i)のうち、「x(i)=a」である、すなわち、指定された話者aによる発話だと識別されており、且つ、音声環境が「e(i)=2」である、すなわち、雑音が混じっておりクリーンでない音声であると識別されているもの全てを第２音声V2(a)として検索して取得して第２分析部5に出力する。集合による表記の式で表すと第１音声V1(a)及び第２音声V2(a)は以下の通りとなる。
V1(a)={D(i)|x(i)=a, e(i)=1}
V2(a)={D(i)|x(i)=a, e(i)=2} Specifically, among a large number of utterance sections D (i), it is identified that "x (i) = a", that is, the utterance is made by the designated speaker a, and the voice environment is "e". (i) = 1 ”, that is, all the voices identified as clean voices without noise are searched for as the first voice V1 (a), acquired, and output to the first analysis unit 4. Further, among a large number of utterance sections D (i), it is identified that "x (i) = a", that is, the utterance is made by the designated speaker a, and the voice environment is "e (i)". ) = 2 ”, that is, all the voices that are identified as unclean voices mixed with noise are searched for as the second voice V2 (a), acquired, and output to the second analysis unit 5. The first voice V1 (a) and the second voice V2 (a) are expressed by the set notation formula as follows.
V1 (a) = {D (i) | x (i) = a, e (i) = 1}
V2 (a) = {D (i) | x (i) = a, e (i) = 2}

ステップS3では、次いで、取得された第１音声V1(a)を第１分析部4が分析して話者aの第１話者性情報s1(a)を求めて予測部7へと出力し、且つ、取得された第２音声V2(a)を第２分析部5が分析して話者aの第２話者性情報s2(a)を求めて予測部7へと出力してから、ステップS4へと進む。 In step S3, the first analysis unit 4 then analyzes the acquired first voice V1 (a), obtains the first speaker information s1 (a) of the speaker a, and outputs the first voice V1 (a) to the prediction unit 7. In addition, the acquired second voice V2 (a) is analyzed by the second analysis unit 5, the second speaker information s2 (a) of the speaker a is obtained, and the second voice V2 (a) is output to the prediction unit 7. Proceed to step S4.

なお、識別部2において各区間D(i)にその発声されているテキストt(i)の情報も取得して紐づけている場合には、第１分析部4及び第２分析部5においては第１音声V1(a)及び第２音声V2(a)を分析する際に、この紐づけられているテキスト情報も併せて利用することで、第１話者性情報s1(a)及び第２話者性情報s2(a)をそれぞれ求めるようにすることができる。 If the identification unit 2 also acquires and links the information of the spoken text t (i) to each section D (i), the first analysis unit 4 and the second analysis unit 5 may use the information. When analyzing the first voice V1 (a) and the second voice V2 (a), by also using the linked text information, the first speaker information s1 (a) and the second voice V2 (a) can be used. Speaker information s2 (a) can be obtained respectively.

また、声質変化の影響を抑えるために、第１分析部4の分析結果である第１話者性情報s1(a)が以前の結果と大きく異なる場合は、第１音声V1(a)は破棄し処理を終了する手順としても良い。第２分析部5の分析結果である第２話者性情報s2(a)に対しても同様である。すなわち、収集部1において継続的に音声収集が行われる前提のもと、第１分析部4及び第２分析部5では、継続的に、各期間m（m=1,2,…、例えば1週間ごとの期間など）で新たに収集された第１音声V1(a)_[m]及び第２音声V2(a)_[m]も追加して第１話者性情報及び第２話者性情報を分析するようにしてよい。この際、過去の期間m-1までの全音声（破棄されたものを除く）を用いて既に分析されている第１話者性情報s1(a)_[m-1]及び第２話者性情報s2(a)_[m-1]と、現期間mに新たに収集された第１音声V1(a)_[m]及び第２音声V2(a)_[m]のみを利用して分析した第１話者性情報s1(a)_[mのみ]及び第２話者性情報s2(a)_[mのみ]とを対比し、相違が閾値以上であると判定された場合、現期間mで新たに収集された第１音声V1(a)_[m]及び第２音声V2(a)_[m]は、一時的な声質変化（例えば風邪をひいていることなどによる声質変化）が発生しているものとして、第１話者性情報及び第２話者性情報を分析するために用いる対象から削除するようにしてよい。 In addition, in order to suppress the influence of changes in voice quality, if the first speaker information s1 (a), which is the analysis result of the first analysis unit 4, is significantly different from the previous result, the first voice V1 (a) is discarded. It may be a procedure to end the process. The same applies to the second speaker information s2 (a), which is the analysis result of the second analysis unit 5. That is, under the premise that voice collection is continuously performed in the collecting unit 1, the first analysis unit 4 and the second analysis unit 5 continuously perform each period m (m = 1,2, ..., For example, 1). The first voice V1 (a) _[m] and the second voice V2 (a) _[m] newly collected in the weekly period, etc.) are also added to the first speaker information and the second speaker information. May be analyzed. At this time, the first speaker information s1 (a) _[m-1] and the second speaker have already been analyzed using all the voices up to m-1 in the past period (excluding those discarded) _. The analysis was performed using only the information s2 (a) _[m-1] and the first voice V1 (a) _[m] and the second voice V2 (a) _[m] newly collected during the current period m. When the 1-speaker information s1 (a) _{[m only]} and the 2nd speaker information s2 (a) _{[m only]} are compared and it is determined that the difference is greater than or equal to the threshold value, a new period m is used. The first voice V1 (a) _[m] and the second voice V2 (a) _[m] collected in the _above have a temporary change in voice quality (for example, a change in voice quality due to having a cold). As a matter of fact, the first speaker information and the second speaker information may be deleted from the objects used for analysis.

上記したステップS3の後半側の処理としての第１分析部4及び第２分析部5の処理の詳細については後述する。 The details of the processing of the first analysis unit 4 and the second analysis unit 5 as the processing on the latter half of step S3 described above will be described later.

ステップS4では、ステップS2で指定されたテキストTを解析部6が解析してその中間表現im(T)を得て、予測部7へと出力してからステップS5へと進む。解析部6による解析処理は、テキストに対するルールベースの手法などを用いることができ、中間表現im(T)としては例えば前掲の非特許文献１の「日本語テキスト音声合成用記号」ように、予め定義されている所定様式のものを用いればよい。 In step S4, the analysis unit 6 analyzes the text T specified in step S2, obtains the intermediate representation im (T), outputs the text T to the prediction unit 7, and then proceeds to step S5. For the analysis processing by the analysis unit 6, a rule-based method for text can be used, and the intermediate representation im (T) is previously described as, for example, "symbol for Japanese text speech synthesis" in Non-Patent Document 1 described above. The defined predetermined format may be used.

なお、ステップS3とステップS4とは順番を逆にしてもよいし、両ステップが並行して実施されるようにしてもよい。 The order of step S3 and step S4 may be reversed, or both steps may be performed in parallel.

ステップS5では、ステップS3,S4で得られた第１話者性情報s1(a)、第２話者性情報s2(a)及び中間表現im(T)を入力として用いて、話者aによってテキストTが発声される際の音声特徴f(a,T)を予測部7が予測して生成部8へと出力してから、ステップS6へと進む。 In step S5, the first speaker information s1 (a), the second speaker information s2 (a), and the intermediate representation im (T) obtained in steps S3 and S4 are used as inputs by the speaker a. After the prediction unit 7 predicts the voice feature f (a, T) when the text T is uttered and outputs it to the generation unit 8, the process proceeds to step S6.

予測部7で予測する音声特徴f(a,T)は、ＨＭＭモデル等に基づく任意の既存の音響特徴量を用いてよく、例えば、基本周波数と所定次元のメルケプストラム係数による静的特徴量及びデルタパラメータ等の動的特徴量の組み合わせ（ベクトル）の当該テキストTに沿ったフレーム時系列として、音声特徴f(a,T)を得ることができる。 For the voice features f (a, T) predicted by the prediction unit 7, any existing acoustic features based on the HMM model or the like may be used. For example, the static features based on the fundamental frequency and the mer cepstrum coefficient of a predetermined dimension and the static features A voice feature f (a, T) can be obtained as a frame time series along the text T of a combination (vector) of dynamic features such as a delta parameter.

予測部7による具体的な予測は、予め学習しておいたモデルを用いて行うことが可能であり、その詳細に関しては後述する。 Specific prediction by the prediction unit 7 can be performed using a model learned in advance, and the details will be described later.

ステップS6では、ステップS5において予測部7で得た音声特徴f(a,T)を入力として用いて生成部8が合成処理を行い、話者aがテキストTを発声している合成音声波形W(a,T)を出力し、図２のフローは終了する。合成音声波形の生成処理は、予測部7で予測した音声特徴f(a,T)の種類に応じた所定の波形生成処理を行えばよく、例えば基本周波数と所定次元のメルケプストラム係数の組み合わせで音声特徴が表されているときは、メルケプストラム係数で決まるスペクトル包絡特性を持ったデジタルフィルタを作成し、基本周波数をその駆動周波数とするインパルス列で前述のデジタルフィルタを駆動することで、合成音声波形を生成できる。 In step S6, the synthetic voice waveform W in which the generation unit 8 performs the synthesis processing using the voice feature f (a, T) obtained in the prediction unit 7 in step S5 as an input and the speaker a utters the text T. (a, T) is output, and the flow of FIG. 2 ends. The synthetic speech waveform generation process may be performed by performing a predetermined waveform generation process according to the type of the speech feature f (a, T) predicted by the prediction unit 7, for example, by combining the fundamental frequency and the mer cepstrum coefficient of a predetermined dimension. When the speech characteristics are represented, a synthetic speech is created by creating a digital filter with a spectrum wrapping characteristic determined by the mer cepstrum coefficient and driving the above-mentioned digital filter with an impulse train having the fundamental frequency as its drive frequency. Can generate waveforms.

以上、図２の各ステップを説明した。以下、詳細を後述するとしたステップS3の後半側の第１分析部4及び第２分析部5の処理と、ステップS5で予測部7が予測を行うことを可能にするためのモデル学習と、をこの順番で説明する。 Each step of FIG. 2 has been described above. Hereinafter, the processing of the first analysis unit 4 and the second analysis unit 5 on the latter half side of step S3, which will be described in detail later, and the model learning for enabling the prediction unit 7 to make a prediction in step S5. The explanation will be given in this order.

第１分析部4及び第２分析部5に関して、この２つを分けて利用することの考え方をまず説明する。前提として、第１分析部4への入力としてのクリーンな第１音声V1(a)は、ノイズのない静寂な環境で取得される必要があるという制約から取得が難しく、取得の手間やコストの存在によりデータ量が少ないことが想定されるものであり、一方、第２分析部5への入力としてのクリーンでない第２音声V2(a)は、取得環境が静寂である制約がないため取得が容易であり、データ量が豊富であることが想定されるものである。 Regarding the first analysis unit 4 and the second analysis unit 5, the concept of using these two separately will be described first. As a premise, it is difficult to acquire the clean first voice V1 (a) as an input to the first analysis unit 4 due to the restriction that it needs to be acquired in a quiet environment without noise, and it takes time and cost to acquire it. It is assumed that the amount of data is small due to the existence, while the unclean second voice V2 (a) as the input to the second analysis unit 5 can be acquired because there is no restriction that the acquisition environment is quiet. It is assumed that it is easy and the amount of data is abundant.

第２音声は例えば、日常的な騒音環境下にあるスマート家電への操作音声として、収集部1において大量に容易に取得可能であることが想定されるものである。 It is assumed that the second voice can be easily acquired in large quantities by the collecting unit 1, for example, as an operation voice for a smart home appliance in a daily noisy environment.

本実施形態は、利用できる音声データのこのような前提条件を積極的に活用して、少量しかない第１音声V1(a)からでないと求められない話者性情報に関しては第１話者性情報として求め、多量に取得できる第２音声V2(a)からでも求められる話者性情報に関しては第２話者性情報として求めることにより、効率的に、目標話者aのテキストTの音声特徴f(a,T)を予測するものである。 In this embodiment, such a precondition of available voice data is positively utilized, and the speaker character information that can be obtained only from the first voice V1 (a), which is only a small amount, is the first speaker character. The speech characteristics of the text T of the target speaker a can be efficiently obtained by obtaining the speaker information that can be obtained as information and can be obtained in large quantities from the second voice V2 (a) as the second voice information. It predicts f (a, T).

ここで、話者性情報に関して、以下のような第１及び第２考察が可能である。 Here, the following first and second considerations can be made with respect to speaker information.

（第１考察）
話者性情報のうち、例えば音声のスペクトル特徴は、その推定結果が音声収録環境における雑音、反響等の影響を強く受けるため、高精度な推定にはクリーンな音声を必要とする。しかしスペクトルの特徴の多くは、人間の発声器官の形状および生理学的な運動条件の制約下にあるため、人間の音声生成の音響的特徴のモデルとして、多数の異なる話者の音声からモデルが高精度に学習できていれば、目標話者音声（第１音声集合V1(a)）が少量しかなくても、目標話者のスペクトル的な特徴を高精度に推定できることが期待される。 (First consideration)
Of the speaker information, for example, the spectral characteristics of voice are strongly affected by noise, reverberation, etc. in the voice recording environment, and therefore clean voice is required for highly accurate estimation. However, many of the spectral features are constrained by the shape of the human vocal organs and physiological motor conditions, so the model of the acoustic features of human speech generation is high from the voices of many different speakers. If the target speaker can be learned accurately, it is expected that the spectral characteristics of the target speaker can be estimated with high accuracy even if the target speaker voice (first speech set V1 (a)) is small.

（第２考察）
一方、話者特有の音声の長時間（数音節以上）のパワー変化の様式や、基本周波数変化の様式、また音節の継続時間長等の時間軸方向の特徴の変化様式（これらをまとめて以下、韻律的特徴という）は、対象が元々長時間的な特徴であることに加え、生成機構的には変化のさせ方の自由度が比較的大きいため、短時間の音声データから推定することが難しい。しかし、パワー変化や基本周波数の情報は、音声再現に必要なスペクトル分析と比較し、多少の雑音が含む環境下でも抽出することができる。また、時間軸方法の特徴の変化様式を抽出するためには音声認識技術による音素や音節の時刻情報が必要だが、音声認識そのものは声質再現に必要な程のスペクトル精度を要さない。つまり、韻律的特徴は音声対話システムの入力音声情報等（すなわち、第２音声V2(a)のようにノイズ環境下で取得された音声等）からでも推定できる。 (Second consideration)
On the other hand, the mode of power change of the speaker's voice for a long time (several syllables or more), the mode of fundamental frequency change, and the mode of change of characteristics in the time axis direction such as the duration of syllables (these are summarized below). , Syllable features) can be estimated from short-term audio data because the object is originally a long-term feature and the generation mechanism has a relatively large degree of freedom in how to change it. difficult. However, power change and fundamental frequency information can be extracted even in an environment containing some noise by comparing with the spectrum analysis required for voice reproduction. In addition, time information of phonemes and syllables by voice recognition technology is required to extract the change mode of the characteristics of the time axis method, but voice recognition itself does not require the spectral accuracy required for voice quality reproduction. That is, the prosodic feature can be estimated from the input voice information of the voice dialogue system (that is, the voice acquired in a noise environment such as the second voice V2 (a)).

以上の第１及び第２考察に基づき、第１分析部4では少量のクリーンな第１音声V1(a)の集合より、第１話者性情報s1(a)として話者aの、テキストに依存しない形で一般化されたスペクトル特徴が記述されたベクトル（または第１考察に基づき、スペクトル特徴と同様の特性を有する特徴に関してのテキスト非依存の特徴ベクトル）を得る。また、第２分析部5では、多量のクリーンでない第２音声V2(a)の集合より、第２話者性情報s2(a)として話者aの、テキストに依存しない形で一般化された韻律的特徴が記述されたベクトル（または第２考察に基づき、韻律的特徴と同様の特性を有する特徴に関するテキスト非依存の特徴ベクトル）を得る。 Based on the above first and second considerations, in the first analysis unit 4, a small amount of clean first voice V1 (a) is converted into the text of the speaker a as the first speaker information s1 (a). Obtain a vector in which the generalized spectral features are described in an independent manner (or, based on the first discussion, a text-independent feature vector for features having characteristics similar to the spectral features). Further, in the second analysis unit 5, a large amount of unclean second voice V2 (a) was generalized as the second speaker information s2 (a) in a text-independent manner of the speaker a. Obtain a vector in which the prosodic features are described (or, based on the second consideration, a text-independent feature vector for features having characteristics similar to the prosodic features).

第１話者性情報及び第２話者性情報に関しては、それぞれ、あらかじめ人手で定義した複数の印象等に対応する物理的特徴データ（平均ケプストラムや、平均基本周波数のような物理量で定義される多次元のベクトル）で直接構成しても良いし、i-vectorや、ネットワーク構造としてボトルネック層（入力層や出力層のユニット数よりも少ないユニット数で構成される中間層）を持つ、音声特徴から話者（ここでは例えばワンホットベクトルで表される話者ＩＤ情報）を推定するニューラルネットワークにおける入力に対するボトルネック層における値（以下、ボトルネック特徴量という）のような、統計処理的な方法で定義された多次元のベクトルでも良い。あるいはその両者の組み合わせによるベクトルでも良い。 The first speaker information and the second speaker information are each defined by physical feature data (mean cepstrum, average fundamental frequency, and other physical quantities) corresponding to a plurality of impressions defined manually in advance. It may be directly configured with a multidimensional vector), or it has an i-vector or a bottleneck layer (an intermediate layer composed of fewer units than the number of units in the input layer and output layer) as a network structure. Statistical processing such as a value in the bottleneck layer (hereinafter referred to as a bottleneck feature quantity) for an input in a neural network that estimates a speaker (here, speaker ID information represented by a one-hot vector) from a feature. It may be a multidimensional vector defined by the method. Alternatively, it may be a vector obtained by combining both of them.

なお、上記のように統計処理的な方法で話者性情報を求める場合、その処理を行う第１分析部4や第２分析部5（及び後述する図３の特徴抽出部18）は、それぞれを比較的少量の音声から話者性情報を予測するような予測器と考え、予め別の大量の音声データからそれらの予測器を学習しておく。例えば、i-vectorやボトルネック特徴量の推定における誤差評価（ボトルネック特徴量の場合、話者推定するニューラルネットワーク全体を学習する際の誤差評価）で、スペクトルを重く、基本周波数を軽くして学習することで第１分析部4が用いる予測器を学習しておき、逆にスペクトルを軽く、基本周波数を重くして学習することで第２分析部5が用いる予測器を学習しておくことが可能である。 When the speaker information is obtained by the statistical processing method as described above, the first analysis unit 4 and the second analysis unit 5 (and the feature extraction unit 18 in FIG. 3 to be described later) that perform the processing are respectively. Is considered as a predictor that predicts speaker information from a relatively small amount of voice, and those predictors are learned in advance from another large amount of voice data. For example, in error evaluation in i-vector and bottleneck feature estimation (in the case of bottleneck feature, error evaluation when learning the entire neural network estimated by the speaker), the spectrum is heavy and the fundamental frequency is lightened. By learning, the predictor used by the first analysis unit 4 is learned, and conversely, by learning by making the spectrum lighter and the fundamental frequency heavier, the predictor used by the second analysis unit 5 is learned. Is possible.

こうして、第１分析部4及び第２分析部5においては、予め定義されている第１話者性情報及び第２話者性情報に応じた手法（例えば、上記のスペクトルと基本周波数とのいずれを重視するかで学習時の誤差評価が異なり、結果として得られる異なる予測器による手法）でそれぞれ第１音声V1(a)及び第２音声V2(a)をそれぞれ分析し、第１分析部4においては得られた第１話者性情報を予測部7へと出力し、第２分析部5においては第２話者性情報を予測部7へと出力するようにすればよい。 In this way, in the first analysis unit 4 and the second analysis unit 5, any of the above-mentioned spectrum and the fundamental frequency (for example, any of the above spectrum and the fundamental frequency) according to the predefined first speaker information and the second speaker information. The error evaluation at the time of learning differs depending on the emphasis on, and the first voice V1 (a) and the second voice V2 (a) are analyzed by the method using different predictors obtained as a result, respectively, and the first analysis unit 4 In, the obtained first speaker information may be output to the prediction unit 7, and in the second analysis unit 5, the second speaker information may be output to the prediction unit 7.

図３は、予測部7が予測するために必要な学習モデルMを学習する学習装置20の機能ブロック図である。学習装置20は、その全体的な動作として、大量の学習用の音声V[n](n=1,2,…)を学習データとして利用することによって、学習モデルMを出力する。図示するように、学習装置20は、加工部13、学習用第１分析部14、学習用第２分析部15、学習用解析部16、学習部17及び特徴抽出部18を備える。 FIG. 3 is a functional block diagram of the learning device 20 that learns the learning model M required for the prediction unit 7 to make a prediction. The learning device 20 outputs a learning model M by using a large amount of learning voice V [n] (n = 1,2, ...) As learning data as its overall operation. As shown in the figure, the learning device 20 includes a processing unit 13, a learning first analysis unit 14, a learning second analysis unit 15, a learning analysis unit 16, a learning unit 17, and a feature extraction unit 18.

これらのうち、加工部13、学習用第１分析部14、学習用第２分析部15、学習用解析部16及び特徴抽出部18は、学習用音声V[n](n=1,2,…)の各々から学習部17で用いる学習用データL[n](n=1,2,…)を用意するための構成である。学習用データL[n]は、後述する学習用第１話者性情報s1[n]、学習用第２話者性情報s2[n]、学習用中間表現im(T[n])及び学習用音声特徴f[n]で構成されるもの（L[n]={ s1[n], s2[n],im(T[n]),f[n]}(n=1,2,…)）である。 Of these, the processing unit 13, the first analysis unit 14 for learning, the second analysis unit 15 for learning, the analysis unit 16 for learning, and the feature extraction unit 18 are the learning voice V [n] (n = 1,2, This is a configuration for preparing learning data L [n] (n = 1,2, ...) Used in the learning unit 17 from each of (...). The learning data L [n] includes the learning first speaker information s1 [n], the learning second speaker information s2 [n], the learning intermediate representation im (T [n]), and the learning. Voice features for f [n] (L [n] = {s1 [n], s2 [n], im (T [n]), f [n]} (n = 1,2,… )).

学習用音声V[n](n=1,2,…)の各々は、クリーンな状態（音声合成装置10の識別部2で音声環境を識別した場合にe=1（クリーンな音声）と判定される状態）のものとして用意しておくものとする。学習用音声V[n]は、加工部13、学習用解析部16、特徴抽出部18へとそれぞれ読み込まれる。 Each of the learning voices V [n] (n = 1,2, ...) is determined to be in a clean state (e = 1 (clean voice) when the voice environment is identified by the identification unit 2 of the voice synthesizer 10). It shall be prepared as the one (state to be done). The learning voice V [n] is read into the processing unit 13, the learning analysis unit 16, and the feature extraction unit 18, respectively.

加工部13では、学習用音声V[n]をそのまま学習用第１音声V1[n]として学習用第１分析部14へと出力する一方、学習用音声V[n]に対してノイズ重畳を施したものを学習用第２音声V2[n]として学習用第２分析部15へと出力する。ここで、学習用第２音声V2[n]を音声合成装置10の識別部2で音声環境を識別した場合にe=2（クリーンでない音声）と判定される程度のノイズ重畳を行えばよい。ただし、例えば基本周波数の推定で見られるように、ノイズの影響による誤りの傾向がランダムな特徴を話者性情報推定に用いる場合等については、加工部13では処理をノイズ重畳を行わず、あるいは加工部13を設けず、クリーン音声をそのまま学習用第２音声V2[n]とする構成でも良い。 The processing unit 13 outputs the learning voice V [n] as it is to the learning first analysis unit 14 as the learning first voice V1 [n], while superimposing noise on the learning voice V [n]. The applied voice is output to the learning second analysis unit 15 as the learning second voice V2 [n]. Here, noise superimposition may be performed to the extent that e = 2 (unclean voice) is determined when the voice environment of the second voice V2 [n] for learning is identified by the identification unit 2 of the voice synthesizer 10. However, for example, when a feature in which the tendency of error due to the influence of noise is random is used for speaker information estimation as seen in the estimation of the fundamental frequency, the processing unit 13 does not perform noise superimposition or performs processing. The clean voice may be used as it is as the second voice V2 [n] for learning without providing the processing unit 13.

学習用第１分析部14は、第１分析部4と同じ処理を学習用第１音声V1[n]に対して行うものであり、学習用第１音声V1[n]のうち、同じ話者の音声の集合をそれぞれ分析することによって学習用第１話者性情報s1[n]を得て学習部17へ出力する。学習用第２分析部15は、第２分析部5と同じ処理を学習用第２音声V2[n]に対して行うものであり、学習用第２音声V2[n]のうち、同じ話者の音声の集合をそれぞれ分析することによって学習用第２話者性情報s2[n]を得て学習部17へ出力する。これらの処理は話者単位で行い、同じ話者の音声に対してはそれぞれ同じs1[n]とs2[n]を出力する。（すなわち、学習用音声V[n]は、手動付与及び／又は自動識別によるラベルnによって話者nが予め識別されているものとし、話者nの音声を集めたものを学習用音声V[n]として用いて、学習用第１分析部14及び学習用第２分析部15でそれぞれ学習用第１話者性情報s1[n]及び学習用第２話者性情報s2[n]を出力するようにすればよい。） The learning first analysis unit 14 performs the same processing as the learning first analysis unit 4 on the learning first voice V1 [n], and the same speaker among the learning first voice V1 [n]. The first speaker character information s1 [n] for learning is obtained by analyzing each set of voices of, and is output to the learning unit 17. The learning second analysis unit 15 performs the same processing as the learning second analysis unit 5 on the learning second voice V2 [n], and is the same speaker among the learning second voice V2 [n]. By analyzing each set of voices of, the second speaker character information s2 [n] for learning is obtained and output to the learning unit 17. These processes are performed for each speaker, and the same s1 [n] and s2 [n] are output for the voice of the same speaker, respectively. (That is, in the learning voice V [n], the speaker n is pre-identified by the label n by manual assignment and / or automatic identification, and the collection of the voices of the speaker n is the learning voice V [n]. Used as n], the learning first analysis unit 14 and the learning second analysis unit 15 output the learning first speaker information s1 [n] and the learning second speaker information s2 [n], respectively. You can do it.)

学習用解析部16は、学習用音声V[n]を音声認識してそのテキストT[n]を得たうえでさらに、音声合成装置10の解析部6と同じ処理によってこのテキストの中間表現im(T[n])を求め、学習部17へと出力する。学習用解析部16における音声認識には任意の既存手法を用いてよい。音声コーパス等により、学習用音声V[n]に予めテキストT[n]が紐づけられている場合、学習用解析部16では音声認識によるテキストT[n]取得を省略してよい。あるいは、音声認識誤りの影響を避けるために、人手による学習用音声の聴取に基づき、手作業で音声合成装置10の解析部6の出力に相当する中間表現im(T[n])を作成しても良い。 The learning analysis unit 16 recognizes the learning voice V [n] by voice, obtains the text T [n], and further performs the same processing as the analysis unit 6 of the speech synthesizer 10 to perform an intermediate expression im of this text. Find (T [n]) and output it to the learning unit 17. Any existing method may be used for speech recognition in the learning analysis unit 16. When the text T [n] is associated with the learning voice V [n] in advance by a voice corpus or the like, the learning analysis unit 16 may omit the acquisition of the text T [n] by voice recognition. Alternatively, in order to avoid the influence of speech recognition errors, an intermediate representation im (T [n]) corresponding to the output of the analysis unit 6 of the speech synthesizer 10 is manually created based on the manual listening of the learning speech. You may.

特徴抽出部18では、学習用音声V[n]を解析して、この音声のフレーム時系列に沿ったデータとして学習用音声特徴f[n]を得て学習部17へと出力する。 The feature extraction unit 18 analyzes the learning voice V [n], obtains the learning voice feature f [n] as data along the frame time series of this voice, and outputs it to the learning unit 17.

特徴抽出部18が出力する学習用音声特徴f[n]と同種の特徴が、既に説明した音声合成装置10の予測部7が出力する音声特徴f(a,T)である。（予測部7は学習装置20で学習したモデルMを利用するため。）従って、この学習用音声特徴f[n]も、予測部7において既に説明した通り、メルケプストラム係数等として、既存手法により求めるようにすればよい。 A feature similar to the learning voice feature f [n] output by the feature extraction unit 18 is the voice feature f (a, T) output by the prediction unit 7 of the speech synthesizer 10 described above. (Because the prediction unit 7 uses the model M learned by the learning device 20.) Therefore, as already explained in the prediction unit 7, this learning voice feature f [n] is also used by the existing method as the mer cepstrum coefficient or the like. You just have to ask.

学習部17は、以上のように得られる学習データL[n]={ s1[n], s2[n],im(T[n]),f[n]}(n=1,2,…)を用いて学習を行い、予測部7が用いるモデルMを出力する。具体的には、学習用第１話者性情報s1[n]と、学習用第２話者性情報s2[n]と、学習用中間表現im(T[n])と、を入力として、学習用音声特徴f[n]を出力するような、音声合成装置10における予測器７で用いるモデルMを学習する。このモデルMには、第１話者性情報s1[n]と第２話者性情報s2[n]と中間表現im(T[n])の組を説明変数とする決定木や、同様にそれらを入力とするDNN（深層ニューラルネットワーク）を用いることができる。多数の話者nに関する大量の学習用音声V[n]（及び対応する大量の中間表現im(T[n])）をもとにモデルMを学習することで、予測部7においても任意の話者a（及び任意のテキストT）に関する予測が可能となることが期待される。 The learning unit 17 has the learning data L [n] = {s1 [n], s2 [n], im (T [n]), f [n]} (n = 1,2,… obtained as described above. ) Is used to perform learning, and the model M used by the prediction unit 7 is output. Specifically, the first speaker information for learning s1 [n], the second speaker information for learning s2 [n], and the intermediate representation im (T [n]) for learning are input. The model M used by the predictor 7 in the speech synthesizer 10 that outputs the speech feature f [n] for learning is learned. In this model M, a decision tree using a set of the first speaker information s1 [n], the second speaker information s2 [n], and the intermediate representation im (T [n]) as explanatory variables, and similarly DNN (Deep Neural Network) that takes them as input can be used. By learning the model M based on a large amount of learning speech V [n] (and a corresponding large number of intermediate representations im (T [n])) for a large number of speakers n, the prediction unit 7 can also be arbitrary. It is expected that predictions about speaker a (and any text T) will be possible.

予測部7による予測の構成には種々の変形例が可能である。以下、図４及び図５を参照して変形例を説明する。 Various variations are possible in the composition of the prediction by the prediction unit 7. Hereinafter, a modified example will be described with reference to FIGS. 4 and 5.

図４は、予測部７が第１変形例で構成された音声合成装置10の機能ブロック図であり、予測部7が第１予測器71及び第２予測器72で構成され、対応する処理を行う以外は図１の音声合成装置10と同様であり、図２のフローに即して動作することが可能なものである。すなわち、予測部7の第１予測器71及び第２予測器72と、これへのデータ入出力を行う構成と、以外に関しては図１及び図２で説明したとの同様の動作であるため、重複した説明を省略する。 FIG. 4 is a functional block diagram of the speech synthesizer 10 in which the prediction unit 7 is composed of the first modification, and the prediction unit 7 is composed of the first predictor 71 and the second predictor 72, and performs the corresponding processing. It is the same as the voice synthesizer 10 of FIG. 1 except that it is performed, and it is possible to operate according to the flow of FIG. That is, since the operations are the same as those described in FIGS. 1 and 2 except for the first predictor 71 and the second predictor 72 of the prediction unit 7, the configuration for inputting / outputting data to / from the first predictor 71, and the second predictor 72. Duplicate description is omitted.

第１予測器71は、第１分析部4から得られる第１話者性情報s1(a)及び解析部6から得られる中間表現im(T)を用いて、テキストTを話者aが発声する際の特徴量フレーム時系列としての第１音声特徴f1(a,T)を予測して、生成部8へと出力する。第２予測器72は、第２分析部5から得られる第２話者性情報s2(a)及び解析部6から得られる中間表現im(T)を用いて、テキストTを話者aが発声する際の特徴量フレーム時系列としての第２音声特徴f2(a,T)を予測して、生成部8へと出力する。 In the first predictor 71, the speaker a utters the text T using the first speaker information s1 (a) obtained from the first analysis unit 4 and the intermediate representation im (T) obtained from the analysis unit 6. The first voice feature f1 (a, T) as a feature quantity frame time series is predicted and output to the generation unit 8. In the second predictor 72, the speaker a utters the text T using the second speaker information s2 (a) obtained from the second analysis unit 5 and the intermediate representation im (T) obtained from the analysis unit 6. The second voice feature f2 (a, T) as a feature quantity frame time series is predicted and output to the generation unit 8.

ここで、第１予測器71が出力する第１音声特徴f1(a,T)は、第１話者性情報s1(a)と相関が高い特徴量をテキストTに沿った時系列としたものとすることができる。例えばスペクトルに関する特徴量である。また、第２予測器72が出力する第２音声特徴f2(a,T)は、第２話者性情報s2(a)と相関が高い特徴量をテキストTに沿った時系列としたものとすることができる。例えば基本周波数に関する特徴量である。 Here, the first voice feature f1 (a, T) output by the first predictor 71 is a time series of features having a high correlation with the first speaker information s1 (a) along the text T. Can be. For example, it is a feature quantity related to a spectrum. Further, the second voice feature f2 (a, T) output by the second predictor 72 is a time series of features having a high correlation with the second speaker information s2 (a) along the text T. can do. For example, it is a feature quantity related to the fundamental frequency.

生成部8では、第１音声特徴f1(a,T)及び第２音声特徴f2(a,T)を組み合わせて音声特徴f(a,T)を得て、（第１音声特徴f1(a,T)の各フレームがP次元で第２音声特徴f2(a,T)の各フレームがQ次元である場合、これらを統合してP+Q次元ベクトルを得て、）この音声特徴f(a,T)に対して図１の実施形態の場合と同様の処理により音声合成波形W(a,T)を得ることができる。 In the generation unit 8, the first voice feature f1 (a, T) and the second voice feature f2 (a, T) are combined to obtain the voice feature f (a, T), and (the first voice feature f1 (a, T) If each frame of T) is P-dimensional and each frame of second voice feature f2 (a, T) is Q-dimensional, these are integrated to obtain a P + Q-dimensional vector.) This voice feature f (a) , T), the voice synthesis waveform W (a, T) can be obtained by the same processing as in the embodiment of FIG.

学習に関しても、第１予測器71が利用するモデルM1と、第２予測器72が利用するモデルM2とを、図３の場合とほぼ同様に、次のように学習して求めることができる。すなわち、大量のクリーンなまたはクリーンでない学習用音声V[n]（及び対応テキストT[n]）を用いて、学習データとして第１話者性情報s1[n]、第２話者性情報s2[n]、中間表現im(T[n])、第１音声特徴f1[n]及び第２音声特徴f2[n]を用意し、モデルM1は、第１話者性情報s1[n]及び中間表現im(T[n])を入力として第１音声特徴f1[n]を出力するものとしてクリーンな学習用音声V[n]をもとに学習し、モデルM2は、第２話者性情報s2[n]及び中間表現im(T[n])を入力として第２音声特徴f2[n]を出力するものとしてクリーンでない学習用音声V[n]をもとに学習することができる。学習に関しても機械学習（深層学習を含む）を用いればよい。この方法は第１話者性情報s1[n]と第２音声特徴f2[n]、第２話者性情報s2[n]と第１音声特徴f1[n]がそれぞれ独立であることを仮定することになるので、大量の学習データ量がある場合は不利と考えられるが、図１の予測器７におけるモデルMよりも小さいモデルM1とM2を独立に学習できるので、学習データ量が限られている場合は有効である。 Regarding learning, the model M1 used by the first predictor 71 and the model M2 used by the second predictor 72 can be learned and obtained as follows, almost in the same manner as in the case of FIG. That is, using a large amount of clean or unclean learning voice V [n] (and corresponding text T [n]), the first speaker information s1 [n] and the second speaker information s2 are used as learning data. [n], intermediate expression im (T [n]), first voice feature f1 [n] and second voice feature f2 [n] are prepared, and model M1 uses the first speaker information s1 [n] and Learning is based on the clean learning voice V [n] as the first voice feature f1 [n] is output by inputting the intermediate expression im (T [n]), and the model M2 is the second speaker. It is possible to learn based on the unclean learning voice V [n] as the output of the second voice feature f2 [n] by inputting the information s2 [n] and the intermediate expression im (T [n]). Machine learning (including deep learning) may be used for learning as well. This method assumes that the first speaker information s1 [n] and the second voice feature f2 [n], and the second speaker information s2 [n] and the first voice feature f1 [n] are independent. Therefore, it is considered disadvantageous if there is a large amount of training data, but the amount of training data is limited because models M1 and M2, which are smaller than the model M in the predictor 7 in FIG. 1, can be trained independently. If so, it is valid.

図５は、予測部７が第２変形例で構成された音声合成装置10の機能ブロック図であり、予測部7が第３予測器73及び第４予測器74で構成され、対応する処理を行う以外は図１の音声合成装置10と同様であり、図２のフローに即して動作することが可能なものである。すなわち、予測部7の第３予測器73及び第４予測器74と、これへのデータ入出力を行う構成と、以外に関しては図１及び図２で説明したとの同様の動作であるため、重複した説明を省略する。 FIG. 5 is a functional block diagram of the speech synthesizer 10 in which the prediction unit 7 is composed of the second modification, and the prediction unit 7 is composed of the third predictor 73 and the fourth predictor 74, and performs the corresponding processing. It is the same as the voice synthesizer 10 of FIG. 1 except that it is performed, and it is possible to operate according to the flow of FIG. That is, the operations are the same as those described in FIGS. 1 and 2 except for the third predictor 73 and the fourth predictor 74 of the prediction unit 7, the configuration for inputting / outputting data to and from the third predictor 73, and the fourth predictor 74. Duplicate description is omitted.

図５の実施形態は、線L1の流れ（解析部6から得た中間表現im(T)を第４予測器74へ出力する流れ）がある場合とない場合との２つが可能である。以下、線L1がある場合を説明する。 There are two possible embodiments of FIG. 5 with and without the flow of the line L1 (the flow of outputting the intermediate representation im (T) obtained from the analysis unit 6 to the fourth predictor 74). Hereinafter, the case where the line L1 is present will be described.

第３予測器73は、第２分析部5から得られる第２話者性情報s2(a)及び解析部6から得られる中間表現im(T)を用いて、テキストTを話者aが発声する際の特徴量フレーム時系列としての第２音声特徴f2(a,T)を予測して、第４予測器74へと出力する。第４予測器74は、第１分析部4から得られる第１話者性情報s1(a)と、第３予測器73から得られる第２音声特徴f2(a,T)と、解析部6から得られる中間表現im(T)と、を用いて、テキストTを話者aが発声する際の特徴量フレーム時系列としての音声特徴f(a,T)を予測して、生成部8へと出力する。 In the third predictor 73, the speaker a utters the text T using the second speaker information s2 (a) obtained from the second analysis unit 5 and the intermediate representation im (T) obtained from the analysis unit 6. The second voice feature f2 (a, T) as a feature quantity frame time series is predicted and output to the fourth predictor 74. The fourth predictor 74 includes the first speaker information s1 (a) obtained from the first analysis unit 4, the second voice feature f2 (a, T) obtained from the third predictor 73, and the analysis unit 6. Using the intermediate representation im (T) obtained from, the speech feature f (a, T) as the feature quantity frame time series when the speaker a utters the text T is predicted, and the text T is sent to the generator 8. Is output.

ここで、第３予測器73が出力する第２音声特徴f2(a,T)は、第２話者性情報s2(a)と相関の高い特徴量をテキストTに沿った時系列としたものとすることができる。一方、第４予測器74が出力する音声特徴f(a,T)は、音声合成に必要な全ての特徴を含んだ特徴量をテキストTに沿った時系列とする必要がある。 Here, the second voice feature f2 (a, T) output by the third predictor 73 is a time series of features highly correlated with the second speaker information s2 (a) along the text T. Can be. On the other hand, the speech feature f (a, T) output by the fourth predictor 74 needs to have a feature amount including all features necessary for speech synthesis as a time series along the text T.

学習に関しても、第３予測器73が利用するモデルM3と、第４予測器74が利用するモデルM4とを、図３の場合とほぼ同様に、次のように学習して求めることができる。すなわち、大量のクリーンなまたはクリーンでない学習用音声V[n]（及び対応テキストT[n]）を用いて、学習データとして第１話者性情報s1[n]、第２話者性情報s2[n]、中間表現im(T[n])、第１音声特徴f1[n]及び第２音声特徴f2[n]を用意しておく。 Regarding learning, the model M3 used by the third predictor 73 and the model M4 used by the fourth predictor 74 can be learned and obtained as follows, almost in the same manner as in the case of FIG. That is, using a large amount of clean or unclean learning speech V [n] (and corresponding text T [n]), the first speaker information s1 [n] and the second speaker information s2 are used as learning data. Prepare [n], intermediate representation im (T [n]), first voice feature f1 [n], and second voice feature f2 [n].

モデルM3は、第２話者性情報s2[n]及び中間表現im(T[n])を入力として第２音声特徴f2[n]を出力するものとしてクリーンでない学習用音声V[n]をもとに学習し、モデルM4は、第１話者性情報s1[n]、第２音声特徴f2[n]（クリーンでない音声から求められた第２音声特徴f2[n]）及び中間表現im(T[n])を入力として音声特徴f[n]（クリーンな音声から求めた第１音声特徴f1[n]及びクリーンでない音声から求められた第２音声特徴f2[n]のベクトル要素としての組み合わせ）を出力するものとして学習することができる。学習に関しても機械学習（深層学習を含む）を用いればよい。この方法は第２音声特徴f2[n]に音声の基本周波数が含まれる場合に、音声の基本周波数が音声のスペクトルに与える影響を、中間表現im(T[n])を介した間接的な情報だけでなく、直接的にモデル化することを目指した構成である。 The model M3 outputs the second speech feature f2 [n] by inputting the second speaker information s2 [n] and the intermediate expression im (T [n]), and outputs the unclean learning speech V [n]. Learning based on, the model M4 has the first speaker information s1 [n], the second speech feature f2 [n] (the second speech feature f2 [n] obtained from unclean speech) and the intermediate expression im. With (T [n]) as the input, as a vector element of the voice feature f [n] (the first voice feature f1 [n] obtained from clean voice and the second voice feature f2 [n] obtained from unclean voice. Can be learned as an output (combination of). Machine learning (including deep learning) may be used for learning as well. In this method, when the second speech feature f2 [n] includes the fundamental frequency of speech, the influence of the fundamental frequency of speech on the spectrum of speech is indirectly expressed via the intermediate representation im (T [n]). It is a structure that aims to model not only information but also directly.

また、図５の実施形態で線L1が省略される場合も、以下のように動作することが可能である。 Further, even when the line L1 is omitted in the embodiment of FIG. 5, it is possible to operate as follows.

第３予測器73は、第２分析部5から得られる第２話者性情報s2(a)及び解析部6から得られる中間表現im(T)を用いて、テキストTを話者aが発声する際の特徴量フレーム時系列としての第２音声特徴f2(a,T)を予測して、第４予測器74へと出力する。第４予測器74は、第１分析部4から得られる第１話者性情報s1(a)と、第３予測器73から得られる第２音声特徴f2(a,T)と、を用いて、テキストTを話者aが発声する際の特徴量フレーム時系列としての音声特徴f(a,T)を予測して、生成部8へと出力する。 In the third predictor 73, the speaker a utters the text T using the second speaker information s2 (a) obtained from the second analysis unit 5 and the intermediate representation im (T) obtained from the analysis unit 6. The second voice feature f2 (a, T) as a feature quantity frame time series is predicted and output to the fourth predictor 74. The fourth predictor 74 uses the first speaker information s1 (a) obtained from the first analysis unit 4 and the second voice feature f2 (a, T) obtained from the third predictor 73. , The voice feature f (a, T) as a feature amount frame time series when the speaker a utters the text T is predicted and output to the generation unit 8.

ここでは、第４予測器74が出力する音声特徴f2(a,T)に加え、第３予測器73が出力する第２音声特徴f2(a,T)も、音声合成に必要な全ての特徴を含んだ特徴量をテキストTに沿った時系列としたものとする必要がある。 Here, in addition to the voice feature f2 (a, T) output by the fourth predictor 74, the second voice feature f2 (a, T) output by the third predictor 73 is also all the features required for speech synthesis. It is necessary to make the feature quantity including the above into a time series along the text T.

モデルM3は、第２話者性情報s2[n]及び中間表現im(T[n])を入力として第２音声特徴f2[n]を出力するものとしてクリーンでない学習用音声V[n]をもとに学習し、モデルM4は、第１話者性情報s1[n]及び第２音声特徴f2[n]（クリーンでない音声から求められた第２音声特徴f2[n]）を入力として音声特徴f[n]（クリーンな音声から求めた第１音声特徴f1[n]及びクリーンでない音声から求められた第２音声特徴f2[n]のベクトル要素としての組み合わせ）を出力するものとして学習することができる。学習に関しても機械学習を用いればよい。クリーンでない音声がより大量に得られる場合や、例えば背景雑音レベルが低くクリーンでない音声からスペクトルに関連した話者性情報も高精度に得られるような場合においては、中間表現im(T[n])に大きく基づいた音声特徴の予測は第３予測器73だけで高精度に行うことが可能で、第４予測器74の入力に中間表現im(T[n])が含まれないことで、入出力関係がより単純になり、モデルM4の学習がより容易になる。 The model M3 outputs the second speech feature f2 [n] by inputting the second speaker information s2 [n] and the intermediate expression im (T [n]), and outputs the unclean learning speech V [n]. Learning based on this, the model M4 uses the first speaker information s1 [n] and the second voice feature f2 [n] (the second voice feature f2 [n] obtained from the unclean voice) as inputs. Learn as outputting feature f [n] (combination of first speech feature f1 [n] obtained from clean speech and second speech feature f2 [n] obtained from unclean speech as vector elements) be able to. Machine learning may be used for learning as well. The intermediate representation im (T [n]] is used when a larger amount of unclean voice is obtained, or when, for example, background noise level is low and spectrum-related speaker information can be obtained with high accuracy from unclean voice. ) Can be used to predict voice features with high accuracy using only the third predictor 73, and the input of the fourth predictor 74 does not include the intermediate representation im (T [n]). The input / output relationship becomes simpler, and the training of model M4 becomes easier.

なお、以上の説明では話者毎に話者性情報を求めるとして説明したが、ここで話者は実際の発話者に即したものでなくても良い。例えば、同じ人による発話でも異なる感情表現やスタイルに発話に対しては、別の話者の音声として扱っても良い。このようにすることで、同様の方法で様々な感情やスタイルの音声の合成が可能となる。逆に、話者性情報において類似の異なる話者については、同じ話者として扱うことで、話者性情報の推定のためのデータを増やして推定精度を上げることができる。ここで、話者性情報が類似とは、例えば話者性情報を表すベクトル間の距離が所定値以下といった形で定義できる。 In the above explanation, it is assumed that the speaker character information is requested for each speaker, but here the speaker does not have to be in line with the actual speaker. For example, utterances by the same person but with different emotional expressions and styles may be treated as voices of different speakers. By doing so, it is possible to synthesize sounds of various emotions and styles in the same way. On the contrary, by treating different speakers who are similar in the speaker property information as the same speaker, it is possible to increase the data for estimating the speaker property information and improve the estimation accuracy. Here, the similarity of speaker information can be defined, for example, in the form that the distance between vectors representing the speaker information is equal to or less than a predetermined value.

図６は、一般的なコンピュータ装置50におけるハードウェア構成を示す図であり、音声合成装置10はこのような構成を有する１台以上のコンピュータ装置50として実現可能である。コンピュータ装置50は、所定命令を実行するCPU（中央演算装置）51、CPU51の実行命令の一部又は全部をCPU51に代わって又はCPU51と連携して実行する専用プロセッサ52（GPU（グラフィック演算装置）や深層学習専用プロセッサ等）、CPU51や専用プロセッサ52にワークエリアを提供する主記憶装置としてのRAM53、補助記憶装置としてのROM54、通信インタフェース55、ディスプレイ56、マイク57及びスピーカ58、キーボード、マウス、タッチパネル等で構成されユーザからの操作入力を受け付ける入力インタフェース59と、これらの間でデータを授受するためのバスBと、を備える。 FIG. 6 is a diagram showing a hardware configuration in a general computer device 50, and the speech synthesizer 10 can be realized as one or more computer devices 50 having such a configuration. The computer device 50 is a CPU (central processing unit) 51 that executes a predetermined instruction, and a dedicated processor 52 (GPU (graphic calculation device)) that executes a part or all of the execution instructions of the CPU 51 on behalf of the CPU 51 or in cooperation with the CPU 51. RAM53 as the main storage device that provides a work area for the CPU 51 and the dedicated processor 52, ROM 54 as the auxiliary storage device, communication interface 55, display 56, microphone 57 and speaker 58, keyboard, mouse, etc. It is provided with an input interface 59 composed of a touch panel or the like and receiving operation input from a user, and a bus B for exchanging data between them.

音声合成装置10の各部は、各部の機能に対応する所定のプログラムをROM54から読み込んで実行するCPU51及び／又は専用プロセッサ52によって実現することができる。ここで、表示関連の処理が行われる場合にはさらに、ディスプレイ56が連動して動作し、データ送受信に関する通信関連の処理が行われる場合にはさらに通信インタフェース55が連動して動作し、音声録音に関する処理が行われる場合にはマイク57が連動して動作し、音声再生に関する処理が行われる場合にはスピーカ58が連動して動作する。例えば、生成部8で得た合成音声はスピーカ58から再生して出力されるようにしてもよい。 Each part of the speech synthesizer 10 can be realized by a CPU 51 and / or a dedicated processor 52 that reads and executes a predetermined program corresponding to the function of each part from the ROM 54. Here, when display-related processing is performed, the display 56 operates in conjunction with the display 56, and when communication-related processing related to data transmission / reception is performed, the communication interface 55 further operates in conjunction with the voice recording. When the processing related to the above is performed, the microphone 57 operates in conjunction with each other, and when the processing related to audio reproduction is performed, the speaker 58 operates in conjunction with each other. For example, the synthetic voice obtained by the generation unit 8 may be reproduced and output from the speaker 58.

２台以上のコンピュータ装置50で音声合成装置10がシステムとして実現される場合、ネットワーク経由で各処理に必要な情報を送受信するようにすればよい。 When the voice synthesizer 10 is realized as a system by two or more computer devices 50, the information required for each process may be transmitted and received via the network.

10…音声合成装置、4…第１分析部、5…第２分析部、7…解析部、8…生成部 10 ... Speech synthesizer, 4 ... 1st analysis unit, 5 ... 2nd analysis unit, 7 ... Analysis unit, 8 ... Generation unit

Claims

The first analysis unit that analyzes the first voice that is judged to be clean for a specific speaker and obtains the first speaker information,
A second analysis unit that analyzes the second voice of the specific speaker to obtain second speaker information, and
The specific speaker utters the text by applying a learning model to the first speaker information, the second speaker information, and the intermediate representation for voice synthesis of the designated text. A voice synthesizer including a prediction unit that predicts voice characteristics at the time of learning.

The prediction unit
By applying a learning model to the first speaker information and the intermediate representation, the first voice feature when the specific speaker utters the text is predicted.
By applying the learning model to the second speaker character information and the intermediate representation, the second voice feature when the specific speaker utters the text is predicted.
The voice synthesizer according to claim 1, wherein the voice feature is obtained by combining the first voice feature and the second voice feature.

The prediction unit
By applying the learning model to the second speaker character information and the intermediate representation, the second voice feature when the specific speaker utters the text is predicted.
The voice synthesizer according to claim 1, wherein the voice feature is obtained by applying a learning model to the first speaker information and the second voice feature.

The prediction unit
By applying the learning model to the second speaker character information and the intermediate representation, the second voice feature when the specific speaker utters the text is predicted.
The voice synthesizer according to claim 1, wherein the voice feature is obtained by applying a learning model to the first speaker information, the intermediate representation, and the second voice feature. ..

The speech synthesizer according to any one of claims 1 to 4, wherein the first speaker information includes information regarding spectral features, and the second speaker information includes information regarding prosodic features.

Further equipped with a collecting unit for collecting operation voices of the specific speaker for smart home appliances,
The voice synthesizer according to any one of claims 1 to 5, wherein the second voice includes the operation voice.

The voice synthesizer according to any one of claims 1 to 6, further comprising a generation unit that synthesizes voice using the voice feature.

The first analysis unit and / or the second analysis unit continuously add and acquire new first voice and / or second voice, respectively, and acquire first speaker character information and / or second speaker. The sexual information is obtained, and the first speaker information and / or the second speaker information obtained by using only the newly acquired first voice and / or the second voice has already been obtained. If it is determined that the first speaker information and / or the second speaker information is different from the second speaker information in the threshold determination, the newly acquired first voice and / or the second voice is used as the first speaker information and / or. The voice synthesizer according to any one of claims 1 to 7, wherein the voice synthesizer is excluded from the target used for obtaining the second speaker information.

The first analysis stage to analyze the first voice judged to be clean of a specific speaker and obtain the first speaker character information, and
The second analysis stage of analyzing the second voice of the specific speaker to obtain the second speaker information, and
The specific speaker utters the text by applying a learning model to the first speaker information, the second speaker information, and the intermediate representation for voice synthesis of the designated text. A voice synthesis method comprising a prediction stage for predicting voice characteristics at the time of learning.

A program characterized in that a computer functions as the voice synthesizer according to any one of claims 1 to 8.