JP2014026222A

JP2014026222A - Data generation device and data generation method

Info

Publication number: JP2014026222A
Application number: JP2012168473A
Authority: JP
Inventors: Noriaki Asemi; 典昭阿瀬見
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2012-07-30
Filing date: 2012-07-30
Publication date: 2014-02-06
Anticipated expiration: 2032-07-30
Also published as: JP6003352B2

Abstract

PROBLEM TO BE SOLVED: To diversify emotional expression without increasing throughput required to extract data including the content of a specified emotion when emotional expression is performed by a synthesized voice in speech synthesis.SOLUTION: The data generation method comprises: executing clustering to all speech parameters PV (S130); estimating representative tag data TD on the basis of tag data TG included in classification clusters CL(S150); deriving a difference between an average parameter CPV_Ain the classification clusters CLand a reference parameter NPV for each classification cluster CLas a representative parameter DPVin the classification clusters CL(S190); and generating speech emotion data EThaving representative tag data TDand a representative parameter DPVassociated with each classification cluster CLto be stored (S200).

Description

本発明は、音声感情データを生成するデータ生成装置，及びデータ生成方法に関する。 The present invention relates to a data generation apparatus and a data generation method for generating voice emotion data.

従来、音声合成にて出力する合成音の声質を任意に変換する音声合成装置が知られている（例えば、特許文献１参照）。この種の音声合成装置の中には、音声合成の手法として、予め用意した音声パラメータを調整して音声波形、ひいては合成音を生成するフォルマント合成を用いるものが存在する。 2. Description of the Related Art Conventionally, a speech synthesizer that arbitrarily converts the voice quality of synthesized sound output by speech synthesis is known (see, for example, Patent Document 1). Among these types of speech synthesizers, as a speech synthesis method, there is one that uses formant synthesis that generates speech waveforms by adjusting speech parameters prepared in advance, and thus synthesized speech.

このようなフォルマント合成にて生成される合成音に感情を付与する場合、音声パラメータの調整は、少なくとも一つの音声データに基づいて実施される必要がある。ここで言う音声データとは、人が音声を発したときの感情の内容と、当該内容の感情にて発せられた音声波形に基づいて生成された音声パラメータとを、当該音声を発した人物ごとに予め一対一で対応付けた一つのデータである。この音声データは、一般的に、記憶装置に記憶され、データベースが構築されている。 When an emotion is given to a synthesized sound generated by such formant synthesis, the adjustment of the voice parameter needs to be performed based on at least one voice data. The voice data here refers to the content of the emotion when a person utters the voice and the voice parameters generated based on the voice waveform uttered by the emotion of the content for each person who uttered the voice. Is one piece of data previously associated with each other. This audio data is generally stored in a storage device, and a database is constructed.

このような音声合成装置においては、合成音に付与する感情の種類を増加させるために、音声データを構成する感情の種類、ひいては音声データの数を増加させる必要がある。 In such a speech synthesizer, it is necessary to increase the types of emotions constituting the speech data, and hence the number of speech data, in order to increase the types of emotions to be given to the synthesized speech.

特開２００４−３８０７１号JP 2004-38071 A

上述したように、記憶装置に記憶する音声データの数を増加させれば、従来の音声合成装置において、合成音に付与できる感情の種類を増やすこと、即ち、合成音による感情表現を多様化できる。 As described above, if the number of voice data stored in the storage device is increased, in the conventional voice synthesizer, the types of emotions that can be added to the synthesized sound can be increased, that is, the emotional expression by the synthesized sound can be diversified. .

しかしながら、従来の音声データは、感情の内容と音声パラメータとを、音声を発した人物ごとに一対一で対応付けたものである。このため、従来の技術において、合成音に付加する感情の種類を多様化するためには、音声を発した人物ごとに別個に音声データを用意して記憶装置に記憶する必要があり、音声データの数が膨大なものとなる可能性があった。 However, conventional voice data is a one-to-one correspondence between emotion content and voice parameters for each person who utters voice. Therefore, in the conventional technique, in order to diversify the types of emotions added to the synthesized sound, it is necessary to prepare voice data separately for each person who utters the voice and store it in the storage device. The number of could be enormous.

そのため、記憶装置に記憶された音声データの数が膨大なものとなると、音声合成装置では、音声合成の際に、利用者によって指定された感情の内容を含む音声データを抽出するために要する処理量が増加し、目的とする音声データを抽出するまでに要する時間が長くなるという問題が生じる。 For this reason, when the number of speech data stored in the storage device becomes enormous, the speech synthesizer requires processing to extract speech data including emotional content designated by the user during speech synthesis. There is a problem that the amount increases and the time required to extract the target audio data becomes longer.

そこで、本発明は、音声合成において、合成音による感情表現を多様化しつつも、指定された感情の内容を含むデータを抽出するまでに要する処理量の増加を抑制することを目的とする。 Therefore, an object of the present invention is to suppress an increase in the amount of processing required to extract data including the contents of a specified emotion while diversifying emotion expressions by synthesized sounds in speech synthesis.

上記目的を達成するためになされた本発明は、データ生成装置に関する。
本発明のデータ生成装置は、パラメータ取得手段と、分類手段と、タグ取得手段と、代表推定手段と、パラメータ決定手段と、データ生成手段とを備えている。 The present invention made to achieve the above object relates to a data generation apparatus.
The data generation apparatus of the present invention includes parameter acquisition means, classification means, tag acquisition means, representative estimation means, parameter determination means, and data generation means.

パラメータ取得手段は、音声データを少なくとも２つ記憶する第一記憶装置から、音声データそれぞれに含まれる音声パラメータを取得する。ここで言う音声パラメータは、人が発した音の波形を表す少なくとも一つの特徴量である。さらに、ここで言う音声データは、音声パラメータと、当該音声パラメータによって表される音を発した人の感情を含む情報であるタグデータとを人ごとに対応付けたデータである。 The parameter acquisition means acquires a voice parameter included in each of the voice data from a first storage device that stores at least two voice data. The voice parameter referred to here is at least one feature amount representing a waveform of a sound emitted by a person. Furthermore, the voice data referred to here is data in which voice parameters are associated with tag data, which is information including the emotions of the person who produced the sound represented by the voice parameters, for each person.

分類手段は、パラメータ取得手段にて取得した音声パラメータの群を、音声パラメータの分布に基づいて、少なくとも２つのグループに分類する。その分類手段にて分類されたグループのそれぞれを分類クラスタとして、タグ取得手段が、分類クラスタのそれぞれに含まれる音声パラメータと対応付けられたタグデータのそれぞれを、第一記憶装置から分類クラスタごとに取得する。 The classifying unit classifies the group of voice parameters acquired by the parameter acquiring unit into at least two groups based on the voice parameter distribution. Each of the groups classified by the classification means is set as a classification cluster, and the tag acquisition means sends each of the tag data associated with the voice parameter included in each classification cluster from the first storage device for each classification cluster. get.

その取得したタグデータに基づいて、代表推定手段が、分類クラスタのそれぞれを代表する感情を含む情報である代表タグデータを、分類クラスタごとに少なくとも一つ推定する。さらに、パラメータ決定手段が、分類手段にて分類された各分類クラスタに含まれる音声パラメータに基づいて、当該分類クラスタを代表して表す音声パラメータである代表パラメータを、分類クラスタごとに決定する。 Based on the acquired tag data, the representative estimation means estimates at least one representative tag data, which is information including emotions representing each classification cluster, for each classification cluster. Further, the parameter determining means determines, for each classification cluster, a representative parameter that is a voice parameter representing the classification cluster based on the voice parameters included in each classification cluster classified by the classification means.

この決定された代表パラメータと、代表推定手段にて推定された代表タグデータとを、それぞれが対応する分類クラスタごとに対応付けた音声感情データを、データ生成手段が生成して、第二記憶装置に記憶する。 The data generation means generates voice emotion data in which the determined representative parameter and the representative tag data estimated by the representative estimation means are associated with each corresponding classification cluster, and the second storage device To remember.

本発明のデータ生成装置によれば、音声合成装置にて出力される合成音に付与可能な感情の種類が従来と同数であったとしても、第二記憶装置に記憶される音声感情データのデータ量を、従来の技術に比べて低減できる。 According to the data generation device of the present invention, even if the number of types of emotions that can be imparted to the synthesized sound output by the speech synthesizer is the same as in the past, the voice emotion data stored in the second storage device The amount can be reduced compared to the prior art.

この結果、本発明のデータ生成装置によって生成された、音声感情データが記憶された第二記憶装置を用いれば、音声合成装置における音声合成の際に、利用者に指定された感情の内容を含む音声感情データを抽出するまでに要する処理量を低減でき、ひいては、当該音声感情データの抽出までに要する時間長を短縮できる。 As a result, if the second storage device that stores the voice emotion data generated by the data generation device of the present invention is used, the content of the emotion designated by the user is included at the time of speech synthesis in the speech synthesizer. The amount of processing required to extract the voice emotion data can be reduced, and consequently, the time length required to extract the voice emotion data can be shortened.

換言すれば、本発明によれば、音声合成において、合成音による感情表現を多様化しつつも、指定された感情の内容を含むデータを抽出するまでに要する処理量の増加を抑制できる。 In other words, according to the present invention, in speech synthesis, it is possible to suppress an increase in the amount of processing required to extract data including the contents of a designated emotion, while diversifying emotion expressions by synthesized sounds.

本発明におけるパラメータ決定手段では、平均手段が、分類クラスタのそれぞれに含まれる音声パラメータの平均値である平均パラメータを、分類クラスタごとに導出し、差分導出手段が、平均手段にて導出された平均パラメータと、規定された基準値での音声パラメータを表す基準パラメータとの差分を代表パラメータとして、分類クラスタごとに導出しても良い。 In the parameter determining means in the present invention, the averaging means derives an average parameter, which is an average value of the speech parameters included in each of the classification clusters, for each classification cluster, and the difference deriving means calculates the average derived by the averaging means. The difference between the parameter and the reference parameter representing the speech parameter at the specified reference value may be derived for each classification cluster as a representative parameter.

このようなデータ生成装置によれば、代表パラメータを、平均パラメータと基準パラメータとの差分とすることができる。
このような音声感情データを用いて音声パラメータを調整して音声合成すれば、音声パラメータとして基準パラメータのみが存在する状況下であっても、音声合成により、感情を付与した合成音を生成できる。 According to such a data generation device, the representative parameter can be a difference between the average parameter and the reference parameter.
When speech synthesis is performed by adjusting speech parameters using such speech emotion data, a synthesized sound to which emotions are added can be generated by speech synthesis even in a situation where only a reference parameter exists as a speech parameter.

さらに、本発明における差分導出手段では、抽出手段が、感情が自然体であることを表すタグデータと対応付けられた音声パラメータのそれぞれを、第一記憶装置から抽出し、基準導出手段が、抽出手段で抽出した音声パラメータの平均を、基準パラメータとして導出しても良い。 Further, in the difference deriving unit in the present invention, the extracting unit extracts each of the speech parameters associated with the tag data representing that the emotion is a natural body from the first storage device, and the reference deriving unit includes the extracting unit. The average of the speech parameters extracted in step 1 may be derived as a reference parameter.

このようなデータ生成装置によれば、基準パラメータの導出に用いる音声パラメータに対応付けられたタグデータを、感情が自然体であることを表すタグデータとすることができる。 According to such a data generation device, tag data associated with an audio parameter used for derivation of a reference parameter can be tag data representing that an emotion is a natural body.

この感情が自然体であることを、例えば、ニュース番組での表情のように無表情である場合の感情や、通常の会話における感情とすれば、当該タグデータと対応付けられた音声パラメータを容易に収集でき、ひいては、基準パラメータを容易に導出できる。 If this emotion is natural, for example, if it is an emotion when there is no expression like an expression in a news program or an emotion in a normal conversation, the voice parameters associated with the tag data can be easily And thus the reference parameters can be easily derived.

ところで、本発明は、データ生成方法としてなされていても良い。
本発明がデータ生成方法としてなされている場合、本発明のデータ生成方法では、第一記憶装置から、音声データそれぞれに含まれる音声パラメータを取得するパラメータ取得手順と、その取得した音声パラメータの群を音声パラメータの分布に基づいて、少なくとも２つのグループに分類する分類手順とを有している必要がある。さらに、データ生成方法では、分類クラスタのそれぞれに含まれる音声パラメータと対応付けられたタグデータのそれぞれを、第一記憶装置から分類クラスタごとに取得するタグ取得手順と、その取得したタグデータに基づいて、分類クラスタのそれぞれを代表する感情を含む情報である代表タグデータを、分類クラスタごとに少なくとも一つ推定する代表推定手順とを有している必要がある。 By the way, this invention may be made | formed as a data generation method.
When the present invention is used as a data generation method, in the data generation method of the present invention, a parameter acquisition procedure for acquiring audio parameters included in each audio data from the first storage device, and a group of the acquired audio parameters It is necessary to have a classification procedure for classifying into at least two groups based on the distribution of speech parameters. Furthermore, in the data generation method, based on the tag acquisition procedure for acquiring each tag data associated with the speech parameter included in each classification cluster from the first storage device for each classification cluster, and the acquired tag data Thus, it is necessary to have a representative estimation procedure for estimating at least one representative tag data, which is information including emotion representing each classification cluster, for each classification cluster.

さらには、データ生成方法では、その分類された各分類クラスタに含まれる音声パラメータに基づいて、当該分類クラスタを代表して表す音声パラメータである代表パラメータを、分類クラスタごとに決定するパラメータ決定手順と、その決定された代表パラメータと、代表推定手順にて推定された代表タグデータとを、それぞれが対応する分類クラスタごとに対応付けた音声感情データを生成して、第二記憶装置に記憶するデータ生成手順とを有している必要がある。 Further, in the data generation method, a parameter determination procedure for determining, for each classification cluster, a representative parameter, which is a voice parameter representing the classification cluster, based on the voice parameter included in each classified classification cluster. Data for generating voice emotion data in which the determined representative parameter and the representative tag data estimated in the representative estimation procedure are associated with each other corresponding to each classification cluster, and stored in the second storage device Production procedure.

このようなデータ生成方法によれば、請求項１に係るデータ生成装置と同様の効果を得ることができる。 According to such a data generation method, the same effect as that of the data generation apparatus according to claim 1 can be obtained.

音声合成システムの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of a speech synthesis system. 音声感情データ生成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of an audio | voice emotion data generation process. 音声感情データ生成処理の処理概要を示す説明図である。It is explanatory drawing which shows the process outline | summary of an audio | voice emotion data generation process. 音声感情データ生成処理の処理概要を示す説明図である。It is explanatory drawing which shows the process outline | summary of an audio | voice emotion data generation process. 音声合成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a speech synthesis process.

以下に本発明の実施形態を図面と共に説明する。
〈音声合成システム〉
図１に示すように、音声合成システム１は、当該音声合成システム１の利用者が指定した内容の音声が出力されるように、予め登録された音声パラメータＰＶに基づいて音声合成した音声（即ち、合成音）を出力するシステムである。この音声合成システム１による音声合成では、詳しくは後述する音声感情データＥＴに基づいて、合成音に対して、利用者によって指定された感情を含む音の性質を付加することがなされる。 Embodiments of the present invention will be described below with reference to the drawings.
<Speech synthesis system>
As shown in FIG. 1, the speech synthesis system 1 is configured to synthesize speech based on speech parameters PV registered in advance so that speech having the content specified by the user of the speech synthesis system 1 is output (that is, , Synthesized sound). In the speech synthesis by the speech synthesis system 1, a sound property including an emotion designated by the user is added to the synthesized sound based on speech emotion data ET described later in detail.

これを実現するために、音声合成システム１は、少なくとも一つの音声入力装置１０と、少なくとも一つの音声格納サーバ２５と、少なくとも一つの情報処理装置３０と、少なくとも一つのデータ格納サーバ５０と、少なくとも一つの音声出力端末６０とを備えている。 To realize this, the speech synthesis system 1 includes at least one speech input device 10, at least one speech storage server 25, at least one information processing device 30, at least one data storage server 50, One audio output terminal 60 is provided.

音声入力装置１０は、音声が入力される装置である。音声格納サーバ２５は、音声入力装置１０にて入力された音声に基づいて生成された音声パラメータＰＶと、当該音声の性質を表すタグデータＴＧとを対応付けた音声データＳＤを格納する。 The voice input device 10 is a device to which voice is input. The voice storage server 25 stores voice data SD in which the voice parameter PV generated based on the voice input by the voice input device 10 is associated with the tag data TG representing the nature of the voice.

情報処理装置３０は、音声入力装置１０に格納されている音声データＳＤ群に基づいて、少なくとも２つ以上の音声感情データＥＴを生成する。データ格納サーバ５０は、情報処理装置３０にて生成された音声感情データＥＴを格納する。 The information processing apparatus 30 generates at least two or more voice emotion data ET based on the voice data SD group stored in the voice input device 10. The data storage server 50 stores the voice emotion data ET generated by the information processing apparatus 30.

音声出力端末６０は、音声格納サーバ２５に格納されている音声パラメータＰＶ，及びデータ格納サーバ５０に格納されている音声感情データＥＴに基づいて音声合成した合成音を出力する。
〈音声入力装置〉
音声入力装置１０は、通信部１１と、入力受付部１２と、表示部１３と、音声入力部１４と、音声出力部１５と、記憶部１７と、制御部２０とを備えている。音声入力装置１０は、例えば、周知のカラオケ装置として構成されていても良いし、その他の装置として構成されていても良い。 The voice output terminal 60 outputs a synthesized sound obtained by voice synthesis based on the voice parameter PV stored in the voice storage server 25 and the voice emotion data ET stored in the data storage server 50.
<Voice input device>
The voice input device 10 includes a communication unit 11, an input receiving unit 12, a display unit 13, a voice input unit 14, a voice output unit 15, a storage unit 17, and a control unit 20. The voice input device 10 may be configured as, for example, a well-known karaoke device, or may be configured as another device.

通信部１１は、通信網を介して、音声入力装置１０が外部との間で通信を行う。ここで言う通信網には、例えば、公衆無線通信網やネットワーク回線を含む。
入力受付部１２は、外部からの操作に従って情報や指令の入力を受け付ける入力機器である。この入力機器には、例えば、キーやスイッチ、リモコンの受付部を含む。 In the communication unit 11, the voice input device 10 communicates with the outside through a communication network. The communication network mentioned here includes, for example, a public wireless communication network and a network line.
The input receiving unit 12 is an input device that receives input of information and commands in accordance with external operations. The input device includes, for example, a key, a switch, and a remote control receiving unit.

表示部１３は、少なくとも、文字コードで示される情報を含む画像を表示する表示装置である。この表示装置には、例えば、液晶ディスプレイやＣＲＴを含む。また、音声入力部１４は、音を電気信号に変換して制御部２０に入力する装置、いわゆるマイクロホンである。 The display unit 13 is a display device that displays at least an image including information indicated by a character code. Examples of the display device include a liquid crystal display and a CRT. The voice input unit 14 is a device that converts sound into an electric signal and inputs the electric signal to the control unit 20, that is, a so-called microphone.

音声出力部１５は、制御部２０からの電気信号を音に変換して出力する装置である。音声出力部１５は、ＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）規格によって規定されたデータに基づいて、音源からの音を模擬した出力音を出力する音源モジュールとして構成されていても良い。この音源モジュールには、例えば、ＭＩＤＩ音源を含む。 The audio output unit 15 is a device that converts an electrical signal from the control unit 20 into sound and outputs the sound. The audio output unit 15 may be configured as a sound source module that outputs output sound that simulates sound from a sound source based on data defined by the MIDI (Musical Instrument Digital Interface) standard. This tone generator module includes, for example, a MIDI tone generator.

記憶部１７は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。
また、制御部２０は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納するＲＯＭ２１と、処理プログラムやデータを一時的に格納するＲＡＭ２２と、ＲＯＭ２１やＲＡＭ２２に記憶された処理プログラムに従って各処理（各種演算）を実行するＣＰＵ２３とを少なくとも有した周知のコンピュータを中心に構成されている。
〈音声格納サーバ〉
音声格納サーバ２５は、記憶内容を読み書き可能に構成された不揮発性の記憶装置を中心に構成された装置である。この音声格納サーバ２５は、通信網を介して、音声入力装置１０、情報処理装置３０、データ格納サーバ５０に接続されている。 The storage unit 17 is a non-volatile storage device configured to be able to read and write stored contents.
The control unit 20 is stored in the ROM 21 that stores processing programs and data that need to retain stored contents even when the power is turned off, the RAM 22 that temporarily stores processing programs and data, and the ROM 21 and RAM 22. It is mainly configured by a known computer having at least a CPU 23 that executes each process (various operations) according to the processing program.
<Audio storage server>
The voice storage server 25 is a device that is mainly configured of a non-volatile storage device configured to be able to read and write stored contents. The voice storage server 25 is connected to the voice input device 10, the information processing device 30, and the data storage server 50 through a communication network.

この音声格納サーバ２５には、少なくとも２つ以上の音声データＳＤが格納されている。音声データＳＤは、音声パラメータＰＶ_iと、タグデータＴＧ_iとを発声者ごとに対応付けたデータである。すなわち、音声パラメータＰＶ_iと、タグデータＴＧ_iとに、発声者を識別する識別番号（ＩＤ）を付与したデータが、音声データＳＤとして生成される。 The voice storage server 25 stores at least two pieces of voice data SD. The voice data SD is data in which voice parameters PV _i and tag data TG _i are associated with each speaker. That is, data in which an identification number (ID) for identifying a speaker is added to the audio parameter PV _i and the tag data TG _i is generated as the audio data SD.

音声パラメータＰＶは、人が発した音の波形ごとに用意されるものであり、当該音声波形ｉを表す少なくとも一つの特徴量である。この特徴量は、いわゆるフォルマント合成に用いる音声の特徴量であり、発声者ごとに用意される。音声パラメータＰＶにおける特徴量として、発声音声における各音素での基本周波数Ｆ０、メル周波数ケプストラム（ＭＦＣＣ）、音素長、パワー、及びそれらの時間差分を、少なくとも備えている。この音声パラメータＰＶにおける特徴量は、音素ごとに用意される。 The voice parameter PV is prepared for each waveform of sound produced by a person, and is at least one feature amount representing the voice waveform i. This feature amount is a feature amount of speech used for so-called formant synthesis, and is prepared for each speaker. As a feature amount in the speech parameter PV, at least a fundamental frequency F0, a mel frequency cepstrum (MFCC), a phoneme length, a power, and a time difference thereof in each phoneme in the uttered speech are provided. The feature amount in the voice parameter PV is prepared for each phoneme.

これらの基本周波数Ｆ０、ＭＦＣＣ、パワーの導出方法は、周知であるため、ここでの詳しい説明は省略するが、例えば、基本周波数Ｆ０であれば、音素ごとの音声素片の時間軸に沿った自己相関、音素ごとの音声素片の周波数スペクトルの自己相関、またはケプストラム法などの手法を用いて導出すれば良い。また、ＭＦＣＣであれば、音素ごとの音声素片に対して時間分析窓を適用して、時間分析窓ごとに周波数解析（例えば、ＦＦＴ）をした結果について、周波数ごとの大きさを対数化した結果を、さらに、周波数解析することで導出すれば良い。パワーについては、音素ごとの音声素片に対して時間分析窓を適用して振幅の二乗した結果を時間方向に積分することで導出すれば良い。 Since the fundamental frequency F0, MFCC, and the power deriving method are well known, detailed description thereof is omitted here. For example, in the case of the fundamental frequency F0, the time axis of the speech unit for each phoneme is aligned. What is necessary is just to derive | lead-out using methods, such as autocorrelation, the autocorrelation of the frequency spectrum of the speech segment for every phoneme, or a cepstrum method. In the case of MFCC, the time analysis window is applied to the speech element for each phoneme, and the result of frequency analysis (for example, FFT) for each time analysis window is logarithmized for each frequency. The result may be derived by further frequency analysis. The power may be derived by integrating the result of squaring the amplitude by applying a time analysis window to the speech element for each phoneme in the time direction.

タグデータＴＧは、音声パラメータＰＶによって表される音の性質を表すデータであり、発声者の特徴を表す発声者特徴情報と、当該音声が発声されたときの発声者の感情を表す感情情報とを少なくとも含む。発声者特徴情報には、例えば、発声者の性別、年齢などを含む。また、感情情報は、感情そのものを表す情報に加えて、発声したときの情景、情緒や、雰囲気などを表す情報や、発声者の感情を推定するために必要な情報を含んでも良い。 The tag data TG is data representing the nature of the sound represented by the speech parameter PV, and includes speaker feature information representing the features of the speaker, emotion information representing the speaker's emotion when the speech is spoken, and At least. The speaker characteristic information includes, for example, the gender and age of the speaker. In addition to information representing the emotion itself, the emotion information may include information representing a scene, emotion, atmosphere, and the like when speaking, and information necessary for estimating the emotion of the speaker.

音声データＳＤの生成は、音声入力装置１０を介して入力された音声を、人が手作業で解析することで行っても良いし、音声入力装置１０などの情報処理装置がプログラムを実行することで行っても良い。 The generation of the voice data SD may be performed by manually analyzing a voice input via the voice input device 10 or an information processing device such as the voice input device 10 executes a program. You can go there.

音声入力装置１０にてプログラムを実行することで、音声データＳＤを生成する場合、例えば、当該音声入力装置１０が周知のカラオケ装置であれば、カラオケ用に予め用意され、楽曲の楽譜を表すカラオケデータ（即ち、ＭＩＤＩデータ）を用いて、以下のように実行すれば良い。音声波形に対してカラオケデータを照合することで、各音節または音素ごとの音声素片を抽出し、その音声素片それぞれから音声パラメータＰＶを導出する。 When generating the voice data SD by executing a program in the voice input device 10, for example, if the voice input device 10 is a well-known karaoke device, a karaoke that is prepared in advance for karaoke and represents a musical score of music. Using data (that is, MIDI data), the following may be executed. By collating karaoke data against a speech waveform, speech segments for each syllable or phoneme are extracted, and speech parameters PV are derived from each speech segment.

また、音声入力装置１０を周知のカラオケ装置と想定した場合、音声パラメータＰＶ_iとタグデータＴＧ_iとを発声者ごとに対応付ける方法の一例として、当該音声入力装置１０へのログインの際に、利用者から入力され、楽曲の予約時に曲と対応付けられるＩＤを、発声者の識別番号として、音声パラメータＰＶ_iとタグデータＴＧ_iと対応付ければ良い。
〈情報処理装置〉
この情報処理装置３０は、通信部３１と、入力受付部３２と、表示部３３と、記憶部３４と、制御部４０とを備えている。 Further, when the voice input device 10 is assumed to be a well-known karaoke device, as an example of a method for associating the voice parameter PV _i and the tag data TG _i for each speaker, it is used when logging into the voice input device 10. The ID input from the user and associated with the song when the song is reserved may be associated with the voice parameter PV _i and the tag data TG _i as the speaker identification number.
<Information processing device>
The information processing apparatus 30 includes a communication unit 31, an input reception unit 32, a display unit 33, a storage unit 34, and a control unit 40.

通信部３１は、通信網を介して外部との間で通信を行う。入力受付部３２は、外部からの操作に従って情報や指令の入力を受け付ける入力機器である。入力機器には、例えば、キーボードやポインティングデバイスを含む。 The communication unit 31 communicates with the outside via a communication network. The input receiving unit 32 is an input device that receives input of information and commands in accordance with external operations. The input device includes, for example, a keyboard and a pointing device.

表示部３３は、画像を表示する表示装置である。表示装置には、例えば、液晶ディスプレイやＣＲＴを含む。記憶部３４は、記憶内容を読み書き可能に構成された不揮発性の記憶装置である。記憶装置には、例えば、ハードディスク装置や、フラッシュメモリを含む。 The display unit 33 is a display device that displays an image. Examples of the display device include a liquid crystal display and a CRT. The storage unit 34 is a non-volatile storage device configured to be able to read and write stored contents. The storage device includes, for example, a hard disk device and a flash memory.

また、制御部４０は、ＲＯＭ４１、ＲＡＭ４２、ＣＰＵ４３を少なくとも有した周知のコンピュータを中心に構成されている。
そして、情報処理装置３０のＲＯＭ４１には、音声格納サーバ２５に格納されている音声データＳＤ群に基づいて、少なくとも２つ以上の音声感情データＥＴを生成する音声感情データ生成処理を制御部４０が実行するための処理プログラムが記憶されている。 The control unit 40 is configured around a known computer having at least a ROM 41, a RAM 42, and a CPU 43.
In the ROM 41 of the information processing apparatus 30, the control unit 40 performs voice emotion data generation processing for generating at least two or more voice emotion data ET based on the voice data SD group stored in the voice storage server 25. A processing program for execution is stored.

すなわち、情報処理装置３０は、音声感情データ生成処理を実行することで、本発明のデータ生成装置として機能する。
なお、データ格納サーバ５０は、記憶内容を読み書き可能に構成された不揮発性の記憶装置を中心に構成された装置であり、通信網を介して少なくとも情報処理装置３０に接続されている。
〈音声感情データ生成処理〉
音声感情データ生成処理は、図２に示すように、起動されると、音声格納サーバ２５に格納されている全ての音声パラメータＰＶを取得する（Ｓ１１０）。 That is, the information processing apparatus 30 functions as the data generation apparatus of the present invention by executing the voice emotion data generation process.
The data storage server 50 is a device that is mainly configured by a non-volatile storage device that can read and write stored contents, and is connected to at least the information processing device 30 via a communication network.
<Voice emotion data generation processing>
As shown in FIG. 2, when the voice emotion data generation process is started, all voice parameters PV stored in the voice storage server 25 are acquired (S110).

続いて、少なくとも２以上の数値であるクラスタ数ｋを取得する（Ｓ１２０）。その取得したクラスタ数ｋに従って、Ｓ１１０にて取得した全ての音声パラメータＰＶを、クラスタ数ｋのクラスタに分類するクラスタリングを実行する（Ｓ１３０）。このＳ１３０にて実行するクラスタリングは、ｋ−ｍｅａｎｓなどの周知の手法によって実行すれば良い。以下、クラスタリングによって分類した各クラスタを分類クラスタＣＬと称す。 Subsequently, the cluster number k, which is a numerical value of at least 2 or more, is acquired (S120). Clustering for classifying all voice parameters PV acquired in S110 into clusters of cluster number k is executed according to the acquired cluster number k (S130). The clustering executed in S130 may be executed by a known method such as k-means. Hereinafter, each cluster classified by clustering is referred to as a classification cluster CL.

すなわち、分類クラスタＣＬは、図３に示すように、空間平面において、類似するとみなせる音声パラメータＰＶのグループ（集合）である。
Ｓ１３０では、さらに、分類クラスタＣＬごとに、当該分類クラスタＣＬに含まれる音声パラメータＰＶそれぞれに、識別符号Ｎ_k,jを付して記憶する。ただし、符号ｊは、分類クラスタＣＬ_kに含まれる音声パラメータＰＶのインデックスである。 That is, as shown in FIG. 3, the classification cluster CL is a group (set) of speech parameters PV that can be regarded as similar in the spatial plane.
In S130, for each classification cluster CL, each speech parameter PV included in the classification cluster CL is stored with an identification code N _{k, j} . However, code j is the index of the speech parameter PV included in the classification cluster CL _k.

また、Ｓ１２０では、利用者が入力した数値をクラスタ数ｋとして取得しても良いし、音声パラメータＰＶ群における各音声パラメータＰＶの空間分布から推定した結果をクラスタ数ｋとして取得しても良い。 In S120, a numerical value input by the user may be acquired as the cluster number k, or a result estimated from the spatial distribution of each voice parameter PV in the voice parameter PV group may be acquired as the cluster number k.

後者の場合、クラスタ数ｋを推定する手法の一例としては、空間平面における原点から各音声パラメータＰＶ_iまでの距離ＥＤ_iの分布、及び空間平面における基準ベクトルＲＶと各音声パラメータＰＶ_iとがなす角度ＡＮＧ_iの分布に基づいて推定することが考えられる。より具体的には、クラスタ数ｋは、距離ＥＤ_iの分布におけるピークの数に、角度ＡＮＧ_iの分布におけるピークの数を乗じた値として推定すれば良い。 In the latter case, as an example of a method for estimating the number of clusters k, the distribution of the distance ED _i from the origin in the space plane to each sound parameter PV _i , and the reference vector RV in the space plane and each sound parameter PV _i are formed. It is possible to estimate based on the distribution of the angle ANG _i . More specifically, the cluster number k may be estimated as a value obtained by multiplying the number of peaks in the distribution of the distance ED _i by the number of peaks in the distribution of the angle ANG _i .

なお、距離ＥＤ_iは、下記（１）式にて求めれば良く、角度ＡＮＧ_iは、下記（２）式にて求めれば良い。 The distance ED _i may be obtained by the following equation (1), and the angle ANG _i may be obtained by the following equation (2).

ただし、（１）式中のｔ（）は、ベクトルの転置、即ち、転置行列を意味する。また、基準ベクトルＲＶとは、任意の方向を持つ基準ベクトルである。
さらに、ここで言うピークとは、それぞれの分布における極大値である。そして、ピークの数は、距離ＥＤ_iまたは角度ＡＮＧ_iの分布によって表される曲線を微分した結果に従って、ゼロクロスの回数をカウントすることで求めれば良い。 However, t () in the equation (1) means transposition of a vector, that is, a transpose matrix. The reference vector RV is a reference vector having an arbitrary direction.
Furthermore, the peak mentioned here is a maximum value in each distribution. The number of peaks may be obtained by counting the number of zero crossings according to the result of differentiating the curve represented by the distribution of distance ED _i or angle ANG _i .

続いて、音声感情データ生成処理では、分類クラスタＣＬ_kごとに、各分類クラスタＣＬ_kに含まれる音声パラメータＰＶ（Ｎ_k,j）それぞれを対応付けられたタグデータＴＧを収集する（Ｓ１４０）。 Subsequently, the voice emotion data generation processing, classification for each cluster CL _k, speech parameters PV included in each classification cluster CL _k (N _k, _j) collecting tag data TG associated respectively (S140).

そして、Ｓ１４０にて収集した、各分類クラスタＣＬ_kのタグデータＴＧ群に基づいて、各分類クラスタＣＬ_kを代表するタグデータＴＧを代表タグデータＴＤとして推定する（Ｓ１５０）。この代表タグデータＴＤの推定は、タグデータＴＧ群を構成するタグデータＴＧのヒストグラムを分類クラスタＣＬ_kごとに求め、その分類クラスタＣＬ_kごとのヒストグラムにおいて、頻度が最大であるタグデータＴＧを、当該分類クラスタＣＬ_kにおける代表タグデータＴＤとすれば良い。 The collected at S140, based on the tag data TG group of each classification cluster CL _k, estimating the tag data TG representing each classified cluster CL _k as a representative tag data TD (S150). Estimation of the representative tag data TD is a histogram of the tag data TG constituting the tag data TG group by category cluster CL _k, in the histogram for each classification cluster CL _k, the tag data TG frequency is maximum, it may be the representative tag data TD in the classification cluster CL _k.

さらに、分類クラスタＣＬ_kごとに、当該分類クラスタＣＬ_kにおける音声パラメータＰＶの代表値であるクラスタ代表値を導出する（Ｓ１６０）。このクラスタ代表値には、各分類クラスタＣＬ_kに含まれる音声パラメータＰＶの平均値である平均パラメータＣＰＶ＿Ａ_kと、各分類クラスタＣＬ_kに含まれる音声パラメータＰＶの分散である分散パラメータＣＰＶ＿Ｖ_kとが含まれる。 Furthermore, the classification cluster for each CL _k, derives the cluster representative value is a representative value of the speech parameter PV in the classification cluster CL _k (S160). The cluster representative value, an average parameter CPV_A _k is the mean value of speech parameters PV included in each classification cluster CL _k, and a dispersion parameter CPV_V _k is the variance of the speech parameter PV included in each classification cluster CL _k included.

なお、各分類クラスタＣＬ_kにおける平均パラメータＣＰＶ＿Ａ_kの導出は、下記（３）式に従って実行される。また、各分類クラスタＣＬ_kにおける分散パラメータＣＰＶ＿Ｖ_kの導出は、下記（４）式に従って実行される。 Note that the derivation of the mean parameters CPV_A _k in each category cluster CL _k is performed according to the following equation (3). Also, derivation of dispersion parameter CPV_V _k in each category cluster CL _k is performed according to the following equation (4).

続いて、特定条件を満たす全ての音声パラメータＰＶを中立パラメータＰＶとして、音声格納サーバ２５から取得する（Ｓ１７０）。下記（５）式に従って、取得した中立パラメータＰＶの平均値を基準パラメータＮＰＶとして導出する（Ｓ１８０）。 Subsequently, all voice parameters PV satisfying the specific condition are acquired from the voice storage server 25 as neutral parameters PV (S170). According to the following equation (5), the average value of the acquired neutral parameters PV is derived as the reference parameter NPV (S180).

具体的に、Ｓ１７０における特定条件とは、タグデータＴＧにおける感情が自然体であることを表していることである。これにより、基準パラメータＮＰＶは、感情が自然体であることを表す音声パラメータとなる。 Specifically, the specific condition in S170 represents that the emotion in the tag data TG is a natural body. As a result, the reference parameter NPV becomes a voice parameter indicating that the emotion is a natural body.

なお、特定条件は、タグデータＴＧにおける感情が自然体であることを表していることに限らず、タグデータＴＧが対応付けられていないことであっても良い。さらに、特定条件は、音声格納サーバ２５に格納されている全ての音声パラメータＰＶであっても良い。 The specific condition is not limited to representing that the emotion in the tag data TG is a natural body, but may be that the tag data TG is not associated. Furthermore, the specific condition may be all voice parameters PV stored in the voice storage server 25.

続いて、下記（６）式に従って、分類クラスタＣＬ_kごとに、当該分類クラスタＣＬ_kにおける平均パラメータＣＰＶ＿Ａ_kと基準パラメータＮＰＶとの差分を、当該分類クラスタＣＬ_kにおける代表パラメータＤＰＶ_kとして導出する（Ｓ１９０）。 Subsequently, according to the following equation (6), for each classification cluster CL _k, the difference between the average parameter CPV_A _k and the reference parameter NPV in the classification cluster CL _k, derives as a representative parameter DPV _k in the classification cluster CL _k ( S190).

分類クラスタＣＬ_kにおける平均パラメータＣＰＶ＿Ａ_kと基準パラメータＮＰＶとの差分は、図４に示すように、空間平面において、基準パラメータＮＰＶによって表される座標から、各平均パラメータＣＰＶ＿Ａ_kによって表される座標までの距離となる。 As shown in FIG. 4, the difference between the average parameter CPV_A _k and the reference parameter NPV in the classification cluster CL _k is from a coordinate represented by the reference parameter NPV to a coordinate represented by each average parameter CPV_A _k in the spatial plane. It becomes the distance.

さらに、代表タグデータＴＤ_kと、代表パラメータＤＰＶ_kと、分散パラメータＣＰＶ＿Ｖ_kとを、それぞれに対応する分類クラスタＣＬ_kごとに対応付けた音声感情データＥＴ_kを、分類クラスタＣＬ_kごとに生成して、データ格納サーバ５０に格納する（Ｓ２００）。 Furthermore, a representative tag data TD _k, the representative parameter DPV _k, and a dispersion parameter CPV_V _k, voice emotion data ET _k that associates each classification cluster CL _k corresponding respectively to generate for each classification cluster CL _k And stored in the data storage server 50 (S200).

その後、本音声感情データ生成処理を終了する。
〈音声出力端末〉
この音声出力端末６０は、図１に示すように、情報受付部６１と、表示部６２と、音出力部６３と、通信部６４と、記憶部６５と、制御部６７とを備えている。音声出力端末６０は、例えば、周知の携帯端末や、周知の情報処理装置として構成されていても良い。ここで言う携帯端末には、携帯電話や携帯情報端末を含む。また、情報処理装置には、いわゆるパーソナルコンピュータを含む。 Thereafter, the voice emotion data generation process ends.
<Audio output terminal>
As shown in FIG. 1, the audio output terminal 60 includes an information receiving unit 61, a display unit 62, a sound output unit 63, a communication unit 64, a storage unit 65, and a control unit 67. The audio output terminal 60 may be configured as, for example, a known portable terminal or a known information processing apparatus. The mobile terminals referred to here include mobile phones and mobile information terminals. The information processing apparatus includes a so-called personal computer.

このうち、情報受付部６１は、入力装置（図示せず）を介して入力された情報を受け付ける。表示部６２は、制御部６７からの指令に基づいて画像を表示する。音出力部６３は、音を出力する周知の装置であり、例えば、ＰＣＭ音源と、スピーカとを備えている。 Among these, the information reception part 61 receives the information input via the input device (not shown). The display unit 62 displays an image based on a command from the control unit 67. The sound output unit 63 is a known device that outputs sound, and includes, for example, a PCM sound source and a speaker.

通信部６４は、通信網を介して音声出力端末６０が外部との間で情報通信を行うものである。記憶部６５は、記憶内容を読み書き可能に構成された不揮発性の記憶装置であり、各種処理プログラムや各種データが記憶される。 The communication unit 64 is for the voice output terminal 60 to perform information communication with the outside via a communication network. The storage unit 65 is a non-volatile storage device configured to be able to read and write stored contents, and stores various processing programs and various data.

また、制御部６７は、ＲＯＭ、ＲＡＭ、ＣＰＵを少なくとも有した周知のコンピュータを中心に構成されている。
〈音声合成処理〉
次に、音声出力端末６０の制御部６７が実行する音声合成処理について説明する。 The control unit 67 is mainly configured by a known computer having at least a ROM, a RAM, and a CPU.
<Speech synthesis processing>
Next, speech synthesis processing executed by the control unit 67 of the speech output terminal 60 will be described.

この音声合成処理は、音声出力端末６０の情報受付部６１を介して起動指令が入力されると起動される。
図５に示すように、音声合成処理は、起動されると、まず、情報受付部６１を介して入力された情報（以下、入力情報と称す）を取得する（Ｓ５１０）。このＳ５１０にて取得する入力情報とは、例えば、合成音として出力する音声の内容（文言）を表す出力文言や、合成音として出力する音の性質を表す出力性質情報を含むものである。なお、出力性質情報は、タグデータＴＧに対応する情報であり、発声者特徴情報と、感情情報とを含む。 This voice synthesis process is started when a start command is input via the information receiving unit 61 of the voice output terminal 60.
As shown in FIG. 5, when the speech synthesis process is started, first, information input via the information receiving unit 61 (hereinafter referred to as input information) is acquired (S510). The input information acquired in S510 includes, for example, output text indicating the content (word) of the sound output as synthesized sound, and output property information indicating the nature of the sound output as synthesized sound. Note that the output property information is information corresponding to the tag data TG, and includes speaker feature information and emotion information.

続いて、Ｓ５１０にて取得した出力文言を合成音として出力するために必要な音素それぞれに対応し、かつＳ５１０にて取得した出力性質情報のうちの発声者特徴情報に最も類似する代表タグデータＴＤと対応付けられた音声パラメータＰＶを、音声格納サーバ２５から抽出する（Ｓ５２０）。 Subsequently, the representative tag data TD corresponding to each phoneme necessary for outputting the output word acquired in S510 as a synthesized sound and most similar to the speaker characteristic information in the output property information acquired in S510. Is extracted from the voice storage server 25 (S520).

さらに、Ｓ５１０にて取得した出力性質情報のうちの感情情報に最も類似する代表タグデータＴＤを含む音声感情データＥＴを、データ格納サーバ５０から抽出する（Ｓ５３０）。 Further, the voice emotion data ET including the representative tag data TD most similar to the emotion information in the output property information acquired in S510 is extracted from the data storage server 50 (S530).

そして、Ｓ５１０にて取得した入力情報に即した合成音が出力されるように、Ｓ５２０にて抽出した音声パラメータＰＶを、Ｓ５３０にて抽出した音声感情データＥＴに基づいて調整する（Ｓ５４０）。続いて、Ｓ５４０にて調整された音声パラメータＰＶに基づいて、音声合成する（Ｓ５５０）。このＳ５５０における音声合成は、フォルマント合成による周知の音声合成の手法を用いる。 Then, the voice parameter PV extracted in S520 is adjusted based on the voice emotion data ET extracted in S530 so that a synthesized sound in accordance with the input information acquired in S510 is output (S540). Subsequently, speech synthesis is performed based on the speech parameter PV adjusted in S540 (S550). The voice synthesis in S550 uses a well-known voice synthesis technique using formant synthesis.

さらに、Ｓ５５０にて音声合成することによって生成された合成音を音出力部６３から出力する（Ｓ５６０）。
その後、本音声合成処理を終了する。
［実施形態の効果］
以上説明したように、情報処理装置３０によれば、ある程度の数の音声パラメータＰＶを統計処理することにより、表情空間領域（クラスタＣＬ）を推定している。このため、音声合成システム１によれば、発声者ごとにすべての感情の音声パラメータＰＶを用意する必要がない。 Furthermore, the synthesized sound generated by the voice synthesis in S550 is output from the sound output unit 63 (S560).
Thereafter, the speech synthesis process ends.
[Effect of the embodiment]
As described above, according to the information processing apparatus 30, the facial expression space region (cluster CL) is estimated by statistically processing a certain number of audio parameters PV. For this reason, according to the speech synthesis system 1, it is not necessary to prepare all emotional speech parameters PV for each speaker.

したがって、情報処理装置３０によれば、音声出力端末から出力される合成音に付与可能な感情の種類が従来と同数であったとしても、データ格納サーバ５０に記憶される音声感情データＥＴのデータ量を、従来の技術に比べて低減できる。 Therefore, according to the information processing apparatus 30, even if the number of types of emotions that can be imparted to the synthesized sound output from the voice output terminal is the same as the conventional type, the data of the voice emotion data ET stored in the data storage server 50 The amount can be reduced compared to the prior art.

この結果、情報処理装置３０によって生成された、音声感情データＥＴが記憶されたデータ格納サーバ５０を用いれば、音声合成の際に、利用者に指定された感情の内容を含む音声感情データＥＴを抽出するまでに要する処理量を低減でき、ひいては、当該音声感情データＥＴの抽出までに要する時間長を短縮できる。 As a result, if the data storage server 50 generated by the information processing apparatus 30 and storing the voice emotion data ET is used, the voice emotion data ET including the emotion content designated by the user is obtained at the time of voice synthesis. The amount of processing required until extraction can be reduced, and consequently the time length required to extract the voice emotion data ET can be reduced.

しかも、音声合成システム１によれば、各表情に対する人間の発声の自然な変化が反映できる。
換言すれば、上記実施形態の情報処理装置３０によれば、音声合成の際に、合成音による感情表現を多様化しつつも、指定された感情の内容を含むデータを抽出するまでに要する処理量の増加を抑制できる。 In addition, the speech synthesis system 1 can reflect natural changes in human speech for each facial expression.
In other words, according to the information processing apparatus 30 of the above-described embodiment, the amount of processing required to extract data including the content of the specified emotion while diversifying the emotion expression by the synthesized sound at the time of speech synthesis. Can be suppressed.

しかも、上記音声感情データ生成処理では、平均パラメータＣＰＶ＿Ａと基準パラメータＮＰＶとの差分を代表パラメータＤＰＶとして生成し、この代表パラメータＤＰＶを、代表タグデータＴＤを対応付けたデータを音声感情データＥＴとしている。 Moreover, in the voice emotion data generation process, the difference between the average parameter CPV_A and the reference parameter NPV is generated as the representative parameter DPV, and the data associated with the representative parameter DPV and the representative tag data TD is used as the voice emotion data ET. .

このような音声感情データＥＴを用いて音声パラメータＰＶを調整して音声合成すれば、音声パラメータＰＶとして基準パラメータＮＰＶのみが存在する状況下であっても、音声合成により、感情を付与した合成音を生成できる。 If speech synthesis is performed by adjusting the speech parameter PV using such speech emotion data ET, even if only the reference parameter NPV exists as the speech parameter PV, the synthesized speech to which emotion is added by speech synthesis. Can be generated.

さらに、上記音声感情データ生成処理では、音声感情データＥＴに、分散パラメータＣＰＶ＿Ｖを含めている。
このため、上記実施形態の音声合成処理によれば、音声合成の際に音声パラメータＰＶを調整する調整量を微調整することができる。 Further, in the voice emotion data generation process, the variance parameter CPV_V is included in the voice emotion data ET.
For this reason, according to the speech synthesis process of the above embodiment, the adjustment amount for adjusting the speech parameter PV at the time of speech synthesis can be finely adjusted.

なお、上記音声感情データ生成処理では、基準パラメータの導出に用いる音声パラメータに対応付けられたタグデータＴＧを、感情が自然体であることを表すタグデータＴＧとしている。 In the voice emotion data generation process, the tag data TG associated with the voice parameter used for deriving the reference parameter is the tag data TG indicating that the emotion is a natural body.

この感情が自然体であることを、例えば、ニュース番組での表情のように無表情である場合の感情や、通常の会話における感情とすれば、当該タグデータＴＧと対応付けられた音声パラメータＰＶを容易に収集でき、ひいては、基準パラメータを容易に導出できる。 If this emotion is a natural body, for example, if it is an emotion when there is no expression like an expression in a news program or an emotion in a normal conversation, the voice parameter PV associated with the tag data TG is It is easy to collect and thus the reference parameters can be easily derived.

以上説明したように、音声合成システム１によれば、音声感情データＥＴに従って音声パラメータＰＶを調整して音声合成することで、元来の発声者であるか否かに拘わらず、当該音声感情データＥＴにおける感情を付加した合成音を生成することができる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 As described above, according to the speech synthesis system 1, by adjusting the speech parameter PV according to the speech emotion data ET and performing speech synthesis, the speech emotion data can be used regardless of whether or not the speaker is the original speaker. It is possible to generate a synthesized sound to which emotions in ET are added.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

例えば、上記実施形態においては、音声格納サーバ２５とデータ格納サーバ５０とは、別個に構成されていたが、音声格納サーバ２５とデータ格納サーバ５０との構成は、これに限るものではない。すなわち、音声格納サーバ２５とデータ格納サーバ５０とは共通のサーバとして構成されていても良い。 For example, in the above embodiment, the voice storage server 25 and the data storage server 50 are configured separately, but the configurations of the voice storage server 25 and the data storage server 50 are not limited to this. That is, the voice storage server 25 and the data storage server 50 may be configured as a common server.

また、上記実施形態では、基準ベクトルＲＶを、任意の方向を持つ基準ベクトルとしていたが、基準ベクトルＲＶは、これに限るものではなく、空間平面における原点から中立パラメータＰＶへと向かうベクトルを、基準ベクトルＲＶとしても良い。 In the above embodiment, the reference vector RV is a reference vector having an arbitrary direction. However, the reference vector RV is not limited to this, and a vector from the origin in the space plane toward the neutral parameter PV is used as the reference vector. It may be a vector RV.

さらに、上記実施形態の音声感情データ生成処理におけるＳ１３０では、クラスタリングを、タグデータＴＧにてスクリーニングを実行した後に実行しても良い。
さらには、Ｓ１３０にて実行するクラスタリングは、ｋ−ｍｅａｎｓに限るものではなく、その他の周知のクラスタリング手法を用いても良い。 Furthermore, in S130 in the voice emotion data generation process of the above embodiment, clustering may be executed after screening is performed using the tag data TG.
Furthermore, the clustering executed in S130 is not limited to k-means, and other known clustering methods may be used.

なお、上記実施形態における音声感情データＥＴには、分散パラメータＣＰＶ＿Ｖが含まれていたが、音声感情データＥＴには、分散パラメータＣＰＶ＿Ｖが含まれていなくとも良い。すなわち、本発明における音声感情データは、少なくとも、代表タグデータＴＤ_kと、代表パラメータＤＰＶ_kとが、それぞれに対応する分類クラスタＣＬ_kごとに対応付けられていれば良い。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 The voice emotion data ET in the above embodiment includes the distributed parameter CPV_V, but the voice emotion data ET may not include the distributed parameter CPV_V. In other words, at least the representative tag data TD _k and the representative parameter DPV _k need only be associated with each classification cluster CL _k in the voice emotion data in the present invention.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

上記実施形態の音声感情データ生成処理におけるＳ１１０が、本発明におけるパラメータ取得手段に相当し、音声感情データ生成処理におけるＳ１２０，Ｓ１３０が、分類手段に相当する。そして、音声感情データ生成処理におけるＳ１４０が、本発明におけるタグ取得手段に相当し、Ｓ１５０が、代表推定手段に相当する。さらに、音声感情データ生成処理におけるＳ１６０〜Ｓ１９０が、本発明におけるパラメータ決定手段に相当し、Ｓ２００が、データ生成手段に相当する。 S110 in the voice emotion data generation process of the above embodiment corresponds to the parameter acquisition means in the present invention, and S120 and S130 in the voice emotion data generation process correspond to the classification means. S140 in the voice emotion data generation process corresponds to a tag acquisition unit in the present invention, and S150 corresponds to a representative estimation unit. Further, S160 to S190 in the voice emotion data generation process correspond to the parameter determination means in the present invention, and S200 corresponds to the data generation means.

また、上記実施形態における音声格納サーバ２５が、本発明の第一記憶装置に相当し、データ格納サーバ５０が、本発明の第二記憶装置に相当する。
なお、上記実施形態の音声感情データ生成処理におけるＳ１６０が、本発明における平均手段に相当し、Ｓ１９０が、本発明における差分導出手段に相当する。さらに、音声感情データ生成処理におけるＳ１７０が、抽出手段に相当し、Ｓ１８０が基準導出手段に相当する。 Further, the voice storage server 25 in the above embodiment corresponds to the first storage device of the present invention, and the data storage server 50 corresponds to the second storage device of the present invention.
Note that S160 in the voice emotion data generation process of the above embodiment corresponds to the averaging means in the present invention, and S190 corresponds to the difference deriving means in the present invention. Further, S170 in the voice emotion data generation process corresponds to an extraction unit, and S180 corresponds to a reference derivation unit.

１…音声合成システム１０…音声入力装置２５…音声格納サーバ３０…情報処理装置３１…通信部３２…入力受付部３３…表示部３４…記憶部４０…制御部４１…ＲＯＭ４２…ＲＡＭ４３…ＣＰＵ５０…データ格納サーバ６０…音声出力端末６１…情報受付部６２…表示部６３…音出力部６４…通信部６５…記憶部６７…制御部 DESCRIPTION OF SYMBOLS 1 ... Speech synthesis system 10 ... Voice input device 25 ... Voice storage server 30 ... Information processing device 31 ... Communication part 32 ... Input reception part 33 ... Display part 34 ... Memory | storage part 40 ... Control part 41 ... ROM 42 ... RAM 43 ... CPU DESCRIPTION OF SYMBOLS 50 ... Data storage server 60 ... Voice output terminal 61 ... Information reception part 62 ... Display part 63 ... Sound output part 64 ... Communication part 65 ... Memory | storage part 67 ... Control part

Claims

A speech parameter that is at least one feature amount representing a waveform of sound emitted by a person and tag data that is information including the emotion of the person who emitted the sound represented by the speech parameter are associated with each person. Parameter acquisition means for acquiring the audio parameters included in each of the audio data from a first storage device storing at least two audio data;
Classification means for classifying the group of voice parameters acquired by the parameter acquisition means into at least two groups based on the distribution of the voice parameters;
Each of the groups classified by the classification means is defined as a classification cluster, and each of the tag data associated with the speech parameter included in each of the classification clusters is acquired for each classification cluster from the first storage device. Tag acquisition means to perform,
Representative estimation means for estimating at least one representative tag data for each classification cluster, based on the tag data acquired by the tag acquisition means, information representing emotions representing each of the classification clusters;
Parameter determining means for determining, for each classification cluster, a representative parameter that is the voice parameter representing the classification cluster based on the voice parameter included in each classification cluster classified by the classification means;
Generating voice emotion data in which the representative parameter determined by the parameter determination unit and the representative tag data estimated by the representative estimation unit are associated with each corresponding classification cluster, and the second memory A data generation device comprising: data generation means stored in the device.

The parameter determination means includes
Average means for deriving an average parameter, which is an average value of speech parameters included in each of the classification clusters, for each classification cluster;
A difference deriving unit for deriving for each classification cluster using, as the representative parameter, a difference between an average parameter derived by the averaging unit and a reference parameter representing a speech parameter at a specified reference value; The data generation device according to claim 1.

The difference derivation means includes
Extraction means for extracting each of the voice parameters associated with the tag data representing that the emotion is a natural body from the first storage device;
The data generation apparatus according to claim 2, further comprising: a reference derivation unit that derives an average of the voice parameters extracted by the extraction unit as the reference parameter.

A speech parameter that is at least one feature amount representing a waveform of sound emitted by a person and tag data that is information including the emotion of the person who emitted the sound represented by the speech parameter are associated with each person. A parameter acquisition procedure for acquiring the audio parameters included in each of the audio data from a first storage device storing at least two audio data;
A classification procedure for classifying the group of voice parameters acquired in the parameter acquisition procedure into at least two groups based on the distribution of the voice parameters;
Each of the groups classified in the classification procedure is defined as a classification cluster, and each of the tag data associated with the speech parameter included in each of the classification clusters is acquired for each classification cluster from the first storage device. Tag acquisition procedure to
Based on the tag data acquired in the tag acquisition procedure, a representative estimation procedure for estimating at least one representative tag data for each classification cluster, which is information including emotions representing each of the classification clusters;
A parameter determination procedure for determining, for each classification cluster, a representative parameter that is the voice parameter representing the classification cluster based on the voice parameter included in each classification cluster classified in the classification procedure;
Generating voice emotion data in which the representative parameter determined in the parameter determination procedure and the representative tag data estimated in the representative estimation procedure are associated with each corresponding classification cluster; A data generation method comprising: a data generation procedure stored in an apparatus.