JP5845857B2

JP5845857B2 - Parameter extraction device, speech synthesis system

Info

Publication number: JP5845857B2
Application number: JP2011262297A
Authority: JP
Inventors: 典昭阿瀬見
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2011-11-30
Filing date: 2011-11-30
Publication date: 2016-01-20
Anticipated expiration: 2031-11-30
Also published as: JP2013114191A

Description

本発明は、音声合成を実行する音声合成システム、及び音声合成に必要な音声パラメータを音声から抽出するパラメータ抽出装置に関する。 The present invention relates to a speech synthesis system that performs speech synthesis and a parameter extraction device that extracts speech parameters necessary for speech synthesis from speech.

従来、音声合成装置では、外部から入力されたテキスト、かつ外部操作によって指定された感情を表す音声を音声合成によって生成して出力することがなされている（特許文献１参照）。 2. Description of the Related Art Conventionally, in a speech synthesizer, text that is input from the outside and speech that represents an emotion designated by an external operation is generated and output by speech synthesis (see Patent Document 1).

これを実現するために、特許文献１に記載の音声合成装置では、言語属性ベクトルと、音響属性ベクトルと、感情ベクトルとから構成される感情表出パターンを複数個蓄積している。そして、蓄積されている複数個の感情表出パターンの中から、外部入力によって指定された感情に最も近い感情が表された感情表出パターンを抽出し、その抽出した感情表出パターンに従って、入力されたテキストに対して音声合成を行う。 In order to realize this, the speech synthesizer described in Patent Document 1 stores a plurality of emotion expression patterns composed of language attribute vectors, acoustic attribute vectors, and emotion vectors. Then, an emotion expression pattern that represents the emotion closest to the emotion specified by the external input is extracted from a plurality of accumulated emotion expression patterns, and input according to the extracted emotion expression pattern Speech synthesis is performed on the generated text.

なお、特許文献１において、言語属性ベクトルとは、話者と聴取者の社会的関係性を構築する属性を示すものであり、言語的内容、意味あるいは概念のもつ好悪のような感情や、依頼や命令といった話者の態度、対等か目上か、主従の関係か等である。また、音響属性ベクトルとは、話者と聴取者の社会的関係性を表現するのに用いられている音響的特徴量を示すものであり、音響的特徴として平均ピッチやピッチのダイナミックレンジ、声門開放度のような感情や、依頼や命令といった話者の態度、対等か目上か、主従の関係か等である。感情ベクトルは、話者と聴取者の社会的関係性を示すものであり、音声全体として表現されている感情や、依頼や命令といった話者の態度、対等か目上か、主従の関係か等である。 In Patent Document 1, a language attribute vector indicates an attribute that builds a social relationship between a speaker and a listener, such as emotions such as linguistic content, meaning, or concept, and requests. Or speaker's attitude, whether it is equal or superior, whether it is a master-slave relationship, etc. The acoustic attribute vector indicates the acoustic features used to express the social relationship between the speaker and the listener. The acoustic features include the average pitch, the dynamic range of the pitch, and the glottis. They are feelings like openness, speaker's attitudes like requests and orders, whether they are equal or superior, or a master-slave relationship. The emotion vector indicates the social relationship between the speaker and the listener. The emotion expressed as a whole voice, the speaker's attitude such as request or command, whether it is equal or superior, whether it is a master-slave relationship, etc. It is.

特開２００７−１８３４２１号公報JP 2007-183421 A

ところで、特許文献１に記載された音声合成装置から出力される合成音は、予め用意された一つの標準的声質の音声パラメータを、感情表出パターンに従って音声合成（変更）したものである。 By the way, the synthesized sound output from the speech synthesizer described in Patent Document 1 is obtained by synthesizing (changing) speech parameters of one standard voice quality prepared in advance according to an emotion expression pattern.

したがって、特許文献１に記載された音声合成装置では、当該音声合成装置から出力する合成音によって表現される感情を変更して多様化できるものの、当該合成音を発声した人物の性別や、年齢、声質を多様なものとすることは困難である。 Therefore, in the speech synthesizer described in Patent Document 1, although the emotion expressed by the synthesized sound output from the speech synthesizer can be changed and diversified, the gender of the person who uttered the synthesized sound, age, It is difficult to make voice quality diverse.

つまり、特許文献１に記載された音声合成装置では、一つの標準的声質の音声パラメータに基づいて音声合成しているため、合成音として出力する音に対する発声者の特徴を多様化させることが困難であるという問題があった。 In other words, since the speech synthesizer described in Patent Document 1 synthesizes speech based on speech parameters of one standard voice quality, it is difficult to diversify the features of the speaker with respect to the sound output as synthesized sound. There was a problem of being.

そこで、本発明は、音声合成によって生成される合成音に対する発声者の特徴を多様化することを目的とする。 Accordingly, an object of the present invention is to diversify a speaker's characteristics with respect to a synthesized sound generated by speech synthesis.

上記目的を達成するためになされた本発明のパラメータ抽出装置では、内容情報取得手段が、発声すべき内容の文字列を表す発声内容情報を取得し、タイミング情報取得手段が、内容情報取得手段で取得した発声内容情報である特定内容情報によって表される文字列のうち、少なくとも一つの文字の発声開始タイミングを指定する発声タイミング情報を取得し、波形取得手段が、特定内容情報によって表される文字列について発声された音声波形である対象波形を取得する。 In the parameter extraction device of the present invention made to achieve the above object, the content information acquisition means acquires utterance content information representing the character string of the content to be uttered, and the timing information acquisition means is the content information acquisition means. Among the character strings represented by the specific content information that is the acquired utterance content information, obtain utterance timing information that specifies the utterance start timing of at least one character, and the waveform acquisition means uses the character represented by the specific content information. A target waveform that is a speech waveform uttered for a column is acquired.

そして、音節波形抽出手段では、少なくとも、タイミング情報取得手段で取得した発声タイミング情報に基づいて、波形取得手段で取得した対象波形から、特定内容情報によって表される文字列を形成する各音節に対して発声した音声波形である音節波形を抽出する。さらに、パラメータ導出手段が、その抽出した各音節波形から、予め規定された少なくとも一つの特徴量である音声パラメータを導出する。 In the syllable waveform extracting means, at least for each syllable forming the character string represented by the specific content information from the target waveform acquired by the waveform acquiring means based on the utterance timing information acquired by the timing information acquiring means. Extract a syllable waveform that is a speech waveform uttered. Further, the parameter deriving means derives a speech parameter which is at least one feature quantity defined in advance from each extracted syllable waveform.

また、メタデータ生成手段では、特定内容情報、及びタイミング情報取得手段で取得した発声タイミング情報に基づいて、音節波形によって表される音声の性質を推定し、その推定結果をメタデータとして生成し、パラメータ登録手段が、パラメータ導出手段で導出された音声パラメータと、メタデータ生成手段で生成されたメタデータとを対応する音節ごとに対応付けて、第一記憶装置に記憶する。 Further, in the metadata generation means, based on the specific content information and the utterance timing information acquired by the timing information acquisition means, the property of the voice represented by the syllable waveform is estimated, and the estimation result is generated as metadata, The parameter registering means stores the speech parameter derived by the parameter deriving means and the metadata generated by the metadata generating means in association with each corresponding syllable in the first storage device.

なお、本発明における音声パラメータとしての特徴量は、フォルマント合成による音声合成を実行する際に必要となる特徴量であり、例えば、基本周波数や、メル周波数ケプストラム（ＭＦＣＣ）、パワーなど、及びそれらの各時間差分などを含む。 Note that the feature quantity as a speech parameter in the present invention is a feature quantity required when executing speech synthesis by formant synthesis, and includes, for example, a fundamental frequency, a mel frequency cepstrum (MFCC), power, and the like. Each time difference is included.

このようなパラメータ抽出装置によれば、発声内容情報によって表される文字列の内容を多くの人物に発声させた各対象波形から音声パラメータを導出することで、多様な発声者の音声パラメータを導出できる。この結果、本発明のパラメータ抽出装置によれば、音声パラメータの種類を多様化できる。 According to such a parameter extraction device, speech parameters of various speakers are derived by deriving speech parameters from each target waveform obtained by causing many people to utter the content of the character string represented by the utterance content information. it can. As a result, according to the parameter extraction device of the present invention, the types of voice parameters can be diversified.

また、本発明のパラメータ抽出装置では、特定内容情報及び発声タイミング情報に基づいて、メタデータを自動的に推定できる。このため、本発明のパラメータ抽出装置によれば、従来の音声合成装置とは異なり、発声内容情報によって表される文字列の内容を発声するときに、メタデータとしての当該音声の性質を、利用者らに入力させる必要を無くすことができる。 Moreover, in the parameter extraction apparatus of this invention, metadata can be estimated automatically based on specific content information and utterance timing information. For this reason, according to the parameter extraction device of the present invention, unlike the conventional speech synthesizer, when the content of the character string represented by the utterance content information is uttered, the property of the speech as metadata is used. It is possible to eliminate the need for the users to input.

以上のことから、本発明のパラメータ抽出装置にて抽出した音声パラメータを用いて、音声合成すれば、その合成音を発声したとみなせる発声者の特徴を多様化できる。
また、本発明における音声の性質とは、当該音声が発声されたときの発声者の感情を少なくとも含むものであり、例えば、情緒や、雰囲気などを含む概念である。さらに、音声の性質には、感情を推定するために必要な情報を含んでも良い。 From the above, if speech synthesis is performed using speech parameters extracted by the parameter extraction apparatus of the present invention, the features of a speaker who can be regarded as having uttered the synthesized speech can be diversified.
In addition, the nature of speech in the present invention includes at least the emotion of the speaker when the speech is uttered, and is a concept that includes, for example, emotion and atmosphere. Furthermore, information necessary for estimating emotions may be included in the nature of speech.

また、本発明のパラメータ抽出装置においては、楽譜データ取得手段が、楽曲の一つである対象楽曲の楽譜を表し、音源モジュールから出力される個々の出力音について、少なくとも音高及び演奏開始タイミングが規定された楽譜データを取得する。 In the parameter extraction device of the present invention, the score data acquisition means represents the score of the target song that is one of the songs, and for each output sound output from the sound module, at least the pitch and the performance start timing are present. Get the specified musical score data.

そして、本発明における内容情報取得手段は、対象楽曲の歌詞を構成する歌詞構成文字の文字列を、発声内容情報として取得し、タイミング情報取得手段は、歌詞構成文字の少なくとも１つに対する出力タイミングが、当該歌詞構成文字に対応する出力音の演奏開始タイミングと対応付けられた歌詞出力タイミングを、発声タイミング情報として取得する。このとき、本発明における波形取得手段は、楽譜データに基づく対象楽曲の演奏中に入力された音声が時間軸に沿って推移した波形を、対象波形として取得し、音節波形抽出手段は、対象波形において、個々の出力音に対応する区間での音声波形を、音節波形として抽出しても良い。 And the content information acquisition means in this invention acquires the character string of the lyric constituent character which comprises the lyrics of the object music as utterance content information, and the timing information acquisition means has an output timing for at least one of the lyric constituent characters. Then, the lyrics output timing associated with the performance start timing of the output sound corresponding to the lyrics constituent character is acquired as utterance timing information. At this time, the waveform acquisition means in the present invention acquires, as the target waveform, a waveform in which the voice input during the performance of the target music based on the score data has shifted along the time axis, and the syllable waveform extraction means , The speech waveform in the section corresponding to each output sound may be extracted as a syllable waveform.

このようなパラメータ抽出装置によれば、楽譜データに基づいて対象楽曲を演奏している期間に音声が入力されるカラオケ装置などを利用して音声波形を収集することができる。 According to such a parameter extraction device, speech waveforms can be collected using a karaoke device or the like in which speech is input during a period in which the target music is being played based on the musical score data.

そして、このようなパラメータ抽出装置によれば、カラオケ装置などにおいて歌唱した音声から、音素パラメータを生成することができる。
一般的に、楽曲における調が、長調であれば明るい印象を受け、単調であれば悲しい印象を受ける。これと同様に、歌詞も、楽曲の調が長調であるときには、明るい印象の歌詞が多く、楽曲の調が単調であるときには、悲しい印象の歌詞が多い。 And according to such a parameter extraction apparatus, a phoneme parameter can be produced | generated from the audio | voice sung in the karaoke apparatus etc.
Generally, if the key in a music is a major key, a bright impression is received, and if it is monotonous, a sad impression is received. In the same manner, the lyrics have many bright impressions when the key of the music is major, and many sad lyrics when the key of the music is monotonous.

そこで、本発明における楽譜データは、対象楽曲の曲中において転調していれば、前記対象楽曲の演奏開始の時点を原点とした場合の時間軸上における転調する時点の位置を表す転調フラグを含んでも良い。
この場合、本発明のメタデータ生成手段では、楽譜データ取得手段で取得した楽譜データに基づいて、区間特定手段が、対象楽曲において同一の調が継続される各区間である調同一区間を特定し、主音特定手段が、区間特定手段にて特定した各調同一区間に含まれ、それぞれの調同一区間における時間軸に沿った最後の出力音を主音として特定する。そして、音名頻度導出手段が、区間特定手段にて特定した調同一区間に含まれる同一音名の出力音の頻度を表す登場音名頻度を、主音特定手段で特定した主音の音名を起点として調同一区間毎に導出すると、調推定手段が、各調にて利用可能な音名の分布を表すテンプレートとして調毎に予め用意した調テンプレートに、音名頻度導出手段で導出した各登場音名頻度を照合した結果、最も相関が高い調それぞれに対する音声の性質を、メタデータとする。
さらに、調推定手段は、前記最も相関が高い調が長調であれば、明るいという音声の性質を前記メタデータとし、前記最も相関が高い調が短調であれば、暗いという音声の性質を前記メタデータとする。 Therefore, if the musical score data in the present invention is transposed in the music of the target music, the musical score data includes a transposition flag indicating the position of the time of transposition on the time axis when the performance start time of the target music is used as the origin. But it ’s okay.
In this case, in the metadata generation means of the present invention, based on the score data acquired by the score data acquisition means, the section specifying means specifies the same key section, which is each section in which the same key is continued in the target music. The main sound specifying means is included in each key-same section specified by the section specifying means, and specifies the last output sound along the time axis in each key-same section as the main sound. Then, the pitch name frequency deriving means starts from the pitch name of the main sound specified by the main sound specifying means, with the appearance pitch name frequency indicating the frequency of the output sound of the same pitch name included in the same key interval specified by the section specifying means. The key estimation means derives each of the appearance sounds derived by the pitch name frequency deriving means into a key template prepared in advance for each key as a template representing the distribution of pitch names that can be used in each key. As a result of collating the name frequency, the sound property for each key having the highest correlation is defined as metadata.
Further, the key estimation means uses the sound property of being bright as the metadata if the key with the highest correlation is the major key, and the sound property of dark as the metadata when the key with the highest correlation is the minor key. Data.

このようなパラメータ抽出装置であれば、対象楽曲における各調同一区間の調をメタデータとすることができ、ひいては、各調同一区間に対応する歌詞を発声したときの発声者の感情をメタデータとすることができる。しかも、このような調特定手段によれば、各調同一区間における調を確実に特定することができる。 With such a parameter extraction device, the key of each tone-same section in the target music can be used as metadata, and by extension, the emotion of the speaker when the lyrics corresponding to each key-same section are uttered is metadata. It can be. Moreover, according to such a key specifying means, the key in each key same section can be specified reliably.

さらに、本発明におけるメタデータ生成手段では、単語分割手段が、内容情報取得手段で取得した発声内容情報によって表される文字列を、単語を構成する文字列である単語文字ごとに分割し、メタデータ抽出手段が、単語分割手段で分割された各単語文字によって表される単語に対応する性質情報をメタデータとして、単語性質テーブルから抽出しても良い。 Further, in the metadata generating means in the present invention, the word dividing means divides the character string represented by the utterance content information acquired by the content information acquiring means for each word character that is a character string constituting the word, The data extracting means may extract the property information corresponding to the word represented by each word character divided by the word dividing means as metadata from the word property table.

このようなパラメータ抽出装置によれば、各単語の性質をメタデータとすることができる。
なお、本発明における単語性質テーブルとは、各単語の性質を表す性質情報を、当該単語の識別情報と対応付けたテーブルであり、予め用意されたものである。さらに、ここでいう単語の性質とは、当該単語の意味や、当該単語によって表される感情を含むものである。 According to such a parameter extraction device, the property of each word can be used as metadata.
The word property table in the present invention is a table in which property information representing the property of each word is associated with identification information of the word, and is prepared in advance. Furthermore, the word property mentioned here includes the meaning of the word and the emotion represented by the word.

さらには、本発明における内容情報取得手段は、少なくとも一つの文を構成する文字列である文構成文字を、発声内容情報として取得しても良い。この場合、タイミング情報取得手段は、文構成文字を構成する少なくとも一つの文字を、外部に出力する出力タイミングが表された情報を、発声タイミング情報として取得し、波形取得手段は、発声内容情報に基づく文構成文字を構成する文字列の出力中に入力された音声が時間軸に沿って推移した波形を対象波形として取得しても良い。 Furthermore, the content information acquisition means in this invention may acquire the sentence structure character which is a character string which comprises at least 1 sentence as utterance content information. In this case, the timing information acquisition means acquires, as utterance timing information, information indicating the output timing of outputting at least one character constituting the sentence constituent character to the outside, and the waveform acquisition means includes the utterance content information. A waveform in which a voice input during output of a character string constituting a sentence-constituting character based on the time axis may be acquired as a target waveform.

このようなパラメータ抽出装置によれば、文を読み上げたときの音声波形を対象波形として取得できる。すなわち、本発明のパラメータ抽出装置によれば、いわゆるアフレコ機能を有するカラオケ装置などを介して対象波形を取得できる。 According to such a parameter extraction device, a speech waveform when a sentence is read out can be acquired as a target waveform. That is, according to the parameter extraction device of the present invention, the target waveform can be acquired via a karaoke device having a so-called after-recording function.

なお、ここでいう文字を外部に出力とは、少なくとも、文字を表示することを含む。
ところで、本発明は、音声合成システムとしてなされていても良い。
この場合、本発明の音声合成システムは、パラメータ抽出装置と、合成音出力装置とを備えることが望ましい。 Note that outputting the character to the outside here includes at least displaying the character.
By the way, the present invention may be implemented as a speech synthesis system.
In this case, the speech synthesis system of the present invention preferably includes a parameter extraction device and a synthesized sound output device.

このうち、パラメータ抽出装置は、内容情報取得手段と、タイミング情報取得手段と、波形取得手段と、音節波形抽出手段と、パラメータ導出手段と、メタデータ生成手段と、パラメータ登録手段とを有し、パラメータ分析手段が、第一記憶装置に記憶された音声パラメータを、当該音声パラメータと対応付けられたメタデータ毎にデータ解析して、メタデータに対応する各音声パラメータの範囲を表すメタデータ対応テーブルを生成して、第二記憶装置に記憶しても良い。 Among these, the parameter extraction device has content information acquisition means, timing information acquisition means, waveform acquisition means, syllable waveform extraction means, parameter derivation means, metadata generation means, and parameter registration means, A metadata analysis table in which the parameter analysis means analyzes the audio parameter stored in the first storage device for each metadata associated with the audio parameter, and represents a range of each audio parameter corresponding to the metadata May be generated and stored in the second storage device.

さらに、合成音出力装置では、出力性質情報取得手段が、外部から入力され、音の性質を表す出力性質情報を取得し、文言取得手段が、外部から入力された文言を表す出力文言を取得する。そして、テーブル取得手段が、出力性質情報取得手段で取得した出力性質情報に対応するメタデータを含むメタデータ対応テーブルを第二記憶装置から取得すると共に、出力性質情報に対応する情報を有した音声パラメータを第一記憶装置から取得する。 Further, in the synthesized sound output device, the output property information acquisition unit acquires output property information that is input from the outside and represents the sound property, and the word acquisition unit acquires an output statement that represents the word input from the outside. . Then, the table acquisition unit acquires the metadata correspondence table including the metadata corresponding to the output property information acquired by the output property information acquisition unit from the second storage device, and the voice having information corresponding to the output property information The parameter is acquired from the first storage device.

すると、音声合成手段が、文言取得手段で取得した出力文言となるように、テーブル取得手段で取得した音声パラメータをメタデータ対応テーブルに従って音声合成し、出力手段が、音声合成手段にて音声合成することで生成された合成音を出力する。
そして、前記パラメータ抽出装置は、楽曲の一つである対象楽曲の楽譜を表し、音源モジュールから出力される個々の出力音について、少なくとも音高及び演奏開始タイミングが規定された楽譜データを取得する楽譜データ取得手段を備える。
前記内容情報取得手段は、前記対象楽曲の歌詞を構成する歌詞構成文字の文字列を、前記発声内容情報として取得し、前記タイミング情報取得手段は、前記歌詞構成文字の少なくとも１つに対する出力タイミングが、当該歌詞構成文字に対応する前記出力音の演奏開始タイミングと対応付けられた歌詞出力タイミングを、前記発声タイミング情報として取得する。
そして、前記波形取得手段は、前記楽譜データに基づく前記対象楽曲の演奏中に入力された音声が時間軸に沿って推移した波形を、前記対象波形として取得する。
さらに、前記音節波形抽出手段は、前記対象波形において、個々の出力音に対応する区間での音声波形を、前記音節波形として抽出し、前記楽譜データは、前記対象楽曲の曲中において転調していれば、前記対象楽曲の演奏開始の時点を原点とした場合の時間軸上における転調する時点の位置を表す転調フラグを含む。
また、前記メタデータ生成手段は、前記楽譜データ取得手段で取得した楽譜データに基づいて、前記対象楽曲において同一の調が継続される各区間である調同一区間を特定する区間特定手段と、前記区間特定手段にて特定した各調同一区間に含まれ、それぞれの調同一区間における時間軸に沿った最後の出力音を主音として特定する主音特定手段と、前記区間特定手段にて特定した調同一区間に含まれる同一音名の出力音の頻度を表す登場音名頻度を、前記主音特定手段で特定した主音の音名を起点として前記調同一区間毎に導出する音名頻度導出手段と、各調にて利用可能な音名の分布を表すテンプレートとして調毎に予め用意した調テンプレートに、前記音名頻度導出手段で導出した各登場音名頻度を照合した結果、最も相関が高い調それぞれに対応する音声の性質を、前記メタデータとする調推定手段とを備える。
なお、前記調推定手段は、前記最も相関が高い調が長調であれば、明るいという音声の性質を前記メタデータとし、前記最も相関が高い調が短調であれば、暗いという音声の性質を前記メタデータとする。 Then, the speech synthesizer synthesizes the speech parameter acquired by the table acquisition unit according to the metadata correspondence table so that the output message acquired by the word acquisition unit is obtained, and the output unit synthesizes the speech by the speech synthesizer. The synthesized sound generated in this way is output.
The parameter extraction device represents a musical score of a target musical piece that is one of musical pieces, and obtains musical score data in which at least a pitch and a performance start timing are defined for each output sound output from the sound source module. Data acquisition means are provided.
The content information acquisition means acquires a character string of lyric constituent characters constituting the lyrics of the target music as the utterance content information, and the timing information acquisition means has an output timing for at least one of the lyric constituent characters. Then, the lyrics output timing associated with the performance start timing of the output sound corresponding to the lyrics constituent characters is acquired as the utterance timing information.
And the said waveform acquisition means acquires the waveform which the voice input during the performance of the said target music based on the said score data changed along the time axis as said target waveform.
Further, the syllable waveform extracting means extracts a speech waveform in a section corresponding to each output sound as the syllable waveform in the target waveform, and the score data is transposed in the music of the target music. Then, a modulation flag indicating a position at the time of transposing on the time axis when the starting point of performance of the target music is used as the origin is included.
Further, the metadata generation means, based on the musical score data acquired by the musical score data acquisition means, section specifying means for specifying the same key section that is each section in which the same key is continued in the target music, The main tone specifying means for specifying the last output sound along the time axis in each key same section as the main sound included in each key same section specified by the section specifying means, and the same key specified by the section specifying means A pitch name frequency deriving means for deriving an appearance pitch name frequency representing a frequency of output sounds of the same pitch name included in a section for each key-same section starting from the pitch name of the main sound identified by the main sound identifying means; As a result of comparing each appearance name name frequency derived by the above-described sound name frequency deriving means with a key template prepared in advance for each key as a template representing the distribution of pitch names that can be used in the key, the tone with the highest correlation is obtained. The nature of the speech corresponding to, respectively, and a tone estimation means and the metadata.
The key estimation means uses the sound property of bright as the metadata if the key with the highest correlation is the major key, and the sound property of dark as the key with the highest correlation is the minor key. Use metadata.

このような音声合成システムによれば、メタデータ対応テーブル及び音声パラメータに基づいて、多様な合成音を生成することができる。
すなわち、本発明の音声合成システムよれば、合成音を発声したとみなせる発声者の特徴を多様化できる。 According to such a speech synthesis system, various synthesized sounds can be generated based on the metadata correspondence table and the speech parameters.
That is, according to the speech synthesis system of the present invention, it is possible to diversify the features of a speaker who can be regarded as having synthesized speech.

音声合成システムの全体構成を示すブロック図である。It is a block diagram which shows the whole structure of a speech synthesis system. 音声パラメータ登録処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of an audio | voice parameter registration process. 第一実施形態におけるメタデータ推定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the metadata estimation process in 1st embodiment. メタデータ推定処理の処理内容を示す図である。It is a figure which shows the processing content of a metadata estimation process. メタデータ推定処理の処理内容を示す図である。It is a figure which shows the processing content of a metadata estimation process. メタデータ推定処理の処理内容を示す図である。It is a figure which shows the processing content of a metadata estimation process. 第一実施形態における音声パラメータ登録処理の概要を示す図である。It is a figure which shows the outline | summary of the audio | voice parameter registration process in 1st embodiment. 音声分析処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of an audio | voice analysis process. メタデータ対応テーブルを例示する図である。It is a figure which illustrates a metadata corresponding | compatible table. 音声合成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a speech synthesis process. 第二実施形態におけるメタデータ推定処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the metadata estimation process in 2nd embodiment. 第二実施形態における音声パラメータ登録処理の概要を示す図である。It is a figure which shows the outline | summary of the audio | voice parameter registration process in 2nd embodiment.

以下に本発明の実施形態を図面と共に説明する。
［第一実施形態］
〈音声合成システムについて〉
図１は、本発明が適用された音声合成システムの概略構成を示す図である。 Embodiments of the present invention will be described below with reference to the drawings.
[First embodiment]
<About the speech synthesis system>
FIG. 1 is a diagram showing a schematic configuration of a speech synthesis system to which the present invention is applied.

本発明が適用された音声合成システム１は、当該音声合成システム１の利用者が指定した内容の音声が出力されるように、予め登録された音声パラメータに基づいて音声合成した音声（即ち、合成音）を出力するシステムである。 The speech synthesis system 1 to which the present invention has been applied is a speech synthesized based on speech parameters registered in advance (that is, synthesis) so that speech having contents designated by the user of the speech synthesis system 1 is output. Sound).

これを実現するために、音声合成システム１は、音声を入力する音声入力装置１０と、音声入力装置１０を介して入力された音声（以下、音声波形データＳＶと称す）及びカラオケの用途に用いられる各種データ（以下、音楽データＭＤと称す）を格納するＭＩＤＩ格納サーバ２５と、ＭＩＤＩ格納サーバ２５に格納されている音声波形データＳＶ及び音楽データＭＤに基づいて、少なくとも音声パラメータを生成する処理を実行する情報処理装置３０とを備えている。さらに、音声合成システム１は、情報処理装置３０にて生成された音声パラメータを格納するデータ格納サーバ５０と、データ格納サーバ５０に格納されている音声パラメータに基づいて音声合成した合成音を出力する音声出力端末６０とを備えている。なお、本実施形態における音声合成システム１は、音声出力端末６０を複数台備えている。 In order to realize this, the speech synthesis system 1 is used for speech input device 10 for inputting speech, speech input through speech input device 10 (hereinafter referred to as speech waveform data SV), and karaoke. MIDI storage server 25 for storing various types of data (hereinafter referred to as music data MD), and processing for generating at least audio parameters based on audio waveform data SV and music data MD stored in MIDI storage server 25 And an information processing apparatus 30 to be executed. Furthermore, the speech synthesis system 1 outputs a synthesized sound obtained by synthesizing speech based on the speech parameters stored in the data storage server 50 and the speech storage stored in the data storage server 50. And an audio output terminal 60. Note that the speech synthesis system 1 in this embodiment includes a plurality of speech output terminals 60.

すなわち、本実施形態の音声合成システム１においては、情報処理装置３０が、ＭＩＤＩ格納サーバ２５に格納されている音声波形データＳＶ及び音楽データＭＤに基づいて、少なくとも音声パラメータＰＭを生成してデータ格納サーバ５０に格納する。そして、音声出力端末６０は、当該音声出力端末６０を介して、利用者が指定した内容の音声が出力されるように、データ格納サーバ５０に格納された音声パラメータＰＭに基づいて音声合成した合成音を出力する。 That is, in the speech synthesis system 1 of the present embodiment, the information processing apparatus 30 generates at least the speech parameter PM based on the speech waveform data SV and the music data MD stored in the MIDI storage server 25 and stores the data. Store in the server 50. Then, the voice output terminal 60 performs synthesis by voice synthesis based on the voice parameter PM stored in the data storage server 50 so that the voice having the content specified by the user is output via the voice output terminal 60. Output sound.

なお、ここで言う音声パラメータＰＭとは、詳しくは後述するが、いわゆるフォルマント合成に用いる音声の特徴量であり、例えば、発声音声における各音節での基本周波数、メル周波数ケプストラム（ＭＦＣＣ）、パワー、及びそれらの時間差分を含むものである。
〈ＭＩＤＩ格納サーバについて〉
まず、ＭＩＤＩ格納サーバ２５は、記憶内容を読み書き可能に構成された記憶装置を中心に構成された装置であり、通信網を介して、音声入力装置１０に接続されている。 The speech parameter PM referred to here is a feature amount of speech used for so-called formant synthesis, which will be described in detail later. For example, the fundamental frequency, mel frequency cepstrum (MFCC), power, And the time difference between them.
<About the MIDI storage server>
First, the MIDI storage server 25 is a device mainly composed of a storage device configured to be able to read and write stored contents, and is connected to the voice input device 10 via a communication network.

このＭＩＤＩ格納サーバ２５には、少なくとも、楽曲ごとに予め用意された音楽データＭＤが格納されている。この音楽データＭＤには、楽曲ＭＩＤＩデータＤＭ（特許請求の範囲における楽譜データに相当）と、歌詞データ群ＤＬとが含まれ、これら楽曲ＭＩＤＩデータＤＭと歌詞データ群ＤＬとは、それぞれ対応する楽曲ごとに対応付けられている。 The MIDI storage server 25 stores at least music data MD prepared in advance for each piece of music. The music data MD includes music MIDI data DM (corresponding to the score data in the claims) and the lyrics data group DL. The music MIDI data DM and the lyrics data group DL are respectively corresponding music data. Are associated with each other.

このうち、楽曲ＭＩＤＩデータＤＭは、周知のＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）規格によって、一つの楽曲の楽譜を表すデータであり、楽曲ごとに予め用意されている。この楽曲ＭＩＤＩデータＤＭの各々は、楽曲を区別するデータである識別データと、当該楽曲にて用いられる楽器ごとの楽譜を表す楽譜トラックと、当該楽曲において調が変化する時刻を表す変調フラグとを少なくとも有している。 Of these, the music MIDI data DM is data representing the score of one music according to the well-known MIDI (Musical Instrument Digital Interface) standard, and is prepared in advance for each music. Each of the music MIDI data DM includes identification data that is data for discriminating music, a score track that represents a score for each musical instrument used in the music, and a modulation flag that represents the time at which the key changes in the music. Have at least.

そして、楽譜トラックには、ＭＩＤＩ音源から出力される個々の出力音について、少なくとも、音高（いわゆるノートナンバー）と、音源モジュールが出力音を出力する期間（以下、音符長）とが規定されている。ただし、楽譜トラックの音符長は、当該出力音の出力を開始するまでの当該楽曲の演奏開始からの時間を表す演奏開始タイミング（いわゆるノートオンタイミング）と、当該出力音の出力を終了するまでの当該楽曲の演奏開始からの時間を表す演奏終了タイミング（いわゆるノートオフタイミング）とによって規定されている。 The musical score track defines at least a pitch (so-called note number) and a period during which the sound module outputs the output sound (hereinafter, note length) for each output sound output from the MIDI sound source. Yes. However, the note length of the musical score track is the performance start timing (so-called note-on timing) indicating the time from the start of the performance of the music until the output of the output sound is started, and the output of the output sound is ended. It is defined by the performance end timing (so-called note-off timing) that represents the time from the start of performance of the music.

なお、楽譜トラックは、例えば、鍵盤楽器（例えば、ピアノやパイプオルガンなど）、弦楽器（例えば、バイオリンやビオラ、ギター、ベースギター、琴など）、打楽器（例えば、ヴィブラフォンや、ドラム、シンバル、ティンパニー、木琴など）、及び管楽器（例えば、クラリネットやトランペット、フルート、尺八など）などの楽器ごとに用意されている。 Note that the score track includes, for example, a keyboard instrument (eg, piano or pipe organ), a stringed instrument (eg, violin, viola, guitar, bass guitar, koto), or a percussion instrument (eg, vibraphone, drum, cymbal, timpani, Xylophone, etc.) and wind instruments (eg, clarinet, trumpet, flute, shakuhachi, etc.).

一方、歌詞データ群ＤＬは、周知のカラオケ装置を構成する表示装置に表示される歌詞に関するデータであり、楽曲の歌詞を構成する文字（以下、歌詞構成文字とする）を表す歌詞テロップデータＤＴと、歌詞構成文字の出力タイミングである歌詞出力タイミングを、楽曲ＭＩＤＩデータＤＭの演奏と対応付けるタイミング対応関係が規定された歌詞出力データＤＯとを備えている。 On the other hand, the lyrics data group DL is data relating to lyrics displayed on a display device that constitutes a well-known karaoke device, and lyrics telop data DT representing characters constituting the lyrics of the music (hereinafter referred to as lyrics constituent characters); And lyrics output data DO in which a timing correspondence relationship for associating the lyrics output timing, which is the output timing of the lyrics constituent characters, with the performance of the music MIDI data DM is provided.

具体的に、本実施形態におけるタイミング対応関係は、楽曲ＭＩＤＩデータＤＭの演奏を開始するタイミングに、歌詞テロップデータＤＴの出力を開始するタイミングが対応付けられた上で、特定楽曲の時間軸に沿った各歌詞構成文字の歌詞出力タイミングが、楽曲ＭＩＤＩデータＤＭの演奏を開始からの経過時間によって規定されている。なお、ここでいう経過時間とは、例えば、表示された歌詞構成文字の色替えを実行するタイミングを表す時間であり、色替えの速度によって規定されている。また、ここでいう歌詞構成文字は、歌詞を構成する文字の各々であっても良いし、その文字の各々を時間軸に沿った特定の規則に従って一群とした文節やフレーズであっても良い。
〈音声入力装置の構成について〉
次に、音声入力装置１０について説明する。 Specifically, the timing correspondence relationship in the present embodiment is based on the timing of starting the output of the lyrics telop data DT to the timing of starting the performance of the music MIDI data DM, and along the time axis of the specific music. The lyrics output timing of each lyrics constituent character is defined by the elapsed time from the start of the performance of the music MIDI data DM. The elapsed time referred to here is, for example, a time representing the timing of executing color change of the displayed lyrics constituent characters, and is defined by the color change speed. Further, the lyric constituent characters here may be each of the characters constituting the lyric, or may be a phrase or a phrase in which each of the characters is grouped according to a specific rule along the time axis.
<About the configuration of the voice input device>
Next, the voice input device 10 will be described.

音声入力装置１０は、通信部１１と、入力受付部１２と、表示部１３と、音声入力部１４と、音声出力部１５と、音源モジュール１６と、記憶部１７と、制御部２０とを備えている。すなわち、本実施形態における音声入力装置１０は、いわゆる周知のカラオケ装置として構成されている。 The voice input device 10 includes a communication unit 11, an input receiving unit 12, a display unit 13, a voice input unit 14, a voice output unit 15, a sound source module 16, a storage unit 17, and a control unit 20. ing. That is, the voice input device 10 in the present embodiment is configured as a so-called well-known karaoke device.

このうち、通信部１１は、通信網（例えば、公衆無線通信網やネットワーク回線）を介して、音声入力装置１０が外部との間で通信を行う。入力受付部１２は、外部からの操作に従って情報や指令の入力を受け付ける入力機器（例えば、キーやスイッチ、リモコンの受付部など）である。 Among these, the communication unit 11 communicates with the outside through the communication network (for example, a public wireless communication network or a network line). The input reception unit 12 is an input device (for example, a key, a switch, a remote control reception unit, or the like) that receives input of information and commands in accordance with external operations.

表示部１３は、少なくとも、文字コードで示される情報を含む画像を表示する表示装置（例えば、液晶ディスプレイやＣＲＴ等）である。また、音声入力部１４は、音を電気信号に変換して制御部２０に入力する装置（いわゆるマイクロホン）である。音声出力部１５は、制御部２０からの電気信号を音に変換して出力する装置（いわゆるスピーカ）である。さらに、音源モジュール１６は、ＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）規格によって規定されたデータに基づいて、音源からの音を模擬した音（即ち、出力音）を出力する装置（例えば、ＭＩＤＩ音源）である。 The display unit 13 is a display device (for example, a liquid crystal display or a CRT) that displays an image including at least information indicated by a character code. The voice input unit 14 is a device (so-called microphone) that converts sound into an electrical signal and inputs the signal to the control unit 20. The audio output unit 15 is a device (so-called speaker) that converts an electrical signal from the control unit 20 into sound and outputs the sound. Furthermore, the sound module 16 is a device (for example, a MIDI sound source) that outputs a sound (that is, an output sound) that simulates a sound from a sound source, based on data defined by the MIDI (Musical Instrument Digital Interface) standard. .

記憶部１７は、記憶内容を読み書き可能に構成された不揮発性の記憶装置（例えば、ハードディスク装置や、フラッシュメモリ）である。
また、制御部２０は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを格納するＲＯＭ２１と、処理プログラムやデータを一時的に格納するＲＡＭ２２と、ＲＯＭ２１やＲＡＭ２２に記憶された処理プログラムに従って各処理（各種演算）を実行するＣＰＵ２３とを少なくとも有した周知のコンピュータを中心に構成されている。 The storage unit 17 is a non-volatile storage device (for example, a hard disk device or a flash memory) configured to be able to read and write stored contents.
The control unit 20 is stored in the ROM 21 that stores processing programs and data that need to retain stored contents even when the power is turned off, the RAM 22 that temporarily stores processing programs and data, and the ROM 21 and RAM 22. It is mainly configured by a known computer having at least a CPU 23 that executes each process (various operations) according to the processing program.

そして、ＲＯＭ２１には、周知のカラオケ演奏処理を制御部が実行する処理プログラムや、カラオケ演奏処理によって対象楽曲が演奏されている期間中に、音声入力部１４を介して入力された音声を音声波形データＳＶとして、当該対象楽曲を識別する楽曲識別情報と対応付けて、ＭＩＤＩ格納サーバ２５に格納する音声格納処理を制御部２０が実行する処理プログラムが記憶されている。 The ROM 21 stores a processing program for executing a well-known karaoke performance process by the control unit, and a voice waveform inputted through the voice input unit 14 during a period in which the target music piece is being played by the karaoke performance process. As the data SV, a processing program is stored that is associated with the music identification information for identifying the target music and for which the control unit 20 executes the audio storage processing stored in the MIDI storage server 25.

つまり、音声入力装置１０では、カラオケ演奏処理に従って、入力受付部１２を介して指定された一つの楽曲（以下、対象楽曲とする）に対応する音楽データＭＤをＭＩＤＩ格納サーバ２５から取得して、当該音楽データＭＤ中の楽曲ＭＩＤＩデータＤＭに基づいて、対象楽曲を演奏すると共に、当該音楽データＭＤ中の歌詞データ群ＤＬに基づいて対象楽曲の歌詞を表示部１３に表示する。 That is, the voice input device 10 acquires music data MD corresponding to one piece of music (hereinafter referred to as a target song) designated via the input receiving unit 12 from the MIDI storage server 25 according to the karaoke performance process, The target music is played based on the music MIDI data DM in the music data MD, and the lyrics of the target music are displayed on the display unit 13 based on the lyrics data group DL in the music data MD.

さらに、音声入力装置１０では、カラオケ演奏処理の実行中に、音声入力部１４を介して入力された音声を音声波形データＳＶとして、当該対象楽曲を識別する楽曲識別情報（ここでは、音楽データＭＤそのもの）及び音声を入力した人物（以下、発声者とする）を識別する発声者識別情報（以下、発声者ＩＤと称す）と対応付けて、ＭＩＤＩ格納サーバ２５に格納する。なお、ＭＩＤＩ格納サーバ２５に格納される音声波形データＳＶには、発声者の特徴を表す発声者特徴情報も対応付けられており、この発声者特徴情報には、例えば、発声者の性別、年齢などを含む。
〈情報処理装置の構成について〉
次に、情報処理装置３０について説明する。 Furthermore, in the voice input device 10, during the karaoke performance process, the voice input via the voice input unit 14 is used as voice waveform data SV, and music identification information (here, music data MD) for identifying the target music. Itself) and speaker identification information (hereinafter referred to as “speaker ID”) for identifying the person who has input the voice (hereinafter referred to as “speaker”), and stores it in the MIDI storage server 25. The voice waveform data SV stored in the MIDI storage server 25 is also associated with speaker feature information representing the features of the speaker. The speaker feature information includes, for example, the gender and age of the speaker. Etc.
<Configuration of information processing device>
Next, the information processing apparatus 30 will be described.

この情報処理装置３０は、通信部３１と、入力受付部３２と、表示部３３と、記憶部３４と、制御部４０とを備えている。
このうち、通信部３１は、通信網（例えば、公衆無線通信網やネットワーク回線）を介して外部との間で通信を行う。入力受付部３２は、外部からの操作に従って情報や指令の入力を受け付ける入力機器（例えば、キーボードやポインティングデバイス）である。 The information processing apparatus 30 includes a communication unit 31, an input reception unit 32, a display unit 33, a storage unit 34, and a control unit 40.
Among these, the communication unit 31 communicates with the outside via a communication network (for example, a public wireless communication network or a network line). The input receiving unit 32 is an input device (for example, a keyboard or a pointing device) that receives input of information and commands in accordance with an external operation.

表示部３３は、画像を表示する表示装置（例えば、液晶ディスプレイやＣＲＴ等）である。
記憶部３４は、記憶内容を読み書き可能に構成された不揮発性の記憶装置（例えば、ハードディスク装置や、フラッシュメモリ）である。また、制御部４０は、ＲＯＭ４１、ＲＡＭ４２、ＣＰＵ４３を少なくとも有した周知のコンピュータを中心に構成されている。 The display unit 33 is a display device (for example, a liquid crystal display or a CRT) that displays an image.
The storage unit 34 is a non-volatile storage device (for example, a hard disk device or a flash memory) configured to be able to read and write stored contents. The control unit 40 is configured around a known computer having at least a ROM 41, a RAM 42, and a CPU 43.

そして、情報処理装置３０のＲＯＭ４１には、ＭＩＤＩ格納サーバ２５に格納されている音声波形データＳＶ及び音楽データＭＤに基づいて生成した音声パラメータＰＭを、当該音声パラメータＰＭの生成源である音声の性質を表すメタデータと対応付けてデータ格納サーバ５０に格納する音声パラメータ登録処理を制御部４０が実行するための処理プログラムが記憶されている。 In the ROM 41 of the information processing apparatus 30, the voice parameter PM generated based on the voice waveform data SV and the music data MD stored in the MIDI storage server 25 is stored in the nature of the voice that is the generation source of the voice parameter PM. A processing program for the control unit 40 to execute a voice parameter registration process stored in the data storage server 50 in association with the metadata representing the.

さらに、情報処理装置３０のＲＯＭ４１には、音声パラメータ登録処理によってデータ格納サーバ５０に格納された音声パラメータＰＭを統計処理した結果に基づいて、メタデータに対応する音声パラメータＰＭの傾向を表すメタデータ対応テーブル（以下、表情テーブルＴＤと称す）を、当該音声パラメータＰＭと対応付けられたメタデータの種類ごとに作成し、データ格納サーバ５０に記憶する音声分析処理を制御部４０が実行するための処理プログラムが記憶されている。 Further, in the ROM 41 of the information processing apparatus 30, metadata representing a tendency of the voice parameter PM corresponding to the metadata based on the result of statistical processing of the voice parameter PM stored in the data storage server 50 by the voice parameter registration process. A correspondence table (hereinafter referred to as an expression table TD) is created for each type of metadata associated with the voice parameter PM, and the control unit 40 executes voice analysis processing stored in the data storage server 50. A processing program is stored.

本実施形態において、メタデータとは、当該音声が発声されたときの発声者の感情を少なくとも含む、音声の性質を表すものであり、例えば、情緒や、雰囲気などを含む概念である。さらに、音声の性質には、感情を推定するために必要な情報を含んでも良い。 In the present embodiment, the metadata represents the nature of speech including at least the emotion of the speaker when the speech is uttered, and is a concept that includes, for example, emotion and atmosphere. Furthermore, information necessary for estimating emotions may be included in the nature of speech.

なお、データ格納サーバ５０は、記憶内容を読み書き可能に構成された記憶装置を中心に構成された装置であり、通信網を介して情報処理装置３０に接続されている。
〈音声パラメータ登録処理について〉
次に、情報処理装置３０が実行する音声パラメータ登録処理について説明する。 The data storage server 50 is a device that is mainly configured of a storage device that is configured to be able to read and write stored contents, and is connected to the information processing device 30 via a communication network.
<Voice parameter registration process>
Next, the voice parameter registration process executed by the information processing apparatus 30 will be described.

ここで、図２は、音声パラメータ登録処理の処理手順を示すフローチャートである。
この図２に示すように、音声パラメータ登録処理は、起動されると、入力受付部３２を介して指定された楽曲（以下、対象楽曲と称す）の楽曲ＭＩＤＩデータＤＭを取得する（Ｓ１１０）。続いて、対象楽曲の歌詞データ群ＤＬを取得し（Ｓ１２０）、対象楽曲に対応し、かつ入力受付部３２を介して指定された発声者ＩＤに対応する一つの音声波形データＳＶ（特許請求の範囲における対象波形に相当）を取得する（Ｓ１３０）。 Here, FIG. 2 is a flowchart showing the processing procedure of the voice parameter registration processing.
As shown in FIG. 2, when the voice parameter registration process is started, the music MIDI data DM of the music specified through the input receiving unit 32 (hereinafter referred to as the target music) is acquired (S110). Subsequently, the lyrics data group DL of the target music is acquired (S120), and one speech waveform data SV corresponding to the target music and corresponding to the speaker ID designated via the input receiving unit 32 (claims) (Corresponding to the target waveform in the range) is acquired (S130).

さらに、Ｓ１３０で取得した音声波形データＳＶにおいて、当該音声波形データＳＶの発声内容に含まれる音節それぞれに対応する区間での音声波形（以下、音節波形と称す）を特定する（Ｓ１４０）。 Furthermore, in the speech waveform data SV acquired in S130, a speech waveform (hereinafter referred to as a syllable waveform) in a section corresponding to each syllable included in the utterance content of the speech waveform data SV is specified (S140).

具体的に、本実施形態のＳ１４０では、Ｓ１１０で取得した楽曲ＭＩＤＩデータＤＭのうち、歌唱旋律を表す楽譜トラック（以下、メロディトラックと称す）に規定された各出力音の演奏開始タイミング及び演奏終了タイミングを抽出すると共に、各出力音に対応付けられた歌詞構成文字の音節を特定する。そして、音声波形データＳＶにおいて、各出力音の演奏開始タイミングから演奏終了タイミングまでの区間それぞれに対応する区間での音声波形を音節波形として特定する。なお、本実施形態のＳ１４０にて特定される音節波形それぞれは、当該音節波形にて発声した音節の内容と対応付けられたものである。 Specifically, in S140 of the present embodiment, the performance start timing and performance end of each output sound defined in a musical score track (hereinafter referred to as a melody track) representing the singing melody in the music MIDI data DM acquired in S110. The timing is extracted, and the syllables of the lyrics constituent characters associated with each output sound are specified. Then, in the speech waveform data SV, the speech waveform in the section corresponding to each section from the performance start timing to the performance end timing of each output sound is specified as a syllable waveform. Note that each syllable waveform specified in S140 of the present embodiment is associated with the content of the syllable uttered by the syllable waveform.

さらに、音節波形それぞれから音声パラメータＰＭを導出する（Ｓ１５０）。本実施形態のＳ１５０では、基本周波数、メル周波数ケプストラム（ＭＦＣＣ）、パワー、それらの時間差分を、それぞれ、音声パラメータＰＭとして導出する。これらの基本周波数、ＭＦＣＣ、パワーの導出方法は、周知であるため、ここでの詳しい説明は省略するが、例えば、基本周波数であれば、音節波形の時間軸に沿った自己相関、音節波形の周波数スペクトルの自己相関、またはケプストラム法などの手法を用いて導出すれば良い。また、ＭＦＣＣであれば、音節波形に対して時間分析窓を適用して、時間分析窓ごとに周波数解析（例えば、ＦＦＴ）をした結果について、周波数ごとの大きさを対数化した結果を、さらに、周波数解析することで導出すれば良い。パワーについては、音節波形に対して時間分析窓を適用して振幅の二乗した結果を時間方向に積分することで導出すれば良い。 Further, the speech parameter PM is derived from each syllable waveform (S150). In S150 of the present embodiment, the fundamental frequency, the mel frequency cepstrum (MFCC), the power, and the time difference between them are each derived as a voice parameter PM. Since these fundamental frequency, MFCC, and power derivation methods are well known, detailed description thereof is omitted here. For example, if the fundamental frequency is used, autocorrelation along the time axis of the syllable waveform, syllable waveform What is necessary is just to derive | lead-out using methods, such as an autocorrelation of a frequency spectrum, or a cepstrum method. In addition, in the case of MFCC, a result obtained by applying a time analysis window to a syllable waveform and performing frequency analysis (for example, FFT) for each time analysis window is further obtained by logarithmizing the size for each frequency. It can be derived by frequency analysis. The power may be derived by applying a time analysis window to the syllable waveform and integrating the square of the amplitude in the time direction.

続いて、Ｓ１４０にて特定した各音節波形についてのメタデータを推定するメタデータ推定処理を実行する（Ｓ１６０）。
ここで、メタデータ推定処理について、図３を用いて詳細に説明する。なお、図３は、メタデータ推定処理の処理手順を示したフローチャートである。 Subsequently, metadata estimation processing for estimating metadata about each syllable waveform specified in S140 is executed (S160).
Here, the metadata estimation process will be described in detail with reference to FIG. FIG. 3 is a flowchart showing the processing procedure of the metadata estimation processing.

このメタデータ推定処理は、起動されると、まず、先のＳ１１０にて取得した楽曲ＭＩＤＩデータに基づいて、対象楽曲において同一の調が継続される各区間である調同一区間を特定する（Ｓ３１０）。具体的に、本実施形態のＳ３１０では、図４に示すように、楽曲ＭＩＤＩデータに含まれる転調フラグに基づき、時間軸に沿って互いに隣接する転調フラグの間の区間を、調同一区間として特定する。 When this metadata estimation process is started, first, based on the music MIDI data acquired in the previous S110, the same key section that is the section in which the same key is continued in the target music is specified (S310). ). Specifically, in S310 of the present embodiment, as shown in FIG. 4, based on the modulation flag included in the music MIDI data, the section between the modulation flags adjacent to each other along the time axis is identified as the same key section. To do.

続いて、Ｓ３１０にて特定した調同一区間における主音を特定する（Ｓ３２０）。具体的に、本実施形態のＳ３２０では、図５に示すように、調同一区間において、時間軸に沿って最後の出力音を、当該調同一区間における主音として特定する。本実施形態では、Ｓ３１０にて特定した調同一区間のそれぞれについて、主音を特定する。 Subsequently, the main sound in the same key interval specified in S310 is specified (S320). Specifically, in S320 of this embodiment, as shown in FIG. 5, in the same key interval, the last output sound along the time axis is specified as the main sound in the same key interval. In the present embodiment, the main sound is specified for each of the same key intervals specified in S310.

そして、Ｓ３２０にて特定した主音の音名を起点とし、当該主音が特定された調同一区間に含まれる出力音それぞれの音名を階級とし、各音名の登場回数を度数としたヒストグラム（以下、登場音名頻度と称す）を導出する（Ｓ３３０）。具体的に、本実施形態のＳ３３０にて導出する登場音名頻度は、図６（Ａ）に示すように、調同一区間に含まれる同一音名の出力音の登場回数（登場頻度）を集計したものである。そして、本実施形態においては、オクターブが異なる出力音であっても、音名が同一であれば、同一音名の出力音として集計する。なお、本実施形態では、各調同一区間について、登場音名頻度を導出する。 Then, a histogram (hereinafter referred to as a pitch) in which the pitch name of the main tone specified in S320 is a starting point, the pitch names of the output sounds included in the same key interval in which the main tone is specified is a rank, and the number of appearances of each pitch name is a frequency. , Referred to as frequency of appearance sound name) (S330). Specifically, as shown in FIG. 6 (A), the appearance name frequency derived in S330 of this embodiment is the total number of appearances (appearance frequency) of output sounds with the same pitch name included in the same key interval. It is a thing. And in this embodiment, even if it is an output sound from which an octave differs, if the pitch name is the same, it totals as an output sound of the same pitch name. In the present embodiment, the appearance name frequency is derived for the same section.

続いて、Ｓ３３０にて導出した登場音名頻度を、各調にて利用可能な音名の分布を表すテンプレートとして調毎に予め用意した調テンプレートに照合した結果に基づいて、当該調同一区間における調を特定する（Ｓ３４０）。具体的に、本実施形態のＳ３４０では、長調の楽曲にて利用可能な音名の分布を表す長調テンプレート（図６（Ｂ）参照）と、短調の楽曲にて利用可能な音名の分布を表す短調テンプレート（図６（Ｃ）参照）とを予め用意し、それぞれの調テンプレートにＳ３３０にて導出した登場音名頻度を照合する。その結果、最も高い相関を示す調テンプレートに対応する調を、当該調同一区間における調として特定する。なお、本実施形態のＳ３４０では、調同一区間のそれぞれについての調を特定する。 Subsequently, based on the result of matching the key name frequency derived in step S330 with a key template prepared in advance for each key as a template representing the distribution of pitch names that can be used in each key, A key is specified (S340). Specifically, in S340 of the present embodiment, a major template (see FIG. 6B) that represents the distribution of pitch names that can be used in major music, and the distribution of pitch names that can be used in minor music. A minor template to be expressed (see FIG. 6C) is prepared in advance, and the frequency of appearance note names derived in S330 is collated with each key template. As a result, the key corresponding to the key template showing the highest correlation is specified as the key in the key same section. In S340 of the present embodiment, the key for each key-same section is specified.

さらに、Ｓ３４０で特定した調同一区間における楽曲の調に対応する音声の性質を、メタデータとして特定する（Ｓ３５０）。具体的に、本実施形態のＳ３５０では、調同一区間における調が長調であれば、当該調同一区間での歌詞（即ち、発声内容）が「明るい」という感情を表す音声の性質をメタデータとして特定する。また、調同一区間における調が短調であれば、当該調同一区間での歌詞が「暗い」という感情を表す音声の性質をメタデータとして特定する。なお、本実施形態においては、調同一区間に含まれる全ての音節について、当該調同一区間に対応するメタデータを割り当てている。 Furthermore, the sound property corresponding to the key of the music in the same key section identified in S340 is specified as metadata (S350). Specifically, in S350 of the present embodiment, if the key in the same key section is a major key, the property of the voice representing the emotion that the lyrics (that is, the utterance content) in the same key section is “bright” is used as metadata. Identify. If the key in the same key section is a minor key, the property of the voice representing the feeling that the lyrics in the same key section are “dark” is specified as metadata. In the present embodiment, metadata corresponding to the same key interval is assigned to all syllables included in the same key interval.

そして、その後、音声パラメータ登録処理のＳ１７０へと移行する。
その音声パラメータ登録処理のＳ１７０では、Ｓ１５０にて導出した音声パラメータＰＭと、Ｓ１６０にて推定したメタデータとを、対応する音節毎に対応付けてデータ格納サーバ５０に格納する音声パラメータ登録を実行する（Ｓ１７０）。なお、本実施形態のＳ１７０にてデータ格納サーバ５０に格納される音声パラメータＰＭと対応付けられるデータは、メタデータに加えて、発声した音節の内容（種類）や、発声者ＩＤ、発声者特徴情報を含む。 Thereafter, the process proceeds to S170 of the voice parameter registration process.
In S170 of the speech parameter registration process, speech parameter registration is performed in which the speech parameter PM derived in S150 and the metadata estimated in S160 are associated with each corresponding syllable and stored in the data storage server 50. (S170). Note that the data associated with the speech parameter PM stored in the data storage server 50 in S170 of the present embodiment includes, in addition to metadata, the content (type) of the uttered syllable, the speaker ID, and speaker characteristics. Contains information.

その後、本音声パラメータ登録処理を終了する。
以上説明したように、図７に示すように、本実施形態の音声パラメータ登録処理では、対象楽曲の演奏期間中に入力された音声波形を処理対象とする。そして、その音声波形に基づく音声波形データＳＶを、当該対象楽曲のメロディラインを構成する各出力音の演奏期間に対応する区間（即ち、発声内容に含まれる各音節）毎に分割して音節波形を生成すると共に、各音節波形から音声パラメータＰＭを導出する。 Thereafter, the voice parameter registration process is terminated.
As described above, as shown in FIG. 7, in the voice parameter registration process of the present embodiment, the voice waveform input during the performance period of the target music is the processing target. Then, the speech waveform data SV based on the speech waveform is divided into sections corresponding to the performance periods of the output sounds constituting the melody line of the target music (that is, each syllable included in the utterance content) to obtain a syllable waveform. And a speech parameter PM is derived from each syllable waveform.

これと共に、音声パラメータ登録処理では、対象楽曲において同一の調が継続する期間（即ち、調同一区間）それぞれを特定し、各調同一区間における調（調性）を特定する。そして、その特定した調からイメージされる感情として予め規定された音声の性質をメタデータとして特定する。 At the same time, in the voice parameter registration process, each period (that is, the same key section) in which the same key continues in the target music is specified, and the key (tonality) in each key same section is specified. And the property of the voice previously defined as the emotion imaged from the specified key is specified as metadata.

その上で、音声パラメータ登録処理では、対応する音節毎に、音声パラメータＰＭと、メタデータとを対応付けて、データ格納サーバ５０に格納する。
〈音声分析処理について〉
次に、情報処理装置３０の制御部４０が実行する音声分析処理について、図８を用いて説明する。 In addition, in the voice parameter registration process, the voice parameter PM and the metadata are associated with each other syllable and stored in the data storage server 50.
<About voice analysis processing>
Next, speech analysis processing executed by the control unit 40 of the information processing apparatus 30 will be described with reference to FIG.

この図８に示すように、音声分析処理は、起動されると、まず、同一の内容を表すメタデータ（以下、対象メタデータとする）と対応付けられた全ての音声パラメータ（以下、音声パラメータ群と称す）を、データ格納サーバ５０から取得する（Ｓ４１０）。すなわち、本実施形態のＳ４１０にて取得する音声パラメータ群とは、データ格納サーバ５０に格納された音声パラメータの中で、対象メタデータと対応付けられた全ての音声パラメータＰＭである。さらに、ここでの音声パラメータＰＭには、基本周波数、メル周波数ケプストラム（ＭＦＣＣ）、パワー、それらの時間差分のそれぞれを含む。 As shown in FIG. 8, when the voice analysis process is started, first, all voice parameters associated with metadata representing the same content (hereinafter referred to as target metadata) (hereinafter referred to as voice parameters). (Referred to as a group) is acquired from the data storage server 50 (S410). That is, the speech parameter group acquired in S410 of the present embodiment is all speech parameters PM associated with the target metadata among the speech parameters stored in the data storage server 50. Furthermore, the audio parameter PM here includes each of a fundamental frequency, a mel frequency cepstrum (MFCC), power, and their time differences.

続いて、Ｓ４１０にて取得した音声パラメータ群に基づいて表情テーブルＴＤを生成する（Ｓ４２０）。
具体的に、本実施形態のＳ４２０では、Ｓ４１０にて取得した音声パラメータ群に含まれる各音声パラメータ（即ち、基本周波数、メル周波数ケプストラム（ＭＦＣＣ）、パワー、それらの時間差分のそれぞれ）について平均値を算出する。そして、その算出した平均値と、Ｓ４１０にて取得した音声パラメータ群に含まれる各音声パラメータＰＭとの差分であるパラメータ差分を、当該音声パラメータＰＭと対応付けられている発声者ＩＤごと、かつ当該音声パラメータＰＭと対応付けられている音節ごとに導出する。 Subsequently, the facial expression table TD is generated based on the voice parameter group acquired in S410 (S420).
Specifically, in S420 of the present embodiment, the average value for each voice parameter (that is, each of fundamental frequency, mel frequency cepstrum (MFCC), power, and their time difference) included in the voice parameter group acquired in S410. Is calculated. Then, a parameter difference that is a difference between the calculated average value and each voice parameter PM included in the voice parameter group acquired in S410 is determined for each speaker ID associated with the voice parameter PM, and Derived for each syllable associated with the speech parameter PM.

さらに、本実施形態のＳ４２０では、導出したパラメータ差分を、当該パラメータ差分に対応するメタデータ、発声者ＩＤ、及び音節と対応付けることで、表情テーブルＴＤを生成する。すなわち、Ｓ４２０にて生成される表情テーブルＴＤは、図９に示すように、発声者ＩＤごとに、メタデータの内容が分類された上で、音節の内容と、当該音節に対応するパラメータ差分とが対応付けられたものである。 Furthermore, in S420 of the present embodiment, the facial expression table TD is generated by associating the derived parameter difference with the metadata, the speaker ID, and the syllable corresponding to the parameter difference. That is, as shown in FIG. 9, the expression table TD generated in S420 is obtained by classifying the content of metadata for each speaker ID, and the syllable content and the parameter difference corresponding to the syllable. Are associated with each other.

そして、Ｓ４２０にて生成した表情テーブルＴＤを、データ格納サーバ５０に格納する（Ｓ４３０）。
その後、本音声分析処理を終了する。
〈音声出力端末の構成について〉
次に、音声出力端末について説明する（図１参照）。 Then, the facial expression table TD generated in S420 is stored in the data storage server 50 (S430).
Thereafter, the voice analysis process is terminated.
<Configuration of audio output terminal>
Next, the audio output terminal will be described (see FIG. 1).

この音声出力端末６０は、情報受付部６１と、表示部６２と、音出力部６３と、通信部６４と、記憶部６５と、制御部６７とを備えている。本実施形態における音声出力端末６０として、例えば、周知の携帯端末（携帯電話や携帯情報端末）や、周知の情報処理装置（いわゆるパーソナルコンピュータ）を想定しても良い。 The voice output terminal 60 includes an information receiving unit 61, a display unit 62, a sound output unit 63, a communication unit 64, a storage unit 65, and a control unit 67. As the audio output terminal 60 in the present embodiment, for example, a known portable terminal (a mobile phone or a portable information terminal) or a known information processing apparatus (a so-called personal computer) may be assumed.

このうち、情報受付部６１は、入力装置（図示せず）を介して入力された情報を受け付ける。表示部６２は、制御部６７からの指令に基づいて画像を表示する。音出力部６３は、音を出力する周知の装置であり、例えば、ＰＣＭ音源と、スピーカとを備えている。 Among these, the information reception part 61 receives the information input via the input device (not shown). The display unit 62 displays an image based on a command from the control unit 67. The sound output unit 63 is a known device that outputs sound, and includes, for example, a PCM sound source and a speaker.

通信部６４は、通信網（例えば、公衆無線通信網やネットワーク回線）を介して音声出力端末６０が外部との間で情報通信を行うものである。記憶部６５は、記憶内容を読み書き可能に構成された不揮発性の記憶装置（例えば、ハードディスク装置や、フラッシュメモリ）であり、各種処理プログラムや各種データが記憶される。 The communication unit 64 is for the voice output terminal 60 to perform information communication with the outside via a communication network (for example, a public wireless communication network or a network line). The storage unit 65 is a non-volatile storage device (for example, a hard disk device or a flash memory) configured to be able to read and write stored contents, and stores various processing programs and various data.

また、制御部６７は、ＲＯＭ、ＲＡＭ、ＣＰＵを少なくとも有した周知のコンピュータを中心に構成されている。
〈音声合成処理について〉
次に、音声出力端末６０の制御部６７が実行する音声合成処理について説明する。 The control unit 67 is mainly configured by a known computer having at least a ROM, a RAM, and a CPU.
<About voice synthesis processing>
Next, speech synthesis processing executed by the control unit 67 of the speech output terminal 60 will be described.

ここで、図１０は、音声合成処理の処理手順を示すフローチャートである。
この音声合成処理は、音声出力端末６０の情報受付部６１を介して起動指令が入力されると起動される。 Here, FIG. 10 is a flowchart showing the procedure of the speech synthesis process.
This voice synthesis process is started when a start command is input via the information receiving unit 61 of the voice output terminal 60.

この図１０に示すように、音声合成処理は、起動されると、まず、情報受付部６１を介して入力された情報（以下、入力情報と称す）を取得する（Ｓ５１０）。このＳ５１０にて取得する入力情報とは、例えば、合成音として出力する音声の内容（文言）を表す出力文言や、合成音として出力する音の性質を表す出力性質情報を含むものである。なお、ここで言う音の性質（即ち、出力性質情報）とは、発声者の性別、発声者の年齢といった、発声者の声の特徴に加えて、発声者が発声したときの感情などメタデータとして規定される情報を含むものである。 As shown in FIG. 10, when the speech synthesis process is started, first, information input via the information receiving unit 61 (hereinafter referred to as input information) is acquired (S510). The input information acquired in S510 includes, for example, output text indicating the content (word) of the sound output as synthesized sound, and output property information indicating the nature of the sound output as synthesized sound. Note that the sound property (ie, output property information) here refers to metadata such as the emotion when the speaker utters in addition to the features of the speaker's voice such as the gender of the speaker and the age of the speaker. It contains information prescribed as

続いて、Ｓ５１０にて取得した出力性質情報のうちのメタデータとして規定されるべき感情等の情報に最も類似する情報を含む表情テーブルＴＤを、データ格納サーバ５０から抽出する（Ｓ５２０）。さらに、Ｓ５１０にて取得した出力文言を合成音として出力するために必要な音節それぞれに対応し、かつＳ５１０にて取得した出力性質情報のうちの声の特徴に最も類似する情報を有した音声パラメータＰＭを、データ格納サーバ５０から抽出する（Ｓ５３０）。 Subsequently, the facial expression table TD including information most similar to information such as emotion to be defined as metadata in the output property information acquired in S510 is extracted from the data storage server 50 (S520). Furthermore, the speech parameter corresponding to each syllable necessary for outputting the output word acquired in S510 as a synthesized sound and having information most similar to the voice feature in the output property information acquired in S510 PM is extracted from the data storage server 50 (S530).

そして、Ｓ５１０にて取得した出力文言の内容にて合成音が出力されるように、Ｓ５３０にて取得した音声パラメータＰＭを、Ｓ５２０にて取得した表情テーブルＴＤに従って設定する（Ｓ５４０）。続いて、Ｓ５４０にて設定された音声パラメータＰＭに基づいて、音声合成する（Ｓ５５０）。このＳ５５０における音声合成は、特許文献1の他にもフォルマント合成による周知の音声合成の手法を用いれば良い。 Then, the voice parameter PM acquired in S530 is set according to the facial expression table TD acquired in S520 so that the synthesized sound is output with the content of the output word acquired in S510 (S540). Subsequently, speech synthesis is performed based on the speech parameter PM set in S540 (S550). For the speech synthesis in S550, in addition to Patent Document 1, a known speech synthesis method using formant synthesis may be used.

さらに、Ｓ５５０にて音声合成することによって生成された合成音を音出力部６３から出力する（Ｓ５６０）。
その後、本音声合成処理を終了する。
［第一実施形態の効果］
以上説明したように、本実施形態において音声入力装置１０は、カラオケ装置によって構成されている。このため、音声入力装置１０の利用者（即ち、発声者）が歌唱（発声）した結果（音声波形）を音声波形データＳＶとして収集することができ、多くの利用者に歌唱させることで、多数の発声者による多様な音声波形データＳＶを収集できる。 Furthermore, the synthesized sound generated by the voice synthesis in S550 is output from the sound output unit 63 (S560).
Thereafter, the speech synthesis process ends.
[Effect of the first embodiment]
As described above, in the present embodiment, the voice input device 10 is constituted by a karaoke device. For this reason, the result (voice waveform) sung (spoken) by the user (that is, the speaker) of the voice input device 10 can be collected as the voice waveform data SV, and many users can sing. Various voice waveform data SV can be collected.

そして、本実施形態における情報処理装置３０では、多くの発声者による多様な音声波形データＳＶから音声パラメータＰＭを導出することで、多様な人物が発声した多様な音声パラメータＰＭを導出できる。すなわち、本実施形態の情報処理装置３０によれば、音声パラメータＰＭを多様化できる。 The information processing apparatus 30 according to the present embodiment can derive various voice parameters PM uttered by various persons by deriving the voice parameters PM from the various voice waveform data SV of many speakers. That is, according to the information processing apparatus 30 of the present embodiment, the voice parameter PM can be diversified.

また、本実施形態における音声パラメータ登録処理のメタデータ推定処理では、音楽データＭＤに基づいてメタデータを自動的に推定できる。このため、本実施形態の音声パラメータ登録処理によれば、特許文献１に記載の音声合成装置とは異なり、対象楽曲を歌唱するときに、メタデータとしての音声の性質を発声者らに入力させる必要を無くすことができる。 Further, in the metadata estimation process of the voice parameter registration process in the present embodiment, the metadata can be automatically estimated based on the music data MD. For this reason, according to the speech parameter registration process of the present embodiment, unlike the speech synthesizer described in Patent Document 1, when the target music is sung, the voices as metadata are input to the speakers. The need can be eliminated.

そして、本実施形態における音声出力端末６０では、音声パラメータ登録処理にて登録された多様な発声者の音声パラメータＰＭ、及び音声分析処理にて生成された表情テーブルＴＤの中から、入力された出力性質情報に合致する表情テーブルＴＤ、音声パラメータＰＭを抽出して、入力された出力文言が実現されるように音声合成している。 Then, in the voice output terminal 60 according to the present embodiment, the output inputted from the voice parameters PM of various speakers registered in the voice parameter registration process and the facial expression table TD generated in the voice analysis process. A facial expression table TD and a speech parameter PM that match the property information are extracted, and speech synthesis is performed so that the input output text is realized.

したがって、本実施形態の音声合成システム１によれば、合成音を発声したとみなせる発声者の特徴を多様化できる。
なお、本実施形態のメタデータ推定処理では、対象楽曲における各調同一区間の調によって表される可能性が高い歌唱者の感情をメタデータとしている。すなわち、本実施形態のメタデータ推定処理によれば、各調同一区間に対応する歌詞を発声したときの発声者の感情をメタデータとすることができ、しかも、各調同一区間における調を確実に特定することができる。
［第二実施形態］
次に、本発明の第二実施形態について説明する。 Therefore, according to the speech synthesis system 1 of the present embodiment, it is possible to diversify the features of a speaker who can be regarded as having synthesized speech.
In addition, in the metadata estimation process of this embodiment, the emotion of a singer who is highly likely to be represented by the key of each key in the target music is used as metadata. That is, according to the metadata estimation process of the present embodiment, it is possible to use the emotion of the speaker when the lyrics corresponding to the same key interval are uttered as metadata, and to ensure the key in the same key interval. Can be specified.
[Second Embodiment]
Next, a second embodiment of the present invention will be described.

第二実施形態の音声合成システムは、第一実施形態の音声合成システム１とは、主として、メタデータ推定処理の処理内容が異なる。このため、本実施形態においては、第一実施形態と同様の構成及び処理には、同一の符号を付して説明を省略し、第一実施形態とは異なるメタデータ推定処理を中心に説明する。
〈メタデータ推定処理について〉
ここで、図１１は、本実施形態のメタデータ推定処理の処理手順を示すフローチャートである。 The speech synthesis system of the second embodiment mainly differs from the speech synthesis system 1 of the first embodiment in the processing content of the metadata estimation process. For this reason, in the present embodiment, the same configurations and processes as those in the first embodiment are denoted by the same reference numerals, description thereof will be omitted, and description will be made focusing on metadata estimation processing different from that in the first embodiment. .
<About metadata estimation processing>
Here, FIG. 11 is a flowchart showing a processing procedure of the metadata estimation processing of the present embodiment.

この図１１に示すように、音声パラメータ登録処理のＳ１６０にて起動されると、先のＳ１２０にて取得した歌詞データ群ＤＬに含まれている歌詞テロップデータＤＴによって表される歌詞を形態素解析する（Ｓ７１０）。すなわち、本実施形態のＳ７１０では、形態素解析を実行することで、歌詞を構成する文字列を、当該歌詞中の単語を構成する文字列である単語文字ごとに分割する。なお、Ｓ７１０にて実行する形態素解析は、周知の処理であるため、ここでの詳しい説明は省略する。 As shown in FIG. 11, when activated in S160 of the voice parameter registration process, the morphological analysis is performed on the lyrics expressed by the lyrics telop data DT included in the lyrics data group DL acquired in the previous S120. (S710). That is, in S710 of this embodiment, the character string which comprises a lyrics is divided | segmented for every word character which is the character string which comprises the word in the said lyrics by performing a morphological analysis. Note that the morphological analysis performed in S710 is a well-known process, and a detailed description thereof is omitted here.

続いて、予め用意された単語性質テーブルが格納された単語メタデータデータベース（図中ＤＢ）１００から、Ｓ７１０の形態素解析した結果である単語毎に単語性質情報を取得する（Ｓ７２０）。ただし、ここで言う単語性質テーブルとは、各単語の性質を表す単語性質情報を当該単語の識別情報と対応付けたテーブルであり、ここ言う単語の性質とは、当該単語の意味や、当該単語によって表される感情を含む。 Subsequently, word property information is acquired for each word that is a result of the morphological analysis of S710 from a word metadata database (DB in the figure) 100 in which a word property table prepared in advance is stored (S720). However, the word property table referred to here is a table in which word property information representing the property of each word is associated with the identification information of the word, and the word property referred to here is the meaning of the word, the word, Including emotions represented by

そして、Ｓ７２０にて取得した単語性質情報をメタデータとして、当該単語を発声した区間に割り当てる（Ｓ７３０）。
その後、本メタデータ推定処理を終了して、音声パラメータ登録処理へと戻る。 Then, the word property information acquired in S720 is assigned as metadata to the section where the word is uttered (S730).
Thereafter, the metadata estimation process is terminated, and the process returns to the voice parameter registration process.

以上説明したように、本実施形態のメタデータ推定処理では、図１２に示すように、対象楽曲の歌詞に対して形態素解析を実行し、対象楽曲の歌詞を、単語を構成する文字列である単語文字ごとに分割する。その上で、予め用意された単語メタデータデータベース１００に格納されている単語性質テーブルに含まれる単語性質情報の中から、各単語に対応する単語性質情報を取得し、当該単語性質情報のそれぞれを、対応する音節の音声パラメータＰＭに対するメタデータとしている。
［第二実施形態］
以上説明したように、本実施形態のメタデータ推定処理によれば、発声者が発声した単語の意味や、当該単語によって表される感情などを、メタデータとすることができる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 As described above, in the metadata estimation process of the present embodiment, as shown in FIG. 12, morphological analysis is performed on the lyrics of the target music, and the lyrics of the target music are character strings that constitute words. Divide by word character. In addition, word property information corresponding to each word is acquired from the word property information included in the word property table stored in the word metadata database 100 prepared in advance, and each of the word property information is obtained. , Metadata for the speech parameter PM of the corresponding syllable.
[Second Embodiment]
As described above, according to the metadata estimation process of the present embodiment, the meaning of the word uttered by the speaker and the emotion represented by the word can be used as metadata.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

例えば、上記実施形態の音声合成処理では、音声入力装置１０がカラオケ演奏処理を実行して対象楽曲を演奏している期間に入力された音声に基づいて音声波形データＳＶを生成していたが、本発明における音声波形データＳＶは、これに限るものではない。 For example, in the voice synthesis process of the above embodiment, the voice input device 10 executes the karaoke performance process and generates the voice waveform data SV based on the voice input during the period in which the target music is played. The voice waveform data SV in the present invention is not limited to this.

すなわち、本発明では、音声入力装置１０において、カラオケ装置などにて周知のアフレコ機能を用いて、音声波形データＳＶを生成しても良い。つまり、アフレコ機能を有した音声入力装置（カラオケ装置）であれば、発声すべき台詞に関するデータとして、台詞を構成する文字（以下、台詞構成文字と称す）を表す台詞テロップデータ（即ち、歌詞テロップデータと同様のデータ）と、台詞構成文字を表示部１３に表示するタイミングを規定した台詞出力データ（即ち、歌詞出力データと同様のデータ）とを備えている。 That is, in the present invention, the voice input data 10 may be generated in the voice input device 10 using a well-known after-recording function in a karaoke machine or the like. That is, in the case of a voice input device (karaoke device) having an after-recording function, dialogue telop data (that is, lyrics telop) representing characters constituting dialogue (hereinafter referred to as dialogue constituent characters) as data relating to dialogue to be uttered. Data) and dialogue output data that defines the timing for displaying dialogue constituent characters on the display unit 13 (that is, data similar to the lyrics output data).

よって、アフレコ機能を用いて音声波形データＳＶを取得する場合、音声入力装置１０は、台詞テロップデータに基づく台詞を表示部１３に表示し、当該台詞が表示部１３に表示されている期間に音声入力部１４を介して入力された音声波形を音声波形データＳＶとして、ＭＩＤＩ格納サーバ２５に格納しても良い。 Therefore, when the speech waveform data SV is acquired using the after-recording function, the speech input device 10 displays the speech based on the speech telop data on the display unit 13 and performs speech during the period in which the speech is displayed on the display unit 13. A speech waveform input via the input unit 14 may be stored in the MIDI storage server 25 as speech waveform data SV.

この場合、情報処理装置３０では、アフレコ機能を用いて生成した音声波形データＳＶを音声パラメータ登録処理の処理対象としても良い。つまり、音声パラメータ登録処理においては、Ｓ１１０が省略された上で、Ｓ１２０にて、台詞テロップデータ及び台詞出力データを取得し、Ｓ１４０にて、その取得した台詞テロップデータ及び台詞出力データに基づいて、音節波形を特定すれば良い。このとき、Ｓ１４０にて、音節波形を特定する手法としては、台詞出力データによって規定された台詞構成文字を表示部１３に表示するタイミングにて、音声波形データＳＶにおける当該台詞構成文字が発声されているものとして音節波形を特定すれば良い。 In this case, the information processing apparatus 30 may use the speech waveform data SV generated using the after-recording function as a processing target of the speech parameter registration process. That is, in the voice parameter registration process, after S110 is omitted, in S120, the dialogue telop data and dialogue output data are obtained, and in S140, based on the obtained dialogue telop data and dialogue output data, What is necessary is just to specify a syllable waveform. At this time, in S140, as a method of specifying the syllable waveform, the dialogue constituent characters in the speech waveform data SV are uttered at the timing when the dialogue constituent characters defined by the dialogue output data are displayed on the display unit 13. The syllable waveform may be specified as being present.

また、上記実施形態では、音声入力装置１０として、カラオケ装置を想定したが、音声入力装置１０として想定する装置は、カラオケ装置に限るものではなく、例えば、周知の携帯端末（携帯電話や携帯情報端末）や、周知の情報処理装置（いわゆるパーソナルコンピュータ）を想定しても良い。 Moreover, in the said embodiment, although the karaoke apparatus was assumed as the audio | voice input apparatus 10, the apparatus assumed as the audio | voice input apparatus 10 is not restricted to a karaoke apparatus, For example, a well-known portable terminal (a mobile phone or portable information) Terminal) or a known information processing apparatus (so-called personal computer) may be assumed.

また、上記実施形態の音声合成システムにおいては、ＭＩＤＩ格納サーバ２５が設けられていたが、本発明の音声合成システムにおいては、ＭＩＤＩ格納サーバ２５は設けられていなくとも良い。この場合、音楽データＭＤや音声波形データＳＶは、音声入力装置１０の記憶部１７に格納されても良いし、データ格納サーバ５０に格納されても良いし、さらには、情報処理装置３０の記憶部３４に格納されても良い。 In the speech synthesis system of the above embodiment, the MIDI storage server 25 is provided. However, in the speech synthesis system of the present invention, the MIDI storage server 25 may not be provided. In this case, the music data MD and the voice waveform data SV may be stored in the storage unit 17 of the voice input device 10, may be stored in the data storage server 50, and further stored in the information processing device 30. It may be stored in the unit 34.

同様に、上記実施形態の音声合成システムにおいては、データ格納サーバ５０が設けられていたが、本発明の音声合成システムにおいては、データ格納サーバ５０は設けられていなくとも良い。この場合、音声パラメータＰＭや表情テーブルＴＤは、情報処理装置３０の記憶部３４に格納されても良いし、音声入力装置１０の記憶部１７に格納されても良いし、さらには、ＭＩＤＩ格納サーバ２５に格納されても良い。 Similarly, in the speech synthesis system of the above embodiment, the data storage server 50 is provided. However, in the speech synthesis system of the present invention, the data storage server 50 may not be provided. In this case, the voice parameter PM and the expression table TD may be stored in the storage unit 34 of the information processing device 30, may be stored in the storage unit 17 of the voice input device 10, or may be a MIDI storage server. 25 may be stored.

なお、音声出力端末６０にて実行する音声合成処理の処理内容は、上記実施形態にて説明した内容に限るものではない。例えば、音声出力端末６０にて実行する音声合成処理の処理内容としては、Ｓ３１０と、Ｓ３６０との２つのステップのみでも良い。ただし、この場合、Ｓ３２０〜Ｓ３５０の各ステップを、情報処理装置３０などにて実行する必要がある。つまり、音声出力端末６０にて実行する音声合成処理は、入力情報を取得して情報処理装置３０に送信し、当該情報処理装置３０にて音声合成した結果（即ち、合成音）を出力する処理を実行するようになされており、表情テーブルＴＤや音声パラメータＰＭを取得して、入力情報に合致するように音声合成を実行する処理は、情報処理装置３０にて実行しても良い。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 Note that the processing content of the speech synthesis processing executed by the speech output terminal 60 is not limited to the content described in the above embodiment. For example, the content of the speech synthesis process executed by the speech output terminal 60 may be only two steps of S310 and S360. However, in this case, it is necessary to execute each step of S320 to S350 in the information processing apparatus 30 or the like. That is, the speech synthesis process executed by the speech output terminal 60 is a process for acquiring input information and transmitting it to the information processing apparatus 30 and outputting a result of speech synthesis by the information processing apparatus 30 (ie, synthesized sound). The information processing apparatus 30 may execute the process of acquiring the facial expression table TD and the voice parameter PM and executing the voice synthesis so as to match the input information.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

上記実施形態の音声パラメータ登録処理におけるＳ１１０が、特許請求の範囲の記載におけるタイミング情報取得手段、特に楽譜データ取得手段に相当し、Ｓ１２０が、特許請求の範囲の記載における内容情報取得手段に相当し、Ｓ１３０が、特許請求の範囲の記載における波形取得手段に相当する。さらに、パラメータ登録処理におけるＳ１４０が、特許請求の範囲の記載における音節波形抽出手段に相当し、Ｓ１５０が、特許請求の範囲の記載におけるパラメータ導出手段に相当し、Ｓ１６０が、特許請求の範囲の記載におけるメタデータ生成手段に相当し、Ｓ１７０が、特許請求の範囲の記載におけるパラメータ登録手段に相当する。 S110 in the voice parameter registration process of the above embodiment corresponds to the timing information acquisition means in the description of the claims, in particular, the score data acquisition means, and S120 corresponds to the content information acquisition means in the description of the claims. , S130 corresponds to the waveform acquisition means in the claims. Further, S140 in the parameter registration process corresponds to the syllable waveform extracting means in the description of the claims, S150 corresponds to the parameter deriving means in the description of the claims, and S160 is the description of the claims. S170 corresponds to the parameter registration means in the claims.

また、上記第一実施形態のメタデータ推定処理におけるＳ３１０が、特許請求の範囲の記載における区間特定手段に相当し、Ｓ３２０が、特許請求の範囲の記載における主音特定手段に相当し、Ｓ３３０が、特許請求の範囲の記載における音名頻度導出手段に相当し、Ｓ６４０，Ｓ６５０が、調推定手段に相当する。さらに、上記第二実施形態のメタデータ推定処理におけるＳ７１０が、特許請求の範囲の記載における単語分割手段に相当し、Ｓ７２０が、メタデータ抽出手段に相当する。 Further, S310 in the metadata estimation process of the first embodiment corresponds to the section specifying means in the description of the claims, S320 corresponds to the main sound specifying means in the description of the claims, and S330, It corresponds to the pitch name frequency deriving means in the description of the claims, and S640 and S650 correspond to the key estimating means. Further, S710 in the metadata estimation process of the second embodiment corresponds to the word dividing means in the claims, and S720 corresponds to the metadata extracting means.

なお、上記実施形態の音声合成処理におけるＳ５１０が、特許請求の範囲の記載における出力性質情報取得手段及び文言取得手段に相当し、Ｓ５２０が、特許請求の範囲の記載におけるテーブル取得手段に相当し、Ｓ５４０，Ｓ５５０が、特許請求の範囲の記載における音声合成手段に相当し、Ｓ５６０が、特許請求の範囲の記載における出力手段に相当する。 Note that S510 in the speech synthesis process of the above embodiment corresponds to the output property information acquisition unit and the word acquisition unit in the description of the claims, and S520 corresponds to the table acquisition unit in the description of the claims. S540 and S550 correspond to the speech synthesis means in the description of the claims, and S560 corresponds to the output means in the description of the claims.

１…音声合成システム１０…音声入力装置１１…通信部１２…入力受付部１３…表示部１４…音声入力部１５…音声出力部１６…音源モジュール１７…記憶部２０…制御部２１，４１…ＲＯＭ２２，４２…ＲＡＭ２３，４３…ＣＰＵ２５…ＭＩＤＩ格納サーバ３０…情報処理装置３１…通信部３２…入力受付部３３…表示部３４…記憶部４０…制御部５０…データ格納サーバ６０…音声出力端末６１…情報受付部６２…表示部６３…音出力部６４…通信部６５…記憶部６７…制御部１００…単語メタデータデータベース DESCRIPTION OF SYMBOLS 1 ... Voice synthesis system 10 ... Voice input device 11 ... Communication part 12 ... Input reception part 13 ... Display part 14 ... Voice input part 15 ... Voice output part 16 ... Sound source module 17 ... Memory | storage part 20 ... Control part 21,41 ... ROM 22, 42 ... RAM 23, 43 ... CPU 25 ... MIDI storage server 30 ... Information processing device 31 ... Communication unit 32 ... Input reception unit 33 ... Display unit 34 ... Storage unit 40 ... Control unit 50 ... Data storage server 60 ... Audio output Terminal 61 ... Information receiving unit 62 ... Display unit 63 ... Sound output unit 64 ... Communication unit 65 ... Storage unit 67 ... Control unit 100 ... Word metadata database

Claims

Content information acquisition means for acquiring utterance content information representing a character string of content to be uttered;
Timing information acquisition means for acquiring utterance timing information that specifies the utterance start timing of at least one character among the character string represented by the specific content information that is the utterance content information acquired by the content information acquisition means;
Waveform acquisition means for acquiring a target waveform which is a speech waveform uttered for the character string represented by the specific content information;
A speech waveform uttered to each syllable forming the character string represented by the specific content information from the target waveform acquired by the waveform acquisition unit based on at least the utterance timing information acquired by the timing information acquisition unit A syllable waveform extracting means for extracting a syllable waveform,
Parameter derivation means for deriving a speech parameter which is at least one feature quantity defined in advance from each syllable waveform extracted by the syllable waveform extraction means;
Based on the specific content information and the utterance timing information acquired by the timing information acquisition unit, the property of the voice represented by the syllable waveform is estimated, and a metadata generation unit that generates the estimation result as metadata;
One of the musical composition parameters and the parameter registration means for storing the speech parameter derived by the parameter deriving means and the metadata generated by the metadata generating means for each corresponding syllable and storing them in the first storage device A score data acquisition unit that represents a score of a target music piece and acquires score data in which at least a pitch and a performance start timing are defined for each output sound output from the sound module;
The content information acquisition means includes
Acquiring a character string of lyric constituting characters constituting the lyrics of the target music as the utterance content information,
The timing information acquisition means includes
The output timing for at least one of the lyrics constituent characters is acquired as the utterance timing information, the lyrics output timing associated with the performance start timing of the output sound corresponding to the lyrics constituent characters,
The waveform acquisition means includes
A waveform in which a voice input during the performance of the target music based on the score data has shifted along the time axis is acquired as the target waveform,
The syllable waveform extracting means includes
In the target waveform, a speech waveform in a section corresponding to each output sound is extracted as the syllable waveform,
The musical score data is
If transposing in the song of the target music , including a modulation flag indicating the position of the time to transpose on the time axis when the starting time of the performance of the target music is the origin ,
The metadata generation means includes
Based on the musical score data acquired by the musical score data acquiring means, section specifying means for specifying the same key section, which is each section in which the same key is continued in the target music,
A main sound specifying means for specifying the last output sound along the time axis in each key same section as the main sound included in each key same section specified by the section specifying means;
An appearance pitch name frequency representing a frequency of output sounds of the same pitch name included in the same pitch section specified by the section specifying means is set for each key same section starting from the pitch name of the main sound specified by the main sound specifying means. A pitch name deriving means for deriving;
As a result of comparing each appearance name frequency derived by the sound name frequency deriving means with a key template prepared in advance for each key as a template representing the distribution of pitch names that can be used in each key, each key having the highest correlation is obtained. And a key estimation means using the property of speech corresponding to the metadata as the metadata,
The key estimation means includes
If the key with the highest correlation is a major key, the sound property of bright is the metadata, and if the key with the highest correlation is a minor key, the property of the voice is dark is the metadata. Parameter extraction device to do.

The content information acquisition means includes
A sentence composing character that is a character string constituting at least one sentence is acquired as the utterance content information;
The timing information acquisition means includes
Information indicating the output timing of outputting at least one character constituting the sentence constituent character to the outside is acquired as the utterance timing information,
The waveform acquisition means includes
According to claim 1, characterized in that to obtain the waveform voice input in the output of the character string constituting the sentence structure characters based on the utterance contents information has remained along the time axis as the target waveform Parameter extraction device.

Content information acquisition means for acquiring utterance content information representing a character string of content to be uttered;
Timing information acquisition means for acquiring utterance timing information that specifies the utterance start timing of at least one character among the character string represented by the specific content information that is the utterance content information acquired by the content information acquisition means;
Waveform acquisition means for acquiring a target waveform which is a speech waveform uttered for the character string represented by the specific content information;
A speech waveform uttered to each syllable forming the character string represented by the specific content information from the target waveform acquired by the waveform acquisition unit based on at least the utterance timing information acquired by the timing information acquisition unit A syllable waveform extracting means for extracting a syllable waveform,
Parameter derivation means for deriving a speech parameter which is at least one feature quantity defined in advance from each syllable waveform extracted by the syllable waveform extraction means;
Based on the specific content information and the utterance timing information acquired by the timing information acquisition unit, the property of the voice represented by the syllable waveform is estimated, and a metadata generation unit that generates the estimation result as metadata;
A parameter registration unit that associates the speech parameter derived by the parameter deriving unit with the metadata generated by the metadata generation unit for each corresponding syllable and stores it in the first storage device;
The audio parameters stored in the first storage device are analyzed for each metadata associated with the audio parameters, and a metadata correspondence table representing a range of each audio parameter corresponding to the metadata is generated. A parameter extraction device having parameter analysis means for storing in the second storage device,
An output property information acquisition means for acquiring output property information that is input from the outside and represents a sound property;
A word acquisition means for acquiring an output word representing a word input from the outside;
The metadata correspondence table including the metadata corresponding to the output property information acquired by the output property information acquisition means is acquired from the second storage device, and an audio parameter having information corresponding to the output property information is Table acquisition means for acquiring from the first storage device;
Voice synthesis means for synthesizing the voice parameters acquired by the table acquisition means according to the metadata correspondence table so as to be the output text acquired by the word acquisition means;
A synthesized sound output device comprising: output means for outputting a synthesized sound generated by voice synthesis by the voice synthesis means;
The parameter extraction device includes:
Representing the score of the target music that is one of the music, for each output sound output from the sound source module, comprising score data acquisition means for acquiring score data in which at least the pitch and the performance start timing are defined,
The content information acquisition means includes
Acquiring a character string of lyric constituting characters constituting the lyrics of the target music as the utterance content information,
The timing information acquisition means includes
The output timing for at least one of the lyrics constituent characters is acquired as the utterance timing information, the lyrics output timing associated with the performance start timing of the output sound corresponding to the lyrics constituent characters,
The waveform acquisition means includes
A waveform in which a voice input during the performance of the target music based on the score data has shifted along the time axis is acquired as the target waveform,
The syllable waveform extracting means includes
In the target waveform, a speech waveform in a section corresponding to each output sound is extracted as the syllable waveform,
The musical score data is
If transposing in the song of the target music , including a modulation flag indicating the position of the time to transpose on the time axis when the starting time of the performance of the target music is the origin ,
The metadata generation means includes
Based on the musical score data acquired by the musical score data acquiring means, section specifying means for specifying the same key section, which is each section in which the same key is continued in the target music,
A main sound specifying means for specifying the last output sound along the time axis in each key same section as the main sound included in each key same section specified by the section specifying means;
An appearance pitch name frequency representing a frequency of output sounds of the same pitch name included in the same pitch section specified by the section specifying means is set for each key same section starting from the pitch name of the main sound specified by the main sound specifying means. A pitch name deriving means for deriving;
As a result of comparing each appearance name frequency derived by the sound name frequency deriving means with a key template prepared in advance for each key as a template representing the distribution of pitch names that can be used in each key, each key having the highest correlation is obtained. And a key estimation means using the property of speech corresponding to the metadata as the metadata,
The key estimation means includes
If the key with the highest correlation is a major key, the sound property of bright is the metadata, and if the key with the highest correlation is a minor key, the property of the voice is dark is the metadata. A speech synthesis system.