JP2016051036A

JP2016051036A - Voice synthesis system and voice synthesis device

Info

Publication number: JP2016051036A
Application number: JP2014175831A
Authority: JP
Inventors: 典昭阿瀬見; Noriaki Asemi
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2014-08-29
Filing date: 2014-08-29
Publication date: 2016-04-11
Anticipated expiration: 2034-08-29
Also published as: JP6260499B2

Abstract

PROBLEM TO BE SOLVED: To provide a technique for simplifying generation of a virtual voice sound.SOLUTION: In a voice source data generation process, a voice waveform data WD is grouped by similarity in voice quality, and a representative value of the voice parameter P of a note vocal Vo included in each group is derived for each common note property p. Together with this, a classification key Xq which is the representative value of a voice feature quantity M in the group is associated with the representative value of the voice parameter P derived for each note property p, thereby generating voice data SD. On the other hand, in the voice synthesis process, an input voice input via a microphone is analyzed so as to derive an input voice quality Yk (S530). In the voice data SD stored in a storage part, the voice source data SD having the classification key Xq most approximate to the input voice quality Yk is specified (S540), and the synthesized voice which synthesizes the voice according to the specified voice source data SD is outputted (S570, S580).SELECTED DRAWING: Figure 4

Description

本発明は、合成音声を生成する技術に関する。 The present invention relates to a technique for generating synthesized speech.

従来、楽曲を演奏するカラオケ装置であって、複数存在する歌唱旋律のうちの少なくとも１つの歌唱旋律を歌唱した音声を合成音声にて生成して出力するカラオケ装置が知られている（特許文献１参照）。このカラオケ装置では、複数存在する歌唱旋律のうちの少なくとも１つを利用者が担当して歌唱し、他の歌唱旋律を合成音声が担当する。 2. Description of the Related Art Conventionally, there is known a karaoke apparatus that plays music, and generates and outputs a voice that sang at least one singing melody of a plurality of singing melody as a synthesized voice (Patent Document 1). reference). In this karaoke apparatus, the user is in charge of singing at least one of a plurality of singing melody, and the synthesized voice is in charge of other singing melody.

特開２０１１−１５４２９０号公報JP 2011-154290 A

ところで、カラオケ装置において、当該カラオケ装置の利用者が歌唱することなく、利用者自身が歌唱したような合成音声（以下、「仮想発声音」と称す）を生成して出力することが考えられている。 By the way, in the karaoke apparatus, it is considered that the synthesized voice (hereinafter referred to as “virtual vocalization sound”) sung by the user himself / herself is generated and outputted without the user of the karaoke apparatus singing. Yes.

音源データの生成は、実際に人が発声した音声を分析することで実施される。このため、仮想発声音を、カラオケ装置の一人の利用者自身が発声することで生成した音源データだけに基づいて生成するためには、膨大な量の音源データが必要となる。そして、膨大な量の音源データを生成するためには、カラオケ装置の一人の利用者自身が発声した膨大な量の音声データを収集する必要がある。 The generation of sound source data is performed by analyzing voice actually uttered by a person. For this reason, in order to generate | occur | produce a virtual utterance sound based only on the sound source data produced | generated by one user himself of a karaoke apparatus, a huge amount of sound source data is needed. In order to generate an enormous amount of sound source data, it is necessary to collect an enormous amount of audio data uttered by one user of the karaoke apparatus.

一般的に、一人の利用者の膨大な量の音声データを収集することは困難であるため、従来の技術において、仮想発声音を生成することは、困難であるという課題がある。
つまり、従来の技術では、仮想発声音の生成が困難であるという課題があった。 Generally, since it is difficult to collect an enormous amount of voice data of one user, there is a problem that it is difficult to generate a virtual utterance sound in the conventional technology.
That is, the conventional technique has a problem that it is difficult to generate a virtual voice.

そこで、本発明は、仮想発声音の生成を簡易化する技術を提供することを目的とする。 Therefore, an object of the present invention is to provide a technique that simplifies the generation of a virtual vocal sound.

上記目的を達成するためになされた本発明の音声合成システムは、音声データ取得手段と、分析手段と、分類手段と、記憶制御手段と、入力受付手段と、声質分析手段と、検索手段と、合成手段とを備えている。 The speech synthesis system of the present invention made to achieve the above object includes speech data acquisition means, analysis means, classification means, storage control means, input reception means, voice quality analysis means, search means, Synthesizing means.

このうち、音声データ取得手段は、発声した人及び歌詞のうちの少なくとも１つが異なる音声データを少なくとも２つ以上取得する。ここで言う音声データとは、音高と音価との組み合わせからなる複数の音符のうちの少なくとも一部に割り当てられた歌詞を発声した音声波形を表すものである。 Among these, the voice data acquisition means acquires at least two or more voice data in which at least one of the uttered person and the lyrics is different. The voice data referred to here represents a voice waveform obtained by uttering lyrics assigned to at least a part of a plurality of notes composed of combinations of pitches and note values.

分析手段は、音声データ取得手段で取得した音声データにおける声質を表す声質特徴量を導出する。分類手段は、分析手段で導出した声質特徴量の分布に基づいて、音声データを少なくとも２つのグループに分類する。 The analysis means derives a voice quality feature amount representing voice quality in the voice data acquired by the voice data acquisition means. The classifying unit classifies the voice data into at least two groups based on the distribution of the voice quality feature amount derived by the analyzing unit.

記憶制御手段は、分類手段で分類された音声データのグループごとに、グループそれぞれに含まれる音声データから、歌詞の音節かつ音符の種類ごとの音声の特徴量を表す音声特徴量の代表値を導出し、その導出した音声特徴量の代表値とその音声特徴量の代表値に対応する声質特徴量とを対応付けた音源データを記憶装置に記憶する。 The storage control means derives a representative value of the speech feature amount representing the speech feature amount for each syllable of lyric and note type from the speech data included in each group of the speech data classified by the classification means. Then, the sound source data in which the representative value of the derived voice feature quantity and the voice quality feature quantity corresponding to the representative value of the voice feature quantity are associated is stored in the storage device.

入力受付手段は、音声の入力を受け付ける。声質分析手段は、入力受付手段で入力を受け付けた入力音声を分析して、入力音声の声質特徴量である入力声質を導出する。さらに、検索手段は、記憶装置に記憶された音源データの中で、声質分析手段で導出した入力声質に最も類似する声質特徴量を有した音源データを特定する。合成手段は、検索手段で特定した音源データに従って音声合成した合成音声を出力する。 The input receiving means receives voice input. The voice quality analyzing unit analyzes the input voice received by the input receiving unit and derives an input voice quality that is a voice quality feature amount of the input voice. Further, the search means specifies sound source data having a voice quality feature amount most similar to the input voice quality derived by the voice quality analysis means from the sound source data stored in the storage device. The synthesizing unit outputs a synthesized voice synthesized by voice according to the sound source data specified by the search unit.

本発明の音声合成システムでは、利用者自身の声と声質が似ている音声パラメータ（音源データ）を用いて、合成音声を生成している。本発明において、このような音源データは、一人の利用者の音声データだけでなく複数人の音声データに基づいて導出される。 In the speech synthesis system of the present invention, synthesized speech is generated using speech parameters (sound source data) that are similar in voice quality to the user's own voice. In the present invention, such sound source data is derived based not only on the sound data of one user but also on the sound data of a plurality of persons.

すなわち、本発明の音声合成システムによれば、音源データの生成に必要になる音声データとして、カラオケ装置の利用者自身が発声した音声データだけでなく、当該利用者とは異なる他の人物が発声した音声データを用いている。 That is, according to the speech synthesis system of the present invention, not only speech data uttered by the user of the karaoke apparatus but also other people who are different from the user uttered as speech data necessary for generating sound source data. Audio data is used.

したがって、本発明の音声合成システムによれば、音声合成に必要な音声データの収集を容易なものとすることができ、仮想発声音の生成の困難性を低下させることができる。この結果、本発明によれば、仮想発声音の生成の簡易化を実現できる。 Therefore, according to the speech synthesis system of the present invention, it is possible to easily collect speech data necessary for speech synthesis, and to reduce the difficulty of generating virtual utterances. As a result, according to the present invention, it is possible to simplify the generation of the virtual vocal sound.

さらに、本発明の音声合成システムでは、楽曲データ取得手段が、指定された楽曲を表す楽曲データであって、複数の音符を時間軸に沿って配置した楽譜データ、及び複数の音符のうちの少なくとも一部の音符に割り当てられた歌詞を表す歌詞データを含む楽曲データを取得しても良い。 Further, in the speech synthesis system of the present invention, the music data acquisition means is music data representing the designated music, and at least one of the music data and the musical score data in which a plurality of notes are arranged along the time axis. You may acquire the music data containing the lyric data showing the lyrics allocated to some notes.

この場合、本発明における合成手段は、楽曲データ取得手段で取得した楽曲データに基づいて、当該楽曲データに含まれる歌詞を歌唱した合成音声を出力しても良い。
このような音声合成システムによれば、利用者の声質に近い声による合成音声で、楽曲を歌唱させることができる。 In this case, the synthesizing unit in the present invention may output a synthesized voice in which the lyrics included in the song data are sung based on the song data acquired by the song data acquiring unit.
According to such a speech synthesis system, it is possible to sing a song with synthesized speech using a voice close to the voice quality of the user.

ところで、本発明は、音声合成装置としてなされていても良い。本発明の音声合成装置は、入力受付手段と、声質分析手段と、検索手段と、合成手段とを備えている。
入力受付手段は、音声の入力を受け付ける。声質分析手段は、入力受付手段で入力を受け付けた入力音声を分析して、入力音声の声質を表す声質特徴量である入力声質を導出する。 By the way, the present invention may be implemented as a speech synthesizer. The speech synthesizer according to the present invention includes an input receiving unit, a voice quality analyzing unit, a searching unit, and a synthesizing unit.
The input receiving means receives voice input. The voice quality analyzing unit analyzes the input voice received by the input receiving unit, and derives an input voice quality that is a voice quality feature amount representing the voice quality of the input voice.

検索手段は、記憶装置に記憶された音源データの中で、声質分析手段で導出した入力声質に最も類似する声質特徴量を有した音源データを特定する。ただし、音源データは、音声データ取得ステップと、分析ステップと、分類ステップと、記憶制御ステップとが実行されることで記憶装置に記憶される。 The search means specifies sound source data having a voice quality feature amount most similar to the input voice quality derived by the voice quality analysis means from the sound source data stored in the storage device. However, the sound source data is stored in the storage device by executing an audio data acquisition step, an analysis step, a classification step, and a storage control step.

音声データ取得ステップでは、発声した人及び歌詞のうちの少なくとも１つが異なる音声データを少なくとも２つ以上取得する。分析ステップでは、音声データ取得ステップで取得した音声データにおける声質を表す声質特徴量を導出する。分類ステップでは、分析ステップで導出した声質特徴量の分布に基づいて、音声データを少なくとも２つのグループに分類する。記憶制御ステップでは、分類ステップで分類された音声データのグループごとに、グループそれぞれに含まれる音声データから、音節かつ音符の種類ごとの音声の特徴量を表す音声特徴量の代表値を導出し、その導出した音声特徴量の代表値とその音声特徴量の代表値に対応する声質特徴量とを対応付けた音源データを記憶装置に記憶する。 In the voice data acquisition step, at least two or more voice data in which at least one of the uttered person and the lyrics is different are acquired. In the analysis step, a voice quality feature amount representing the voice quality in the voice data acquired in the voice data acquisition step is derived. In the classification step, the speech data is classified into at least two groups based on the distribution of voice quality feature values derived in the analysis step. In the storage control step, for each group of voice data classified in the classification step, a representative value of a voice feature amount representing a voice feature amount for each type of syllable and note is derived from the voice data included in each group, Sound source data in which the representative value of the derived voice feature quantity and the voice quality feature quantity corresponding to the representative value of the voice feature quantity are associated is stored in the storage device.

そして、音声合成装置における合成手段は、検索手段で特定した音源データに従って音声合成した合成音声を出力する。
このような音声合成装置であれば、請求項１に係る発明と同様の効果を得ることができる。 Then, the synthesizing means in the speech synthesizer outputs synthesized speech synthesized by voice according to the sound source data specified by the search means.
With such a speech synthesizer, the same effect as that of the first aspect of the invention can be obtained.

本発明が適用された音声合成システムの全体構成を示すブロック図である。1 is a block diagram showing an overall configuration of a speech synthesis system to which the present invention is applied. 音源データ生成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a sound source data generation process. 音源データを例示した説明図である。It is explanatory drawing which illustrated sound source data. 音声合成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a speech synthesis process.

以下に本発明の実施形態を図面と共に説明する。
＜音声合成システム＞
図１に示す音声合成システム１は、ユーザが指定した楽曲（以下、指定楽曲と称す）を歌唱した合成音声を、ユーザに類似する声にて生成して出力するシステムである。 Embodiments of the present invention will be described below with reference to the drawings.
<Speech synthesis system>
A speech synthesis system 1 shown in FIG. 1 is a system that generates and outputs a synthesized speech singing a song designated by a user (hereinafter referred to as a designated song) with a voice similar to the user.

これを実現するために、音声合成システム１は、情報処理装置２と、情報処理サーバ１０と、カラオケ装置３０とを備えている。
情報処理装置２は、人が発声した音声を含む音声波形データＷＤ及びその発声した内容を表すＭＩＤＩ楽曲ＭＤに基づいて、合成音声の生成（即ち、音声合成）に必要な音源データＳＤを生成する。ここで言う音源データＳＤは、詳しくは後述するが、いわゆるフォルマント合成に用いる音声パラメータを含むデータである。この音声パラメータには、例えば、発声音声における各音節での基本周波数（ｆ０）、メル周波数ケプストラム（ＭＦＣＣ）、パワーを含む。ここで言う音声パラメータは、特許請求の範囲に記載された音声特徴量の一例である。 In order to realize this, the speech synthesis system 1 includes an information processing device 2, an information processing server 10, and a karaoke device 30.
The information processing device 2 generates sound source data SD necessary for generating synthesized speech (that is, speech synthesis) based on speech waveform data WD including speech uttered by a person and MIDI music MD representing the uttered content. . The sound source data SD referred to here is data including speech parameters used for so-called formant synthesis, as will be described in detail later. The speech parameters include, for example, the fundamental frequency (f0), mel frequency cepstrum (MFCC), and power at each syllable in the uttered speech. The voice parameter mentioned here is an example of a voice feature amount described in the claims.

情報処理サーバ１０には、少なくとも、情報処理装置２にて生成された音源データＳＤ及びＭＩＤＩ楽曲ＭＤが記憶されている。
カラオケ装置３０は、情報処理サーバ１０に記憶されたＭＩＤＩ楽曲ＭＤを演奏すると共に、そのＭＩＤＩ楽曲ＭＤに対応する楽曲を歌唱した合成音声を、音源データＳＤに従って生成して出力する。なお、音声合成システム１は、複数のカラオケ装置３０を備えている。
＜音声波形データ＞
音声波形データＷＤは、楽曲を演奏した演奏音を表す音声データであり、当該楽曲に関する情報が記述された楽曲管理情報と対応付けられている。楽曲管理情報には、楽曲を識別する楽曲識別情報（以下、楽曲ＩＤと称す）が含まれる。 The information processing server 10 stores at least sound source data SD and MIDI music MD generated by the information processing apparatus 2.
The karaoke apparatus 30 plays the MIDI music MD stored in the information processing server 10, and generates and outputs a synthesized voice in which the music corresponding to the MIDI music MD is sung according to the sound source data SD. Note that the speech synthesis system 1 includes a plurality of karaoke apparatuses 30.
<Audio waveform data>
The audio waveform data WD is audio data representing a performance sound of playing a music, and is associated with music management information in which information related to the music is described. The music management information includes music identification information (hereinafter referred to as music ID) for identifying music.

本実施形態の音声波形データＷＤには、演奏音として、少なくとも１つの楽器を演奏した伴奏音と、少なくとも人が歌唱した歌唱音とを含む。なお、音声波形データＷＤは、その音声波形データＷＤごとに、歌唱した人物または楽曲（歌詞）が異なっている。 The audio waveform data WD of the present embodiment includes, as performance sounds, accompaniment sounds that are played by at least one musical instrument and singing sounds that are sung by humans. Note that the voice waveform data WD differs in the sung person or song (lyric) for each voice waveform data WD.

この音声波形データＷＤは、非圧縮音声ファイルフォーマットの音声ファイルによって構成されたデータであっても良いし、音声圧縮フォーマットの音声ファイルによって構成されたデータであっても良い。この音声波形データＷＤは、ユーザが楽曲を歌唱した際に音声を録音することで生成されても良いし、その他の方法で生成されても良い。 The audio waveform data WD may be data configured by an audio file in an uncompressed audio file format, or may be data configured by an audio file in an audio compression format. The voice waveform data WD may be generated by recording voice when the user sings a song, or may be generated by other methods.

本実施形態における音声波形データＷＤは、特許請求の範囲に記載された音声データの一例である。
＜ＭＩＤＩ楽曲＞
ＭＩＤＩ楽曲ＭＤは、楽曲ごとに予め用意されたものであり、楽曲データと、歌詞データとを有している。 The audio waveform data WD in the present embodiment is an example of audio data described in the claims.
<MIDI music>
The MIDI music MD is prepared in advance for each music and has music data and lyrics data.

このうち、楽曲データは、周知のＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）規格によって、一つの楽曲の楽譜を表したデータである。この楽曲データは、楽曲ＩＤと、当該楽曲にて用いられる楽器ごとの楽譜を表す楽譜トラックとを少なくとも有している。 Among these, the music data is data representing the score of one music according to the well-known MIDI (Musical Instrument Digital Interface) standard. This music data has at least a music ID and a music score track representing a music score for each musical instrument used in the music.

そして、楽譜トラックには、ＭＩＤＩ音源から出力される個々の演奏音について、少なくとも、音高（いわゆるノートナンバー）と、ＭＩＤＩ音源が演奏音を出力する期間（以下、音符長と称す）とが規定されている。楽譜トラックにおける音符長は、当該演奏音の出力を開始するまでの当該楽曲の演奏開始からの時間を表す演奏開始タイミング（いわゆるノートオンタイミング）と、当該演奏音の出力を終了するまでの当該楽曲の演奏開始からの時間を表す演奏終了タイミング（いわゆるノートオフタイミング）とによって規定されている。 The musical score track defines at least the pitch (so-called note number) and the period during which the MIDI sound source outputs the performance sound (hereinafter referred to as the note length) for each performance sound output from the MIDI sound source. Has been. The note length in the score track is the performance start timing (so-called note-on timing) indicating the time from the start of the performance of the music until the output of the performance sound and the music until the output of the performance sound ends. Performance end timing (so-called note-off timing) representing the time from the start of the performance.

すなわち、楽譜トラックでは、ノートナンバーと、ノートオンタイミング及びノートオフタイミングによって表される音符長とによって、１つの音符ＮＯが規定される。そして、楽譜トラックは、音符ＮＯが演奏順に配置されることによって、１つの楽譜として機能する。なお、楽譜トラックは、例えば、鍵盤楽器、弦楽器、打楽器、及び管楽器などの楽器ごとに用意されている。このうち、本実施形態では、特定の楽器（例えば、ヴィブラフォン）が、楽曲における歌唱旋律を担当する楽器として規定されている。 That is, in the score track, one note NO is defined by the note number and the note length represented by the note-on timing and note-off timing. The musical score track functions as one musical score by arranging note NO in the order of performance. Note that the musical score track is prepared for each instrument such as a keyboard instrument, a stringed instrument, a percussion instrument, and a wind instrument, for example. Among these, in this embodiment, a specific musical instrument (for example, vibraphone) is defined as a musical instrument responsible for singing melody in music.

一方、歌詞データは、楽曲の歌詞に関するデータであり、歌詞テロップデータと、歌詞出力データとを備えている。歌詞テロップデータは、楽曲の歌詞を構成する文字（以下、歌詞構成文字とする）を表す。歌詞出力データは、歌詞構成文字の出力タイミングである歌詞出力タイミングを、演奏データの演奏と対応付けるタイミング対応関係が規定されたデータである。 On the other hand, the lyrics data is data relating to the lyrics of the music, and includes lyrics telop data and lyrics output data. The lyrics telop data represents characters that constitute the lyrics of the music (hereinafter referred to as lyrics component characters). The lyrics output data is data in which a timing correspondence relationship that associates the lyrics output timing, which is the output timing of the lyrics constituent characters, with the performance of the performance data is defined.

具体的に、本実施形態におけるタイミング対応関係では、演奏データの演奏を開始するタイミングに、歌詞テロップデータの出力を開始するタイミングが対応付けられている。さらに、タイミング対応関係では、楽曲の時間軸に沿った各歌詞構成文字の歌詞出力タイミングが、演奏データの演奏開始からの経過時間によって規定されている。これにより、楽譜トラックに規定された個々の演奏音（即ち、音符ＮＯ）と、歌詞構成文字それぞれとが対応付けられる。
＜情報処理装置＞
情報処理装置２は、入力受付部３と、外部出力部４と、記憶部５と、制御部６とを備えた周知の情報処理装置（例えば、パーソナルコンピュータ）である。 Specifically, in the timing correspondence relationship in the present embodiment, the timing for starting the output of the lyrics telop data is associated with the timing for starting the performance of the performance data. Furthermore, in the timing correspondence relationship, the lyrics output timing of each lyrics constituent character along the time axis of the music is defined by the elapsed time from the performance start of the performance data. Thereby, each performance sound (namely, note NO) prescribed | regulated to the score track | truck and each lyric component character are matched.
<Information processing device>
The information processing apparatus 2 is a known information processing apparatus (for example, a personal computer) including an input receiving unit 3, an external output unit 4, a storage unit 5, and a control unit 6.

入力受付部３は、外部からの情報や指令の入力を受け付ける入力機器である。ここでの入力機器とは、例えば、キーやスイッチ、可搬型の記憶媒体（例えば、ＣＤやＤＶＤ、フラッシュメモリ）に記憶されたデータを読み取る読取ドライブ、通信網を介して情報を取得する通信ポートなどである。外部出力部４は、外部に情報を出力する出力装置である。ここでの出力装置とは、可搬型の記憶媒体にデータを書き込む書込ドライブや、通信網に情報を出力する通信ポートなどである。 The input receiving unit 3 is an input device that receives input of information and commands from the outside. The input device here is, for example, a key or switch, a reading drive for reading data stored in a portable storage medium (for example, CD, DVD, flash memory), or a communication port for acquiring information via a communication network. Etc. The external output unit 4 is an output device that outputs information to the outside. Here, the output device is a writing drive that writes data to a portable storage medium, a communication port that outputs information to a communication network, or the like.

記憶部５は、記憶内容を読み書き可能に構成された周知の記憶装置である。記憶部５には、少なくとも２以上の音声波形データＷＤが、その音声波形データＷＤでの発声内容を表すＭＩＤＩ楽曲ＭＤと対応付けて記憶されている。なお、図１中における符号「ｌ」は、音声波形データＷＤを識別する識別子であり、ユーザごとかつ当該ユーザが歌唱した楽曲ごとに割り当てられている。この符号「ｌ」は、２以上の自然数である。また、図１における符号「ｏ」は、ＭＩＤＩ楽曲ＭＤを識別する識別子であり、楽曲ごとに割り当てられている。この符号「ｏ」は、２以上の自然数である。 The storage unit 5 is a known storage device configured to be able to read and write stored contents. The storage unit 5 stores at least two or more speech waveform data WD in association with a MIDI music piece MD representing the utterance content in the speech waveform data WD. In addition, the code | symbol "l" in FIG. 1 is an identifier which identifies the audio | voice waveform data WD, and is allocated for every user and every music which the said user sang. This code “l” is a natural number of 2 or more. Further, the symbol “o” in FIG. 1 is an identifier for identifying the MIDI music piece MD, and is assigned to each music piece. This code “o” is a natural number of 2 or more.

制御部６は、ＲＯＭ７，ＲＡＭ８，ＣＰＵ９を備えた周知のマイクロコンピュータを中心に構成された周知の制御装置である。ＲＯＭ７は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを記憶する。ＲＡＭ８は、処理プログラムやデータを一時的に記憶する。ＣＰＵ９は、ＲＯＭ７やＲＡＭ８に記憶された処理プログラムに従って各処理を実行する。 The control unit 6 is a known control device that is configured around a known microcomputer including a ROM 7, a RAM 8, and a CPU 9. The ROM 7 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 8 temporarily stores processing programs and data. The CPU 9 executes each process according to a processing program stored in the ROM 7 or RAM 8.

本実施形態のＲＯＭ７には、記憶部５に記憶されている音声波形データＷＤ及びＭＩＤＩ楽曲ＭＤに基づいて音源データＳＤを生成する音源データ生成処理を、制御部６が実行するための処理プログラムが記憶されている。
＜情報処理サーバ＞
情報処理サーバ１０は、通信部１２と、記憶部１４と、制御部１６とを備えている。 The ROM 7 of the present embodiment has a processing program for the control unit 6 to execute sound source data generation processing for generating sound source data SD based on the audio waveform data WD and the MIDI music piece MD stored in the storage unit 5. It is remembered.
<Information processing server>
The information processing server 10 includes a communication unit 12, a storage unit 14, and a control unit 16.

このうち、通信部１２は、通信網を介して、情報処理サーバ１０が外部との間で通信を行う。すなわち、情報処理サーバ１０は、通信網を介してカラオケ装置３０と接続されている。なお、ここで言う通信網は、有線による通信網であっても良いし、無線による通信網であっても良い。 Among these, the communication unit 12 performs communication between the information processing server 10 and the outside via a communication network. That is, the information processing server 10 is connected to the karaoke apparatus 30 via a communication network. The communication network referred to here may be a wired communication network or a wireless communication network.

記憶部１４は、記憶内容を読み書き可能に構成された周知の記憶装置である。この記憶部１４には、少なくとも、複数のＭＩＤＩ楽曲ＭＤが記憶される。なお、図１に示す符号「ｎ」は、情報処理サーバ１０の記憶部１４に記憶されているＭＩＤＩ楽曲ＭＤを識別する識別子であり、楽曲ごとに割り当てられている。この符号「ｎ」は、１以上の自然数である。さらに、記憶部１４には、情報処理装置２が音源データ生成処理を実行することで生成された音源データＳＤが記憶される。なお、図１に示す符号「ｍ」は、情報処理サーバ１０の記憶部１４に記憶されている音源データＳＤを識別する識別子であり、詳しくは後述するグループごとに割り当てられている。この符号「ｍ」は、２以上の自然数である。 The storage unit 14 is a known storage device configured to be able to read and write stored contents. The storage unit 14 stores at least a plurality of MIDI music pieces MD. 1 is an identifier for identifying the MIDI music piece MD stored in the storage unit 14 of the information processing server 10, and is assigned to each music piece. This code “n” is a natural number of 1 or more. Further, the storage unit 14 stores sound source data SD generated by the information processing apparatus 2 executing sound source data generation processing. The code “m” shown in FIG. 1 is an identifier for identifying the sound source data SD stored in the storage unit 14 of the information processing server 10, and is assigned to each group to be described in detail later. This code “m” is a natural number of 2 or more.

制御部１６は、ＲＯＭ１８，ＲＡＭ２０，ＣＰＵ２２を備えた周知のマイクロコンピュータを中心に構成された周知の制御装置である。ＲＯＭ１８は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを記憶する。ＲＡＭ２０は、処理プログラムやデータを一時的に記憶する。ＣＰＵ２２は、ＲＯＭ１８やＲＡＭ２０に記憶された処理プログラムに従って各処理を実行する。
＜カラオケ装置＞
カラオケ装置３０は、通信部３２と、入力受付部３４と、楽曲再生部３６と、記憶部３８と、音声制御部４０と、映像制御部４６と、制御部５０とを備えている。 The control unit 16 is a known control device that is configured around a known microcomputer including a ROM 18, a RAM 20, and a CPU 22. The ROM 18 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 20 temporarily stores processing programs and data. The CPU 22 executes each process according to a processing program stored in the ROM 18 or the RAM 20.
<Karaoke equipment>
The karaoke apparatus 30 includes a communication unit 32, an input reception unit 34, a music playback unit 36, a storage unit 38, an audio control unit 40, a video control unit 46, and a control unit 50.

通信部３２は、通信網を介して、カラオケ装置３０が外部との間で通信を行う。入力受付部３４は、外部からの操作に従って情報や指令の入力を受け付ける入力機器である。ここでの入力機器とは、例えば、キーやスイッチ、リモコンの受付部などである。 In the communication unit 32, the karaoke apparatus 30 communicates with the outside via a communication network. The input receiving unit 34 is an input device that receives input of information and commands in accordance with external operations. Here, the input device is, for example, a key, a switch, a reception unit of a remote controller, or the like.

楽曲再生部３６は、情報処理サーバ１０からダウンロードしたＭＩＤＩ楽曲ＭＤに基づく楽曲の演奏を実行する。この楽曲再生部３６は、例えば、ＭＩＤＩ音源である。音声制御部４０は、音声の入出力を制御するデバイスであり、出力部４２と、マイク入力部４４とを備えている。 The music playback unit 36 performs a music performance based on the MIDI music MD downloaded from the information processing server 10. The music reproducing unit 36 is, for example, a MIDI sound source. The voice control unit 40 is a device that controls voice input / output, and includes an output unit 42 and a microphone input unit 44.

マイク入力部４４には、マイク６２が接続される。これにより、マイク入力部４４は、マイク６２を介して入力された音声を取得する。出力部４２にはスピーカ６０が接続されている。出力部４２は、楽曲再生部３６によって再生される楽曲の音源信号、マイク入力部４４からの歌唱音の音源信号をスピーカ６０に出力する。スピーカ６０は、出力部４２から出力される音源信号を音に換えて出力する。 A microphone 62 is connected to the microphone input unit 44. As a result, the microphone input unit 44 acquires the sound input via the microphone 62. A speaker 60 is connected to the output unit 42. The output unit 42 outputs the sound source signal of the music reproduced by the music reproducing unit 36 and the sound source signal of the singing sound from the microphone input unit 44 to the speaker 60. The speaker 60 outputs the sound source signal output from the output unit 42 instead of sound.

映像制御部４６は、制御部５０から送られてくる映像データに基づく映像または画像の出力を行う。映像制御部４６には、映像または画像を表示する表示部６４が接続されている。 The video control unit 46 outputs a video or an image based on the video data sent from the control unit 50. The video control unit 46 is connected to a display unit 64 that displays video or images.

制御部５０は、ＲＯＭ５２，ＲＡＭ５４，ＣＰＵ５６を少なくとも有した周知のコンピュータを中心に構成されている。ＲＯＭ５２は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを記憶する。ＲＡＭ５４は、処理プログラムやデータを一時的に記憶する。ＣＰＵ５６は、ＲＯＭ５２やＲＡＭ５４に記憶された処理プログラムに従って各処理を実行する。 The control unit 50 is configured around a known computer having at least a ROM 52, a RAM 54, and a CPU 56. The ROM 52 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 54 temporarily stores processing programs and data. The CPU 56 executes each process according to a processing program stored in the ROM 52 or the RAM 54.

本実施形態のＲＯＭ５２には、音声合成処理を制御部５０が実行するための処理プログラムが記憶されている。音声合成処理は、ユーザの声に声質が類似する音声で、ユーザによって指定された楽曲を歌唱した合成音声を生成して出力する処理である。
＜音源データ生成処理＞
情報処理装置２の制御部６が実行する音源データ生成処理について説明する。 The ROM 52 of the present embodiment stores a processing program for the control unit 50 to execute speech synthesis processing. The voice synthesis process is a process of generating and outputting a synthesized voice in which the voice quality is similar to the voice of the user and singing the music designated by the user.
<Sound source data generation processing>
A sound source data generation process executed by the control unit 6 of the information processing apparatus 2 will be described.

図２に示すように、音源データ生成処理が起動されると、制御部６は、入力受付部３を介して指定された楽曲ＩＤが含まれるＭＩＤＩ楽曲ＭＤを取得する（Ｓ１１０）。続いて、制御部６は、記憶部５に記憶されている全ての音声波形データＷＤの中から、Ｓ１１０にて取得した楽曲ＩＤと対応付けられた一つの音声波形データＷＤを取得する（Ｓ１２０）。 As shown in FIG. 2, when the sound source data generation process is activated, the control unit 6 acquires a MIDI music piece MD including the music piece ID designated via the input receiving unit 3 (S110). Subsequently, the control unit 6 acquires one audio waveform data WD associated with the music ID acquired in S110 from all the audio waveform data WD stored in the storage unit 5 (S120). .

音源データ生成処理では、制御部６は、Ｓ１２０にて取得した音声波形データＷＤに含まれる伴奏音を抑制する（Ｓ１３０）。本実施形態においては、伴奏音の抑制手法として周知の手法を用いれば良い。本実施形態における伴奏音の抑制手法は、音声波形データＷＤに含まれる歌唱音を強調する手法であっても良いし、ＭＩＤＩ楽曲ＭＤによって表される楽器の演奏音を音声波形データＷＤから除去する手法であっても良い。 In the sound source data generation process, the control unit 6 suppresses the accompaniment sound included in the speech waveform data WD acquired in S120 (S130). In the present embodiment, a known method may be used as a method for suppressing the accompaniment sound. The accompaniment sound suppression method in the present embodiment may be a method of emphasizing the singing sound included in the audio waveform data WD, or removing the performance sound of the musical instrument represented by the MIDI music piece MD from the audio waveform data WD. It may be a technique.

さらに、音源データ生成処理では、制御部６は、Ｓ１３０にて伴奏音を抑制した音声波形データＷＤと、Ｓ１１０にて取得したＭＩＤＩ楽曲ＭＤとに基づいて、音符ボーカルＶｏ（ａ，ｉ）を特定する（Ｓ１４０）。音符ボーカルＶｏ（ａ，ｉ）とは、音声波形データＷＤにおいて、歌唱旋律を構成し、かつ、歌詞が割り当てられた各音符ＮＯ（ａ，ｉ）に対応する区間である。Ｓ１４０においては、制御部６は、Ｓ１２０で取得した音声波形データＷＤに、ＭＩＤＩ楽曲ＭＤにおける演奏開始タイミングｎｎｔ（ａ，ｉ）及び演奏終了タイミングｎｆｔ（ａ，ｉ）を照合することで、音符ボーカルＶｏ（ａ，ｉ）を特定する。 Further, in the sound source data generation process, the control unit 6 specifies the note vocal Vo (a, i) based on the voice waveform data WD in which the accompaniment sound is suppressed in S130 and the MIDI music piece MD acquired in S110. (S140). The note vocal Vo (a, i) is a section corresponding to each note NO (a, i) that constitutes a singing melody and is assigned lyrics in the voice waveform data WD. In S140, the control unit 6 collates the voice waveform data WD acquired in S120 with the performance start timing nnt (a, i) and the performance end timing nft (a, i) in the MIDI music piece MD, so that a musical note vocal is obtained. Specify Vo (a, i).

本実施形態における符号ａは、楽曲を識別する符号であり、符号ｉは、楽曲における歌唱旋律の音符ＮＯを識別する符号である。なお、Ｓ１４０で特定された音符ボーカルＶｏ（ａ，ｉ）は、当該音符ＮＯ（ａ，ｉ）に割り当てられた歌詞の母音が対応付けられた上で、記憶部５に記憶される。 The symbol a in the present embodiment is a code for identifying a music piece, and the code i is a code for identifying a note NO of a singing melody in the music piece. The note vocal Vo (a, i) specified in S140 is stored in the storage unit 5 after the vowel of the lyrics assigned to the note NO (a, i) is associated.

さらに、音源データ生成処理では、制御部６は、音符ボーカルＶｏ（ａ，ｉ）のそれぞれに、複数の分析窓を設定する（Ｓ１５０）。このＳ１５０での分析窓の設定では、制御部６は、複数個の分析窓が時間軸に沿って互いに隣接するように設定する。この分析窓は、音符ＮＯ（ａ，ｉ）の時間長よりも短い時間長を有した区間である。 Further, in the sound source data generation process, the control unit 6 sets a plurality of analysis windows for each of the note vocals Vo (a, i) (S150). In setting the analysis window in S150, the control unit 6 sets the plurality of analysis windows so as to be adjacent to each other along the time axis. This analysis window is a section having a time length shorter than the time length of the note NO (a, i).

続いて、音源データ生成処理では、制御部６は、各音符ボーカルＶｏ（ａ，ｉ）における声質特徴量Ｍ（ａ，ｉ）を算出する（Ｓ１６０）。ここで言う声質特徴量Ｍとは、発声した人物の声質を表す特徴量である。このＳ１６０では、制御部６は、まず、Ｓ１５０にて設定された音符ボーカルＶｏ（ａ，ｉ）の分析窓それぞれについて、周波数解析（例えば、ＤＦＴ）を実施する。制御部６は、周波数解析の結果（周波数スペクトル）に対してケプストラム分析を実行することで、各分析窓のメル周波数ケプストラム（ＭＦＣＣ）を声質特徴量Ｍ（ａ，ｉ）として算出する。 Subsequently, in the sound source data generation process, the control unit 6 calculates a voice quality feature amount M (a, i) in each note vocal Vo (a, i) (S160). The voice quality feature amount M referred to here is a feature amount representing the voice quality of the person who uttered. In S160, the control unit 6 first performs frequency analysis (for example, DFT) for each analysis window of the note vocal Vo (a, i) set in S150. The control unit 6 calculates a mel frequency cepstrum (MFCC) of each analysis window as a voice quality feature amount M (a, i) by performing cepstrum analysis on the result of frequency analysis (frequency spectrum).

そして、音源データ生成処理では、制御部６は、Ｓ１２０にて取得した音声波形データＷＤ、ひいては、Ｓ１４０で特定した音符ボーカルＶｏ（ａ，ｉ）のそれぞれに、分類キーＸｑを割り当てる（Ｓ１７０）。ここで言う分類キーＸｑとは、歌詞の母音ごとの声質特徴量Ｍの代表値を表すベクトルである。 In the sound source data generation process, the control unit 6 assigns the classification key Xq to each of the voice waveform data WD acquired in S120, and consequently, the note vocal Vo (a, i) specified in S140 (S170). The classification key Xq here is a vector representing a representative value of the voice quality feature amount M for each vowel of the lyrics.

具体的には、Ｓ１７０では、制御部６は、まず、歌詞の母音が共通する声質特徴量Ｍを全て抽出し、その抽出した声質特徴量Ｍの相加平均を算出する。そして、制御部６は、その相加平均の結果を声質特徴量Ｍの代表値とする。この声質特徴量Ｍの代表値の算出を、歌詞の母音ごとに実行する。さらに、制御部６は、声質特徴量Ｍの代表値を歌詞の母音の順序に配置したベクトルデータを、分類キーＸｑとして生成する。制御部６は、この生成した分類キーＸｑを、Ｓ１２０にて取得した音声波形データＷＤ、ひいては、Ｓ１４０で特定した音符ボーカルＶｏ（ａ，ｉ）のそれぞれに割り当てる。 Specifically, in S <b> 170, the control unit 6 first extracts all the voice quality feature values M that are common to the vowels of the lyrics, and calculates an arithmetic average of the extracted voice quality feature values M. And the control part 6 makes the result of the arithmetic mean the representative value of the voice quality feature-value M. FIG. The calculation of the representative value of the voice quality feature amount M is executed for each vowel of the lyrics. Further, the control unit 6 generates vector data in which representative values of the voice quality feature amount M are arranged in the order of the vowels of the lyrics as the classification key Xq. The control unit 6 assigns the generated classification key Xq to each of the voice waveform data WD acquired in S120, and thus each of the note vocals Vo (a, i) specified in S140.

続いて、音源データ生成処理では、制御部６は、Ｓ１１０にて取得したＭＩＤＩ楽曲ＭＤと対応付けられた全ての音声波形データＷＤに対してＳ１２０からＳ１７０までの処理を実行したか否かを判定する（Ｓ１８０）。このＳ１８０での判定の結果、全ての音声波形データＷＤに対して処理を実行していなければ（Ｓ１８０：ＮＯ）、制御部６は、音源データ生成処理をＳ１２０へと戻す。そのＳ１２０では、Ｓ１１０にて取得したＭＩＤＩ楽曲ＭＤと対応付けられ、かつ、Ｓ１２０〜Ｓ１８０までの処理を未実行である音声波形データＷＤの中から、音声波形データＷＤを１つ取得する。その後、制御部６は、Ｓ１３０からＳ１８０までのステップを実行する。 Subsequently, in the sound source data generation process, the control unit 6 determines whether or not the processes from S120 to S170 have been performed on all the audio waveform data WD associated with the MIDI music piece MD acquired in S110. (S180). If the result of determination in S180 is that processing has not been executed for all audio waveform data WD (S180: NO), the controller 6 returns the sound source data generation processing to S120. In S120, one audio waveform data WD is acquired from the audio waveform data WD that is associated with the MIDI musical piece MD acquired in S110 and that has not been subjected to the processing from S120 to S180. Thereafter, the control unit 6 executes steps from S130 to S180.

一方、Ｓ１８０での判定の結果、全ての音声波形データＷＤに対して処理を実行済みであれば（Ｓ１８０：ＹＥＳ）、制御部６は、記憶部５に記憶されている全てのＭＩＤＩ楽曲ＭＤを取得したか否かを判定する（Ｓ１９０）。このＳ１９０での判定の結果、全てのＭＩＤＩ楽曲ＭＤに対して処理を実行していなければ（Ｓ１９０：ＮＯ）、制御部６は、音源データ生成処理をＳ１１０へと戻す。そのＳ１１０では、制御部６は、処理を実行していないＭＩＤＩ楽曲ＭＤの中から１つのＭＩＤＩ楽曲ＭＤを取得する。その後、音源データ生成処理では、Ｓ１２０からＳ１９０までを繰り返す。 On the other hand, as a result of the determination in S180, if the processing has been executed for all the audio waveform data WD (S180: YES), the control unit 6 stores all the MIDI musical pieces MD stored in the storage unit 5. It is determined whether or not it has been acquired (S190). As a result of the determination in S190, if the process is not executed for all the MIDI music pieces MD (S190: NO), the control unit 6 returns the sound source data generation process to S110. In S110, the control unit 6 acquires one MIDI music MD from the MIDI music MD that has not been processed. Thereafter, in the sound source data generation process, S120 to S190 are repeated.

ところで、Ｓ１９０での判定の結果、全てのＭＩＤＩ楽曲ＭＤに対して処理を実行済みであれば（Ｓ１９０：ＹＥＳ）、制御部６は、音源データ生成処理をＳ２００へと移行させる。 Incidentally, as a result of the determination in S190, if the processing has been executed for all the MIDI music pieces MD (S190: YES), the control unit 6 shifts the sound source data generation processing to S200.

そのＳ２００では、制御部６は、分類キーＸｑに従って、記憶部５に記憶された音声波形データＷＤ、ひいては音符ボーカルＶｏをグルーピングする。このグルーピングは、分類キーＸｑをデータとした、周知のクラスタリング（例えば、ｋ−ｍｅａｎｓ法）を用いて実行すれば良い。したがって、Ｓ２００でのグルーピングにより、声質特徴量Ｍが近似する音声波形データＷＤごとに、音声波形データＷＤ（ひいては、音符ボーカルＶｏ）を分類したグループ（クラスタ）が形成される。 In S200, the control unit 6 groups the speech waveform data WD stored in the storage unit 5 and the note vocal Vo in accordance with the classification key Xq. This grouping may be executed by using known clustering (for example, k-means method) using the classification key Xq as data. Therefore, the grouping in S200 forms a group (cluster) in which the speech waveform data WD (and thus the note vocal Vo) is classified for each speech waveform data WD whose voice quality feature value M approximates.

続いて、音源データ生成処理では、制御部６は、Ｓ２００にて生成された複数のグループのうちの１つを選択する（Ｓ２１０）。さらに、制御部６は、Ｓ２１０にて選択したグループに含まれる音符ボーカルＶｏ（ａ，ｉ）を１つ取得する（Ｓ２２０）。 Subsequently, in the sound source data generation process, the control unit 6 selects one of the plurality of groups generated in S200 (S210). Further, the control unit 6 acquires one note vocal Vo (a, i) included in the group selected in S210 (S220).

さらに、音源データ生成処理では、制御部６は、Ｓ２２０にて取得した音符ボーカルＶｏ（ａ，ｉ）の音声パラメータＰ（ａ，ｉ）を算出する（Ｓ２３０）。本実施形態のＳ２３０にて導出する音声パラメータＰには、少なくとも、基本周波数（ｆ０）、メル周波数ケプストラム（ＭＦＣＣ）、パワー、及びそれぞれの時間差分を含む。これらの基本周波数、ＭＦＣＣ、パワーの導出方法は、周知であるため、ここでの詳しい説明は省略するが、例えば、基本周波数であれば、音符ボーカルＶｏの時間軸に沿った自己相関、音符ボーカルＶｏの周波数スペクトルの自己相関、またはケプストラム法などの手法を用いて導出すれば良い。また、ＭＦＣＣであれば、音符ボーカルＶｏに対して時間分析窓を適用して、時間分析窓ごとに周波数解析（例えば、ＦＦＴ）をした結果について、周波数ごとの大きさを対数化した結果を、さらに、周波数解析することで導出すれば良い。パワーについては、音符ボーカルＶｏに対して時間分析窓を適用して振幅を二乗した結果を時間方向に積分することで導出すれば良い。 Further, in the sound source data generation process, the control unit 6 calculates the speech parameter P (a, i) of the note vocal Vo (a, i) acquired in S220 (S230). The audio parameter P derived in S230 of the present embodiment includes at least a fundamental frequency (f0), a mel frequency cepstrum (MFCC), power, and each time difference. Since these fundamental frequency, MFCC, and power deriving methods are well known, detailed description thereof is omitted here. For example, if the fundamental frequency is used, autocorrelation along the time axis of the note vocal Vo, note vocal What is necessary is just to derive | lead-out using methods, such as the autocorrelation of the frequency spectrum of Vo, or a cepstrum method. In the case of MFCC, the result of frequency analysis (for example, FFT) for each time analysis window by applying a time analysis window to the note vocal Vo, and the result of logarithmizing the size for each frequency, Furthermore, it may be derived by frequency analysis. The power may be derived by integrating the result of squaring the amplitude by applying a time analysis window to the note vocal Vo and integrating it in the time direction.

また、音源データ生成処理では、制御部６は、Ｓ２２０にて取得した音符ボーカルＶｏ（ａ，ｉ）に対応する音符ＮＯ（ａ，ｉ）の音符プロパティｐ（ａ，ｉ）を特定する（Ｓ２４０）。本実施形態のＳ２４０では、具体的には、制御部６は、ＭＩＤＩ楽曲ＭＤから、そのＭＩＤＩ楽曲ＭＤに規定された各音符ＮＯ（ａ，ｉ）の情報を音符プロパティｐ（ａ，ｉ）として抽出して特定する。 In the sound source data generation process, the control unit 6 specifies the note property p (a, i) of the note NO (a, i) corresponding to the note vocal Vo (a, i) acquired in S220 (S240). ). In S240 of the present embodiment, specifically, the control unit 6 uses, as a note property p (a, i), information of each note NO (a, i) defined in the MIDI song MD from the MIDI song MD. Extract and identify.

ここで言う音符プロパティｐ（ａ，ｉ）には、対象音符属性と、前音符属性と、後音符属性とを含む。対象音符属性とは、音符ＮＯ（ａ，ｉ）の属性を表す情報である。この対象音符属性には、音符ＮＯ（ａ，ｉ）の音階（音高）、音符長、歌詞の音節を含む。また、前音符属性とは、時間軸に沿って音符ＮＯ（ａ，ｉ）の一つ前の音符（以下、前音符と称す）ＮＯ（ａ，ｉ−１）の属性を表す情報である。この前音符属性には、前音符ＮＯ（ａ，ｉ−１）の音階（音高）、音符長、歌詞の音節、及び前音符ＮＯ（ａ，ｉ−１）と音符ＮＯ（ａ，ｉ）との間の時間長を含む。 The note property p (a, i) mentioned here includes a target note attribute, a previous note attribute, and a rear note attribute. The target note attribute is information representing the attribute of the note NO (a, i). The target note attributes include the scale (pitch) of note NO (a, i), note length, and syllable of lyrics. The previous note attribute is information representing the attribute of the note NO (a, i-1) immediately preceding the note NO (a, i) (hereinafter referred to as the previous note) along the time axis. The previous note attribute includes the scale (pitch), note length, syllable of the previous note NO (a, i-1), and the previous note NO (a, i-1) and note NO (a, i). Including the length of time between.

さらに、後音符属性とは、時間軸に沿って対象音符ＮＯ（ａ，ｉ）の一つ後の音符（以下、後音符と称す）ＮＯ（ａ，ｉ＋１）の属性を表す情報である。この後音符属性には、音階（音高）、音符長、歌詞の音節、及び音符ＮＯ（ａ，ｉ）と後音符ＮＯ（ａ，ｉ＋１）との間の時間長を含む。なお、音符プロパティｐ（ａ，ｉ）における音符長、及び音符間の時間長は、予め規定された階級へと量子化されていても良い。 Further, the back note attribute is information representing the attribute of the note (hereinafter referred to as a back note) NO (a, i + 1) immediately after the target note NO (a, i) along the time axis. The post-note attributes include a musical scale (pitch), a note length, a syllable of lyrics, and a time length between the note NO (a, i) and the subsequent note NO (a, i + 1). Note that the note length in the note property p (a, i) and the time length between notes may be quantized to a predetermined class.

続いて、制御部６は、Ｓ２４０にて特定した音符プロパティｐ（ａ，ｉ）を、Ｓ２３０にて算出した音声パラメータＰ（ａ，ｉ）と対応付ける（Ｓ２５０）。このＳ２５０では、制御部６は、音符プロパティｐ（ａ，ｉ）と対応付けられた音声パラメータＰ（ａ，ｉ）を、記憶部５に記憶する。 Subsequently, the control unit 6 associates the note property p (a, i) specified in S240 with the voice parameter P (a, i) calculated in S230 (S250). In S250, the control unit 6 stores the speech parameter P (a, i) associated with the note property p (a, i) in the storage unit 5.

さらに、音源データ生成処理では、制御部６は、Ｓ２１０にて設定したグループに含まれる全ての音符ボーカルＶｏに対してＳ２３０からＳ２５０までの処理を実行したか否かを判定する（Ｓ２６０）。このＳ２６０での判定の結果、Ｓ２３０からＳ２５０までの処理を未実行である音符ボーカルＶｏが存在していれば（Ｓ２６０：ＮＯ）、制御部６は、音源データ生成処理をＳ２２０へと戻す。そのＳ２２０では、制御部６は、Ｓ２３０からＳ２５０までの処理を未実行である音符ボーカルＶｏの中から、音符ボーカルＶｏ（ａ，ｉ）を１つ取得する。 Further, in the sound source data generation process, the control unit 6 determines whether or not the processes from S230 to S250 have been executed for all the note vocals Vo included in the group set in S210 (S260). As a result of the determination in S260, if there is a note vocal Vo for which the processing from S230 to S250 has not been executed (S260: NO), the control unit 6 returns the sound source data generation processing to S220. In S220, the control unit 6 acquires one note vocal Vo (a, i) from the note vocals Vo that have not been subjected to the processing from S230 to S250.

一方、Ｓ２１０にて設定したグループに含まれる全ての音符ボーカルＶｏに対して、Ｓ２３０からＳ２５０までの処理を実行済みであれば（Ｓ２６０：ＹＥＳ）、制御部６は、音声パラメータＰの代表値を算出する（Ｓ２７０）。このＳ２７０では、制御部６は、Ｓ２１０にて設定されたグループにおいて、共通する音符プロパティｐごとに、音声パラメータＰを相加平均する。そして、制御部６は、相加平均の結果それぞれを音声パラメータＰの代表値として算出する。 On the other hand, if the processing from S230 to S250 has been executed for all note vocals Vo included in the group set in S210 (S260: YES), the control unit 6 sets the representative value of the voice parameter P. Calculate (S270). In S270, the control unit 6 arithmetically averages the speech parameter P for each common note property p in the group set in S210. Then, the control unit 6 calculates each arithmetic mean result as a representative value of the voice parameter P.

さらに、音源データ生成処理では、制御部６は、音源データＳＤを生成して記憶部５に記憶する（Ｓ２８０）。このＳ２８０では、制御部６は、Ｓ２１０にて設定したグループに対応付けられた分類キーＸｑと、Ｓ２７０にて算出した音声パラメータＰの代表値とを対応付けることで、図３に示すような音源データＳＤを生成する。 Further, in the sound source data generation process, the control unit 6 generates sound source data SD and stores it in the storage unit 5 (S280). In S280, the control unit 6 associates the classification key Xq associated with the group set in S210 with the representative value of the audio parameter P calculated in S270, so that the sound source data as shown in FIG. SD is generated.

続いて、音源データ生成処理では、制御部６は、Ｓ２００にて生成した全てのグループに対してＳ２２０からＳ２８０までの処理を実行したか否かを判定する（Ｓ２９０）。このＳ２９０での判定の結果、Ｓ２２０からＳ２５０までの処理を未実行であるグループが存在していれば（Ｓ２９０：ＮＯ）、制御部６は、音源データ生成処理をＳ２１０へと戻す。そのＳ２１０では、制御部６は、Ｓ２２０からＳ２８０までの処理を未実行であるグループの中から１つのグループを選択する。 Subsequently, in the sound source data generation process, the control unit 6 determines whether or not the processes from S220 to S280 have been executed for all the groups generated in S200 (S290). As a result of the determination in S290, if there is a group in which the processes from S220 to S250 are not executed (S290: NO), the control unit 6 returns the sound source data generation process to S210. In S210, the control unit 6 selects one group from the groups in which the processes from S220 to S280 are not executed.

一方、Ｓ２９０での判定の結果、Ｓ２２０からＳ２５０までの処理を全てのグループに対して実行していれば（Ｓ２９０：ＹＥＳ）、制御部６は、音源データ生成処理を終了する。 On the other hand, as a result of the determination in S290, if the processes from S220 to S250 have been executed for all groups (S290: YES), the control unit 6 ends the sound source data generation process.

以上説明したように、音源データ生成処理では、予め用意された音声波形データＷＤを解析した結果に従って、その音符ボーカルＶｏを声質が近似するもの同士でグループ化する。さらに、音源データ生成処理では、その声質が近似するグループに含まれる音符ボーカルＶｏの音声パラメータＰの代表値を、共通する音符プロパティｐごとに導出する。これと共に、音源データ生成処理では、当該グループにおける声質特徴量Ｍの代表値である分類キーＸｑを、音符プロパティｐごとに導出された音声パラメータＰの代表値に対応付けることで、音源データＳＤを生成する。 As described above, in the sound source data generation process, the note vocals Vo are grouped by those whose voice qualities are approximated according to the result of analyzing the voice waveform data WD prepared in advance. Further, in the sound source data generation process, the representative value of the speech parameter P of the note vocal Vo included in the group whose voice quality is approximated is derived for each common note property p. At the same time, in the sound source data generation process, the sound source data SD is generated by associating the classification key Xq, which is the representative value of the voice quality feature amount M in the group, with the representative value of the speech parameter P derived for each note property p. To do.

なお、情報処理装置２の制御部６が音源データ生成処理を実行することで生成した音源データＳＤは、可搬型の記憶媒体を用いて情報処理サーバ１０の記憶部１４に記憶されても良い。情報処理装置２と情報処理サーバ１０とが通信網を介して接続されている場合には、情報処理装置２の記憶部５に記憶された音源データＳＤは、通信網を介して転送されることで、情報処理サーバ１０の記憶部１４に記憶されても良い。
＜音声合成処理＞
次に、カラオケ装置３０の制御部５０が実行する音声合成処理について説明する。 Note that the sound source data SD generated by the control unit 6 of the information processing device 2 executing the sound source data generation process may be stored in the storage unit 14 of the information processing server 10 using a portable storage medium. When the information processing device 2 and the information processing server 10 are connected via a communication network, the sound source data SD stored in the storage unit 5 of the information processing device 2 is transferred via the communication network. Thus, the information may be stored in the storage unit 14 of the information processing server 10.
<Speech synthesis processing>
Next, the speech synthesis process executed by the control unit 50 of the karaoke apparatus 30 will be described.

図４に示すように、音声合成処理が起動されると、制御部５０は、入力受付部３４を介して指定された楽曲（指定楽曲）に対応する楽曲ＩＤを取得する（Ｓ５１０）。
音声合成処理では、続いて、制御部５０は、マイク入力部４４に接続されたマイク６２を介して入力された音声の波形を表す発声音声データを取得する（Ｓ５２０）。そして、制御部５０は、Ｓ５２０にて取得した発声音声データの声質を分析して、入力声質Ｙｋを算出する（Ｓ５３０）。入力声質Ｙｋは、上述した分類キーＸｑと同様、母音ごとのメル周波数ケプストラム（ＭＦＣＣ）を代表値化した声質特徴量である。この入力声質Ｙｋの算出手法は、「音符ボーカルＶｏ」を「発声音声データ」へと読み替えることを除けば、音源データ生成処理におけるＳ１５０からＳ１７０と同様であるため、ここでの詳しい説明は省略する。 As shown in FIG. 4, when the speech synthesis process is activated, the control unit 50 acquires a song ID corresponding to a song (designated song) designated via the input receiving unit 34 (S510).
In the speech synthesis process, subsequently, the control unit 50 acquires utterance speech data representing the waveform of speech input via the microphone 62 connected to the microphone input unit 44 (S520). Then, the control unit 50 analyzes the voice quality of the voiced voice data acquired in S520, and calculates the input voice quality Yk (S530). The input voice quality Yk is a voice quality feature value representative of the mel frequency cepstrum (MFCC) for each vowel, as with the classification key Xq described above. The method for calculating the input voice quality Yk is the same as S150 to S170 in the sound source data generation process except that “note vocal Vo” is read as “voiced voice data”, and detailed description thereof is omitted here. .

さらに、音声合成処理では、制御部５０は、Ｓ５３０にて算出した入力声質Ｙｋに基づいて、カラオケ装置３０のユーザの声に声質が近似するグループである類似グループを特定する（Ｓ５４０）。このＳ５４０では、制御部５０は、下記（１）式に従って、近似値ｋ_ｏｐｔを算出し、その近似値ｋ_ｏｐｔが最も大きくなるグループを類似グループとして特定する。 Further, in the speech synthesis process, the control unit 50 specifies a similar group that is a group whose voice quality approximates the voice of the user of the karaoke apparatus 30 based on the input voice quality Yk calculated in S530 (S540). In S540, the control unit 50 calculates an approximate value k _opt according to the following equation (1), and identifies a group having the largest approximate value k _opt as a similar group.

なお、（１）式における符号“ｋ”はグループを識別する識別子である。また、（１）式におけるａｒｇ_ｋｍｉｎは、ｋの関数における最小値を意味する。

Note that the symbol “k” in equation (1) is an identifier for identifying a group. Also, arg _k min in the equation (1) means the minimum value in the function of k.

さらに、音声合成処理では、制御部５０は、Ｓ５１０にて取得した楽曲ＩＤに対応するＭＩＤＩ楽曲ＭＤを情報処理サーバ１０から取得する（Ｓ５５０）。
そして、音声合成処理では、制御部５０は、Ｓ５５０にて取得したＭＩＤＩ楽曲ＭＤに含まれる音符ＮＯの音符プロパティｐと対応付けられ、かつ、Ｓ５４０にて特定した類似グループに対応する音源データＳＤを取得する（Ｓ５６０）。そして、制御部５０は、Ｓ５６０にて取得した音源データＳＤに基づいて、Ｓ５５０にて取得したＭＩＤＩ楽曲ＭＤに対応する楽曲の歌唱音が出力されるように音声合成する（Ｓ５７０）。具体的には、制御部５０は、Ｓ５４０にて特定した類似グループに対応する音源データＳＤの中から、Ｓ５５０にて取得したＭＩＤＩ楽曲ＭＤに対応する楽曲の歌詞の音節に対応する音源データＳＤ（音声パラメータＰ）を取得する（Ｓ５６０）。そして、制御部５０は、Ｓ５５０にて取得したＭＩＤＩ楽曲ＭＤに対応する楽曲の歌詞が歌唱されるように、その取得した音声パラメータＰを調整してフォルマント合成することで合成音声を生成する（Ｓ５７０）。 Further, in the speech synthesis process, the control unit 50 acquires the MIDI music MD corresponding to the music ID acquired in S510 from the information processing server 10 (S550).
In the speech synthesis process, the control unit 50 associates the sound source data SD corresponding to the note property p of the note NO included in the MIDI music piece MD acquired in S550 and corresponding to the similar group specified in S540. Obtain (S560). Based on the sound source data SD acquired in S560, the control unit 50 performs voice synthesis so that the singing sound of the music corresponding to the MIDI music MD acquired in S550 is output (S570). Specifically, the control unit 50 selects sound source data SD (corresponding to the syllable of the lyrics of the music corresponding to the MIDI music MD acquired in S550 from the sound source data SD corresponding to the similar group specified in S540. The voice parameter P) is acquired (S560). And the control part 50 produces | generates a synthetic | combination voice by adjusting the acquired audio | voice parameter P and formant-synthesize | combining so that the lyrics of the music corresponding to the MIDI music MD acquired in S550 may be sung (S570). ).

さらに、制御部５０は、Ｓ５６０にて音声合成することによって生成された合成音声を出力部４２へと出力する（Ｓ５８０）。その出力部４２は、スピーカ６０から合成音声を放音する。 Further, the control unit 50 outputs the synthesized voice generated by synthesizing the voice in S560 to the output unit 42 (S580). The output unit 42 emits synthesized speech from the speaker 60.

その後、制御部５０は、本音声合成処理を終了する。
つまり、音声合成処理では、マイク６２を介して入力された入力音声を分析して、その入力音声の声質特徴量である入力声質Ｙｋを導出する。そして、音声合成処理では、情報処理サーバ１０の記憶部１４に記憶された音源データＳＤの中で、入力声質Ｙｋに最も類似する分類キーＸｑを有した音源データＳＤを特定する。さらに、音声合成処理では、その特定した音源データＳＤに従って、指定楽曲を歌唱した合成音声（即ち、仮想発声音）を音声合成にて生成して出力する。
［実施形態の効果］
以上説明したように、音声合成システム１では、ユーザ自身の声と声質が似ている音声パラメータＰ（音源データＳＤ）を用いて合成音声を生成している。音声合成システム１において、このような音源データＳＤは、一人の利用者の音声波形データＷＤだけでなく複数人の音声波形データＷＤに基づいて生成される。 Thereafter, the control unit 50 ends the speech synthesis process.
That is, in the speech synthesis process, the input speech input via the microphone 62 is analyzed, and the input voice quality Yk that is the voice quality feature quantity of the input speech is derived. In the speech synthesis process, the sound source data SD having the classification key Xq most similar to the input voice quality Yk is specified among the sound source data SD stored in the storage unit 14 of the information processing server 10. Further, in the voice synthesis process, a synthesized voice (that is, a virtual utterance) singing the designated music is generated by voice synthesis according to the specified sound source data SD and output.
[Effect of the embodiment]
As described above, the speech synthesis system 1 generates synthesized speech using the speech parameter P (sound source data SD) whose voice quality is similar to that of the user's own voice. In the speech synthesis system 1, such sound source data SD is generated based on not only the speech waveform data WD of one user but also the speech waveform data WD of a plurality of people.

すなわち、音声合成システム１によれば、音源データＳＤの生成に必要になる音声波形データＷＤとして、カラオケ装置３０のユーザ自身が発声した音声波形データＷＤだけでなく、当該ユーザとは異なる他の人物が発声した音声波形データＷＤを用いている。 That is, according to the speech synthesis system 1, not only the speech waveform data WD uttered by the user of the karaoke apparatus 30 as speech waveform data WD necessary for generating the sound source data SD, but also another person different from the user. Is used as voice waveform data WD.

したがって、音声合成システム１によれば、音声合成に必要な音声波形データＷＤの収集を容易なものとすることができ、音声パラメータＰの生成の困難性を低下させることができる。この結果、音声合成システム１によれば、合成音声（ひいては、仮想発声音）の生成の困難性を低下させることができ、合成音声の生成の簡易化を実現できる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 Therefore, according to the speech synthesis system 1, the speech waveform data WD necessary for speech synthesis can be easily collected, and the difficulty of generating the speech parameter P can be reduced. As a result, according to the speech synthesis system 1, it is possible to reduce the difficulty of generating the synthesized speech (and thus the virtual utterance), and it is possible to simplify the generation of the synthesized speech.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

例えば、上記実施形態における音声波形データＷＤには、演奏音として、少なくとも１つの楽器を演奏した伴奏音と、少なくとも人が歌唱した歌唱音とが含まれていたが、本発明における音声波形データＷＤは、歌唱音だけが含まれていても良い。 For example, the audio waveform data WD in the above embodiment includes accompaniment sounds that played at least one instrument and at least singing sounds sung by a person as performance sounds, but the audio waveform data WD in the present invention. May contain only the singing sound.

また、上記実施形態における音源データ生成処理のＳ２７０では、音声パラメータＰの代表値を、同一グループにおいて共通する音符プロパティｐと対応付けられた音声パラメータＰを相加平均した結果としていたが、音声パラメータＰの代表値は、相加平均した結果に限るものではない。すなわち、音声パラメータＰの代表値は、同一グループにおいて共通する音符プロパティｐと対応付けられた音声パラメータＰの最頻値であっても良いし、同一グループにおいて共通する音符プロパティｐと対応付けられた音声パラメータＰの中央値であっても良い。 In S270 of the sound source data generation process in the above embodiment, the representative value of the voice parameter P is a result of arithmetic averaging of the voice parameter P associated with the note property p common in the same group. The representative value of P is not limited to the arithmetic average result. That is, the representative value of the voice parameter P may be the mode value of the voice parameter P associated with the note property p common in the same group, or may be associated with the note property p common in the same group. The median value of the voice parameter P may be used.

なお、上記実施形態の構成の一部を省略した態様も本発明の実施形態である。また、上記実施形態と変形例とを適宜組み合わせて構成される態様も本発明の実施形態である。また、特許請求の範囲に記載した文言によって特定される発明の本質を逸脱しない限度において考え得るあらゆる態様も本発明の実施形態である。 In addition, the aspect which abbreviate | omitted a part of structure of the said embodiment is also embodiment of this invention. Further, an aspect configured by appropriately combining the above embodiment and the modification is also an embodiment of the present invention. Moreover, all the aspects which can be considered in the limit which does not deviate from the essence of the invention specified by the wording described in the claims are the embodiments of the present invention.

また、本発明は、前述した音声合成システムの他、音声合成を実現するためにコンピュータが実行するプログラム、音声合成の方法等、種々の形態で実現することができる。
［実施形態と特許請求の範囲との対応関係］
最後に、実施形態の記載と、特許請求の範囲の記載との関係を説明する。 In addition to the above-described speech synthesis system, the present invention can be implemented in various forms such as a program executed by a computer to implement speech synthesis, a speech synthesis method, and the like.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the embodiment and the description of the claims will be described.

実施形態における音源データ生成処理のＳ１２０を実行することで得られる機能が、特許請求の範囲に記載された音声データ取得手段の一例であり、Ｓ１３０〜Ｓ１６０を実行することで得られる機能が、分析手段の一例である。さらに、Ｓ１７０〜Ｓ２００を実行することで得られる機能が、分類手段の一例であり、Ｓ２１０〜Ｓ２９０を実行することで得られる機能が、記憶制御手段の一例である。 The function obtained by executing S120 of the sound source data generation process in the embodiment is an example of the voice data acquisition means described in the claims, and the function obtained by executing S130 to S160 is analyzed. It is an example of a means. Furthermore, the function obtained by executing S170 to S200 is an example of a classification unit, and the function obtained by executing S210 to S290 is an example of a storage control unit.

また、音声合成処理のＳ５２０を実行することで得られる機能が、特許請求の範囲に記載された入力受付手段の一例であり、Ｓ５３０を実行することで得られる機能が、声質分析手段の一例である。さらに、Ｓ５４０を実行することで得られる機能が、検索手段の一例であり、Ｓ５７０，Ｓ５８０を実行することで得られる機能が、合成手段の一例である。なお、Ｓ５５０を実行することで得られる機能が、楽曲データ取得手段の一例である。 Further, the function obtained by executing S520 of the speech synthesis process is an example of the input receiving means described in the claims, and the function obtained by executing S530 is an example of the voice quality analyzing means. is there. Furthermore, the function obtained by executing S540 is an example of a search unit, and the function obtained by executing S570 and S580 is an example of a synthesis unit. The function obtained by executing S550 is an example of a music data acquisition unit.

１…音声合成システム２…情報処理装置３…入力受付部４…外部出力部５，１４，３８…記憶部６，１６，５０…制御部７，１８，５２…ＲＯＭ８，２０，５４…ＲＡＭ９，２２，５６…ＣＰＵ１０…情報処理サーバ１２，３２…通信部３０…カラオケ装置３４…入力受付部３６…楽曲再生部４０…音声制御部４２…出力部４４…マイク入力部４６…映像制御部６０…スピーカ６２…マイク６４…表示部 DESCRIPTION OF SYMBOLS 1 ... Speech synthesis system 2 ... Information processing device 3 ... Input reception part 4 ... External output part 5, 14, 38 ... Memory | storage part 6, 16, 50 ... Control part 7, 18, 52 ... ROM 8, 20, 54 ... RAM 9, 22, 56 ... CPU 10 ... Information processing server 12, 32 ... Communication unit 30 ... Karaoke device 34 ... Input reception unit 36 ... Music playback unit 40 ... Audio control unit 42 ... Output unit 44 ... Microphone input unit 46 ... Video control Unit 60 ... Speaker 62 ... Microphone 64 ... Display unit

Claims

Voice data representing a speech waveform uttered by a lyric assigned to at least a part of a plurality of notes consisting of a combination of a pitch and a note value, wherein at least one of the uttered person and the lyrics Audio data acquisition means for acquiring at least two different audio data;
Analysis means for deriving a voice quality feature amount representing voice quality in the voice data acquired by the voice data acquisition means;
Classification means for classifying the voice data into at least two groups based on the distribution of voice quality feature values derived by the analysis means;
For each group of voice data classified by the classification means, a representative value of a voice feature amount representing a voice feature amount for each syllable of the lyrics and each note type is derived from the voice data included in each group. Storage control means for storing sound source data in which the representative value of the derived voice feature quantity and the voice quality feature quantity corresponding to the representative value of the voice feature quantity are associated with each other in a storage device;
An input receiving means for receiving voice input;
Voice quality analysis means for analyzing the input voice received by the input reception means and deriving an input voice quality that is a voice quality feature quantity of the input voice;
Search means for identifying the sound source data having the voice quality feature amount most similar to the input voice quality derived by the voice quality analysis means among the sound source data stored in the storage device;
A speech synthesis system comprising: synthesis means for outputting synthesized speech synthesized by voice according to the sound source data specified by the search means.

Music data representing designated music, including musical score data in which a plurality of notes are arranged along a time axis, and lyrics data representing lyrics assigned to at least some of the plurality of notes A music data acquisition means for acquiring music data;
The synthesis means includes
The speech synthesis system according to claim 1, wherein a synthesized speech in which the lyrics included in the song data are sung is output based on the song data acquired by the song data acquisition unit.

An input receiving means for receiving voice input;
Voice quality analysis means for analyzing an input voice received by the input reception means and deriving an input voice quality that is a voice quality feature amount representing a voice quality of the input voice;
Voice data representing a speech waveform uttered by a lyric assigned to at least a part of a plurality of notes consisting of a combination of a pitch and a note value, wherein at least one of the uttered person and the lyrics An audio data acquisition step for acquiring at least two different audio data, an analysis step for deriving a voice quality feature amount representing voice quality in the audio data acquired in the audio data acquisition step, and a voice quality feature amount derived in the analysis step A step of classifying the audio data into at least two groups based on the distribution; and for each group of audio data classified in the classification step, from the audio data included in each group, the syllables of the lyrics and Deriving a representative value of the speech feature value representing the feature value of the speech for each type of note, and deriving it The storage control step of storing sound source data in which the representative value of the voice feature amount and the voice quality feature amount corresponding to the representative value of the voice feature amount are associated with each other is executed, so that the storage device stores the sound source data. Search means for specifying the sound source data having the voice quality feature amount most similar to the input voice quality derived by the voice quality analysis means among the sound source data that has been obtained;
A speech synthesizer comprising: synthesis means for outputting synthesized speech synthesized by sound according to the sound source data specified by the search means.