JP6252420B2

JP6252420B2 - Speech synthesis apparatus and speech synthesis system

Info

Publication number: JP6252420B2
Application number: JP2014201116A
Authority: JP
Inventors: 成田　健; 健成田
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2014-09-30
Filing date: 2014-09-30
Publication date: 2017-12-27
Anticipated expiration: 2034-09-30
Also published as: JP2016071187A

Description

本発明は、合成音声を生成する技術に関する。 The present invention relates to a technique for generating synthesized speech.

従来、予め用意された音声データに基づいて音声合成を実行する音声合成機能を有した携帯端末が知られている（特許文献１参照）。この特許文献１に記載された携帯端末では、受け取った音声信号に応じて、特定の人物の声に似た音声となるように音声合成を行っている。 Conventionally, a portable terminal having a voice synthesis function for performing voice synthesis based on voice data prepared in advance is known (see Patent Document 1). In the portable terminal described in Patent Document 1, speech synthesis is performed so that the voice is similar to that of a specific person in accordance with the received voice signal.

特開２０１０−１６６３２４号公報JP 2010-166324 A

ところで、カラオケ装置においては、指定された楽曲の歌唱旋律を適切に歌唱した模範ボーカルを音声合成で生成して出力することが求められている。この模範ボーカルは、カラオケ装置の利用者の声質に類似した音声にて、その利用者の歌い方の特徴を再現したものであることが好ましい。 By the way, in a karaoke apparatus, it is calculated | required to produce | generate and output the exemplary vocal which sang appropriately the song melody of the designated music by speech synthesis. It is preferable that this exemplary vocal reproduces the characteristics of the user's way of singing with sound similar to the voice quality of the user of the karaoke apparatus.

歌い方の特徴は歌唱に特有の特徴であるため、会話において表出されることは少ない。このため、通常の会話について音声の合成を行う特許文献１に記載された技術を模範ボーカルの生成に用いても、利用者の声質に類似し、かつ、その利用者の歌い方の特徴を真似した模範ボーカルを生成することが困難であるという課題があった。 Singing features are unique to singing and are therefore rarely expressed in conversation. For this reason, even if the technique described in Patent Document 1 for synthesizing speech for normal conversation is used for generating model vocals, it is similar to the voice quality of the user and imitates the characteristics of the user's singing method. There was a problem that it was difficult to generate the model vocal.

つまり、従来の技術では、歌声について、利用者の声質に類似し、かつ、その利用者の歌い方の特徴を真似する音声合成は困難であるという課題があった。
そこで、本発明は、利用者の声質に類似し、その利用者の歌い方の特徴を真似した歌声を合成する技術の提供を目的とする。 That is, the conventional technique has a problem that it is difficult to synthesize a voice that is similar to the voice quality of the user and imitates the characteristics of the user's way of singing.
Therefore, an object of the present invention is to provide a technique for synthesizing a singing voice that resembles the voice quality of a user and imitates the characteristics of the user's singing method.

上記目的を達成するためになされた本発明は、音声データ取得手段と、分析手段と、検索手段と、合成手段とを備えた音声合成装置に関する。
本発明における音声データ取得手段は、音高と音価との組み合わせからなる複数の音符のうちの少なくとも一部に歌詞が割り当てられた楽曲を歌唱した歌唱音声データを取得する。分析手段は、その音声データ取得手段で取得した歌唱音声データの声質の特徴量を表す入力声質、及び歌唱音声データの振幅及び基本周波数の少なくともいずれか一方の楽曲を構成する音符の区間内での推移（歌い回し、以下、歌回と称す）を表す入力歌回を導出する。 The present invention, which has been made to achieve the above object, relates to a speech synthesizer comprising speech data acquisition means, analysis means, search means, and synthesis means.
The voice data acquisition means in the present invention acquires singing voice data of singing a song in which lyrics are assigned to at least a part of a plurality of notes consisting of combinations of pitches and note values. The analysis means is an input voice quality representing a voice quality feature amount of the singing voice data acquired by the voice data acquisition means, and an interval and at least one of the amplitude and the fundamental frequency of the singing voice data within a musical note interval constituting the music piece. An input singing time representing a transition (singing, hereinafter referred to as a singing time) is derived.

そして、検索手段は、第１記憶装置に記憶された声質の情報と歌回の情報とを示す声質歌回データの中で、分析手段で導出した入力声質及び入力歌回との類似度が、予め規定された基準値以上であることを含む規定条件を満たす声質特徴量及び歌回特徴量を含む声質歌回データに含まれる歌唱者識別情報を特定する。第１記憶装置に記憶されている声質歌回データとは、音源音声データにおける声質の特徴量である声質特徴量と、音源音声データの振幅及び基本周波数の少なくともいずれか一方の当該音源音声データでの音符に対応する音符対応区間内での推移を表す歌回特徴量と、発声した人を識別する歌唱者識別情報とを対応付けたデータである。また、ここで言う音源音声データとは、音高と音価との組み合わせからなる複数の音符のうちの少なくとも一部に割り当てられた歌詞を発声した音声波形を表し発声した人が互いに異なるデータである。 And the search means is similar to the input voice quality and the input song time derived by the analysis means in the voice quality song data indicating the voice quality information and song time information stored in the first storage device. The singer identification information included in the voice quality singing data including the voice quality feature value and the singing time feature value satisfying the stipulated condition including being equal to or higher than a predetermined reference value is specified. The voice quality song data stored in the first storage device is a voice quality feature quantity that is a voice quality feature quantity in the sound source voice data, and the sound source voice data of at least one of the amplitude and the fundamental frequency of the sound source voice data. This is data in which the singing feature amount representing the transition in the note-corresponding section corresponding to the note of the song is associated with the singer identification information for identifying the person who has spoken. In addition, the sound source voice data referred to here is a voice waveform that utters lyrics assigned to at least a part of a plurality of notes composed of a combination of pitch and note value, and is different from each other. is there.

また、本発明における合成手段は、第２記憶装置に記憶された音源データの中から、検索手段で特定した歌唱者識別情報である特定識別情報を含む音源データを取得し、その取得した音源データに含まれる音源音声データと音声データ取得手段で取得した歌唱音声データとに従って、指定された楽曲である指定楽曲を歌唱した歌唱音声を音声合成にて生成して出力する。なお、第２記憶装置に記憶されている音源データとは、音源音声データが歌唱者識別情報ごとに対応付けられたデータである。 Further, the synthesizing means in the present invention acquires sound source data including specific identification information that is the singer identification information specified by the search means from the sound source data stored in the second storage device, and the acquired sound source data In accordance with the sound source data included in the sound and the singing voice data acquired by the voice data acquisition means, a singing voice singing the specified music that is the specified music is generated and output by voice synthesis. The sound source data stored in the second storage device is data in which sound source sound data is associated with each singer identification information.

歌唱音声データは、楽曲における一部の区間を歌唱した音声であり、その歌唱音声データだけでは、音声合成を実行するために必要となる音源のデータ量としては不十分である。 The singing voice data is a voice that sang a part of the music piece, and the singing voice data alone is not sufficient as the data amount of the sound source necessary for executing the voice synthesis.

そこで、本発明の音声合成装置においては、歌唱音声データを分析し、その歌唱音声データを生成するための歌唱を実施した人物（即ち、利用者）の声質や歌い方の特徴を特定する。そして、その特定した利用者の声質や歌い方の特徴に類似し、当該利用者とは異なる他の人物の音声から生成した音源データを特定し、その特定した音源データを、歌唱音声を生成する音声合成の音源の少なくとも一部として利用する。 Therefore, in the speech synthesizer of the present invention, the singing voice data is analyzed, and the voice quality and singing characteristics of the person who performed the singing for generating the singing voice data (ie, the user) are specified. Then, sound source data generated from the voice of another person who is similar to the voice quality and singing characteristics of the specified user and different from the user is specified, and singing voice is generated from the specified sound source data. Used as at least part of a sound source for speech synthesis.

このような本発明の音声合成装置によれば、利用者自身の声と、その利用者の声に特徴が類似する他の人物の声を利用して、指定された楽曲を歌唱した歌唱音声を音声合成することができる。 According to such a speech synthesizer of the present invention, the voice of the user himself and the voice of another person whose characteristics are similar to the voice of the user are used to sing the singing voice singing the specified music. Speech synthesis is possible.

この結果、本発明の音声合成装置によれば、利用者の声質に類似し、その利用者の歌い方の特徴を真似した歌声を音声合成することができる。
本発明は、音声データ取得手段と、分析手段と、検索手段と、合成手段とを備えた音声合成システムとしてなされていても良い。 As a result, according to the speech synthesizer of the present invention, it is possible to synthesize a singing voice that resembles the voice quality of the user and imitates the feature of the user's singing.
The present invention may be implemented as a speech synthesis system including speech data acquisition means, analysis means, search means, and synthesis means.

このような音声合成システムによれば、請求項１に係る音声合成装置と同様の効果を得ることができる。
さらに、本発明における検索手段は、類似度が最も高いものから予め規定された規定数までであることを、規定条件を満たすこととして、特定識別情報を特定しても良い。 According to such a speech synthesis system, an effect similar to that of the speech synthesis apparatus according to claim 1 can be obtained.
Furthermore, the search means in this invention may specify specific identification information as satisfying a prescription | regulation condition that it is from a thing with the highest similarity to the prescription | regulation number prescribed | regulated previously.

このような音声合成システムによれば、声質及び歌い方の特徴の類似度が高い他の人物を規定数特定できる。
本発明においては、指定楽曲を構成しかつ歌詞が割り当てられている音符の中で、音声データ取得手段で取得した歌唱音声データによって歌唱された音符を歌唱音符とし、指定楽曲を構成しかつ歌詞が割り当てられている音符の中で、歌唱音符以外の音符を非歌唱音符としても良い。 According to such a speech synthesis system, it is possible to specify a specified number of other persons having high voice quality and high similarity in singing characteristics.
In the present invention, the notes sung by the singing voice data acquired by the voice data acquisition means among the notes that constitute the specified music and the lyrics are assigned are singing notes, the specified music is configured and the lyrics are Of the assigned notes, notes other than singing notes may be non-singing notes.

そして、本発明における合成手段は、音声データ取得手段で取得した歌唱音声データに基づいて音声合成することで、歌唱音符に割り当てられた歌詞の歌唱音声を生成し、検索手段で特定し、かつ、特定識別情報と対応付けられた音源音声データに基づいて音声合成することで、非歌唱音符に割り当てられた歌詞の歌唱音声を生成しても良い。 And the synthesizing means in the present invention generates the singing voice of the lyrics assigned to the singing note by synthesizing the voice based on the singing voice data acquired by the voice data acquiring means, specified by the searching means, and Singing voices of lyrics assigned to non-singing notes may be generated by performing voice synthesis based on sound source voice data associated with the specific identification information.

このような音声合成システムによれば、音声合成に必要となる音源を、指定楽曲を構成しかつ歌詞が割り当てられている音符ごとに特定でき、その特定した音符ごとの音源を用いて音声合成できる。この結果、本発明の音声合成システムによれば、利用者の声質に類似し、その利用者の歌い方の特徴をより正確に真似した歌声を音声合成にて生成することができる。 According to such a speech synthesis system, a sound source necessary for speech synthesis can be specified for each note constituting the designated music and assigned lyrics, and speech synthesis can be performed using the sound source for each specified note. . As a result, according to the speech synthesis system of the present invention, it is possible to generate, by speech synthesis, a singing voice that resembles the voice quality of the user and more accurately imitates the characteristics of the user's singing method.

なお、本発明の音声合成システムにおいては、取得手段と、抽出手段と、特定手段と、第１導出手段と、第２導出手段と、生成手段と、記憶制御手段とを備えていても良い。
取得手段は、ボーカル音を含む楽曲の演奏音の音声波形と、そのボーカル音の発声者を表す識別情報を歌唱者識別情報として少なくとも含む楽曲データを取得する。また、抽出手段は、取得手段により取得された楽曲データに含まれるボーカル音を音源音声データとして抽出する。 Note that the speech synthesis system of the present invention may include an acquisition unit, an extraction unit, an identification unit, a first derivation unit, a second derivation unit, a generation unit, and a storage control unit.
The obtaining means obtains music data including at least the voice waveform of the performance sound of the music including the vocal sound and the identification information representing the speaker of the vocal sound as the singer identification information. The extracting means extracts vocal sounds included in the music data acquired by the acquiring means as sound source sound data.

さらに、特定手段は、抽出手段で抽出した音源音声データのうち、音符対応区間それぞれに対応する音源音声データの区間である音符ボーカルを特定する。そして、第１導出手段は、特定手段にて特定した音符ボーカルの振幅及び基本周波数の少なくともいずれか一方の音符対応区間内での推移を歌回特徴量として導出する。また、第２導出手段は、特定手段にて特定した音符ボーカルごとに、各音符ボーカルにおける声質の特徴量を導出し、声質の特徴量の代表値を声質特徴量として導出する。 Further, the specifying unit specifies a note vocal which is a section of the sound source sound data corresponding to each note corresponding section among the sound source sound data extracted by the extracting unit. Then, the first deriving unit derives a transition in the note-corresponding section of at least one of the amplitude and the fundamental frequency of the note vocal specified by the specifying unit as a song feature amount. The second deriving unit derives a voice quality feature amount for each note vocal for each note vocal specified by the specifying unit, and derives a representative value of the voice quality feature amount as a voice quality feature amount.

生成手段は、第１導出手段で導出された歌回特徴量と、第２導出手段で導出された声質特徴量と、歌唱者識別情報とを対応付けることで声質歌回データを生成する。記憶制御手段は、生成手段で生成された声質歌回データを第１記憶装置に記憶する。 The generation means generates voice quality song data by associating the song feature quantity derived by the first derivation means, the voice quality feature quantity derived by the second derivation means, and the singer identification information. The storage control means stores the voice quality song data generated by the generation means in the first storage device.

このような音声合成システムによれば、声質歌回データを生成して第１記憶装置に記憶することができる。 According to such a speech synthesis system, voice quality song data can be generated and stored in the first storage device.

本発明が適用された音声合成システムとしてのカラオケシステムの概略構成を示すブロック図である。1 is a block diagram showing a schematic configuration of a karaoke system as a speech synthesis system to which the present invention is applied. 音源データ生成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a sound source data generation process. 歌回特徴量の概要を説明する説明図であり、（Ａ）は歌唱振幅ベクトルの概要を、（Ｂ）は歌唱音高ベクトルの概要を説明する図である。It is explanatory drawing explaining the outline | summary of a song time feature-value, (A) is a figure explaining the outline | summary of a song amplitude vector, (B) is the figure explaining the outline | summary of a song pitch vector. 声質歌回データの概要を示す図である。It is a figure which shows the outline | summary of voice quality song times data. 音声合成処理の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of a speech synthesis process.

以下に本発明の実施形態を図面と共に説明する。
＜音声合成システム＞
図１に示す音声合成システム１は、ユーザが指定した楽曲（以下、指定楽曲と称す）を歌唱した合成音声を、ユーザに類似する声にて生成して出力するシステムである。 Embodiments of the present invention will be described below with reference to the drawings.
<Speech synthesis system>
A speech synthesis system 1 shown in FIG. 1 is a system that generates and outputs a synthesized speech singing a song designated by a user (hereinafter referred to as a designated song) with a voice similar to the user.

これを実現するために、音声合成システム１は、情報処理装置２と、情報処理サーバ１０と、カラオケ装置３０とを備えている。
情報処理装置２は、人が発声した音声を含む音声波形データＷＤ及びその発声した内容を表すＭＩＤＩ楽曲ＭＤに基づいて、合成音声の生成（即ち、音声合成）に必要な音源データＳＤを生成する。 In order to realize this, the speech synthesis system 1 includes an information processing device 2, an information processing server 10, and a karaoke device 30.
The information processing device 2 generates sound source data SD necessary for generating synthesized speech (that is, speech synthesis) based on speech waveform data WD including speech uttered by a person and MIDI music MD representing the uttered content. .

情報処理サーバ１０には、少なくとも、情報処理装置２にて生成された音源データＳＤ及びＭＩＤＩ楽曲ＭＤが記憶されている。
カラオケ装置３０は、情報処理サーバ１０に記憶されたＭＩＤＩ楽曲ＭＤを演奏すると共に、そのＭＩＤＩ楽曲ＭＤに対応する楽曲を歌唱した合成音声を、音源データＳＤに従って生成して出力する。なお、音声合成システム１は、複数のカラオケ装置３０を備えている。
＜音声波形データ＞
音声波形データＷＤは、楽曲を演奏した演奏音を表す音声データであり、当該楽曲に関する情報が記述された楽曲管理情報と対応付けられている。楽曲管理情報には、楽曲を識別する楽曲識別情報（以下、楽曲ＩＤと称す）が含まれる。 The information processing server 10 stores at least sound source data SD and MIDI music MD generated by the information processing apparatus 2.
The karaoke apparatus 30 plays the MIDI music MD stored in the information processing server 10, and generates and outputs a synthesized voice in which the music corresponding to the MIDI music MD is sung according to the sound source data SD. Note that the speech synthesis system 1 includes a plurality of karaoke apparatuses 30.
<Audio waveform data>
The audio waveform data WD is audio data representing a performance sound of playing a music, and is associated with music management information in which information related to the music is described. The music management information includes music identification information (hereinafter referred to as music ID) for identifying music.

本実施形態の音声波形データＷＤには、演奏音として、少なくとも１つの楽器を演奏した伴奏音と、少なくとも人が歌唱した歌唱音とを含む。なお、音声波形データＷＤは、その音声波形データＷＤごとに、歌唱した人物または楽曲（歌詞）が異なっている。 The audio waveform data WD of the present embodiment includes, as performance sounds, accompaniment sounds that are played by at least one musical instrument and singing sounds that are sung by humans. Note that the voice waveform data WD differs in the sung person or song (lyric) for each voice waveform data WD.

この音声波形データＷＤは、非圧縮音声ファイルフォーマットの音声ファイルによって構成されたデータであっても良いし、音声圧縮フォーマットの音声ファイルによって構成されたデータであっても良い。この音声波形データＷＤは、ユーザが楽曲を歌唱した際に音声を録音することで生成されても良いし、その他の方法で生成されても良い。 The audio waveform data WD may be data configured by an audio file in an uncompressed audio file format, or may be data configured by an audio file in an audio compression format. The voice waveform data WD may be generated by recording voice when the user sings a song, or may be generated by other methods.

本実施形態における音声波形データＷＤは、特許請求の範囲に記載された音源音声データの一例である。
＜ＭＩＤＩ楽曲＞
ＭＩＤＩ楽曲ＭＤは、楽曲ごとに予め用意されたものであり、楽曲データと、歌詞データとを有している。 The audio waveform data WD in the present embodiment is an example of sound source audio data described in the claims.
<MIDI music>
The MIDI music MD is prepared in advance for each music and has music data and lyrics data.

このうち、楽曲データは、周知のＭＩＤＩ（ＭｕｓｉｃａｌＩｎｓｔｒｕｍｅｎｔＤｉｇｉｔａｌＩｎｔｅｒｆａｃｅ）規格によって、一つの楽曲の楽譜を表したデータである。この楽曲データは、楽曲ＩＤと、当該楽曲にて用いられる楽器ごとの楽譜を表す楽譜トラックとを少なくとも有している。 Among these, the music data is data representing the score of one music according to the well-known MIDI (Musical Instrument Digital Interface) standard. This music data has at least a music ID and a music score track representing a music score for each musical instrument used in the music.

そして、楽譜トラックには、ＭＩＤＩ音源から出力される個々の演奏音について、少なくとも、音高（いわゆるノートナンバー）と、ＭＩＤＩ音源が演奏音を出力する期間（以下、音符長と称す）とが規定されている。楽譜トラックにおける音符長は、当該演奏音の出力を開始するまでの当該楽曲の演奏開始からの時間を表す演奏開始タイミング（いわゆるノートオンタイミング）と、当該演奏音の出力を終了するまでの当該楽曲の演奏開始からの時間を表す演奏終了タイミング（いわゆるノートオフタイミング）とによって規定されている。 The musical score track defines at least the pitch (so-called note number) and the period during which the MIDI sound source outputs the performance sound (hereinafter referred to as the note length) for each performance sound output from the MIDI sound source. Has been. The note length in the score track is the performance start timing (so-called note-on timing) indicating the time from the start of the performance of the music until the output of the performance sound and the music until the output of the performance sound ends. Performance end timing (so-called note-off timing) representing the time from the start of the performance.

すなわち、楽譜トラックでは、ノートナンバーと、ノートオンタイミング及びノートオフタイミングによって表される音符長とによって、１つの音符ＮＯが規定される。そして、楽譜トラックは、音符ＮＯが演奏順に配置されることによって、１つの楽譜として機能する。なお、楽譜トラックは、例えば、鍵盤楽器、弦楽器、打楽器、及び管楽器などの楽器ごとに用意されている。このうち、本実施形態では、特定の楽器（例えば、ヴィブラフォン）が、楽曲における歌唱旋律を担当する楽器として規定されている。 That is, in the score track, one note NO is defined by the note number and the note length represented by the note-on timing and note-off timing. The musical score track functions as one musical score by arranging note NO in the order of performance. Note that the musical score track is prepared for each instrument such as a keyboard instrument, a stringed instrument, a percussion instrument, and a wind instrument, for example. Among these, in this embodiment, a specific musical instrument (for example, vibraphone) is defined as a musical instrument responsible for singing melody in music.

一方、歌詞データは、楽曲の歌詞に関するデータであり、歌詞テロップデータと、歌詞出力データとを備えている。歌詞テロップデータは、楽曲の歌詞を構成する文字（以下、歌詞構成文字とする）を表す。歌詞出力データは、歌詞構成文字の出力タイミングである歌詞出力タイミングを、楽曲データの演奏と対応付けるタイミング対応関係が規定されたデータである。 On the other hand, the lyrics data is data relating to the lyrics of the music, and includes lyrics telop data and lyrics output data. The lyrics telop data represents characters that constitute the lyrics of the music (hereinafter referred to as lyrics component characters). The lyrics output data is data in which a timing correspondence relationship that associates the lyrics output timing, which is the output timing of the lyrics constituent characters, with the performance of the music data is defined.

具体的に、本実施形態におけるタイミング対応関係では、楽曲データの演奏を開始するタイミングに、歌詞テロップデータの出力を開始するタイミングが対応付けられている。さらに、タイミング対応関係では、楽曲の時間軸に沿った各歌詞構成文字の歌詞出力タイミングが、楽曲データの演奏開始からの経過時間によって規定されている。これにより、楽譜トラックに規定された個々の演奏音（即ち、音符ＮＯ）と、歌詞構成文字それぞれとが対応付けられる。
＜情報処理装置＞
情報処理装置２は、入力受付部３と、外部出力部４と、記憶部５と、制御部６とを備えた周知の情報処理装置（例えば、パーソナルコンピュータ）である。 Specifically, in the timing correspondence relationship in the present embodiment, the timing for starting the output of the lyrics telop data is associated with the timing for starting the performance of the music data. Furthermore, in the timing correspondence relationship, the lyrics output timing of each lyrics constituent character along the time axis of the music is defined by the elapsed time from the start of performance of the music data. Thereby, each performance sound (namely, note NO) prescribed | regulated to the score track | truck and each lyric component character are matched.
<Information processing device>
The information processing apparatus 2 is a known information processing apparatus (for example, a personal computer) including an input receiving unit 3, an external output unit 4, a storage unit 5, and a control unit 6.

入力受付部３は、外部からの情報や指令の入力を受け付ける入力機器である。ここでの入力機器とは、例えば、キーやスイッチ、可搬型の記憶媒体（例えば、ＣＤやＤＶＤ、フラッシュメモリ）に記憶されたデータを読み取る読取ドライブ、通信網を介して情報を取得する通信ポートなどである。外部出力部４は、外部に情報を出力する出力装置である。ここでの出力装置とは、可搬型の記憶媒体にデータを書き込む書込ドライブや、通信網に情報を出力する通信ポートなどである。 The input receiving unit 3 is an input device that receives input of information and commands from the outside. The input device here is, for example, a key or switch, a reading drive for reading data stored in a portable storage medium (for example, CD, DVD, flash memory), or a communication port for acquiring information via a communication network. Etc. The external output unit 4 is an output device that outputs information to the outside. Here, the output device is a writing drive that writes data to a portable storage medium, a communication port that outputs information to a communication network, or the like.

記憶部５は、記憶内容を読み書き可能に構成された周知の記憶装置である。記憶部５には、少なくとも２以上の音声波形データＷＤが、その音声波形データＷＤでの発声内容を表すＭＩＤＩ楽曲ＭＤと対応付けて記憶されている。なお、図１中における符号「ｌ」は、音声波形データＷＤを識別する識別子であり、ユーザごとかつ当該ユーザが歌唱した楽曲ごとに割り当てられている。この符号「ｌ」は、２以上の自然数である。また、図１における符号「ｏ」は、ＭＩＤＩ楽曲ＭＤを識別する識別子であり、楽曲ごとに割り当てられている。この符号「ｏ」は、２以上の自然数である。 The storage unit 5 is a known storage device configured to be able to read and write stored contents. The storage unit 5 stores at least two or more speech waveform data WD in association with a MIDI music piece MD representing the utterance content in the speech waveform data WD. In addition, the code | symbol "l" in FIG. 1 is an identifier which identifies the audio | voice waveform data WD, and is allocated for every user and every music which the said user sang. This code “l” is a natural number of 2 or more. Further, the symbol “o” in FIG. 1 is an identifier for identifying the MIDI music piece MD, and is assigned to each music piece. This code “o” is a natural number of 2 or more.

制御部６は、ＲＯＭ７，ＲＡＭ８，ＣＰＵ９を備えた周知のマイクロコンピュータを中心に構成された周知の制御装置である。ＲＯＭ７は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを記憶する。ＲＡＭ８は、処理プログラムやデータを一時的に記憶する。ＣＰＵ９は、ＲＯＭ７やＲＡＭ８に記憶された処理プログラムに従って各処理を実行する。 The control unit 6 is a known control device that is configured around a known microcomputer including a ROM 7, a RAM 8, and a CPU 9. The ROM 7 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 8 temporarily stores processing programs and data. The CPU 9 executes each process according to a processing program stored in the ROM 7 or RAM 8.

本実施形態のＲＯＭ７には、記憶部５に記憶されている音声波形データＷＤ及びＭＩＤＩ楽曲ＭＤに基づいて音源データＳＤを生成する音源データ生成処理を、制御部６が実行するための処理プログラムが記憶されている。
＜情報処理サーバ＞
情報処理サーバ１０は、通信部１２と、記憶部１４と、制御部１６とを備えている。 The ROM 7 of the present embodiment has a processing program for the control unit 6 to execute sound source data generation processing for generating sound source data SD based on the audio waveform data WD and the MIDI music piece MD stored in the storage unit 5. It is remembered.
<Information processing server>
The information processing server 10 includes a communication unit 12, a storage unit 14, and a control unit 16.

このうち、通信部１２は、通信網を介して、情報処理サーバ１０が外部との間で通信を行う。すなわち、情報処理サーバ１０は、通信網を介してカラオケ装置３０と接続されている。なお、ここで言う通信網は、有線による通信網であっても良いし、無線による通信網であっても良い。 Among these, the communication unit 12 performs communication between the information processing server 10 and the outside via a communication network. That is, the information processing server 10 is connected to the karaoke apparatus 30 via a communication network. The communication network referred to here may be a wired communication network or a wireless communication network.

記憶部１４は、記憶内容を読み書き可能に構成された周知の記憶装置である。この記憶部１４には、少なくとも、複数のＭＩＤＩ楽曲ＭＤが記憶される。なお、図１に示す符号「ｎ」は、情報処理サーバ１０の記憶部１４に記憶されているＭＩＤＩ楽曲ＭＤを識別する識別子であり、楽曲ごとに割り当てられている。この符号「ｎ」は、１以上の自然数である。さらに、記憶部１４には、情報処理装置２が音源データ生成処理を実行することで生成された音源データＳＤが記憶される。なお、図１に示す符号「ｍ」は、情報処理サーバ１０の記憶部１４に記憶されている音源データＳＤを識別する識別子であり、詳しくは後述するグループごとに割り当てられている。この符号「ｍ」は、２以上の自然数である。 The storage unit 14 is a known storage device configured to be able to read and write stored contents. The storage unit 14 stores at least a plurality of MIDI music pieces MD. 1 is an identifier for identifying the MIDI music piece MD stored in the storage unit 14 of the information processing server 10, and is assigned to each music piece. This code “n” is a natural number of 1 or more. Further, the storage unit 14 stores sound source data SD generated by the information processing apparatus 2 executing sound source data generation processing. The code “m” shown in FIG. 1 is an identifier for identifying the sound source data SD stored in the storage unit 14 of the information processing server 10, and is assigned to each group to be described in detail later. This code “m” is a natural number of 2 or more.

制御部１６は、ＲＯＭ１８，ＲＡＭ２０，ＣＰＵ２２を備えた周知のマイクロコンピュータを中心に構成された周知の制御装置である。ＲＯＭ１８は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを記憶する。ＲＡＭ２０は、処理プログラムやデータを一時的に記憶する。ＣＰＵ２２は、ＲＯＭ１８やＲＡＭ２０に記憶された処理プログラムに従って各処理を実行する。
＜カラオケ装置＞
カラオケ装置３０は、通信部３２と、入力受付部３４と、楽曲再生部３６と、記憶部３８と、音声制御部４０と、映像制御部４６と、制御部５０とを備えている。 The control unit 16 is a known control device that is configured around a known microcomputer including a ROM 18, a RAM 20, and a CPU 22. The ROM 18 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 20 temporarily stores processing programs and data. The CPU 22 executes each process according to a processing program stored in the ROM 18 or the RAM 20.
<Karaoke equipment>
The karaoke apparatus 30 includes a communication unit 32, an input reception unit 34, a music playback unit 36, a storage unit 38, an audio control unit 40, a video control unit 46, and a control unit 50.

通信部３２は、通信網を介して、カラオケ装置３０が外部との間で通信を行う。入力受付部３４は、外部からの操作に従って情報や指令の入力を受け付ける入力機器である。ここでの入力機器とは、例えば、キーやスイッチ、リモコンの受付部などである。 In the communication unit 32, the karaoke apparatus 30 communicates with the outside via a communication network. The input receiving unit 34 is an input device that receives input of information and commands in accordance with external operations. Here, the input device is, for example, a key, a switch, a reception unit of a remote controller, or the like.

楽曲再生部３６は、情報処理サーバ１０からダウンロードしたＭＩＤＩ楽曲ＭＤに基づく楽曲の演奏を実行する。この楽曲再生部３６は、例えば、ＭＩＤＩ音源である。音声制御部４０は、音声の入出力を制御するデバイスであり、出力部４２と、マイク入力部４４とを備えている。 The music playback unit 36 performs a music performance based on the MIDI music MD downloaded from the information processing server 10. The music reproducing unit 36 is, for example, a MIDI sound source. The voice control unit 40 is a device that controls voice input / output, and includes an output unit 42 and a microphone input unit 44.

マイク入力部４４には、マイク６２が接続される。これにより、マイク入力部４４は、マイク６２を介して入力された音声を取得する。出力部４２にはスピーカ６０が接続されている。出力部４２は、楽曲再生部３６によって再生される楽曲の音源信号、マイク入力部４４からの歌唱音の音源信号をスピーカ６０に出力する。スピーカ６０は、出力部４２から出力される音源信号を音に換えて出力する。 A microphone 62 is connected to the microphone input unit 44. As a result, the microphone input unit 44 acquires the sound input via the microphone 62. A speaker 60 is connected to the output unit 42. The output unit 42 outputs the sound source signal of the music reproduced by the music reproducing unit 36 and the sound source signal of the singing sound from the microphone input unit 44 to the speaker 60. The speaker 60 outputs the sound source signal output from the output unit 42 instead of sound.

映像制御部４６は、制御部５０から送られてくる映像データに基づく映像または画像の出力を行う。映像制御部４６には、映像または画像を表示する表示部６４が接続されている。 The video control unit 46 outputs a video or an image based on the video data sent from the control unit 50. The video control unit 46 is connected to a display unit 64 that displays video or images.

制御部５０は、ＲＯＭ５２，ＲＡＭ５４，ＣＰＵ５６を少なくとも有した周知のコンピュータを中心に構成されている。ＲＯＭ５２は、電源が切断されても記憶内容を保持する必要がある処理プログラムやデータを記憶する。ＲＡＭ５４は、処理プログラムやデータを一時的に記憶する。ＣＰＵ５６は、ＲＯＭ５２やＲＡＭ５４に記憶された処理プログラムに従って各処理を実行する。 The control unit 50 is configured around a known computer having at least a ROM 52, a RAM 54, and a CPU 56. The ROM 52 stores processing programs and data that need to retain stored contents even when the power is turned off. The RAM 54 temporarily stores processing programs and data. The CPU 56 executes each process according to a processing program stored in the ROM 52 or the RAM 54.

本実施形態のＲＯＭ５２には、音声合成処理を制御部５０が実行するための処理プログラムが記憶されている。音声合成処理は、ユーザの声及びそのユーザの声に声質が類似する音声で、ユーザによって指定された楽曲を歌唱した合成音声を生成して出力する処理である。
＜音源データ生成処理＞
情報処理装置２の制御部６が実行する音源データ生成処理について説明する。 The ROM 52 of the present embodiment stores a processing program for the control unit 50 to execute speech synthesis processing. The voice synthesizing process is a process of generating and outputting a synthesized voice in which a user's voice and voice similar to the voice of the user are sung on a song designated by the user.
<Sound source data generation processing>
A sound source data generation process executed by the control unit 6 of the information processing apparatus 2 will be described.

図２に示すように、音源データ生成処理が起動されると、制御部６は、入力受付部３を介して指定された楽曲ＩＤが含まれるＭＩＤＩ楽曲ＭＤを取得する（Ｓ１１０）。続いて、制御部６は、記憶部５に記憶されている全ての音声波形データＷＤの中から、Ｓ１１０にて取得した楽曲ＩＤと対応付けられた一つの音声波形データＷＤを取得する（Ｓ１２０）。 As shown in FIG. 2, when the sound source data generation process is activated, the control unit 6 acquires a MIDI music piece MD including the music piece ID designated via the input receiving unit 3 (S110). Subsequently, the control unit 6 acquires one audio waveform data WD associated with the music ID acquired in S110 from all the audio waveform data WD stored in the storage unit 5 (S120). .

音源データ生成処理では、制御部６は、Ｓ１２０にて取得した音声波形データＷＤに含まれる伴奏音を抑制する（Ｓ１３０）。本実施形態においては、伴奏音の抑制手法として周知の手法を用いれば良い。本実施形態における伴奏音の抑制手法は、音声波形データＷＤに含まれる歌唱音を強調する手法であっても良いし、ＭＩＤＩ楽曲ＭＤによって表される楽器の演奏音を音声波形データＷＤから除去する手法であっても良い。 In the sound source data generation process, the control unit 6 suppresses the accompaniment sound included in the speech waveform data WD acquired in S120 (S130). In the present embodiment, a known method may be used as a method for suppressing the accompaniment sound. The accompaniment sound suppression method in the present embodiment may be a method of emphasizing the singing sound included in the audio waveform data WD, or removing the performance sound of the musical instrument represented by the MIDI music piece MD from the audio waveform data WD. It may be a technique.

さらに、音源データ生成処理では、制御部６は、Ｓ１３０にて伴奏音を抑制した音声波形データＷＤと、Ｓ１１０にて取得したＭＩＤＩ楽曲ＭＤとに基づいて、音符ボーカルＶｏ（ａ，ｉ）を特定する（Ｓ１４０）。音符ボーカルＶｏ（ａ，ｉ）とは、音声波形データＷＤにおいて、歌唱旋律を構成し、かつ、歌詞が割り当てられた各音符ＮＯ（ａ，ｉ）に対応する区間である。Ｓ１４０においては、制御部６は、Ｓ１２０で取得した音声波形データＷＤに、ＭＩＤＩ楽曲ＭＤにおける演奏開始タイミングｎｎｔ（ａ，ｉ）及び演奏終了タイミングｎｆｔ（ａ，ｉ）を照合することで、音符ボーカルＶｏ（ａ，ｉ）を特定する。 Further, in the sound source data generation process, the control unit 6 specifies the note vocal Vo (a, i) based on the voice waveform data WD in which the accompaniment sound is suppressed in S130 and the MIDI music piece MD acquired in S110. (S140). The note vocal Vo (a, i) is a section corresponding to each note NO (a, i) that constitutes a singing melody and is assigned lyrics in the voice waveform data WD. In S140, the control unit 6 collates the voice waveform data WD acquired in S120 with the performance start timing nnt (a, i) and the performance end timing nft (a, i) in the MIDI music piece MD, so that a musical note vocal is obtained. Specify Vo (a, i).

本実施形態における符号「ａ」は、楽曲を識別する符号であり、符号「ｉ」は、楽曲における歌唱旋律の音符ＮＯを識別する符号である。
さらに、音源データ生成処理では、制御部６は、音符ボーカルＶｏ（ａ，ｉ）のそれぞれに、複数の分析窓を設定する（Ｓ１５０）。このＳ１５０での分析窓の設定では、制御部６は、複数個の分析窓が時間軸に沿って互いに隣接するように設定する。この分析窓は、音符ＮＯ（ａ，ｉ）の時間長よりも短い時間長を有した区間である。 The code “a” in the present embodiment is a code that identifies a music piece, and the code “i” is a code that identifies a note NO of a singing melody in the music piece.
Further, in the sound source data generation process, the control unit 6 sets a plurality of analysis windows for each of the note vocals Vo (a, i) (S150). In setting the analysis window in S150, the control unit 6 sets the plurality of analysis windows so as to be adjacent to each other along the time axis. This analysis window is a section having a time length shorter than the time length of the note NO (a, i).

続いて、音源データ生成処理では、制御部６は、音符ボーカルＶｏ（ａ，ｉ）における振幅の音符ＮＯ（ａ，ｉ）に対応する区間内での推移を表す歌唱振幅ベクトルＡ（ａ，ｉ）を算出する（Ｓ１６０）。Ｓ１６０においては、制御部６は、まず、図３（Ａ）に示すように、Ｓ１５０にて設定された分析窓それぞれにおける音符ボーカルＶｏ（ａ，ｉ）の振幅値を算出する。そして、制御部６は、それらの分析窓ごとに算出された振幅値を時間軸に沿って配置することで、振幅値の配列を生成し、その振幅値の配列を歌唱振幅ベクトルＡ（ａ，ｉ）として算出する。なお、Ｓ１６０において算出する分析窓ごとの振幅は、例えば、各分析窓内での音符ボーカルＶｏ（ａ，ｉ）の離散値を相加平均したものであっても良い。 Subsequently, in the sound source data generation process, the control unit 6 performs the singing amplitude vector A (a, i) representing the transition in the section corresponding to the note NO (a, i) of the amplitude in the note vocal Vo (a, i). ) Is calculated (S160). In S160, the control unit 6 first calculates the amplitude value of the note vocal Vo (a, i) in each analysis window set in S150, as shown in FIG. And the control part 6 arrange | positions the amplitude value calculated for every those analysis windows along a time-axis, produces | generates the arrangement | sequence of an amplitude value, and arrange | positions the arrangement | sequence of the amplitude value to singing amplitude vector A (a, a, Calculate as i). The amplitude for each analysis window calculated in S160 may be, for example, an arithmetic average of the discrete values of the note vocal Vo (a, i) in each analysis window.

さらに、音源データ生成処理では、制御部６は、音符ボーカルＶｏ（ａ，ｉ）における基本周波数の音符ＮＯ（ａ，ｉ）に対応する区間内での推移を表す歌唱音高ベクトルＦ（ａ，ｉ）を算出する（Ｓ１７０）。このＳ１７０では、制御部６は、まず、図３（Ｂ）に示すように、Ｓ１５０にて設定された分析窓それぞれにおける音符ボーカルＶｏ（ａ，ｉ）の基本周波数ｆ０を算出する。そして、制御部６は、それらの分析窓ごとに算出された基本周波数ｆ０を時間軸に沿って配置することで、基本周波数ｆ０の配列を生成し、その基本周波数ｆ０の配列を歌唱音高ベクトルＦ（ａ，ｉ）として算出する。本実施形態における基本周波数ｆ０の算出手法として、種種の周知の手法が考えられる。一例として、Ｓ１７０では、制御部６は、音符ボーカルＶｏ（ａ，ｉ）に設定された分析窓それぞれについて、周波数解析（例えば、ＤＦＴ）を実施し、自己相関の結果、最も強い周波数成分を基本周波数ｆ０とすることが考えられる。 Further, in the sound source data generation process, the control unit 6 sings a pitch vector F (a, i) representing the transition in the section corresponding to the note NO (a, i) of the fundamental frequency in the note vocal Vo (a, i). i) is calculated (S170). In S170, the control unit 6 first calculates a fundamental frequency f0 of the note vocal Vo (a, i) in each analysis window set in S150, as shown in FIG. 3B. And the control part 6 produces | generates the arrangement | sequence of the fundamental frequency f0 by arrange | positioning the fundamental frequency f0 calculated for every those analysis windows along a time-axis, and the arrangement | sequence of the fundamental frequency f0 is used as a song pitch vector. Calculated as F (a, i). Various known methods can be considered as a method of calculating the fundamental frequency f0 in the present embodiment. As an example, in S170, the control unit 6 performs frequency analysis (for example, DFT) for each analysis window set in the note vocal Vo (a, i), and uses the strongest frequency component as a result of autocorrelation. The frequency f0 can be considered.

このＳ１６０にて算出される歌唱振幅ベクトルＡ及びＳ１７０にて算出される歌唱音高ベクトルＦのうちの少なくともいずれか一方が、特許請求の範囲に記載の歌回特徴量の一例である。 At least one of the singing amplitude vector A calculated in S160 and the singing pitch vector F calculated in S170 is an example of the singing feature amount described in the claims.

続いて、音源データ生成処理では、制御部６は、各音符ボーカルＶｏ（ａ，ｉ）における声質特徴量Ｍ（ａ，ｉ）を算出する（Ｓ１８０）。ここで言う声質特徴量Ｍとは、Ｓ１２０にて取得した音声波形データＷＤによって表される音を発声した人物の声質を表す特徴量である。このＳ１８０では、制御部６は、まず、Ｓ１５０にて設定された音符ボーカルＶｏ（ａ，ｉ）の分析窓それぞれについて、周波数解析（例えば、ＤＦＴ）を実施する。制御部６は、周波数解析の結果（周波数スペクトル）に対してケプストラム分析を実行することで、各分析窓のメル周波数ケプストラム（ＭＦＣＣ）を声質特徴量Ｍ（ａ，ｉ）として算出する。 Subsequently, in the sound source data generation process, the control unit 6 calculates a voice quality feature amount M (a, i) in each note vocal Vo (a, i) (S180). The voice quality feature amount M referred to here is a feature amount representing the voice quality of the person who uttered the sound represented by the speech waveform data WD acquired in S120. In S180, the control unit 6 first performs frequency analysis (for example, DFT) for each analysis window of the note vocal Vo (a, i) set in S150. The control unit 6 calculates a mel frequency cepstrum (MFCC) of each analysis window as a voice quality feature amount M (a, i) by performing cepstrum analysis on the result of frequency analysis (frequency spectrum).

また、音源データ生成処理では、制御部６は、Ｓ１２０にて取得した音符ボーカルＶｏ（ａ，ｉ）に対応する音符ＮＯ（ａ，ｉ）の音符プロパティｐ（ａ，ｉ）を特定する（Ｓ１９０）。本実施形態のＳ１９０では、具体的には、制御部６は、ＭＩＤＩ楽曲ＭＤから、そのＭＩＤＩ楽曲ＭＤに規定された各音符ＮＯ（ａ，ｉ）の情報を音符プロパティｐ（ａ，ｉ）として抽出して特定する。 In the sound source data generation process, the control unit 6 specifies the note property p (a, i) of the note NO (a, i) corresponding to the note vocal Vo (a, i) acquired in S120 (S190). ). In S190 of the present embodiment, specifically, the control unit 6 uses, as a note property p (a, i), information on each note NO (a, i) defined in the MIDI song MD from the MIDI song MD. Extract and identify.

ここで言う音符プロパティｐ（ａ，ｉ）には、対象音符属性と、前音符属性と、後音符属性とを含む。対象音符属性とは、音符ＮＯ（ａ，ｉ）の属性を表す情報である。この対象音符属性には、音符ＮＯ（ａ，ｉ）の音階（音高）、音符長、及び歌詞の音節を含む。また、前音符属性とは、時間軸に沿って音符ＮＯ（ａ，ｉ）の一つ前の音符（以下、前音符と称す）ＮＯ（ａ，ｉ−１）の属性を表す情報である。この前音符属性には、前音符ＮＯ（ａ，ｉ−１）の音階（音高）、音符長、歌詞の音節、及び前音符ＮＯ（ａ，ｉ−１）と音符ＮＯ（ａ，ｉ）との間の時間長を含む。 The note property p (a, i) mentioned here includes a target note attribute, a previous note attribute, and a rear note attribute. The target note attribute is information representing the attribute of the note NO (a, i). The target note attributes include the scale (pitch) of note NO (a, i), note length, and syllable of lyrics. The previous note attribute is information representing the attribute of the note NO (a, i-1) immediately preceding the note NO (a, i) (hereinafter referred to as the previous note) along the time axis. The previous note attribute includes the scale (pitch), note length, syllable of the previous note NO (a, i-1), and the previous note NO (a, i-1) and note NO (a, i). Including the length of time between.

さらに、後音符属性とは、時間軸に沿って対象音符ＮＯ（ａ，ｉ）の一つ後の音符（以下、後音符と称す）ＮＯ（ａ，ｉ＋１）の属性を表す情報である。この後音符属性には、音階（音高）、音符長、歌詞の音節、及び音符ＮＯ（ａ，ｉ）と後音符ＮＯ（ａ，ｉ＋１）との間の時間長を含む。なお、音符プロパティｐ（ａ，ｉ）における音符長、及び音符間の時間長は、予め規定された階級へと量子化されていても良い。 Further, the back note attribute is information representing the attribute of the note (hereinafter referred to as a back note) NO (a, i + 1) immediately after the target note NO (a, i) along the time axis. The post-note attributes include a musical scale (pitch), a note length, a syllable of lyrics, and a time length between the note NO (a, i) and the subsequent note NO (a, i + 1). Note that the note length in the note property p (a, i) and the time length between notes may be quantized to a predetermined class.

音源データ生成処理では、制御部６は、Ｓ１６０で算出された歌唱振幅ベクトルＡ（ａ，ｉ）と、Ｓ１７０にて算出された歌唱音高ベクトルＦ（ａ，ｉ）と、Ｓ１８０にて算出された声質特徴量Ｍ（ａ，ｉ）と、音符プロパティｐ（ａ，ｉ）とを対応付けた、仮歌回データＴ（ａ，ｉ）を生成する（Ｓ２００）。 In the sound source data generation process, the control unit 6 calculates the singing amplitude vector A (a, i) calculated in S160, the singing pitch vector F (a, i) calculated in S170, and S180. Temporal song data T (a, i) in which the voice quality feature value M (a, i) is associated with the note property p (a, i) is generated (S200).

続いて、音源データ生成処理では、制御部６は、Ｓ１１０にて取得したＭＩＤＩ楽曲ＭＤと対応付けられた全ての音声波形データＷＤに対してＳ１２０からＳ１９０までの処理を実行したか否かを判定する（Ｓ２１０）。このＳ２１０での判定の結果、全ての音声波形データＷＤに対して処理を実行していなければ（Ｓ２１０：ＮＯ）、制御部６は、音源データ生成処理をＳ１２０へと戻す。そのＳ１２０では、Ｓ１１０にて取得したＭＩＤＩ楽曲ＭＤと対応付けられ、かつ、Ｓ１２０〜Ｓ２００までの処理を未実行である音声波形データＷＤの中から、音声波形データＷＤを１つ取得する。その後、制御部６は、Ｓ１３０からＳ２００までのステップを実行する。 Subsequently, in the sound source data generation processing, the control unit 6 determines whether or not the processing from S120 to S190 has been executed for all the audio waveform data WD associated with the MIDI music piece MD acquired in S110. (S210). If the result of determination in S210 is that processing has not been executed for all audio waveform data WD (S210: NO), the controller 6 returns the sound source data generation processing to S120. In S120, one audio waveform data WD is acquired from the audio waveform data WD that is associated with the MIDI musical piece MD acquired in S110 and that has not been subjected to the processing from S120 to S200. Thereafter, the control unit 6 executes steps from S130 to S200.

一方、Ｓ２１０での判定の結果、全ての音声波形データＷＤに対して処理を実行済みであれば（Ｓ２２０：ＹＥＳ）、制御部６は、音源データ生成処理をＳ２２０へと移行させる。そのＳ２２０では、制御部６は、歌唱振幅ベクトルＡ、歌唱音高ベクトルＦ、及び声質特徴量Ｍの代表値を、音符プロパティｐが共通するグループごとに算出する。 On the other hand, if the result of determination in S210 is that processing has been executed for all audio waveform data WD (S220: YES), the controller 6 shifts the sound source data generation processing to S220. In S220, the control unit 6 calculates representative values of the singing amplitude vector A, the singing pitch vector F, and the voice quality feature amount M for each group having the common note property p.

具体的に、本実施形態のＳ２２０では、制御部６は、全ての仮歌回データＴの中で、音符プロパティｐが共通する仮歌回データＴを取得する。そして、その取得した仮歌回データＴに含まれる歌唱振幅ベクトルＡ、歌唱音高ベクトルＦ、及び声質特徴量Ｍそれぞれの代表値を算出する。なお、ここで言う代表値とは、相加平均の結果であっても良いし、中央値であっても良いし、最頻値であっても良い。 Specifically, in S220 of the present embodiment, the control unit 6 acquires the provisional song data T having the same note property p among all the provisional song data T. Then, representative values of the singing amplitude vector A, the singing pitch vector F, and the voice quality feature amount M included in the acquired provisional song data T are calculated. The representative value referred to here may be an arithmetic mean result, a median value, or a mode value.

そして、音源データ生成処理では、制御部６は、声質歌回データＶを生成して、記憶部５に記憶する（Ｓ２３０）。このＳ２３０にて生成される声質歌回データＶは、図４に示すように、音符プロパティｐごとに、その音符プロパティｐと、Ｓ２２０にて算出した歌唱振幅ベクトルＡの代表値と、歌唱音高ベクトルＦの代表値と、声質特徴量Ｍの代表値と、歌唱者を識別する歌唱者識別情報（以下、「歌唱者ＩＤ」と称す）を対応付けたデータである。 In the sound source data generation process, the control unit 6 generates voice quality song data V and stores it in the storage unit 5 (S230). As shown in FIG. 4, the voice quality song data V generated in S230 includes, for each note property p, the note property p, the representative value of the song amplitude vector A calculated in S220, and the singing pitch. This is data in which a representative value of the vector F, a representative value of the voice quality feature amount M, and singer identification information for identifying a singer (hereinafter referred to as “singer ID”) are associated with each other.

さらに、音源データ生成処理では、制御部６は、音源データＳＤを生成する（Ｓ２４０）。このＳ２４０では、制御部６は、音符ボーカルＶｏのそれぞれと、その音符ボーカルＶｏによって表される音に対応する音符プロパティｐと、歌唱者ＩＤとを対応付けることで、音源データＳＤを生成する。 Further, in the sound source data generation process, the control unit 6 generates sound source data SD (S240). In S240, the control unit 6 generates the sound source data SD by associating each note vocal Vo with the note property p corresponding to the sound represented by the note vocal Vo and the singer ID.

続いて、音源データ生成処理では、制御部６は、記憶部５に記憶されている全てのＭＩＤＩ楽曲ＭＤを取得したか否かを判定する（Ｓ２５０）。このＳ２５０での判定の結果、全てのＭＩＤＩ楽曲ＭＤに対して、Ｓ１１０からＳ２４０までのステップを実行していなければ（Ｓ２５０：ＮＯ）、制御部６は、音源データ生成処理をＳ１１０へと戻す。そのＳ１１０では、制御部６は、Ｓ１１０からＳ２４０までのステップを実行していないＭＩＤＩ楽曲ＭＤの中から１つのＭＩＤＩ楽曲ＭＤを取得する。その後、音源データ生成処理では、Ｓ１２０からＳ２４０までを繰り返す。 Subsequently, in the sound source data generation process, the control unit 6 determines whether or not all the MIDI music pieces MD stored in the storage unit 5 have been acquired (S250). As a result of the determination in S250, if the steps from S110 to S240 have not been executed for all the MIDI music pieces MD (S250: NO), the control unit 6 returns the sound source data generation processing to S110. In S110, the control unit 6 acquires one MIDI music MD from the MIDI music MD that has not executed the steps from S110 to S240. Thereafter, in the sound source data generation process, S120 to S240 are repeated.

ところで、Ｓ２５０での判定の結果、全てのＭＩＤＩ楽曲ＭＤに対して、Ｓ１１０からＳ２４０までのステップを実行済みであれば（Ｓ２５０：ＹＥＳ）、制御部６は、音源データ生成処理を終了し、起動指令が入力されるまで待機する。 By the way, as a result of the determination in S250, if the steps from S110 to S240 have been executed for all the MIDI music pieces MD (S250: YES), the control unit 6 ends the sound source data generation process and starts up. Wait until the command is input.

以上説明したように、音源データ生成処理では、予め用意された音声波形データＷＤを解析した結果に従って、音声波形データの声質を表す声質特徴量Ｍ、音声波形データの振幅及び基本周波数の少なくともいずれか一方の楽曲を構成する音符の区間内での推移を表す歌回特徴量（即ち、歌唱振幅ベクトルＡ、歌唱音高ベクトルＦ）、音符プロパティｐ、及び歌唱者ＩＤを対応付けることで、声質歌回データＶを生成する。また、音源データ生成処理では、音符ボーカルＶｏのそれぞれと、各音符ボーカルＶｏに対応する音符プロパティｐと、歌唱者ＩＤとを対応付けることで、音源データＳＤを生成する。 As described above, in the sound source data generation process, at least one of the voice quality feature amount M representing the voice quality of the voice waveform data, the amplitude of the voice waveform data, and the fundamental frequency according to the result of analyzing the voice waveform data WD prepared in advance. A voice quality song by associating a song feature (ie, song amplitude vector A, song pitch vector F), note property p, and singer ID representing the transition of notes constituting one piece of music. Data V is generated. In the sound source data generation process, sound source data SD is generated by associating each note vocal Vo, the note property p corresponding to each note vocal Vo, and the singer ID.

なお、情報処理装置２の制御部６が音源データ生成処理を実行することで生成した声質歌回データＶ及び音源データＳＤは、可搬型の記憶媒体を用いて情報処理サーバ１０の記憶部１４に記憶されても良い。情報処理装置２と情報処理サーバ１０とが通信網を介して接続されている場合には、情報処理装置２の記憶部５に記憶された声質歌回データＶ及び音源データＳＤは、通信網を介して転送されることで、情報処理サーバ１０の記憶部１４に記憶されても良い。
＜音声合成処理＞
次に、カラオケ装置３０の制御部５０が実行する音声合成処理について説明する。 The voice quality song data V and the sound source data SD generated by the control unit 6 of the information processing apparatus 2 executing the sound source data generation process are stored in the storage unit 14 of the information processing server 10 using a portable storage medium. May be remembered. When the information processing device 2 and the information processing server 10 are connected via a communication network, the voice quality song data V and the sound source data SD stored in the storage unit 5 of the information processing device 2 are stored in the communication network. May be stored in the storage unit 14 of the information processing server 10.
<Speech synthesis processing>
Next, the speech synthesis process executed by the control unit 50 of the karaoke apparatus 30 will be described.

図５に示すように、音声合成処理が起動されると、制御部５０は、入力受付部３４を介して指定された楽曲（指定楽曲）に対応する楽曲ＩＤを取得する（Ｓ５１０）。
音声合成処理では、続いて、制御部５０は、マイク入力部４４に接続されたマイク６２を介して入力された音声の波形を表す歌唱音声データを取得する（Ｓ５２０）。このＳ５２０にて取得する歌唱音声データは、指定楽曲における一部の区間を、カラオケ装置の利用者が歌唱した音声である。 As shown in FIG. 5, when the speech synthesis process is activated, the control unit 50 acquires a song ID corresponding to a song (designated song) designated via the input receiving unit 34 (S510).
In the voice synthesis process, subsequently, the control unit 50 acquires singing voice data representing the waveform of the voice input via the microphone 62 connected to the microphone input unit 44 (S520). The singing voice data acquired in S520 is a voice sung by the user of the karaoke apparatus in a part of the specified music piece.

そして、制御部５０は、Ｓ５２０にて取得した歌唱音声データの声質を分析して、歌唱音声データの声質の特徴量を表す入力声質Ｙｋを算出する（Ｓ５３０）。
入力声質Ｙｋは、母音ごとのメル周波数ケプストラム（ＭＦＣＣ）を表した声質特徴量である。この入力声質Ｙｋの算出手法は、「音符ボーカルＶｏ」を「歌唱音声データ」へと読み替えることを除けば、音源データ生成処理におけるＳ１５０、及びＳ１８０と同様であるため、ここでの詳しい説明は省略する。 Then, the control unit 50 analyzes the voice quality of the singing voice data acquired in S520, and calculates the input voice quality Yk that represents the feature quantity of the voice quality of the singing voice data (S530).
The input voice quality Yk is a voice quality feature value representing a mel frequency cepstrum (MFCC) for each vowel. The method for calculating the input voice quality Yk is the same as S150 and S180 in the sound source data generation process except that “note vocal Vo” is replaced with “singing voice data”, and thus detailed description thereof is omitted here. To do.

続いて、音声合成処理では、制御部５０は、歌唱音声データの振幅及び基本周波数の少なくともいずれか一方の楽曲を構成する音符の区間内での推移を表す入力歌回を導出する（Ｓ５４０）。この入力歌回は、歌唱音声データにおける歌唱振幅ベクトルＡ、歌唱音高ベクトルＦである。この入力歌回の算出方法は、「音符ボーカルＶｏ」を「歌唱音声データ」へと読み替えることを除けば、音源データ生成処理におけるＳ１５０からＳ１７０までと同様であるため、ここでの詳しい説明は省略する。 Subsequently, in the speech synthesis process, the control unit 50 derives an input singing time representing a transition in a musical note section constituting at least one of the amplitude and the fundamental frequency of the singing voice data (S540). This input singing time is the singing amplitude vector A and the singing pitch vector F in the singing voice data. The calculation method of the input song times is the same as S150 to S170 in the sound source data generation process except that “note vocal Vo” is replaced with “singing voice data”, and detailed description thereof is omitted here. To do.

さらに、音声合成処理では、制御部５０は、Ｓ５３０にて算出した入力声質Ｙｋ及びＳ５４０にて算出した入力歌回との類似度が、予め規定された基準値以上であることを含む規定条件を満たす声質特徴量及び歌回特徴量を含む声質歌回データに含まれる歌唱者ＩＤを特定する（Ｓ５５０）。このＳ５５０では、制御部５０は、入力歌回と歌回特徴量との相関係数を歌回類似度として算出する。さらに、Ｓ５５０では、制御部５０は、入力声質Ｙｋと声質特徴量Ｍとの相関係数を声質類似度として算出する。そして、制御部５０は、歌回類似度と声質類似度との双方が基準値以上となる声質歌回データに含まれる歌唱者ＩＤを特定する。なお、本実施形態におけるＳ５５０では、類似度が最も高いものから順に、予め規定された規定数（規定数は「１」以上の整数）分の声質歌回データに含まれている歌唱者ＩＤを特定する。 Further, in the speech synthesis process, the control unit 50 has a specified condition including that the similarity between the input voice quality Yk calculated in S530 and the input song time calculated in S540 is equal to or higher than a predetermined reference value. The singer ID included in the voice quality song time data including the voice quality feature value and the song time feature value to be satisfied is specified (S550). In S550, the control unit 50 calculates the correlation coefficient between the input singing time and the singing time feature amount as the singing time similarity. Further, in S550, the control unit 50 calculates the correlation coefficient between the input voice quality Yk and the voice quality feature amount M as the voice quality similarity. And the control part 50 specifies singer ID contained in the voice quality song time data in which both a song time similarity degree and a voice quality similarity degree become more than a reference value. In S550 in the present embodiment, the singer IDs included in the voice quality song data for a predetermined number (the specified number is an integer equal to or greater than “1”) in order from the highest similarity are listed. Identify.

さらに、音声合成処理では、制御部５０は、Ｓ５１０にて取得した楽曲ＩＤに対応するＭＩＤＩ楽曲ＭＤを情報処理サーバ１０から取得する（Ｓ５６０）。続いて、音声合成処理では、制御部５０は、Ｓ５６０で取得したＭＩＤＩ楽曲ＭＤを分析する（Ｓ５７０）。このＳ５７０のＭＩＤＩ楽曲ＭＤの分析では、制御部５０は、指定楽曲の歌唱旋律を構成するメロディ音符ＮＯの音符プロパティｐを合成対象情報として、メロディ音符ＮＯそれぞれの配置順序に従って特定する。 Further, in the speech synthesis process, the control unit 50 acquires the MIDI music MD corresponding to the music ID acquired in S510 from the information processing server 10 (S560). Subsequently, in the speech synthesis process, the control unit 50 analyzes the MIDI music piece MD acquired in S560 (S570). In the analysis of the MIDI musical piece MD in S570, the control unit 50 specifies the note property p of the melody note NO constituting the song melody of the designated musical piece as the synthesis target information according to the arrangement order of each melody note NO.

そして、音声合成処理では、制御部５０は、Ｓ５７０での特定した合成対象情報に従って、歌唱旋律を歌唱した合成音声を生成して出力する（Ｓ５８０）。
具体的に、本実施形態のＳ５８０では、制御部５０は、予め規定された設定条件を満たしていれば、Ｓ５５０で特定した歌唱者ＩＤを含む音源データＳＤを取得する。そして、制御部５０は、その取得した音源データＳＤに含まれる音符ボーカルＶｏに従って、現時点で合成音声を生成すべき音符に割り当てられた歌詞を歌唱した歌唱音声を音声合成にて生成する。ここで言う設定条件とは、現時点で合成音声を生成すべき音符の音符プロパティｐが、Ｓ５２０にて取得した歌唱音声データによって表される音声に発した音符の音符プロパティｐと不一致であることである。 In the speech synthesis process, the control unit 50 generates and outputs a synthesized speech singing the singing melody according to the synthesis target information specified in S570 (S580).
Specifically, in S580 of the present embodiment, the control unit 50 acquires sound source data SD including the singer ID specified in S550 if the predetermined setting conditions are satisfied. And the control part 50 produces | generates the singing voice which sang the lyric assigned to the note which should produce | generate a synthetic | combination voice by speech synthesis | combination according to the note vocal Vo contained in the acquired sound source data SD at the present time. The setting condition referred to here is that the note property p of the note for which the synthesized speech is to be generated at the present time does not match the note property p of the note issued to the speech represented by the singing speech data acquired in S520. is there.

なお、設定条件を満たしている場合におけるＳ２８０では、制御部５０は、Ｓ５５０で特定した歌唱者ＩＤを含み、かつ、現時点で合成音声を生成すべき音符の音符プロパティｐが対応付けられた音源データＳＤを、類似度が最も高いものから順に検索する。この検索において、最も類似度が高い歌唱者ＩＤを含み、かつ、現時点で合成音声を生成すべき音符に割り当てられた音源データＳＤが存在していなければ、次に類似度が高い歌唱者ＩＤを含み、かつ、現時点で合成音声を生成すべき音符に割り当てられた音源データＳＤを検索する。 In S280 when the setting condition is satisfied, the control unit 50 includes the singer ID specified in S550 and the sound source data associated with the note property p of the note for which the synthesized speech is to be generated at the present time. The SD is searched in order from the highest similarity. In this search, if there is no sound source data SD that includes the singer ID having the highest similarity and is currently assigned to the note for which the synthesized speech is to be generated, the singer ID having the next highest similarity is selected. The sound source data SD that is included and assigned to the note for which the synthesized speech is to be generated is searched.

一方、本実施形態のＳ５８０では、制御部５０は、設定条件を満たしていなければ、Ｓ５２０にて取得した歌唱音声データに従って、現時点で合成音声を生成すべき音符に割り当てられた歌詞を歌唱した歌唱音声を音声合成にて生成する。 On the other hand, in S580 of the present embodiment, if the setting condition is not satisfied, the control unit 50 sings the lyrics assigned to the notes that should be generated at this time according to the singing voice data acquired in S520. Generate speech by speech synthesis.

つまり、本実施形態において、指定楽曲を構成しかつ歌詞が割り当てられている音符の中で、Ｓ５２０にて取得した歌唱音声データによって歌唱された音符を歌唱音符と称し、指定楽曲を構成しかつ歌詞が割り当てられている音符の中で、歌唱音符以外の音符を非歌唱音符と称した場合を想定する。 That is, in the present embodiment, among the notes constituting the designated music and assigned lyrics, the notes sung by the singing voice data acquired in S520 are referred to as singing notes, the designated music is constituted and the lyrics Suppose that notes other than the singing note are referred to as non-singing notes among the notes to which is assigned.

この場合、本実施形態におけるＳ５８０では、制御部５０は、Ｓ５２０にて取得した歌唱音声データに基づいて音声合成することで、歌唱音符に割り当てられた歌詞の歌唱音声を生成する。また、制御部５０は、Ｓ５５０にて取得した歌唱者ＩＤを含む音源データＳＤに基づいて音声合成することで、非歌唱音符に割り当てられた歌詞の歌唱音声を生成する。 In this case, in S580 in the present embodiment, the control unit 50 generates a singing voice of the lyrics assigned to the singing note by performing voice synthesis based on the singing voice data acquired in S520. Moreover, the control part 50 produces | generates the singing voice | voice of the lyrics allocated to the non-singing musical note by carrying out voice synthesis | combination based on the sound source data SD containing the singer ID acquired in S550.

なお、本実施形態における音声合成は、いわゆるフォルマント合成によって実現すれば良い。すなわち、本実施形態のＳ５８０では、制御部５０は、音源データＳＤに含まれる音符ボーカルＶｏや歌唱音声データから、各音節での基本周波数（ｆ０）、メル周波数ケプストラム（ＭＦＣＣ）、パワーを算出して、音声合成（フォルマント合成）に用いれば良い。 Note that the speech synthesis in the present embodiment may be realized by so-called formant synthesis. That is, in S580 of the present embodiment, the control unit 50 calculates the fundamental frequency (f0), mel frequency cepstrum (MFCC), and power in each syllable from the note vocal Vo and singing voice data included in the sound source data SD. Thus, it may be used for speech synthesis (formant synthesis).

続いて、制御部５０は、Ｓ５８０にて音声合成することによって生成された合成音声を出力部４２へと出力する（Ｓ５９０）。その出力部４２は、スピーカ６０から合成音声を放音する。 Subsequently, the control unit 50 outputs the synthesized speech generated by synthesizing the speech in S580 to the output unit 42 (S590). The output unit 42 emits synthesized speech from the speaker 60.

その後、制御部５０は、本音声合成処理を終了する。
［実施形態の効果］
以上説明したように、本実施形態の音声合成処理においては、歌唱音声データを分析し、その歌唱音声データを生成するための歌唱を実施した人物（即ち、利用者）の声質や歌い方の特徴を特定する。そして、その特定した利用者の声質や歌い方の特徴に類似し、当該利用者とは異なる他の人物の音声から生成した音源データを特定し、その特定した音源データを、歌唱音声を生成する音声合成の音源の少なくとも一部として利用する。 Thereafter, the control unit 50 ends the speech synthesis process.
[Effect of the embodiment]
As described above, in the speech synthesis process of the present embodiment, the voice quality of the person (that is, the user) who performed the singing for analyzing the singing voice data and generating the singing voice data is characterized. Is identified. Then, sound source data generated from the voice of another person who is similar to the voice quality and singing characteristics of the specified user and different from the user is specified, and singing voice is generated from the specified sound source data. Used as at least part of a sound source for speech synthesis.

このような音声合成処理によれば、利用者自身の声と、その利用者の声に特徴が類似する他の人物の声とを利用して、指定された楽曲を歌唱した歌唱音声を音声合成することができる。 According to such a voice synthesis process, voice synthesis is performed on a singing voice that sings a specified musical piece using the voice of the user himself and the voice of another person whose characteristics are similar to the voice of the user. can do.

また、本実施形態の音声合成処理では、Ｓ５２０にて取得した歌唱音声データが合成対象音符に対するものであれば、その歌唱音声データに基づいて音声合成することで、歌唱音符に割り当てられた歌詞の歌唱音声を生成する。一方、Ｓ５２０にて取得した歌唱音声データが合成対象音符に対するものでなければ、Ｓ５５０にて取得した歌唱者ＩＤと対応付けられた音源音声データに基づいて音声合成することで、非歌唱音符に割り当てられた歌詞の歌唱音声を生成する。 Further, in the speech synthesis process of the present embodiment, if the singing voice data acquired in S520 is for the synthesis target note, the voice of the lyrics assigned to the singing note is synthesized by voice synthesis based on the singing voice data. Generate singing voice. On the other hand, if the singing voice data acquired in S520 is not for the synthesis target note, it is assigned to a non-singing note by performing voice synthesis based on the sound source voice data associated with the singer ID acquired in S550. Generate singing voice of the lyrics.

このような音声合成処理によれば、音声合成に必要となる音源を、指定楽曲を構成しかつ歌詞が割り当てられている音符ごとに特定でき、その特定した音符ごとの音源を用いて音声合成できる。この結果、本発明の音声合成処理によれば、利用者の声質に類似し、その利用者の歌い方の特徴をより正確に真似した歌声を音声合成にて生成することができる。
［その他の実施形態］
以上、本発明の実施形態について説明したが、本発明は上記実施形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において、様々な態様にて実施することが可能である。 According to such a speech synthesis process, a sound source necessary for speech synthesis can be specified for each note constituting the designated music and assigned lyrics, and speech synthesis can be performed using the sound source for each specified note. . As a result, according to the speech synthesis process of the present invention, it is possible to generate, by speech synthesis, a singing voice that is similar to the voice quality of the user and more accurately mimics the characteristics of the user's singing.
[Other Embodiments]
As mentioned above, although embodiment of this invention was described, this invention is not limited to the said embodiment, In the range which does not deviate from the summary of this invention, it is possible to implement in various aspects.

例えば、上記実施形態における音声波形データＷＤには、演奏音として、少なくとも１つの楽器を演奏した伴奏音と、少なくとも人が歌唱した歌唱音とが含まれていたが、本発明における音声波形データＷＤは、歌唱音だけが含まれていても良い。 For example, the audio waveform data WD in the above embodiment includes accompaniment sounds that played at least one instrument and at least singing sounds sung by a person as performance sounds, but the audio waveform data WD in the present invention. May contain only the singing sound.

上記実施形態の構成の一部を省略した態様も本発明の実施形態である。また、上記実施形態と変形例とを適宜組み合わせて構成される態様も本発明の実施形態である。また、特許請求の範囲に記載した文言によって特定される発明の本質を逸脱しない限度において考え得るあらゆる態様も本発明の実施形態である。 The aspect which abbreviate | omitted a part of structure of the said embodiment is also embodiment of this invention. Further, an aspect configured by appropriately combining the above embodiment and the modification is also an embodiment of the present invention. Moreover, all the aspects which can be considered in the limit which does not deviate from the essence of the invention specified by the wording described in the claims are the embodiments of the present invention.

また、本発明は、前述した音声合成装置や音声合成システムの他、歌唱音声を音声合成にて出力するためにコンピュータが実行するプログラム、歌唱音声を音声合成にて出力する音声合成の方法等、種々の形態で実現することができる。
［実施形態と特許請求の範囲との対応関係］
最後に、上記実施形態の記載と、特許請求の範囲の記載との関係を説明する。 In addition to the speech synthesizer and speech synthesis system described above, the present invention includes a program executed by a computer to output singing speech by speech synthesis, a speech synthesis method for outputting singing speech by speech synthesis, etc. It can be realized in various forms.
[Correspondence between Embodiment and Claims]
Finally, the relationship between the description of the above embodiment and the description of the scope of claims will be described.

上記実施形態の音声合成処理におけるＳ５２０を実行することで得られる機能が、特許請求の範囲に記載された音声データ取得手段の一例であり、Ｓ５３０，Ｓ５４０を実行することで得られる機能が、特許請求の範囲に記載された分析手段の一例である。また、音声合成処理におけるＳ５５０を実行することで得られる機能が、特許請求の範囲に記載された検索手段の一例であり、Ｓ５８０，Ｓ５９０を実行することで得られる機能が、特許請求の範囲に記載された合成手段の一例である。 The function obtained by executing S520 in the speech synthesis process of the above embodiment is an example of the voice data acquisition means described in the claims, and the function obtained by executing S530 and S540 is a patent. It is an example of the analysis means described in the claims. Further, the function obtained by executing S550 in the speech synthesis process is an example of the search means described in the claims, and the function obtained by executing S580 and S590 is included in the claims. It is an example of the described synthetic | combination means.

さらに、上記実施形態の音源データ生成処理におけるＳ１２０を実行することで得られる機能が、特許請求の範囲に記載された取得手段の一例であり、Ｓ１３０を実行することで得られる機能が、特許請求の範囲に記載された抽出手段の一例である。また、音源データ生成処理におけるＳ１４０を実行することで得られる機能が、特許請求の範囲に記載された特定手段の一例であり、Ｓ１５０〜Ｓ１７０を実行することで得られる機能が、特許請求の範囲に記載された第１導出手段の一例である。そして、音源データ生成処理におけるＳ１８０を実行することで得られる機能が、特許請求の範囲に記載された第２導出手段の一例であり、音源データ生成処理におけるＳ２００，Ｓ２２０を実行することで得られる機能が、特許請求の範囲に記載された生成手段の一例であり、音源データ生成処理におけるＳ２３０を実行することで得られる機能が、特許請求の範囲に記載された記憶制御手段の一例である。 Furthermore, the function obtained by executing S120 in the sound source data generation process of the above embodiment is an example of the acquisition means described in the claims, and the function obtained by executing S130 is claimed. It is an example of the extraction means described in the range. Further, the function obtained by executing S140 in the sound source data generation process is an example of the specifying unit described in the claims, and the function obtained by executing S150 to S170 is described in the claims. It is an example of the 1st derivation | leading-out means described in (1). The function obtained by executing S180 in the sound source data generation process is an example of the second derivation means described in the claims, and is obtained by executing S200 and S220 in the sound source data generation process. The function is an example of the generation unit described in the claims, and the function obtained by executing S230 in the sound source data generation process is an example of the storage control unit described in the claims.

１…音声合成システム２…情報処理装置３…入力受付部４…外部出力部５，１４，３８…記憶部６，５０，１６…制御部７，１８，５２…ＲＯＭ８，２０，５４…ＲＡＭ９，２２，５６…ＣＰＵ１０…情報処理サーバ１２，３２…通信部３０…カラオケ装置３４…入力受付部３６…楽曲再生部４０…音声制御部４２…出力部４４…マイク入力部４６…映像制御部６０…スピーカ６２…マイク６４…表示部 DESCRIPTION OF SYMBOLS 1 ... Speech synthesis system 2 ... Information processing device 3 ... Input reception part 4 ... External output part 5, 14, 38 ... Storage part 6, 50, 16 ... Control part 7, 18, 52 ... ROM 8, 20, 54 ... RAM 9, 22, 56 ... CPU 10 ... Information processing server 12, 32 ... Communication unit 30 ... Karaoke device 34 ... Input reception unit 36 ... Music playback unit 40 ... Audio control unit 42 ... Output unit 44 ... Microphone input unit 46 ... Video control Unit 60 ... Speaker 62 ... Microphone 64 ... Display unit

Claims

Voice data acquisition means for acquiring singing voice data for singing a song in which lyrics are assigned to at least a part of a plurality of notes consisting of a combination of pitch and note value;
The input voice quality representing the voice quality feature quantity of the singing voice data acquired by the voice data acquisition means, and the transition in the interval of the notes constituting the music of at least one of the amplitude and the fundamental frequency of the singing voice data. An analysis means for deriving an input song time to represent,
The sound source sound data represents a sound waveform obtained by uttering lyrics assigned to at least a part of a plurality of notes composed of a combination of pitch and sound value, and the person who uttered is derived for each different sound source sound data. Voice quality feature quantity that is a voice quality feature quantity and a song feature quantity representing a transition in a note-corresponding section corresponding to a note in the sound source voice data of at least one of the amplitude and fundamental frequency of the sound source voice data And voice quality song data associated with singer identification information for identifying the person who uttered the voice quality song data stored in the first storage device, and the input voice quality derived by the analysis means and Singers included in the voice quality feature quantity satisfying a prescribed condition including that the similarity to the input song time is equal to or higher than a predetermined reference value and the voice quality song time data including the song feature quantity Search means for identifying the different information,
Specific identification information that is the singer identification information specified by the search means from among the sound source data in which the sound source audio data is associated with each singer identification information and stored in the second storage device And synthesizes the singing voice singing the designated music as the designated music according to the sound source data included in the obtained sound source data and the singing voice data obtained by the voice data obtaining means. And a synthesizing unit that generates and outputs the voice.

Voice data acquisition means for acquiring singing voice data for singing a song in which lyrics are assigned to at least a part of a plurality of notes consisting of a combination of pitch and note value;
The input voice quality representing the voice quality feature quantity of the singing voice data acquired by the voice data acquisition means, and the transition in the interval of the notes constituting the music of at least one of the amplitude and the fundamental frequency of the singing voice data. An analysis means for deriving an input song time to represent,
The sound source sound data represents a sound waveform obtained by uttering lyrics assigned to at least a part of a plurality of notes composed of a combination of pitch and sound value, and the person who uttered is derived for each different sound source sound data. Voice quality feature quantity that is a voice quality feature quantity and a song feature quantity representing a transition in a note-corresponding section corresponding to a note in the sound source voice data of at least one of the amplitude and fundamental frequency of the sound source voice data And voice quality song data associated with singer identification information for identifying the person who uttered the voice quality song data stored in the first storage device, and the input voice quality derived by the analysis means and Singers included in the voice quality feature quantity satisfying a prescribed condition including that the similarity to the input song time is equal to or higher than a predetermined reference value and the voice quality song time data including the song feature quantity Search means for identifying the different information,
Specific identification information that is the singer identification information specified by the search means from among the sound source data in which the sound source audio data is associated with each singer identification information and stored in the second storage device And synthesizes the singing voice singing the designated music as the designated music according to the sound source data included in the obtained sound source data and the singing voice data obtained by the voice data obtaining means. A speech synthesis system comprising: synthesis means for generating and outputting at the above.

The search means includes
The speech synthesis system according to claim 2, wherein the specific identification information is specified by satisfying the specified condition that the degree of similarity is from the highest to a specified number specified in advance.

The synthesis means includes
Among the notes constituting the designated music and assigned lyrics, the notes sung by the singing voice data obtained by the voice data obtaining means are singing notes, and the designated music is constituted and the lyrics are assigned. Notes other than the singing notes are non-singing notes,
By synthesizing speech based on the singing voice data acquired by the voice data acquiring means, the singing voice of the lyrics assigned to the singing note is generated, specified by the search means, and corresponding to the specific identification information The speech synthesis system according to claim 3, wherein speech synthesis of the lyrics assigned to the non-singing notes is generated by performing speech synthesis based on the attached sound source speech data.

An acquisition means for acquiring music data including at least the voice waveform of the performance sound of the music including the vocal sound and the identification information representing the utterer of the vocal sound as the singer identification information;
Extracting means for extracting the vocal sound included in the music data acquired by the acquiring means as the sound source sound data;
Identifying means for identifying a note vocal that is a section of the sound source sound data corresponding to each of the note-corresponding sections of the sound source sound data extracted by the extracting means;
First derivation means for deriving a transition in the note-corresponding section of at least one of the amplitude and fundamental frequency of the note vocal specified by the specifying means as the song feature amount;
Second derivation means for deriving a voice quality feature amount for each note vocal for each note vocal identified by the specification means, and deriving a representative value of the voice quality feature amount as the voice quality feature amount;
Generating means for generating the voice quality song data by associating the singing feature quantity derived by the first derivation means, the voice quality feature quantity derived by the second derivation means, and the singer identification information; ,
The voice synthesis according to any one of claims 2 to 4, further comprising storage control means for storing the voice quality song data generated by the generation means in the first storage device. system.