JP6150276B2

JP6150276B2 - Speech evaluation apparatus, speech evaluation method, and program

Info

Publication number: JP6150276B2
Application number: JP2012287839A
Authority: JP
Inventors: 徹弓場; 雄一野呂; 典昭阿瀬見; 青木　彦治; 彦治青木
Original assignee: Brother Industries Ltd; Mie University NUC
Current assignee: Brother Industries Ltd; Mie University NUC
Priority date: 2012-12-28
Filing date: 2012-12-28
Publication date: 2017-06-21
Anticipated expiration: 2032-12-28
Also published as: JP2014130227A

Description

本発明は、発声評価装置、発声評価方法、及びプログラムに関し、特に、裏声に表声を混ぜる発声法を評価するための改良に関する。 The present invention relates to an utterance evaluation apparatus, an utterance evaluation method, and a program, and more particularly, to an improvement for evaluating an utterance method in which a voice is mixed with a back voice.

声楽における歌唱技法にファルセット（falsetto）があり、邦楽における裏声に相当する。近年、ポピュラー音楽の歌手（アーティスト）にもこの裏声を用いて高音部を歌唱する歌手が増えてきており、歌手を志す者やカラオケが上手くなりたい者等にとってはその習得が望まれる。そこで、音声における表声と裏声とを識別する技術が提案されている（例えば、非特許文献１を参照）。 There is a falsetto as a singing technique in vocal music, which corresponds to a back voice in Japanese music. In recent years, more and more singers (artists) of popular music use this back voice to sing the treble part, and this is desired for those who want to be good singers or those who want to improve karaoke. Therefore, a technique for discriminating voices and back voices in speech has been proposed (see, for example, Non-Patent Document 1).

小島俊、外４名、「歌声における裏声と地声を識別するための音響特徴量の検討」、信学技報、一般社団法人電子情報通信学会、平成２４年１０月２７日、１１２（２６６）、ｐ．６７−７２Satoshi Kojima, 4 others, “Examination of acoustic features for discriminating back and local voices in singing”, IEICE Technical Report, The Institute of Electronics, Information and Communication Engineers, October 27, 2012, 112 (266 ), P. 67-72

ところで、単に裏声を出すのではなく、裏声に表声を混ぜて発声することにより、地声で歌う部分との繋ぎ目を感じさせない歌い方がある。従って、表声及び裏声をそれぞれ自在に発声できるようになった後、次なるステップとして裏声に表声を混ぜて発声する歌唱法の習得が望まれるが、前記従来の技術のように裏声と表声を明確に識別する技術では、裏声に表声を混ぜて発声する歌唱法による音声が裏声及び表声の何れかとして判定されてしまう。音程（ピッチ）を変化させつつ裏声と表声とを切り換える発声を行い、その切り換えが上手くできているか否か等を評価すること等は前記従来の技術ではできなかった。 By the way, there is a way of singing that doesn't make you feel the joint with the part that sings with the local voice by uttering by mixing the back voice with the voice instead of just making the back voice. Therefore, after the voice and back voice can be freely spoken, it is desired to learn a singing method in which the voice is mixed with the back voice as the next step. In the technology for clearly identifying the voice, the voice based on the singing method in which the voice is mixed with the back voice is determined as either the back voice or the voice. It has been impossible with the above-mentioned conventional technique to perform utterance for switching between the back voice and the voice while changing the pitch (pitch) and to evaluate whether or not the switching is successful.

本発明は、以上の事情を背景として為されたものであり、その目的とするところは、裏声に表声を混ぜる発声法を客観的に評価する発声評価装置、発声評価方法、及びプログラムを提供することにある。 The present invention has been made against the background of the above circumstances, and the object of the present invention is to provide an utterance evaluation apparatus, an utterance evaluation method, and a program for objectively evaluating an utterance method in which a voice is mixed with a back voice There is to do.

斯かる目的を達成するために、本第１発明の要旨とするところは、音声入力装置から入力される音声情報に基づいて発声の評価を行う発声評価装置であって、前記音声入力装置から入力される音声情報における基本波及び複数の高調波を抽出する抽出部と、予め取得された被験者及び被験者以外の発声者の少なくとも一方の表声に対応する第１音声情報及び裏声に対応する第２音声情報それぞれにおける前記抽出部により抽出された基本波及び複数の高調波に基づく学習データを学習モデルへ入力し、前記学習モデルから出力される第１融合率が予め定められた値となるように前記学習モデルを修正する学習が行われる評価部とを備え、前記評価部は、前記抽出部により抽出される基本波及び複数の高調波に基づく評価データを前記学習が行われた学習モデルに入力させることに基づいて、前記音声情報における表声及び裏声の切換に係る指標としての前記第１融合率を算出するものであることを特徴とするものである。 In order to achieve such an object, the gist of the first aspect of the present invention is an utterance evaluation apparatus that evaluates utterance based on voice information input from the voice input apparatus, and is input from the voice input apparatus. An extraction unit that extracts a fundamental wave and a plurality of harmonics in the voice information to be performed, and a first voice information corresponding to at least one of the voices of the subject and a speaker other than the subject and a second voice corresponding to the back voice Learning data based on the fundamental wave and a plurality of harmonics extracted by the extraction unit in each piece of speech information is input to a learning model, and the first fusion rate output from the learning model is set to a predetermined value. and a evaluation unit for learning to modify the learning model is performed, the evaluation unit, the learning line evaluation data based on the fundamental wave is extracted and a plurality of harmonics by the extraction unit Based on that input to the learning model, it is characterized in that calculates a first fusion rate as an index according to the switching table voice and falsetto in the audio information.

このように、前記第１発明によれば、前記音声入力装置から入力される音声情報における基本波及び複数の高調波を抽出する抽出部と、予め取得された被験者及び被験者以外の発声者の少なくとも一方の表声に対応する第１音声情報及び裏声に対応する第２音声情報それぞれにおける前記抽出部により抽出された基本波及び複数の高調波に基づく学習データを学習モデルへ入力し、前記学習モデルから出力される第１融合率が予め定められた値となるように前記学習モデルを修正する学習が行われる評価部とを備え、前記評価部は、前記抽出部により抽出される基本波及び複数の高調波に基づく評価データを前記学習が行われた学習モデルに入力させることに基づいて、前記音声情報における表声及び裏声の切換に係る指標としての前記第１融合率を算出するものであることから、自分が出した声において表声と裏声とがどれほどの割合で混ざっているのかを実用的な処理により簡便に把握することができる。すなわち、裏声に表声を混ぜる発声法を客観的に評価する発声評価装置を提供することができる。 As described above, according to the first aspect of the present invention, the extraction unit that extracts the fundamental wave and the plurality of harmonics in the voice information input from the voice input device , and at least the subject and the speaker other than the subject acquired in advance. Learning data based on the fundamental wave and a plurality of harmonics extracted by the extraction unit in each of the first voice information corresponding to one voice and the second voice information corresponding to the back voice is input to the learning model, and the learning model and a evaluation unit for the learning to modify the learning model is performed such that the value first fusion rate predetermined output from the evaluation unit, the fundamental wave and a plurality extracted by the extraction unit based on the evaluation data based on the harmonic to be input to the learning model in which the learning is performed, the first as an index according to the switching table voice and falsetto in the voice information Since it is intended to calculate a slip ratio, it can be easily grasped by practical processing whether the voice it sent and the front voice and falsetto are mixed at a ratio of how. That is, it is possible to provide an utterance evaluation apparatus that objectively evaluates a utterance method in which a voice is mixed with a back voice.

前記第１発明に従属する本第２発明の要旨とするところは、前記学習モデルは、ニューラルネットワークまたはサポートベクターマシンであることにある。
前記第１発明または第２発明に従属する本第３発明の要旨とするところは、前記評価部は、前記音声入力装置から入力される音声情報において前記抽出部により抽出される前記基本波の周波数の変化に応じて、その音声情報に対応して算出される前記第１融合率の変化の滑らかさに基づいて、前記音声情報における表声及び裏声の切り換えを評価するものである。このようにすれば、歌声のピッチを変化させつつ裏声と表声とを切り換える発声を行い、その切り換えが上手くできているか否か等を好適に評価することができる。 The gist of the second invention subordinate to the first invention is that the learning model is a neural network or a support vector machine.
The gist of the third invention, which is dependent on the first invention or the second invention, is that the evaluation unit extracts the frequency of the fundamental wave extracted by the extraction unit in the voice information input from the voice input device. In accordance with the change in the voice information, the switching of the voice and the back voice in the voice information is evaluated based on the smoothness of the change in the first fusion rate calculated corresponding to the voice information. In this way, it is possible to suitably evaluate whether or not the switching is successfully performed by performing the utterance for switching the back voice and the voice while changing the pitch of the singing voice.

前記第３発明に従属する本第４発明の要旨とするところは、表声及び裏声をそれぞれ歌い分ける第１の歌唱法について、前記音声入力装置から入力される音声情報における基本波及び複数の高調波に基づいて決定される第２融合率に基づいて、前記音声情報における表声及び裏声の歌い分けを評価する第２評価部を備え、前記評価部は、前記第２評価部により評価された前記第１の歌唱法における表声に対応する前記第１音声情報及び裏声に対応する前記第２音声情報それぞれにおける基本波及び複数の高調波に基づく学習データを含む前記学習データにより学習が行われるものであり、前記評価部は、表声と裏声との間で連続的に変化させた第２の歌唱法について、前記音声入力装置から入力される音声情報において前記抽出部により抽出される前記基本波の周波数の変化に応じて、その音声情報に対応して算出される前記第１融合率の変化の滑らかさに基づいて、前記音声情報における表声及び裏声の切り換えを評価するものである。このようにすれば、自分が出した表声及び裏声を前記モデルの学習データに用いることで、自分が出した声において表声と裏声とがどれほどの割合で混ざっているのかを更に好適に評価することができる。 The gist of the fourth invention, which is dependent on the third invention, is that the first singing method for singing the voice and the back voice is fundamental and a plurality of harmonics in the voice information input from the voice input device. A second evaluation unit that evaluates the singing of the voice and the back voice in the voice information based on a second fusion rate determined based on the wave, the evaluation unit being evaluated by the second evaluation unit; Learning is performed by the learning data including learning data based on a fundamental wave and a plurality of harmonics in each of the first voice information corresponding to the voice in the first singing method and the second voice information corresponding to the back voice. The evaluation unit extracts the second singing method continuously changed between the voice and the back voice by the extraction unit in the voice information input from the voice input device. Evaluating the switching of voice and back voice in the voice information based on the smoothness of the change in the first fusion rate calculated corresponding to the voice information according to the frequency change of the fundamental wave It is. In this way, it is possible to more suitably evaluate how much the voice and back voice are mixed in the voice produced by using the spoken voice and back voice of the model in the learning data of the model. can do.

前記第１発明乃至第４発明の何れかに従属する本第５発明の要旨とするところは、前記学習データは、前記第１音声情報及び前記第２音声情報それぞれにおける前記複数の高調波の強度を前記基本波の強度で除した値であり、前記評価データは、前記音声入力装置から入力される音声情報において前記抽出部により抽出される前記複数の高調波の強度を前記基本波の強度で除した値である。このようにすれば、自分が出した声において表声と裏声とがどれほどの割合で混ざっているのかを実用的なモデルを用いて簡便に把握することができる。 The gist of the fifth invention according to any one of the first to fourth inventions is that the learning data is the intensity of the plurality of harmonics in each of the first voice information and the second voice information. Is divided by the intensity of the fundamental wave, and the evaluation data is the intensity of the plurality of harmonics extracted by the extraction unit in the voice information input from the voice input device by the intensity of the fundamental wave. It is the value divided. In this way, it is possible to easily grasp how much the voice and the back voice are mixed in the voice that is produced by using a practical model.

前記目的を達成するために、本第６発明の要旨とするところは、音声入力装置から入力される音声情報に基づいて発声の評価を行う発声評価方法であって、前記音声入力装置から入力される音声情報における基本波及び複数の高調波を抽出する抽出過程と、予め取得された被験者及び被験者以外の発声者の少なくとも一方の表声に対応する第１音声情報及び裏声に対応する第２音声情報それぞれにおける前記抽出過程により抽出された基本波及び複数の高調波に基づく学習データを学習モデルへ入力し、前記学習モデルから出力される第１融合率が予め定められた値となるように前記学習モデルを修正する学習が行われる評価過程と、前記抽出過程において抽出される基本波及び複数の高調波に基づく評価データを前記学習が行われた学習モデルに入力させることに基づいて、前記音声情報における表声及び裏声の切換に係る指標としての前記第１融合率を算出する算出過程とを、含むことを特徴とする。このようにすれば、自分が出した声において表声と裏声とがどれほどの割合で混ざっているのかを実用的な処理により簡便に把握することができる。すなわち、裏声に表声を混ぜる発声法を客観的に評価する発声評価方法を提供することができる。 In order to achieve the above object, the gist of the sixth invention is an utterance evaluation method for evaluating utterance based on voice information input from a voice input device, which is input from the voice input device. An extraction process for extracting a fundamental wave and a plurality of harmonics in the voice information, and first voice information corresponding to at least one voice of a subject and a speaker other than the subject acquired in advance and a second voice corresponding to a back voice The learning data based on the fundamental wave and the plurality of harmonics extracted by the extraction process in each information is input to a learning model, and the first fusion rate output from the learning model is set to a predetermined value. and evaluation process that learning is performed to correct the learning model, said extracting said training evaluation data based on the fundamental wave and a plurality of harmonics are extracted in the process has been performed the learning mode Based on that input to Le, and a calculation step of calculating the first fusion rate as an index according to the switching table voice and falsetto in the audio information, characterized in that it contains. In this way, it is possible to easily grasp by a practical process how much the voice and the back voice are mixed in the voice that they have made. That is, it is possible to provide an utterance evaluation method that objectively evaluates an utterance method in which a voice is mixed with a back voice.

前記目的を達成するために、本第７発明の要旨とするところは、音声入力装置から入力される音声情報に基づいて発声の評価を行う発声評価装置に備えられた制御装置に、前記音声入力装置から入力される音声情報における基本波及び複数の高調波を抽出する抽出ステップと、予め取得された被験者及び被験者以外の発声者の少なくとも一方の表声に対応する第１音声情報及び裏声に対応する第２音声情報それぞれにおける前記抽出ステップにより抽出された基本波及び複数の高調波に基づく学習データを学習モデルへ入力し、前記学習モデルから出力される第１融合率が予め定められた値となるように前記学習モデルを修正する学習が行われる評価ステップと、前記抽出ステップにおいて抽出される基本波及び複数の高調波に基づく評価データを前記学習が行われた学習モデルに入力させることに基づいて、前記音声情報における表声及び裏声の切換に係る指標としての前記第１融合率を算出する算出ステップとを、実行させることを特徴とするプログラムである。このようにすれば、自分が出した声において表声と裏声とがどれほどの割合で混ざっているのかを実用的な処理により簡便に把握することができる。すなわち、裏声に表声を混ぜる発声法を客観的に評価するプログラムを提供することができる。 In order to achieve the above object, the gist of the seventh aspect of the present invention is that the speech input is provided to a control device provided in a speech evaluation device that evaluates speech based on speech information input from a speech input device. An extraction step for extracting a fundamental wave and a plurality of harmonics in voice information inputted from the apparatus, and correspondence to first voice information and back voice corresponding to at least one voice of a subject and a speaker other than the subject acquired in advance Learning data based on the fundamental wave and the plurality of harmonics extracted in the extraction step in each of the second audio information to be input to the learning model, and the first fusion rate output from the learning model is a predetermined value an evaluation step of the learning to modify the learning model is performed such that, evaluation de based on the fundamental wave and a plurality of harmonics are extracted in the extraction step Based on to input data to the learning model in which the learning is performed, and a calculation step of calculating the first fusion rate as an index according to the switching table voice and falsetto in the voice information, that is performed It is a featured program. In this way, it is possible to easily grasp by a practical process how much the voice and the back voice are mixed in the voice that they have made. That is, it is possible to provide a program that objectively evaluates the utterance method in which the voice is mixed with the back voice.

本発明の一実施例である発声評価装置の構成を例示するブロック図である。It is a block diagram which illustrates the composition of the utterance evaluation device which is one example of the present invention. 本実施例の発声評価の前提となる発声訓練方法（手順）について説明する図である。It is a figure explaining the speech training method (procedure) used as the premise of speech evaluation of a present Example. 図１の発声評価装置のＣＰＵに備えられた制御機能の要部を説明する機能ブロック線図である。It is a functional block diagram explaining the principal part of the control function with which CPU of the utterance evaluation apparatus of FIG. 1 was equipped. 母音「あ」で上行、下行音階を発声した場合のピッチを時間毎にプロットした図である。It is the figure which plotted the pitch at the time of uttering the ascending and descending scales with the vowel “A”. 図４に示すものと同じ音声に対して複数の高調波成分のスペクトル重心を基本周波数で除した値を時間毎にプロットした図である。It is the figure which plotted the value which remove | divided the spectrum gravity center of the several harmonic component with the fundamental frequency with respect to the same audio | voice as shown in FIG. 4 for every time. 図１の発声評価装置による評価結果である、男性による「え」の発声の上昇音階に係る基本波の周波数と第１融合率との関係の一例を示す図である。It is a figure which shows an example of the relationship between the frequency of the fundamental wave which concerns on the rising scale of the utterance of "e" by a man, and a 1st fusion rate which are the evaluation results by the utterance evaluation apparatus of FIG. 図１の発声評価装置による評価結果である、男性による「え」の発声の上昇音階に係る基本波の周波数と第１融合率との関係の一例を示す図である。It is a figure which shows an example of the relationship between the frequency of the fundamental wave which concerns on the rising scale of the utterance of "e" by a man, and a 1st fusion rate which are the evaluation results by the utterance evaluation apparatus of FIG. 図１の発声評価装置による評価結果である、男性による「え」の発声の上昇音階に係る基本波の周波数と第１融合率との関係の一例を示す図である。It is a figure which shows an example of the relationship between the frequency of the fundamental wave which concerns on the rising scale of the utterance of "e" by a man, and a 1st fusion rate which are the evaluation results by the utterance evaluation apparatus of FIG. 図１の発声評価装置による評価結果である、女性による「え」の発声の上昇音階に係る基本波の周波数と第１融合率との関係の一例を示す図である。It is a figure which shows an example of the relationship between the frequency of the fundamental wave which concerns on the rising scale of the utterance of "e" by a woman, and a 1st fusion rate which are the evaluation results by the utterance evaluation apparatus of FIG. 図１の発声評価装置による評価結果である、女性による「え」の発声の上昇音階に係る基本波の周波数と第１融合率との関係の一例を示す図である。It is a figure which shows an example of the relationship between the frequency of the fundamental wave which concerns on the rising scale of the utterance of "e" by a woman, and a 1st fusion rate which are the evaluation results by the utterance evaluation apparatus of FIG. 図１の発声評価装置による評価結果である、女性による「え」の発声の上昇音階に係る基本波の周波数と第１融合率との関係の一例を示す図である。It is a figure which shows an example of the relationship between the frequency of the fundamental wave which concerns on the rising scale of the utterance of "e" by a woman, and a 1st fusion rate which are the evaluation results by the utterance evaluation apparatus of FIG. 図６に示す音声に対応する基本波の周波数と第２融合率との関係を示す図である。It is a figure which shows the relationship between the frequency of the fundamental wave corresponding to the audio | voice shown in FIG. 6, and a 2nd fusion rate. 図７に示す音声に対応する基本波の周波数と第２融合率との関係を示す図である。It is a figure which shows the relationship between the frequency of the fundamental wave corresponding to the audio | voice shown in FIG. 7, and a 2nd fusion rate. 図８に示す音声に対応する基本波の周波数と第２融合率との関係を示す図である。It is a figure which shows the relationship between the frequency of the fundamental wave corresponding to the audio | voice shown in FIG. 8, and a 2nd fusion rate. 図９に示す音声に対応する基本波の周波数と第２融合率との関係を示す図である。It is a figure which shows the relationship between the frequency of the fundamental wave corresponding to the audio | voice shown in FIG. 9, and a 2nd fusion rate. 図１０に示す音声に対応する基本波の周波数と第２融合率との関係を示す図である。It is a figure which shows the relationship between the frequency of the fundamental wave corresponding to the audio | voice shown in FIG. 10, and a 2nd fusion rate. 図１１に示す音声に対応する基本波の周波数と第２融合率との関係を示す図である。It is a figure which shows the relationship between the frequency of the fundamental wave corresponding to the audio | voice shown in FIG. 11, and a 2nd fusion rate. 図１の発声評価装置のタッチパネルディスプレイに表示される評価結果の一例である評価結果画面を示す図である。It is a figure which shows the evaluation result screen which is an example of the evaluation result displayed on the touchscreen display of the utterance evaluation apparatus of FIG. 図１の発声評価装置のＣＰＵによる発声評価制御の一例の要部を説明するフローチャートである。It is a flowchart explaining the principal part of an example of the utterance evaluation control by CPU of the utterance evaluation apparatus of FIG.

以下、本発明の好適な実施例を図面に基づいて詳細に説明する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の一実施例である発声評価装置１０の構成を例示するブロック図である。この図１に示すように、本実施例の発声評価装置１０は、ＣＰＵ１２と、ＲＯＭ１４と、ＲＡＭ１６とを、備えている。このＣＰＵ１２は、発声評価装置１０における中央演算処理装置である。ＲＯＭ１４は、読出専用メモリである。ＲＡＭ１６は、随時書込読出メモリである。発声評価装置１０は、例えばよく知られたメモリカード等の記憶部１８を備えている。発声評価装置１０は、電波によりインターネット等の通信回線との間で通信を行う通信部２０を備えている。発声評価装置１０は、スピーカ２２と、マイクロフォン２４と、Ａ／Ｄコンバータ２６とを、備えている。マイクロフォン２４は、音声入力装置である。Ａ／Ｄコンバータ２６は、マイクロフォン２４から入力された音声をディジタル信号に変換するための装置である。発声評価装置１０は、タッチパネルディスプレイ２８を備えている。タッチパネルディスプレイ２８は、画像（映像）を表示させる表示装置３０と、利用者の指や備え付けのペン等による接触に応じて発声評価装置１０への入力を行うタッチパネル３２とを、備えている。 FIG. 1 is a block diagram illustrating the configuration of an utterance evaluation apparatus 10 that is an embodiment of the present invention. As shown in FIG. 1, the utterance evaluation apparatus 10 according to the present embodiment includes a CPU 12, a ROM 14, and a RAM 16. The CPU 12 is a central processing unit in the utterance evaluation device 10. The ROM 14 is a read-only memory. The RAM 16 is a write / read memory as needed. The utterance evaluation device 10 includes a storage unit 18 such as a well-known memory card, for example. The utterance evaluation apparatus 10 includes a communication unit 20 that communicates with a communication line such as the Internet using radio waves. The utterance evaluation device 10 includes a speaker 22, a microphone 24, and an A / D converter 26. The microphone 24 is a voice input device. The A / D converter 26 is a device for converting the sound input from the microphone 24 into a digital signal. The utterance evaluation device 10 includes a touch panel display 28. The touch panel display 28 includes a display device 30 that displays an image (video), and a touch panel 32 that performs input to the utterance evaluation device 10 in response to contact with a user's finger, an attached pen, or the like.

本実施例の発声評価装置１０は、好適には、タッチパネルディスプレイ２８を備えると共に、タッチパネルディスプレイ２８を介しての操作に応じたインターネットへのアクセス機能をはじめとする各種機能を有する携帯電話機すなわち所謂スマートフォンに、本発明の一実施例であるプログラム（アプリケーションソフトウェア：携帯アプリ）がインストールされたものである。ＣＰＵ１２、ＲＯＭ１４、及びＲＡＭ１６等と同等の構成を備えた一般的なパーソナルコンピュータ、タブレット端末、携帯型音楽プレイヤ、コンシューマゲーム機等に本発明の一実施例であるプログラムがインストールされたものであってもよい。ＭＩＤＩ（Musical Instrument Digital Interface）音源及びアンプミキサ等の構成を備え、多数の演奏曲のうちから選択される演奏曲を出力させると共にマイクロフォンにより入力される音声を増幅して出力させるカラオケ装置に本発明が適用されるものであってもよい。必ずしも通信部１８等の通信機能を備えなくともよい。発声評価を行うが、例えば録音された利用者の歌唱データ等の音声出力を行わない態様の発声評価装置においては、スピーカ２２は必ずしも設けられなくともよい。 The utterance evaluation apparatus 10 according to the present embodiment preferably includes a touch panel display 28 and a mobile phone having various functions including an access function to the Internet according to an operation through the touch panel display 28, that is, a so-called smartphone. In addition, a program (application software: mobile application) according to an embodiment of the present invention is installed. A program according to an embodiment of the present invention is installed in a general personal computer, a tablet terminal, a portable music player, a consumer game machine or the like having a configuration equivalent to that of the CPU 12, ROM 14, RAM 16, and the like. Also good. The present invention is a karaoke apparatus having a configuration such as a MIDI (Musical Instrument Digital Interface) sound source and an amplifier mixer, which outputs a performance music selected from a large number of performance music and amplifies and outputs a sound input by a microphone. It may be applied. The communication function such as the communication unit 18 is not necessarily provided. In the utterance evaluation apparatus that performs utterance evaluation but does not output sound such as recorded user song data, the speaker 22 is not necessarily provided.

マイクロフォン２４は、人の発声する音声等の音響（振動）を電気信号に変換する装置（電気音響変換器）である。このマイクロフォン２４は、好適には、携帯電話機等に備えられて例えば気導音を拾う一般的なマイクロフォンが用いられる。好適には、人の咽喉部に接触させたパッチ状の振動ピックアップにより例えば固体伝導音を拾うマイクロフォンが用いられてもよい。Ａ／Ｄコンバータ２６は、マイクロフォン２４から入力されるアナログ信号としての音声情報をディジタル信号に変換してＣＰＵ１２等に供給する。 The microphone 24 is a device (electroacoustic transducer) that converts sound (vibration) such as voice uttered by a person into an electrical signal. The microphone 24 is preferably a general microphone that is provided in a mobile phone or the like and picks up air conduction sound, for example. Preferably, for example, a microphone that picks up a solid conduction sound by a patch-like vibration pickup brought into contact with a human throat may be used. The A / D converter 26 converts audio information as an analog signal input from the microphone 24 into a digital signal and supplies the digital signal to the CPU 12 or the like.

発声評価装置１０は、ＣＰＵ１２によりＲＡＭ１６の一時記憶機能を利用しつつＲＯＭ１４に予め記憶されたプログラムに従って信号処理を行う。すなわち、発声評価装置１０は、所謂マイクロコンピュータ（電子制御装置）を含んで構成されている。斯かる構成により、ＣＰＵ１２は、後述する本実施例の発声評価制御に係る各種制御を実行する。 The utterance evaluation apparatus 10 performs signal processing according to a program stored in advance in the ROM 14 while the CPU 12 uses the temporary storage function of the RAM 16. That is, the utterance evaluation device 10 includes a so-called microcomputer (electronic control device). With such a configuration, the CPU 12 executes various controls related to the utterance evaluation control of the present embodiment, which will be described later.

図２は、本実施例の発声評価の前提となる発声訓練方法（手順）について説明する図である。図２に示すように、本実施例は、第１段階「Stage.1」から第６段階「Stage.6」までの複数段階それぞれにおいて異なる声の出し方を訓練する発声訓練方法を前提としている。第１段階「Stage.1」においては、裏声と表声とをはっきりと分けて発声する訓練が行われる。音の高さ（音階）は所定の音階に固定される。第２段階「Stage.2」においては、裏声と表声とをはっきりと分けて、様々な音階の音を発声する訓練が行われる。第３段階「Stage.3」においては、裏声と表声とをはっきりと分けて、それぞれで簡単なメロディの曲を歌う訓練が行われる。第４段階「Stage.4」においては、裏声と表声とを切り換えつつ、その両方の間を行き来する訓練が行われる。第５段階「Stage.5」においては、裏声と表声とを混ぜて発声し、換声点すなわち発声時における表声と裏声との変わり目で起こる急激な声区の転換を目立たなくする訓練が行われる。第６段階「Stage.6」においては、裏声と表声とを混ぜて発声し、両方の声を強めて一体化した声を出す訓練が行われる。 FIG. 2 is a diagram for explaining an utterance training method (procedure) that is a premise of the utterance evaluation of the present embodiment. As shown in FIG. 2, the present embodiment is premised on an utterance training method for training different voice production methods in each of a plurality of stages from the first stage “Stage.1” to the sixth stage “Stage.6”. . In the first stage “Stage.1”, training is performed in which the back voice and the voice are clearly separated. The pitch (scale) is fixed at a predetermined scale. In the second stage “Stage.2”, training is performed to utter different scales by clearly separating the back voice and the voice. In the third stage "Stage.3", the back voice and the voice are clearly separated, and each is trained to sing a simple melody song. In the fourth stage "Stage.4", training is performed to switch back and forth between both voices and voices. In the 5th stage “Stage.5”, there is a training to make voices mixed with back voices and voices, and to make inconspicuous the change of voice zone that occurs at the turning point between voice and back voices at the time of vocalization. Done. In the sixth stage "Stage.6", training is performed to utter a voice that mixes back voice and voice and strengthens both voices to produce an integrated voice.

図２に示すように、第４段階「Stage.4」は、更に複数の段階として例えば「Stage.4a」、「Stage.4b」、「Stage.4c」に区分される。「Stage.4a」は、裏声と表声との切り換えはできているが、その切り換えがはっきりしており裏声と表声とが切り換えられたことがはっきりとわかる発声に相当する。「Stage.4b」は、裏声と表声との切り換えが行われる際に、裏声と表声とを多少混ぜて出せている発声に相当する。「Stage.4c」は、裏声と表声との切り換えが行われる際に、裏声と表声とをかなり混ぜて出せている発声に相当する。本実施例において、以下に詳述する図３に示す評価部５２は、第４段階「Stage.4」における表声及び裏声の切り換えを評価する。第２評価部５４は、「Stage.3」等において表声及び裏声の歌い分け、すなわち裏声と表声とをはっきりと分けて出せているかを評価する。 As shown in FIG. 2, the fourth stage “Stage.4” is further divided into, for example, “Stage.4a”, “Stage.4b”, and “Stage.4c” as a plurality of stages. “Stage.4a” corresponds to an utterance in which the switching between the back voice and the voice is clear, but the switching is clear and the back voice and the voice are clearly switched. “Stage.4b” corresponds to an utterance in which the back voice and the voice are mixed slightly when switching between the back voice and the voice. “Stage.4c” corresponds to a utterance that can be mixed back and voices when switching between back and voices. In the present embodiment, the evaluation unit 52 shown in FIG. 3 described in detail below evaluates the switching of voice and back voice in the fourth stage “Stage.4”. The second evaluation unit 54 evaluates whether the voice and the voice are sung in “Stage.3” or the like, that is, whether the voice and the voice are clearly separated.

図３は、発声評価装置１０のＣＰＵ１２に備えられた制御機能の要部を説明する機能ブロック線図である。図３に示す録音部５０、評価部５２、第２評価部５４、及び表示制御部５６は、好適には、何れもＣＰＵ１２に機能的に備えられたものである。すなわち、ＣＰＵ１２は、ＲＯＭ１４等に記憶された本実施例のプログラムを実行することで、図３に示す録音部５０、評価部５２、第２評価部５４、及び表示制御部５６として機能させられる。録音部５０、評価部５２、第２評価部５４、及び表示制御部５６は、それぞれ個別の制御部として構成されると共に相互に情報の通信を行うことで以下に詳述する各種機能を実行するものであってもよい。評価部５２に含まれる抽出部６０が、評価部５２とは別の制御部として構成されると共に相互に情報の通信を行うことで以下に詳述する各種機能を実行するものであってもよい。第２評価部５４は、必ずしも本実施例の発声評価装置１０に備えられたものでなくともよい。 FIG. 3 is a functional block diagram for explaining a main part of the control function provided in the CPU 12 of the utterance evaluation apparatus 10. The recording unit 50, the evaluation unit 52, the second evaluation unit 54, and the display control unit 56 illustrated in FIG. 3 are preferably all functionally provided in the CPU 12. That is, the CPU 12 is caused to function as the recording unit 50, the evaluation unit 52, the second evaluation unit 54, and the display control unit 56 shown in FIG. 3 by executing the program of this embodiment stored in the ROM 14 or the like. The recording unit 50, the evaluation unit 52, the second evaluation unit 54, and the display control unit 56 are configured as individual control units and perform various functions described in detail below by communicating information with each other. It may be a thing. The extraction unit 60 included in the evaluation unit 52 may be configured as a control unit different from the evaluation unit 52 and perform various functions described in detail below by communicating information with each other. . The second evaluation unit 54 is not necessarily provided in the utterance evaluation apparatus 10 of the present embodiment.

録音部５０は、マイクロフォン２４により入力された音声情報を録音する。例えば、そのマイクロフォン２４により入力され、Ａ／Ｄコンバータ２６を介してディジタル信号に変換された音声情報を、記憶部１８等に形成された音声データベース３８に記憶（蓄積）する。例えば、ＡＶＩ（Audio-Video Interleaved）形式のファイルやＷＡＶＥサウンドファイル等、所定形式の音声ファイルとして各ファイルに固有の識別情報を付して記憶する。録音部５０による録音乃至音声データベース３８への記憶の態様としては、マイクロフォン２４から入力される音声が途切れる毎（所定時間以上、音声が入力されない期間が生じる毎）に新たな音声ファイルとして記憶するものであってもよいし、予め定められた一定時間にマイクロフォン２４から入力された音声に対応する音声情報を１単位の音声ファイルとして記憶するものであってもよい。好適には、マイクロフォン２４により入力された音声情報を、発声主体である利用者（発声評価装置１０による評価対象となる利用者）毎に、その利用者の識別情報と対応付けて音声データベース３８に記憶する。 The recording unit 50 records voice information input from the microphone 24. For example, audio information input by the microphone 24 and converted into a digital signal via the A / D converter 26 is stored (accumulated) in an audio database 38 formed in the storage unit 18 or the like. For example, each file is stored with specific identification information as an audio file of a predetermined format such as an AVI (Audio-Video Interleaved) format file or a WAVE sound file. Recording by the recording unit 50 or storage in the voice database 38 is performed as a new voice file every time the voice input from the microphone 24 is interrupted (every period of time during which no voice is input for a predetermined time or longer). Alternatively, the sound information corresponding to the sound input from the microphone 24 at a predetermined time may be stored as one unit sound file. Preferably, the voice information input by the microphone 24 is associated with the identification information of each user (user to be evaluated by the utterance evaluation apparatus 10) in the voice database 38 for each user who is the utterance subject. Remember.

発声評価装置１０による発声評価に際して、録音部５０による録音の対象となる利用者（発声主体）の声は、好適には、表声と裏声とが少なくとも１回は切り換えられる（好適にはそれぞれ１回以上繰り返される）発声に相当するものとされる。この表声と裏声との切り換えには、両者がはっきりと切り換えられるものばかりでなく、例えば図２に示す段階「Stage.4c」のように、その切り換え箇所において表声と裏声とが混ざる発声を含む。録音部５０による録音の対象となる利用者の声は、マイクロフォン２４から入力される利用者の音声に相当する。すなわち、抽出部６０による抽出の対象、延いては評価部５２乃至第２評価部５４による評価の対象となる音声情報に対応する発声に相当する。録音部５０による録音の対象となる発声は、好適には、「あ」や「え」の発声を伸ばしたり上下降音階として発声する等、単一の母音に対応する発声において表声と裏声とを出し分けるものであることが好ましい。好適には、例えば既定の課題曲（単純なメロディ）を歌唱する等、子音を含むものであってもよい。 In the utterance evaluation by the utterance evaluation device 10, the voice of the user (speaking subject) to be recorded by the recording unit 50 is preferably switched at least once between the voice and the back voice (preferably 1 each. (Repeated more than once). In this switching between voice and back voice, not only the voice and voice can be switched clearly, but, for example, in the stage “Stage.4c” shown in FIG. Including. The user's voice to be recorded by the recording unit 50 corresponds to the user's voice input from the microphone 24. That is, it corresponds to the utterance corresponding to the audio information to be extracted by the extracting unit 60, and further, to be evaluated by the evaluating unit 52 to the second evaluating unit 54. The utterances to be recorded by the recording unit 50 are preferably voices and back voices in utterances corresponding to a single vowel, such as extending the utterance of “A” or “E” or uttering it as an upper or lower scale. It is preferable that they are separated. Preferably, it may include consonants such as singing a predetermined task song (simple melody).

本実施例の評価部５２は、抽出部６０を含む。抽出部６０は、マイクロフォン２４から入力される音声情報における基本波及び複数の高調波を抽出する。好適には、音声データベース３８に記憶された音声情報から所定区間（例えば、予め定められた一定時間に相当する区間）のデータを読み出す。そして、その読み出されたデータに関して音声分析を行うことで基本波（フーリエ変換の結果得られる最も低い周波数に対応する成分）及び複数の高調波（基本波の周波数の整数倍周波数に対応する成分）を抽出する。斯かる制御を行うために、抽出部６０は、音声分析部６２、声帯信号スペクトル推定部６４、及び特徴量抽出部６６を備えている。以下、各制御部による処理について分説する。抽出部６０は、評価部５２とは別の制御部として設けられていてもよい。 The evaluation unit 52 of this embodiment includes an extraction unit 60. The extraction unit 60 extracts a fundamental wave and a plurality of harmonics in the audio information input from the microphone 24. Preferably, data of a predetermined section (for example, a section corresponding to a predetermined time) is read from the sound information stored in the sound database 38. Then, by performing speech analysis on the read data, a fundamental wave (a component corresponding to the lowest frequency obtained as a result of Fourier transform) and a plurality of harmonics (a component corresponding to an integer multiple of the fundamental wave frequency) ). In order to perform such control, the extraction unit 60 includes a voice analysis unit 62, a vocal cord signal spectrum estimation unit 64, and a feature amount extraction unit 66. Hereinafter, processing by each control unit will be described. The extraction unit 60 may be provided as a control unit different from the evaluation unit 52.

音声分析部６２は、マイクロフォン２４から入力される音声情報に関して、所定のアルゴリズムに基づく音声分析を行う。例えば、音声分析手法として一般的に用いられるＬＰＣ（Linear Predictive Coding：線形予測符号）分析或いはケプストラム（Cepstrum）分析等により対象となる音声情報の音声解析を行う。好適には、音声データベース３８に記憶された音声情報から所定区間のデータを読み出し、そのデータに関してＬＰＣ分析或いはケプストラム分析等の音声分析を行うことにより、対象となる音声情報から基本波の周波数（以下、基本周波数という）及び声道特性を推定する。この声道特性とは、発声主体である人毎に固有の値をとるものであり、フォルマント（formant：声道内の空気や共鳴周波数に対応する倍音群）と呼ばれる。すなわち、音声分析部６２は、好適には、対象となる音声情報の基本周波数をはじめとする各種情報を推定するものであり、例えば、声道特性に対応するフォルマント周波数、基本周波数（ピッチ情報）、音量（声量）に対応する信号強度、及び有声音乃至無声音の判別に係る情報等の推定（抽出）を行う。 The voice analysis unit 62 performs voice analysis based on a predetermined algorithm for the voice information input from the microphone 24. For example, speech analysis of target speech information is performed by LPC (Linear Predictive Coding) analysis or cepstrum analysis that is generally used as a speech analysis technique. Preferably, data of a predetermined section is read from the voice information stored in the voice database 38, and voice analysis such as LPC analysis or cepstrum analysis is performed on the data, so that the frequency of the fundamental wave (hereinafter referred to as target frequency) is obtained. , Called fundamental frequency) and vocal tract characteristics. This vocal tract characteristic takes a value specific to each person who is the utterance subject, and is called formant (a harmonic group corresponding to air in the vocal tract or resonance frequency). That is, the voice analysis unit 62 preferably estimates various information including the fundamental frequency of the target voice information. For example, the formant frequency and the fundamental frequency (pitch information) corresponding to the vocal tract characteristics are estimated. , Estimation (extraction) of signal intensity corresponding to volume (voice volume) and information related to discrimination of voiced or unvoiced sound.

また、音声から発声者毎に固有の声道特性（フォルマント）を除去し、その声道特性が除去された声帯音源波を解析することで、対象となる音声における表声及び裏声の好適な判別が可能となる。 In addition, by removing vocal tract characteristics (formant) specific to each speaker from the voice and analyzing the vocal cord sound source wave from which the vocal tract characteristics have been removed, it is possible to appropriately discriminate between voice and back voice in the target voice. Is possible.

声帯信号スペクトル推定部６４は、録音部５０により録音された音声情報から、音声分析部６２により推定された声道特性を除去した信号のスペクトルを推定する。好適には、音声データベース３８から読み出された所定区間のデータに関して、音声分析部６２により推定された声道特性を、そのデータから除去して得られる信号のスペクトルを推定する。例えば、ＬＰＣ分析においては、音声分析部６２により推定された声道特性（周波数伝達特性）の逆フィルタを構成する。そして、対象となる音声情報を斯かる逆フィルタに通して残差信号を得た後、その残差信号のフーリエ変換によりスペクトルを推定する。ケプストラム分析においては、対象となる音声情報をローパスリフタに通す処理すなわちその音声情報の低ケフレンシ（quefrency）部をリフタ（lifter）で除去することにより残差信号を得た後、その残差信号のフーリエ変換によりスペクトルを推定する。 The vocal cord signal spectrum estimation unit 64 estimates the spectrum of a signal obtained by removing the vocal tract characteristics estimated by the voice analysis unit 62 from the voice information recorded by the recording unit 50. Preferably, the spectrum of a signal obtained by removing the vocal tract characteristic estimated by the voice analysis unit 62 from the data of the predetermined section read from the voice database 38 is estimated. For example, in the LPC analysis, an inverse filter of the vocal tract characteristic (frequency transfer characteristic) estimated by the voice analysis unit 62 is configured. Then, after passing the target audio information through such an inverse filter to obtain a residual signal, the spectrum is estimated by Fourier transform of the residual signal. In the cepstrum analysis, after obtaining the residual signal by passing the target speech information through a low-pass lifter, that is, by removing the low quefrency part of the speech information with a lifter, the residual signal The spectrum is estimated by Fourier transform.

特徴量抽出部６６は、音声分析部６２により得られた基本周波数等に基づいて、マイクロフォン２４から入力される音声情報における基本波及び複数の高調波を抽出する。例えば、声帯信号スペクトル推定部６４により推定されたスペクトルから、音声分析部６２により推定された基本周波数に基づいて、基本波を除く余の複数の高調波を抽出する。好適には、基本波及び各高調波の相対レベルを抽出する。すなわち、対象となる音声情報における基本波に対応する音圧レベルと各高調波に対応する音圧レベルとの相対的な比を抽出（算出）する。 The feature quantity extraction unit 66 extracts a fundamental wave and a plurality of harmonics in the voice information input from the microphone 24 based on the fundamental frequency obtained by the voice analysis unit 62. For example, from the spectrum estimated by the vocal cord signal spectrum estimation unit 64, a plurality of remaining harmonics excluding the fundamental wave are extracted based on the fundamental frequency estimated by the speech analysis unit 62. Preferably, the relative level of the fundamental wave and each harmonic is extracted. That is, a relative ratio between the sound pressure level corresponding to the fundamental wave and the sound pressure level corresponding to each harmonic in the target audio information is extracted (calculated).

第２評価部５４は、音声情報における表声及び裏声の歌い分け（表声及び裏声の出し分け）を評価する。例えば、図２に示す「Stage.3」における、表声及び裏声をそれぞれ歌い分ける第１の歌唱法について斯かる評価を行う。好適には、対象となる音声情報における表声及び裏声の歌い分けに係る指標として、以下に詳述する第２融合率を評価する。ここで、表声とは、地声とも呼ばれるものであり、声帯音源波において基本波に対する高調波のエネルギが裏声に比べて相対的に大きい傾向にある音声のシリーズ（声区）をいう。この表声は、発声機構的には声帯を取り巻く閉鎖筋群が輪状甲状筋に対して優位に働き、声帯全体が振動することにより生じているものと考えられる。裏声とは、ファルセットとも呼ばれるものであり、声帯音源波において基本波に対する高調波のエネルギが表声に比べて相対的に小さい傾向にある音声のシリーズ（声区）をいう。この裏声は、発声機構的には声帯を取り巻く輪状甲状筋が閉鎖筋群に対して優位に働き、声帯の辺縁部が振動することにより生じているものと考えられる。すなわち、本実施例において、表声と裏声とはそれぞれ異なる声区に属するものと考え、各声区の音声はそれぞれひとつのメカニズム（例えば、声帯の使い方）により発声されるものと考える。対象となる音声情報における表声及び裏声の融合率の評価を行うために、第２評価部５４は、第２融合率算出部７４及び表声／裏声発声能力評価部７６を備えている。以下、各制御部による処理について分説する。 The second evaluation unit 54 evaluates the singing of the voice and the back voice in the voice information (the voice and the back voice are divided). For example, in the “Stage.3” shown in FIG. 2, such evaluation is performed for the first singing method in which the voice and the back voice are separately sung. Preferably, the second fusion rate described in detail below is evaluated as an index relating to the singing of the voice and the back voice in the target audio information. Here, the voice is also referred to as a local voice and refers to a series of voices (voice zones) in which the harmonic energy of the fundamental wave in the vocal cord sound source wave tends to be relatively larger than the back voice. This vocalization is considered to be caused by the vocal cords that the closed muscle group surrounding the vocal cords works preferentially on the ring-shaped thyroid muscle and the entire vocal cords vibrate. The back voice is also called a falset, and is a voice series (voice zone) in which the harmonic energy of the fundamental wave in the vocal cord sound source wave tends to be relatively smaller than the voice. It is considered that this back voice is caused by the ring-shaped thyroid muscles surrounding the vocal cords acting preferentially on the obturator muscle group in terms of vocalization mechanism, and the marginal portion of the vocal cords vibrating. That is, in this embodiment, the voice and the back voice are considered to belong to different vocal tracts, and the voices of the respective vocal tracts are considered to be uttered by one mechanism (for example, use of vocal cords). In order to evaluate the fusion rate of voice and back voice in the target speech information, the second evaluation unit 54 includes a second fusion rate calculation part 74 and a voice / back voice vocalization ability evaluation part 76. Hereinafter, processing by each control unit will be described.

第２融合率算出部７４は、対象となる音声情報における２つの声区である表声及び裏声の融合状態（融合率）を算出する。好適には、この融合状態を指標化した１つの尺度値としての第２融合率を算出する。好適には、対象となる音声情報における複数の高調波のスペクトル重心ｆｇを基本波の周波数（基本周波数）ｆ０で除した値（＝ｆｇ／ｆ０）を第２融合率として算出する。換言すれば、対象となる音声情報において抽出される基本周波数及び各高調波に対応する周波数の相対レベル（音圧レベルの相対値）からスペクトル重心を算出する。そして、その算出されたスペクトル重心と、基本周波数（ピッチ周波数）との比を求め、その比を第２融合率として算出する。 The second fusion rate calculation unit 74 calculates the fusion state (fusion rate) of the voice and back voice, which are the two voice zones in the target voice information. Preferably, a second fusion rate is calculated as one scale value obtained by indexing the fusion state. Preferably, a value (= fg / f0) obtained by dividing the spectral centroid fg of a plurality of harmonics in the target audio information by the fundamental wave frequency (fundamental frequency) f0 is calculated as the second fusion rate. In other words, the spectrum centroid is calculated from the relative frequency (relative value of the sound pressure level) of the fundamental frequency extracted in the target audio information and the frequency corresponding to each harmonic. And the ratio of the calculated spectrum gravity center and a fundamental frequency (pitch frequency) is calculated | required, and the ratio is calculated as a 2nd fusion rate.

本実施例において、好適には、可及的に多数の高調波成分に基づいて第２評価部５４による表声及び裏声の評価を行う。すなわち、可及的に多数の高調波成分のスペクトル重心を求め、そのスペクトル重心を表声及び裏声の評価に用いる。斯かるスペクトル重心は、基本波の周波数（ピッチ周波数）にも依存する。そこで、本実施例においては、この基本波の周波数すなわち基本周波数の影響を排除するため、上述のようにスペクトル重心をその基本周波数で除すことにより、声帯音源の調波構造を指標化（規格化）する処理を行っている。 In the present embodiment, preferably, the voice and back voice are evaluated by the second evaluation unit 54 based on as many harmonic components as possible. That is, the spectrum centroid of as many harmonic components as possible is obtained, and the spectrum centroid is used for the evaluation of the voice and the back voice. Such a spectrum centroid also depends on the frequency (pitch frequency) of the fundamental wave. Therefore, in this embodiment, in order to eliminate the influence of the fundamental frequency, that is, the influence of the fundamental frequency, the harmonic structure of the vocal cord sound source is indexed (standard) by dividing the spectrum centroid by the fundamental frequency as described above. Process).

表声／裏声発声能力評価部７６は、対象となる音声情報における基本波及び複数の高調波に基づいて表声及び裏声の融合率を評価する。具体的には、第２融合率算出部７４により算出される第２融合率ｆｇ／ｆ０に基づいて斯かる評価を行う。好適には、その第２融合率ｆｇ／ｆ０が高いほど表声の比率が高く裏声の比率が低いと評価する。すなわち、表声／裏声発声能力評価部７６による評価の基準となる関係は、第２融合率ｆｇ／ｆ０が大きいほど対象となる音声情報における表声の比率が高く裏声の比率が低いと評価するように予め定められたものである。 The voice / back voice utterance ability evaluation unit 76 evaluates the fusion rate of the voice and back voice based on the fundamental wave and the plurality of harmonics in the target voice information. Specifically, such evaluation is performed based on the second fusion rate fg / f0 calculated by the second fusion rate calculation unit 74. Preferably, it is evaluated that the higher the second fusion rate fg / f0, the higher the voice ratio and the lower the back voice ratio. That is, the relationship that is a criterion for evaluation by the voice / back voice utterance ability evaluation unit 76 is that the higher the second fusion rate fg / f0, the higher the voice ratio in the target voice information and the lower the voice ratio. As such, it is predetermined.

図４は、母音「あ」で上行、下行音階を発声した場合のピッチ（基本周波数）を時間毎にプロットした図である。図５は、同じ音声に対して複数の高調波成分のスペクトル重心ｆｇを基本周波数ｆ０で除した値（＝ｆｇ／ｆ０）を時間毎にプロットした図である。図５においては、裏声の発声法に習熟している人が実際に発声を聞いて、その発声が裏声であると判定された区間を斜線範囲で示している。換言すれば、図５において斜線範囲で示されない余の区間は、発声が表声であると判定された区間に相当する。図５に示す例では、表声及び裏声の両声区間でｆｇ／ｆ０が連続的に変化しており、そのｆｇ／ｆ０が裏声に対応する区間では比較的低く、表声に対応する区間では比較的高い値となっていることがわかる。従って、複数の高調波成分のスペクトル重心ｆｇを基本周波数ｆ０で除した値であるｆｇ／ｆ０が、表声と裏声との歌い分けに係る融合率を評価できる指標値となっていることがわかる。例えば、斯かる指標値が図５に破線で示す基準値（ｆｇ／ｆ０＝２．８程度）以上である場合には表声、その基準値未満である場合には裏声であるというように基準を定めることで、客観的且つ一義的に表声と裏声とを評価することができる。 FIG. 4 is a diagram in which the pitch (fundamental frequency) when the vowel “A” is uttered ascending and descending scales is plotted for each time. FIG. 5 is a diagram in which values (= fg / f0) obtained by dividing the spectral centroids fg of a plurality of harmonic components by the fundamental frequency f0 for the same sound are plotted for each time. In FIG. 5, a section where a person who is proficient in the method of uttering a back sound actually hears the utterance and the utterance is determined to be a back sound is indicated by a hatched area. In other words, the remaining section not indicated by the hatched area in FIG. 5 corresponds to a section in which the utterance is determined to be a voice. In the example shown in FIG. 5, fg / f0 continuously changes in both voice and back voice sections, and the fg / f0 is relatively low in the section corresponding to the back voice, and in the section corresponding to the voice. It can be seen that the value is relatively high. Therefore, it is understood that fg / f0, which is a value obtained by dividing the spectral centroid fg of a plurality of harmonic components by the fundamental frequency f0, is an index value that can evaluate the fusion rate related to the singing of the voice and the back voice. . For example, if the index value is equal to or greater than the reference value (fg / f0 = about 2.8) indicated by the broken line in FIG. It is possible to evaluate the voice and the back voice objectively and uniquely.

評価部５２は、予め定められた関係から抽出部６０により抽出される基本波及び複数の高調波に基づいて、音声情報における表声及び裏声の切り換えを評価する。例えば、図２に示す「Stage.4」における、表声と裏声との間で連続的に変化させた第２の歌唱法について斯かる評価を行う。好適には、対象となる音声情報における表声及び裏声の切り換えに係る指標として、以下に詳述する第１融合率を評価する。対象となる音声情報における表声及び裏声の切り換えを評価するために、評価部５２は、計算式７０及び計算の定数情報７２を備えている。以下、評価部５２の制御について詳細に説明する。 The evaluation unit 52 evaluates switching between voice and back in the voice information based on the fundamental wave and the plurality of harmonics extracted by the extraction unit 60 from a predetermined relationship. For example, in the “Stage.4” shown in FIG. 2, such evaluation is performed for the second singing method that is continuously changed between the voice and the back voice. Preferably, the first fusion rate, which will be described in detail below, is evaluated as an index related to the switching of voice and back voice in the target voice information. In order to evaluate the switching of voice and back voice in the target audio information, the evaluation unit 52 includes a calculation formula 70 and calculation constant information 72. Hereinafter, the control of the evaluation unit 52 will be described in detail.

計算式７０は、好適には、学習モデル（数学モデル）である。更に好適には、非線形ゲインの学習を行う非線形学習モデルである。例えば、入力される音声情報に対して、裏声を「−１」とし、表声を「１」とするその間の値、すなわち−１以上１以下の値（連続値）を出力させるものである。計算式７０としては、サポートベクターマシーン（Support vector machine;SVM）が好適に用いられる。或いは、ニューラルネットワーク（Neural network）が好適に用いられる。計算式７０は、好適には、以下に詳述するように、予め取得された表声に対応する第１音声情報及び裏声に対応する第２音声情報それぞれにおける基本波及び複数の高調波に基づく学習データ（学習用サンプル）により学習が行われる。 The calculation formula 70 is preferably a learning model (mathematical model). More preferably, it is a non-linear learning model that performs non-linear gain learning. For example, with respect to the input voice information, a value between “−1” and “1” as the back voice, that is, a value between −1 and 1 (continuous value) is output. As the calculation formula 70, a support vector machine (SVM) is preferably used. Alternatively, a neural network is preferably used. The calculation formula 70 is preferably based on a fundamental wave and a plurality of harmonics in the first voice information corresponding to the pre-acquired voice and the second voice information corresponding to the back voice, respectively, as will be described in detail below. Learning is performed using the learning data (learning sample).

評価部５２の学習においては、学習用の音声情報が抽出部６０に入力され、抽出部６０により抽出される基本波及び複数の高調波に基づく学習データが計算式７０に入力される。このとき、計算式７０において設定されている定数は、計算の定数情報７２に格納された値に基づいて設定される。しかし、斯かる定数が計算式７０において設定された場合に、所定の音声情報の入力に対して計算式７０から出力される比率（後述する第１融合率）が、学習用の音声情報の意義から考えて不適当な場合、計算式７０に設定された定数を修正（学習）する必要がある。この修正においては、先ず、所定の音声情報の入力に対して計算式７０から出力される比率と、学習用の音声情報とに矛盾がなくなるように、計算式７０の修正が行われる。例えば、学習用の音声情報として、正しい裏声に対応する音声情報が入力された場合に、計算式７０から「−１」が出力されるように計算式７０の修正が行われる。学習用の音声情報として、正しい表声に対応する音声情報が入力された場合に、計算式７０から「１」が出力されるように計算式７０の修正が行われる。次に、学習用の音声情報が変更され、その入力に対して計算式７０から出力される比率と、学習用の音声情報との対応が矛盾のないものとなるように計算式７０の修正が行われる。次に、抽出部６０により抽出される複数の高調波について、その抽出対象が変更される。以上のようにして修正された計算式７０における定数が計算の定数情報７２に格納されることで、計算式７０の学習が完了する。 In the learning of the evaluation unit 52, learning speech information is input to the extraction unit 60, and learning data based on the fundamental wave and a plurality of harmonics extracted by the extraction unit 60 is input to the calculation formula 70. At this time, the constant set in the calculation formula 70 is set based on the value stored in the calculation constant information 72. However, when such a constant is set in the calculation formula 70, the ratio (first fusion rate described later) output from the calculation formula 70 with respect to the input of predetermined voice information is the significance of the voice information for learning. Therefore, if it is inappropriate, it is necessary to correct (learn) the constant set in the calculation formula 70. In this correction, first, the calculation formula 70 is corrected so that there is no contradiction between the ratio output from the calculation formula 70 with respect to the input of predetermined voice information and the learning voice information. For example, the correction of the calculation formula 70 is performed so that “−1” is output from the calculation formula 70 when the voice information corresponding to the correct back voice is input as the learning voice information. The calculation formula 70 is corrected so that “1” is output from the calculation formula 70 when the voice information corresponding to the correct voice is input as the voice information for learning. Next, the calculation formula 70 is corrected so that the correspondence between the learning voice information is changed and the correspondence between the ratio output from the calculation formula 70 with respect to the input and the learning voice information is consistent. Done. Next, the extraction object is changed about the some harmonic extracted by the extraction part 60. FIG. The constants in the calculation formula 70 corrected as described above are stored in the calculation constant information 72, whereby the learning of the calculation formula 70 is completed.

評価部５２の学習に用いられる学習データは、好適には、予め取得された表声に対応する第１音声情報及び裏声に対応する第２音声情報それぞれにおける複数の高調波の強度を基本波の強度で除した値である。例えば、第１音声情報及び第２音声情報それぞれに対応して抽出される第２次高調波から第ｎ次高調波に関して、第１高調波の強度を基本波の強度で除した値（相対レベル）Ｌ０２、第２高調波の強度を基本波の強度で除した値Ｌ０３、・・・、第ｎ高調波の強度を基本波の強度で除した値Ｌ０ｎである。評価部５２の学習に用いられる学習用の第１音声情報及び第２音声情報は、後述するように、評価部５２による評価の対象となる利用者すなわち被験者により発声された表声及び裏声に対応するものであってもよいが、被験者以外の発声者例えば表声及び裏声の発声法に習熟したトレーナ等により発声された表声及び裏声に対応するものであってもよい。単数の利用者の発声に対応する第１音声情報及び第２音声情報により学習が行われるものであってもよいし、複数の利用者によりそれぞれ発声された表声及び裏声に対応する学習データにより学習が行われるものであってもよい。その複数の利用者に、評価部５２による評価の対象となる利用者が含まれてもよいし、含まれなくともよい。例えば、複数のトレーナ等により発声された表声及び裏声に対応する音声情報及び被験者により発声された表声及び裏声に対応する音声情報が学習データとして用いられるものであってもよい。すなわち、評価部５２の学習に用いられる学習用の第１音声情報及び第２音声情報は、予め取得された被験者及び被験者以外の発声者の少なくとも一方の表声及び裏声に対応するものである。計算式７０は、予め発声評価装置１０とは別のコンピュータ等により学習（確定）されたものが記憶部１８等に記憶されるものであってもよい。すなわち、計算式７０の学習は、必ずしも本実施例の発声評価装置１０により行われるものでなくともよい。 The learning data used for the learning of the evaluation unit 52 preferably includes the intensity of the plurality of harmonics in each of the first voice information corresponding to the voice obtained in advance and the second voice information corresponding to the back voice. The value divided by the intensity. For example, a value obtained by dividing the intensity of the first harmonic by the intensity of the fundamental wave with respect to the nth harmonic from the second harmonic extracted corresponding to each of the first voice information and the second voice information (relative level). ) L02, a value L03 obtained by dividing the intensity of the second harmonic by the intensity of the fundamental wave,..., And a value L0n obtained by dividing the intensity of the nth harmonic by the intensity of the fundamental wave. As will be described later, the first voice information and the second voice information for learning used for learning by the evaluation unit 52 correspond to voices and back voices uttered by a user to be evaluated by the evaluation unit 52, that is, a subject. However, it may correspond to speech and back speech produced by a speaker other than the subject, for example, a trainer who is proficient in speech and back speech. Learning may be performed by the first voice information and the second voice information corresponding to the utterance of a single user, or by learning data corresponding to the voice and back voice uttered by a plurality of users, respectively. Learning may be performed. The plurality of users may or may not include users to be evaluated by the evaluation unit 52. For example, voice information corresponding to voices and back voices uttered by a plurality of trainers or the like and voice information corresponding to voices and back voices uttered by a subject may be used as learning data. That is, the first voice information and the second voice information for learning used for learning by the evaluation unit 52 correspond to at least one voice and back voice of a subject and a speaker other than the subject acquired in advance. The calculation formula 70 may be stored in the storage unit 18 or the like that has been learned (determined) in advance by a computer or the like different from the utterance evaluation apparatus 10. That is, the learning of the calculation formula 70 is not necessarily performed by the utterance evaluation apparatus 10 of the present embodiment.

評価部５２における学習は、好適には、第２評価部５４により評価された前記第１の歌唱法における表声に対応する第１音声情報及び裏声に対応する第２音声情報に基づいて行われる。具体的には、第２評価部５４により表声であると評価された第１音声情報及び裏声であると評価された第２音声情報それぞれにおける複数の高調波の強度を基本波の強度で除した値である学習データを含む学習データにより学習が行われる。斯かる学習データに加え、第２評価部５４による評価に係らない表声に対応する第１音声情報及び裏声に対応する第２音声情報それぞれにおける基本波及び複数の高調波に基づく学習データにより学習が行われるものであってもよい。評価部５２及び第２評価部５４の評価対象となる音声情報の発声者が同一である場合、評価部５２は、その評価対象となる利用者本人の表声及び裏声により計算式７０の学習を行うこととなる。 The learning in the evaluation unit 52 is preferably performed based on the first voice information corresponding to the voice and the second voice information corresponding to the back voice in the first singing method evaluated by the second evaluation unit 54. . Specifically, the intensity of a plurality of harmonics in each of the first voice information evaluated as voice by the second evaluation unit 54 and the second voice information evaluated as back voice is divided by the intensity of the fundamental wave. Learning is performed by learning data including learning data that is the obtained value. In addition to such learning data, learning is performed by learning data based on the fundamental wave and a plurality of harmonics in the first voice information corresponding to the voice and the second voice information corresponding to the back voice that are not related to the evaluation by the second evaluation unit 54. May be performed. When the voicers of the speech information to be evaluated by the evaluation unit 52 and the second evaluation unit 54 are the same, the evaluation unit 52 learns the calculation formula 70 based on the voice and back voice of the user who is the evaluation target. Will be done.

評価部５２は、抽出部６０により抽出される基本波及び複数の高調波に基づく評価データを計算式７０に入力させ、その計算式７０から出力される値を対象となる音声情報における表声及び裏声の第１融合率として算出する。この第１融合率は、例えば、裏声を「−１」、表声を「１」とした場合におけるその間の値、すなわち−１以上１以下の値（連続値）である。評価部５２は、好適には、抽出部６０により抽出される基本波及び複数の高調波に基づいて評価データを算出する。この評価データは、好適には、マイクロフォン２４から入力される音声情報において抽出部６０により抽出される複数の高調波の強度を基本波の強度で除した値である。例えば、抽出部６０により抽出される第２次高調波から第ｎ次高調波に関して、第１高調波の強度を基本波の強度で除した値（相対レベル）Ｌ０２、第２高調波の強度を基本波の強度で除した値Ｌ０３、・・・、第ｎ高調波の強度を基本波の強度で除した値Ｌ０ｎをそれぞれ算出する。そして、各値Ｌ０２、Ｌ０３、・・・、Ｌ０ｎを計算式７０に入力させ、その計算式７０から出力される値を第１融合率として算出する。 The evaluation unit 52 inputs the evaluation data based on the fundamental wave and the plurality of harmonics extracted by the extraction unit 60 to the calculation formula 70, and uses the value output from the calculation formula 70 as the voice in the target speech information and Calculated as the first fusion rate of the back voice. The first fusion rate is, for example, a value between “−1” for the back voice and “1” for the voice, that is, a value between −1 and 1 (continuous value). The evaluation unit 52 preferably calculates evaluation data based on the fundamental wave and the plurality of harmonics extracted by the extraction unit 60. This evaluation data is preferably a value obtained by dividing the intensity of a plurality of harmonics extracted by the extraction unit 60 in the audio information input from the microphone 24 by the intensity of the fundamental wave. For example, regarding the n-th harmonic from the second harmonic extracted by the extraction unit 60, a value (relative level) L02 obtained by dividing the intensity of the first harmonic by the intensity of the fundamental wave, and the intensity of the second harmonic A value L03 divided by the intensity of the fundamental wave,..., And a value L0n obtained by dividing the intensity of the nth harmonic wave by the intensity of the fundamental wave are calculated. Then, the values L02, L03,..., L0n are input to the calculation formula 70, and the value output from the calculation formula 70 is calculated as the first fusion rate.

評価部５２は、好適には、対象となる音声情報における表声及び裏声の切り換え（表声と裏声との相互間の推移）を評価する。すなわち、表声と裏声とが切り替わる部分における換声点ショックを評価する。この換声点ショックとは、発声時における表声と裏声との変わり目で起こる急激な声区の転換を言い、斯かる声区の転換に伴い声量やピッチに生じる変動を言うものである。この換声点ショックは、裏声を発する場合に優位となる輪状甲状筋と表声を発する場合に優位となる閉鎖筋群との力関係が急激に変化するために発生する現象であると考えられる。輪状軟骨とその上後部左右にある披裂軟骨との関節部分の複雑な動きも原因となっているものと考えられる。例えば、図５に示す例においては、破線で示す基準値（ｆｇ／ｆ０＝２．８程度）が換声点に相当し、第２融合率の推移がその基準値を跨ぎ越す箇所が換声点に相当する。 The evaluation unit 52 preferably evaluates switching between voice and back voice (transition between voice and back voice) in the target voice information. That is, the voice conversion point shock at the portion where the voice and the back voice are switched is evaluated. This voice point shock refers to a sudden change of vocal tract that occurs at the transition between voice and back voice during utterance, and refers to fluctuations that occur in the voice volume and pitch accompanying such a change of voice. This voice conversion point shock is considered to be a phenomenon that occurs because the force relationship between the ring-shaped thyroid muscle that dominates when producing a reverse voice and the closing muscle group that dominates when producing a voice is abruptly changed. . It is thought that the complicated movement of the joint part between the cricoid cartilage and the arytenoid cartilage on the upper rear left and right is also caused. For example, in the example shown in FIG. 5, the reference value indicated by the broken line (fg / f0 = about 2.8) corresponds to the voice point, and the place where the transition of the second fusion rate crosses the reference value is voiced. It corresponds to a point.

評価部５２は、好適には、抽出部６０により抽出される基本波の周波数の変化に応じて、計算式７０から出力される第１融合率の変化の滑らかさに基づいて、対象となる音声情報における表声及び裏声の切り換えを評価する。本実施例において、第１融合率の変化が滑らかであるとは、抽出部６０により抽出される基本波の周波数と第１融合率との関数を考えた場合、その関数が連続する微係数（導関数）を有することをいう。評価対象となる音声情報が例えば単一の母音に対応する上下降音階として発声されたものである場合、抽出部６０により抽出される基本波の周波数の変化は時系列に沿ったものになる。この場合、評価部５２は、好適には、第１融合率の時系列に沿った変化の滑らかさに基づいて、対象となる音声情報における表声及び裏声の切り換えを評価する。評価対象となる音声情報が既定の課題曲（単純なメロディ）を歌唱するものである場合、評価部５２は、好適には、評価対象となる音声情報を抽出部６０により抽出される基本波の周波数の変化に応じて並べ直す。そして、並べ直された基本波の周波数の変化に応じて、計算式７０から出力される第１融合率の変化の滑らかさに基づいて、対象となる音声情報における表声及び裏声の切り換えを評価する。以下、図６〜図１１を用いて斯かる評価の具体例について説明する。 The evaluation unit 52 preferably uses the target voice based on the smoothness of the change in the first fusion rate output from the calculation formula 70 in accordance with the change in the frequency of the fundamental wave extracted by the extraction unit 60. Evaluate the change of voice and back in information. In the present embodiment, that the change in the first fusion rate is smooth means that when a function of the frequency of the fundamental wave extracted by the extraction unit 60 and the first fusion rate is taken into account, a derivative ( Having a derivative). When the speech information to be evaluated is, for example, uttered as an up / down scale corresponding to a single vowel, the change in the frequency of the fundamental wave extracted by the extraction unit 60 is in time series. In this case, the evaluation unit 52 preferably evaluates the switching of the voice and the back voice in the target voice information based on the smoothness of the change along the time series of the first fusion rate. When the voice information to be evaluated sings a predetermined task song (simple melody), the evaluation unit 52 preferably uses the fundamental wave extracted by the extraction unit 60 for the voice information to be evaluated. Rearrange as frequency changes. Then, based on the smoothness of the change in the first fusion rate output from the calculation formula 70 in accordance with the change in the frequency of the rearranged fundamental wave, the switching of the voice and the back voice in the target voice information is evaluated. To do. Hereinafter, specific examples of such evaluation will be described with reference to FIGS.

図６〜図８は、男性（男声）による「え（ｅ）」の発声の上昇音階に対応する、抽出部６０により抽出される基本波の周波数（ピッチ）と第１融合率との関係の一例を示す図である。基本波の周波数が横軸に、第１融合率が縦軸にそれぞれ対応する（以下の説明において同じ）。図６は、裏声と表声との切り換えはできているが、その切り換えがはっきりしており裏声と表声とが切り換えられたことがはっきりとわかる発声に相当し、図２における段階「Stage.4a」に対応する。図７は、裏声と表声との切り換えが行われる際に、裏声と表声とを多少混ぜて出せている発声に相当し、図２における段階「Stage.4b」に対応する。図８は、裏声と表声との切り換えが行われる際に、裏声と表声とをかなり混ぜて出せている発声に相当し、図２における段階「Stage.4c」に対応する。 6 to 8 show the relationship between the frequency (pitch) of the fundamental wave extracted by the extraction unit 60 and the first fusion rate corresponding to the rising scale of the utterance of “e (e)” by a male (male voice). It is a figure which shows an example. The frequency of the fundamental wave corresponds to the horizontal axis, and the first fusion rate corresponds to the vertical axis (the same applies in the following description). FIG. 6 corresponds to the utterance in which the switching between the back voice and the voice is clear, but the switching is clear and the back voice and the voice are clearly switched, and the stage “Stage. Corresponds to 4a ". FIG. 7 corresponds to the utterance in which the back voice and the voice are mixed slightly when switching between the back voice and the voice, and corresponds to the stage “Stage.4b” in FIG. 2. FIG. 8 corresponds to the utterance in which the back voice and the voice can be mixed considerably when switching between the back voice and the voice, and corresponds to the stage “Stage.4c” in FIG. 2.

図６〜図８を比較すると、図６では、周波数１００（Ｈｚ）付近から２５０（Ｈｚ）付近までの第１融合率が略「１」に維持され、そこから３００（Ｈｚ）付近までの間に急速に第１融合率が略「−１」に変化している。すなわち、抽出部６０により抽出される基本波の周波数と第１融合率との関数が滑らかではなく、２５０（Ｈｚ）付近から３００（Ｈｚ）付近までの間に変化が不連続となるギャップが存在する。図７では、２５０（Ｈｚ）付近から３００（Ｈｚ）付近までの間に変化が不連続となるギャップが存在するものの、周波数１００（Ｈｚ）付近から２５０（Ｈｚ）付近までの間に第１融合率が漸減しており、図６に示す例に比べて抽出部６０により抽出される基本波の周波数と第１融合率との関数が滑らかであると言える。図８では、周波数１００（Ｈｚ）付近から５００（Ｈｚ）付近までの間に第１融合率が「１」から「−１」まで漸減させられ、変化が不連続となるギャップも確認されない。従って、図８に示す関係が最も滑らかであると言える。 6 to 8, in FIG. 6, the first fusion rate from the vicinity of the frequency of 100 (Hz) to the vicinity of 250 (Hz) is maintained at approximately “1”, and from that to the vicinity of 300 (Hz). The first fusion rate rapidly changes to approximately “−1”. That is, the function of the frequency of the fundamental wave extracted by the extraction unit 60 and the first fusion rate is not smooth, and there is a gap where the change is discontinuous between about 250 (Hz) and about 300 (Hz). To do. In FIG. 7, although there is a gap where the change is discontinuous between about 250 (Hz) and about 300 (Hz), the first fusion is performed between the frequency of about 100 (Hz) and about 250 (Hz). The rate gradually decreases, and it can be said that the function of the frequency of the fundamental wave extracted by the extraction unit 60 and the first fusion rate is smoother than in the example shown in FIG. In FIG. 8, the first fusion rate is gradually decreased from “1” to “−1” between the frequency near 100 (Hz) and the vicinity of 500 (Hz), and a gap where the change becomes discontinuous is not confirmed. Therefore, it can be said that the relationship shown in FIG. 8 is the smoothest.

図９〜図１１は、女性（女声）による「え（ｅ）」の発声の上昇音階に対応する、抽出部６０により抽出される基本波の周波数（ピッチ）と第１融合率との関係の一例を示す図である。図９は、裏声と表声との切り換えはできているが、その切り換えがはっきりしており裏声と表声とが切り換えられたことがはっきりとわかる発声に相当し、図２における段階「Stage.4a」に対応する。図１０は、裏声と表声との切り換えが行われる際に、裏声と表声とを多少混ぜて出せている発声に相当し、図２における段階「Stage.4b」に対応する。図１１は、裏声と表声との切り換えが行われる際に、裏声と表声とをかなり混ぜて出せている発声に相当し、図２における段階「Stage.4c」に対応する。 9 to 11 show the relationship between the first fusion rate and the frequency (pitch) of the fundamental wave extracted by the extraction unit 60 corresponding to the rising scale of the utterance of “e (e)” by a woman (woman voice). It is a figure which shows an example. FIG. 9 corresponds to the utterance in which the switching between the back voice and the voice is clear but the switching is clear and the back voice and the voice are clearly switched, and the stage “Stage. Corresponds to 4a ". FIG. 10 corresponds to the utterance in which the back voice and the voice are mixed slightly when switching the back voice and the voice, and corresponds to the stage “Stage.4b” in FIG. FIG. 11 corresponds to the utterance in which the back voice and the voice can be mixed considerably when switching between the back voice and the voice, and corresponds to the stage “Stage.4c” in FIG. 2.

図９〜図１１を比較すると、図９では、周波数１００（Ｈｚ）付近から２２０（Ｈｚ）付近までの第１融合率が略「１」に維持され、そこから略垂直に落ちるように変化して３００（Ｈｚ）付近までの間に急速に第１融合率が略「−１」に変化している。更に、周波数２５０（Ｈｚ）程度の付近に、関数が不連続となる比較的大きなギャップが存在している。すなわち、抽出部６０により抽出される基本波の周波数と第１融合率との関数が滑らかではない。図１０では、周波数２５０（Ｈｚ）程度の付近に、関数が不連続となるギャップが存在するが、そのギャップの大きさは図９に示す例に比べて小さい。従って、図１０に示す例では、図９に示す例に比べて抽出部６０により抽出される基本波の周波数と第１融合率との関数が滑らかであると言える。図１１では、周波数１００（Ｈｚ）付近から５００（Ｈｚ）付近までの間に第１融合率が「１」から「−１」まで漸減させられ、変化が不連続となるギャップも確認されない。従って、図１１に示す関係が最も滑らかであると言える。 9 to FIG. 11, in FIG. 9, the first fusion rate from the vicinity of the frequency of 100 (Hz) to the vicinity of 220 (Hz) is maintained at approximately “1”, and changes so as to drop substantially vertically therefrom. The first fusion rate rapidly changes to approximately “−1” until the vicinity of 300 (Hz). Furthermore, a relatively large gap where the function becomes discontinuous exists in the vicinity of the frequency of about 250 (Hz). That is, the function of the fundamental wave frequency extracted by the extraction unit 60 and the first fusion rate is not smooth. In FIG. 10, there is a gap where the function is discontinuous in the vicinity of a frequency of about 250 (Hz), but the size of the gap is smaller than the example shown in FIG. Therefore, in the example shown in FIG. 10, it can be said that the function of the frequency of the fundamental wave extracted by the extraction unit 60 and the first fusion rate is smoother than in the example shown in FIG. In FIG. 11, the first fusion rate is gradually decreased from “1” to “−1” between the vicinity of the frequency of 100 (Hz) and the vicinity of 500 (Hz), and the gap in which the change is discontinuous is not confirmed. Therefore, it can be said that the relationship shown in FIG. 11 is the smoothest.

以上、図６〜図１１に示すように、抽出部６０により抽出される基本波の周波数と第１融合率との関係の滑らかさは、音声情報における表声及び裏声の切り換えに際して、表声と裏声とをうまく混ぜることができているか否かを分かり易く反映したものとなる。従って、抽出部６０により抽出される基本波の周波数と第１融合率との関数が滑らかさを指標として、音声情報における表声と裏声との相互間の推移を客観的に評価することができる。評価部５２による評価の態様としては、図６〜図１１に示すような関係すなわち基本波の周波数の変化に応じた第１融合率の変化の滑らかさを示すグラフを表示させるものであってもよいし、そのグラフの解析結果を併せて表示させる等、更に詳しい評価を行うものであってもよい。例えば、基本波の周波数と第１融合率との関係において、第１融合率が「−１」と「１」との間の区間における関数の微係数（導関数）を算出し、その連続性を評価することで音声情報における表声及び裏声の切り換えの巧拙を評価する等の態様が考えられる。或いは、予め定められた関係から、関数の滑らかさに基づいて、対象となる音声情報が図２に示す各段階「Stage.4a」、「Stage.4b」、「Stage.4c」の何れの段階に相当する発声であるかを評価するものであってもよい。 As described above, as shown in FIGS. 6 to 11, the smoothness of the relationship between the frequency of the fundamental wave extracted by the extraction unit 60 and the first fusion rate is as follows. Whether or not the back voice can be mixed well is reflected in an easy-to-understand manner. Accordingly, it is possible to objectively evaluate the transition between the voice and the back voice in the voice information, using the smoothness as an index of the function of the fundamental frequency extracted by the extraction unit 60 and the first fusion rate. . As an aspect of the evaluation by the evaluation unit 52, a graph showing the relationship shown in FIGS. 6 to 11, that is, a graph showing the smoothness of the change in the first fusion rate in accordance with the change in the frequency of the fundamental wave may be displayed. It is also possible to perform further detailed evaluation such as displaying the analysis result of the graph together. For example, in the relationship between the frequency of the fundamental wave and the first fusion rate, a differential coefficient (derivative function) of a function in a section where the first fusion rate is between “−1” and “1” is calculated, and the continuity thereof is calculated. It is conceivable to evaluate the skill of switching the voice and the back voice in the voice information. Alternatively, based on a predetermined relationship, based on the smoothness of the function, the target audio information is any of the stages “Stage.4a”, “Stage.4b”, and “Stage.4c” shown in FIG. It is also possible to evaluate whether the utterance corresponds to.

図１２〜図１４は、図６〜図８に示す例の評価対象である発声と同じ音声情報に対応する、抽出部６０により抽出される基本波の周波数（ピッチ）と第２融合率算出部７４により算出される第２融合率ｆｇ／ｆ０との関係の一例を示す図である。基本波の周波数が横軸に、第２融合率が縦軸にそれぞれ対応する（以下の説明において同じ）。すなわち、図１２の例が図２における段階「Stage.4a」の男性の発声に、図１３の例が図２における段階「Stage.4b」の男性の発声に、図１４の例が図２における段階「Stage.4c」の男性の発声にそれぞれ対応する。図１５〜図１７は、図９〜図１１に示す例の評価対象である発声と同じ音声情報に対応する、抽出部６０により抽出される基本波の周波数と第２融合率ｆｇ／ｆ０との関係の一例を示す図である。すなわち、図１５の例が図２における段階「Stage.4a」の女性の発声に、図１６の例が図２における段階「Stage.4b」の女性の発声に、図１７の例が図２における段階「Stage.4c」の女性の発声にそれぞれ対応する。これら図１２〜図１７に示すように、抽出部６０により抽出される基本波の周波数と第２融合率ｆｇ／ｆ０との関係では、音声情報における表声及び裏声の切り換えに際して、表声と裏声とをうまく混ぜることができているか否かに応じて顕著な違いが見られない。 FIGS. 12 to 14 show the frequency (pitch) of the fundamental wave extracted by the extraction unit 60 and the second fusion rate calculation unit corresponding to the same audio information as the utterance that is the evaluation target in the examples shown in FIGS. 7 is a diagram illustrating an example of a relationship with a second fusion rate fg / f0 calculated by 74. FIG. The frequency of the fundamental wave corresponds to the horizontal axis, and the second fusion rate corresponds to the vertical axis (the same applies in the following description). That is, the example of FIG. 12 is the male utterance of the stage “Stage.4a” in FIG. 2, the example of FIG. 13 is the male utterance of the stage “Stage.4b” in FIG. 2, and the example of FIG. Corresponds to male utterances at stage “Stage.4c”. 15 to 17 show the frequency of the fundamental wave extracted by the extraction unit 60 and the second fusion rate fg / f0 corresponding to the same voice information as the utterance that is the evaluation target in the examples shown in FIGS. It is a figure which shows an example of a relationship. That is, the example in FIG. 15 is for the female utterance at the stage “Stage.4a” in FIG. 2, the example in FIG. 16 is for the female utterance at the stage “Stage.4b” in FIG. 2, and the example in FIG. Corresponds to the female utterances at stage "Stage.4c". As shown in FIGS. 12 to 17, in the relationship between the frequency of the fundamental wave extracted by the extraction unit 60 and the second fusion rate fg / f0, when switching between voice and back voice in voice information, the voice and back voice are switched. There is no noticeable difference depending on whether or not can be mixed well.

以上のように、第２融合率算出部７４により算出される第２融合率ｆｇ／ｆ０は、表声と裏声とをはっきりと出し分けることができるか否かの評価には好適に用いられるが、表声と裏声との変わり目における推移の巧みさを評価する指標としては有効ではないと言える。従って、図２に示す「Stage.3」における表声及び裏声をそれぞれ歌い分ける第１の歌唱法について、第２評価部５４により第２融合率ｆｇ／ｆ０に基づく評価を行い、図２に示す「Stage.4」における表声と裏声との間で連続的に変化させた第２の歌唱法について、評価部５２により第１融合率に基づく評価を行うことで、各歌唱法における発声の巧拙を好適に評価することができる。 As described above, the second fusion rate fg / f0 calculated by the second fusion rate calculation unit 74 is preferably used for evaluating whether or not the voice and the back voice can be clearly distinguished. It can be said that it is not effective as an index for evaluating the skill of transition at the transition between voice and back voice. Accordingly, the first singing method for singing the voice and the back voice in “Stage.3” shown in FIG. 2 is evaluated by the second evaluation unit 54 based on the second fusion rate fg / f0, and is shown in FIG. By performing evaluation based on the first fusion rate by the evaluation unit 52 for the second singing method that is continuously changed between the voice and the back voice in “Stage.4”, skillful utterance in each singing method Can be suitably evaluated.

表示制御部５６は、評価部５２による評価結果をタッチパネルディスプレイ２８における表示装置３０等に表示させる。図１８は、タッチパネルディスプレイ２８に表示される評価部５２による評価結果の一例である評価結果画面８０を示す図である。この図１８に示すように、表示制御部５６は、好適には、評価部５２による評価結果として、図６〜図１１に示すような関係すなわち基本波の周波数の変化に応じた第１融合率の変化の滑らかさを示すグラフを表示させる。好適には、グラフの解析結果を併せて表示させる。例えば、基本波の周波数と第１融合率との関係における、第１融合率が「−１」と「１」との間の区間における関数の微係数に係る連続性を評価結果に相当する寸評（コメント）を表示させる。好適には、評価部５２による評価結果として、図２に示す各段階「Stage.4a」、「Stage.4b」、「Stage.4c」の何れの段階に相当する発声であるかを表示させる。このように、評価部５２による評価結果を視覚的に分かり易く表示させることで、利用者は簡便且つ客観的に自己の歌唱法の巧拙を確認することができる。 The display control unit 56 displays the evaluation result by the evaluation unit 52 on the display device 30 or the like in the touch panel display 28. FIG. 18 is a diagram illustrating an evaluation result screen 80 that is an example of an evaluation result by the evaluation unit 52 displayed on the touch panel display 28. As shown in FIG. 18, the display control unit 56 preferably uses the first fusion rate according to the relationship shown in FIGS. 6 to 11, that is, the change in the frequency of the fundamental wave, as the evaluation result by the evaluation unit 52. A graph showing the smoothness of changes is displayed. Preferably, the analysis result of the graph is also displayed. For example, in the relationship between the frequency of the fundamental wave and the first fusion rate, the continuity related to the derivative of the function in the section where the first fusion rate is between “−1” and “1” is a scale corresponding to the evaluation result. (Comment) is displayed. Preferably, as an evaluation result by the evaluation unit 52, it is displayed which of the stages “Stage.4a”, “Stage.4b”, and “Stage.4c” shown in FIG. In this way, by displaying the evaluation result by the evaluation unit 52 in a visually easy-to-understand manner, the user can easily and objectively confirm his skill in singing.

図１９は、発声評価装置１０のＣＰＵ１２による発声評価制御の要部を説明するフローチャートであり、所定の周期で繰り返し実行されるものである。 FIG. 19 is a flowchart for explaining a main part of the utterance evaluation control by the CPU 12 of the utterance evaluation apparatus 10, and is repeatedly executed at a predetermined cycle.

先ず、Ｓ１において、表示装置３０に表示された画面に従いタッチパネル３２等を介して練習ステージ及びモデルの選択が行われる。次に、Ｓ２において、第３段階「Stage.3」及び第４段階「Stage.4」の何れが選択されたかが判断される。このＳ２において、第４段階「Stage.4」が選択された場合には、Ｓ１３以下の処理が実行されるが、Ｓ２において、第３段階「Stage.3」が選択された場合には、Ｓ３において、「Stage.3」の練習メニュー画面が表示装置３０に表示されると共に、ガイド音がスピーカ２２から出力される。次に、Ｓ４において、評価の対象となる音声情報（歌唱音声）の取得が行われる。すなわち、マイクロフォン２４からＡ／Ｄコンバータ２６を介して音声情報の入力が行われ、入力された音声情報が発声主体である利用者の識別情報と関連付けられて記憶部１８等に形成された音声データベース３８に記憶（蓄積）される。 First, in S1, a practice stage and a model are selected via the touch panel 32 or the like according to the screen displayed on the display device 30. Next, in S2, it is determined which of the third stage “Stage.3” and the fourth stage “Stage.4” has been selected. In S2, when the fourth stage “Stage.4” is selected, the processing from S13 is executed. In S2, when the third stage “Stage.3” is selected, S3. The practice menu screen of “Stage.3” is displayed on the display device 30 and a guide sound is output from the speaker 22. Next, in S4, voice information (singing voice) to be evaluated is acquired. In other words, voice information is input from the microphone 24 via the A / D converter 26, and the voice database formed in the storage unit 18 or the like is associated with the identification information of the user who is the main utterance. 38 is stored (accumulated).

次に、Ｓ５において、Ｓ４にて取得された音声情報から、全母音に対応して表声に対応する第１音声情報（表声歌唱音声）が抽出される。このＳ５の処理と併行して、Ｓ６において、Ｓ４にて取得された音声情報から、全母音に対応して裏声に対応する第２音声情報（裏声歌唱音声）が抽出される。次に、Ｓ７において、Ｓ５及びＳ６にて抽出された歌唱音声に関して、ケプストラム分析等の手法によりスペクトル分析が行われる。次に、Ｓ８において、Ｓ７のスペクトル分析の結果として、１次データすなわち基本波の周波数（ピッチ）及び音量等が抽出される。次に、Ｓ９において、２次データすなわち基本波及び各高調波の相対レベル（高調波レベルパターン）が抽出される。次に、Ｓ１０において、Ｓ８にて抽出された１次データ及びＳ９にて抽出された２次データから、第２融合率（声区融合比率）すなわち対象となる音声情報における複数の高調波のスペクトル重心ｆｇを基本波の周波数ｆ０で除した値（＝ｆｇ／ｆ０）が算出される。このＳ１０の処理と併行して、Ｓ１１において、評価部５２における計算式７０の学習データとして、Ｓ５及びＳ６にて抽出された第１音声情報及び第２音声情報それぞれにおける複数の高調波の強度を基本波の強度で除した値が算出される。次に、Ｓ１２において、Ｓ１０にて算出された第２融合率に基づいて、表声及び裏声をそれぞれ歌い分ける第１の歌唱法についての評価（第３段階「Stage.3」に係る評価）が行われ、その評価結果が表示装置３０に表示された後、本ルーチンが終了させられる。 Next, in S5, the 1st audio | voice information (phonetic singing voice | voice) corresponding to a voice is extracted from the audio | voice information acquired in S4 corresponding to all the vowels. In parallel with the process of S5, in S6, the second voice information (back vocal singing voice) corresponding to the back voice corresponding to all the vowels is extracted from the voice information acquired in S4. Next, in S7, spectrum analysis is performed on the singing voice extracted in S5 and S6 by a technique such as cepstrum analysis. Next, in S8, the primary data, that is, the frequency (pitch) and volume of the fundamental wave are extracted as a result of the spectrum analysis in S7. Next, in S9, the secondary data, that is, the fundamental wave and the relative level of each harmonic (harmonic level pattern) are extracted. Next, in S10, from the primary data extracted in S8 and the secondary data extracted in S9, a second fusion rate (voice zone fusion ratio), that is, a spectrum of a plurality of harmonics in the target speech information. A value obtained by dividing the center of gravity fg by the frequency f0 of the fundamental wave (= fg / f0) is calculated. In parallel with the process of S10, in S11, the intensity of the plurality of harmonics in each of the first voice information and the second voice information extracted in S5 and S6 is used as learning data of the calculation formula 70 in the evaluation unit 52. A value divided by the intensity of the fundamental wave is calculated. Next, in S12, based on the second fusion rate calculated in S10, an evaluation of the first singing method for singing the voice and the back voice (evaluation according to the third stage “Stage.3”) is performed. After the evaluation result is displayed on the display device 30, this routine is terminated.

Ｓ１３においては、「Stage.4」の練習メニュー画面が表示装置３０に表示されると共に、ガイド音がスピーカ２２から出力される。次に、Ｓ１４において、評価の対象となる音声情報（歌唱音声）の取得が行われる。すなわち、マイクロフォン２４からＡ／Ｄコンバータ２６を介して音声情報の入力が行われ、入力された音声情報が発声主体である利用者の識別情報と関連付けられて記憶部１８等に形成された音声データベース３８に記憶（蓄積）される。次に、Ｓ１５において、Ｓ１４にて取得された歌唱音声に関して、ケプストラム分析等の手法によりスペクトル分析が行われる。次に、Ｓ１６において、Ｓ１５のスペクトル分析の結果として、１次データすなわち基本波の周波数（ピッチ）及び音量等が抽出される。次に、Ｓ１７において、２次データすなわち基本波及び各高調波の相対レベル（高調波レベルパターン）が抽出される。次に、Ｓ１８において、Ｓ１７にて抽出された基本波及び各高調波の相対レベルが計算式７０に入力され、その計算式７０から出力される値が対象となる音声情報における表声及び裏声の第１融合率として算出される。次に、Ｓ１９において、Ｓ１８にて算出された第１融合率に基づいて、表声と裏声との間で連続的に変化させた第２の歌唱法についての評価（第４段階「Stage.4」に係る評価）が行われ、その評価結果が表示装置３０に表示された後、本ルーチンが終了させられる。 In S <b> 13, the practice menu screen of “Stage.4” is displayed on the display device 30 and a guide sound is output from the speaker 22. Next, in S14, voice information (singing voice) to be evaluated is acquired. In other words, voice information is input from the microphone 24 via the A / D converter 26, and the voice database formed in the storage unit 18 or the like is associated with the identification information of the user who is the main utterance. 38 is stored (accumulated). Next, in S15, spectrum analysis is performed on the singing voice acquired in S14 by a technique such as cepstrum analysis. Next, in S16, primary data, that is, the frequency (pitch) and volume of the fundamental wave are extracted as a result of the spectrum analysis in S15. Next, in S17, secondary data, that is, the fundamental wave and the relative level of each harmonic (harmonic level pattern) are extracted. Next, in S18, the fundamental wave and the relative level of each harmonic extracted in S17 are input to the calculation formula 70, and the value output from the calculation formula 70 is the voice and back voice in the target speech information. Calculated as the first fusion rate. Next, in S19, based on the first fusion rate calculated in S18, the evaluation of the second singing method continuously changed between the voice and the back voice (fourth stage “Stage.4 ) Is performed, and the evaluation result is displayed on the display device 30, and then this routine is terminated.

以上の制御において、Ｓ４及びＳ１４が録音部５０の処理に、Ｓ１５〜Ｓ１８が評価部５２の処理、算出過程、算出ステップに、Ｓ５〜Ｓ１１が第２評価部５４の処理に、Ｓ１６及びＳ１７が抽出部６０の処理、抽出過程、抽出ステップに、Ｓ１２及びＳ１９が表示制御部５６の処理にそれぞれ対応する。第３段階「Stage.3」に対応する処理、すなわちＳ３〜Ｓ１２の処理は、必ずしも実行されなくともよい。 In the above control, S4 and S14 are processing of the recording unit 50, S15 to S18 are processing of the evaluation unit 52, calculation process and calculation step, S5 to S11 are processing of the second evaluation unit 54, and S16 and S17 are processing. S12 and S19 correspond to the processing of the display control unit 56, respectively, to the processing of the extraction unit 60, the extraction process, and the extraction step. The process corresponding to the third stage “Stage.3”, that is, the processes of S3 to S12 are not necessarily executed.

このように、本実施例によれば、自分が出した声において表声と裏声とがどれほどの割合で混ざっているのかを実用的な処理により簡便に把握することができる。すなわち、裏声に表声を混ぜる発声法を客観的に評価する発声評価装置１０を提供することができる。 As described above, according to the present embodiment, it is possible to easily grasp how much the voice and the back voice are mixed in the voice produced by the practical process. That is, it is possible to provide the utterance evaluation apparatus 10 that objectively evaluates the utterance method in which the voice is mixed with the back voice.

また、歌声のピッチを変化させつつ裏声と表声とを切り換える発声を行い、その切り換えが上手くできているか否か等を好適に評価することができる。 In addition, it is possible to suitably evaluate whether or not the switching is successfully performed by changing the pitch of the singing voice while performing the utterance for switching the back voice and the voice.

また、自分が出した表声及び裏声を前記モデルの学習データに用いることで、自分が出した声において表声と裏声とがどれほどの割合で混ざっているのかを更に好適に評価することができる。 In addition, by using the voice and back voice produced by the model for the learning data of the model, it is possible to more suitably evaluate how much the voice and back voice are mixed in the voice produced by the self. .

また、自分が出した声において表声と裏声とがどれほどの割合で混ざっているのかを実用的なモデルを用いて簡便に把握することができる。 In addition, it is possible to easily grasp how much the voice and back voice are mixed with each other by using a practical model.

以上、本発明の好適な実施例を図面に基づいて詳細に説明したが、本発明はこれに限定されるものではなく、更に別の態様においても実施される。 The preferred embodiments of the present invention have been described in detail with reference to the drawings. However, the present invention is not limited to these embodiments, and may be implemented in other modes.

例えば、前述の実施例においては、録音部５０により録音された音声情報が音声データベース３８に記憶され、抽出部６０によりその音声データベース３８に記憶された音声情報を読み出して音声解析等の処理を行う態様について説明したが、録音部５０により録音された音声情報についてリアルタイムで抽出部６０及び評価部５２等による処理が行われるものであってもよい。斯かる態様においては、利用者は自分の現時点における発声に係る表声と裏声との融合率をリアルタイムで確認することができ、更に効率的なトレーニングを実現することができる。 For example, in the above-described embodiment, the voice information recorded by the recording unit 50 is stored in the voice database 38, and the voice information stored in the voice database 38 is read out by the extraction unit 60 to perform processing such as voice analysis. Although the aspect has been described, the audio information recorded by the recording unit 50 may be processed in real time by the extraction unit 60, the evaluation unit 52, and the like. In such an aspect, the user can confirm in real time the fusion rate of the voice and back voice related to his / her current utterance, and can realize more efficient training.

本発明の発声評価装置（発声評価方法）は、男性（男声）、女性（女声）の別に関係なく表声と裏声との切り換えを評価できるものであるが、ニューラルネットワークを用いる形態等においては、発声者の性別や年齢等に応じて柔軟性を高めることも考えられる。本発明による発声評価は、例えばインターネットを介しての通信発声法講座、スマートフォンにおけるアプリケーションとしての発声トレーニングモード（発声トレーニングゲーム）、パーソナルコンピュータにインストールされて用いられるソフトウェアとしての発声トレーニングソフト等、各種形態のトレーニングメニューに広く用いられるものである。 The speech evaluation device (speech evaluation method) of the present invention can evaluate the switching between voice and back voice regardless of whether it is male (male voice) or female (female voice). It may be possible to increase flexibility according to the gender and age of the speaker. The speech evaluation according to the present invention includes various forms such as a communication speech method course via the Internet, a speech training mode (speech training game) as an application in a smartphone, and speech training software as software installed and used in a personal computer. It is widely used for training menus.

その他、一々例示はしないが、本発明はその趣旨を逸脱しない範囲内において種々の変更が加えられて実施されるものである。 In addition, although not illustrated one by one, the present invention is implemented with various modifications within a range not departing from the gist thereof.

１０：発声評価装置、５２：評価部、５４：第２評価部、６０：抽出部、７０：計算式 10: utterance evaluation device, 52: evaluation unit, 54: second evaluation unit, 60: extraction unit, 70: calculation formula

Claims

An utterance evaluation device that evaluates utterance based on voice information input from a voice input device,
An extraction unit for extracting a fundamental wave and a plurality of harmonics in the voice information input from the voice input device;
The fundamental wave and the plurality of harmonics extracted by the extraction unit in each of the first voice information corresponding to the voice of at least one of the subject and the speaker other than the subject acquired in advance and the second voice information corresponding to the back voice An evaluation unit that inputs learning data based on the learning model, and performs learning for correcting the learning model so that the first fusion rate output from the learning model becomes a predetermined value ,
The evaluation unit is based on the fact that to input evaluation data based on the fundamental wave and a plurality of harmonics extracted by the extraction unit to the learning is performed learning model, switching table voice and falsetto in the voice information An utterance evaluation apparatus for calculating the first fusion rate as an index related to the above .

The utterance evaluation apparatus according to claim 1, wherein the learning model is a neural network or a support vector machine.

The evaluation unit is configured to calculate the first fusion rate calculated corresponding to the voice information according to a change in the frequency of the fundamental wave extracted by the extraction unit in the voice information input from the voice input device. based on the smoothness of the change, utterance evaluation apparatus according to claim 1 or 2 is to evaluate the switching table voice and falsetto in the audio information.

About the 1st singing method which sings each voice and back voice, based on the 2nd fusion rate determined based on the fundamental wave in a voice information inputted from the voice input device, and a plurality of harmonics, the voice information A second evaluation unit that evaluates the singing of voice and back voice in
The evaluation unit includes a fundamental wave and a plurality of harmonics in each of the first voice information corresponding to the voice in the first singing method and the second voice information corresponding to the back voice evaluated by the second evaluation unit. Learning is performed with the learning data including learning data based on
The evaluation unit has a frequency of the fundamental wave extracted by the extraction unit in the voice information input from the voice input device for the second singing method continuously changed between the voice and the back voice. depending on the change, on the basis of the smoothness of the change in the first fusion rate calculated in response to the audio information, according to claim 3 is intended to evaluate the switching table voice and falsetto in the voice information Utterance evaluation device.

The learning data is a value obtained by dividing the intensity of the plurality of harmonics in each of the first audio information and the second audio information by the intensity of the fundamental wave,
The evaluation data, according to claim 4 the intensity of the plurality of harmonics extracted by the extraction unit from claim 1 which is a value obtained by dividing the intensity of the fundamental wave in the audio information input from the voice input device The utterance evaluation apparatus according to any one of the above.

A speech evaluation method for evaluating speech based on speech information input from a speech input device,
An extraction process for extracting a fundamental wave and a plurality of harmonics in the voice information input from the voice input device;
The fundamental wave and the plurality of harmonics extracted by the extraction process in the first voice information corresponding to the voice of at least one of the subject and the speaker other than the subject acquired in advance and the second voice information corresponding to the back voice respectively. An evaluation process in which learning is performed to input learning data based on the learning model, and to correct the learning model so that the first fusion rate output from the learning model becomes a predetermined value ;
Based on the input of the evaluation data based on the fundamental wave and the plurality of harmonics extracted in the extraction process to the learning model in which the learning is performed , utterance evaluation method characterized by a calculation step of calculating the first fusion rate, including.

In the control device provided in the utterance evaluation device that evaluates the utterance based on the voice information input from the voice input device,
An extraction step of extracting a fundamental wave and a plurality of harmonics in the voice information input from the voice input device;
The fundamental wave and the plurality of harmonics extracted by the extraction step in the first voice information corresponding to the voice of at least one of the subject and the speaker other than the subject acquired in advance and the second voice information corresponding to the back voice respectively. the based learning data input to the training model, the evaluation step in which the first fusion rate output from the learning model is Ru performed learning to correct the learning model such that the predetermined value,
Based on inputting the evaluation data based on the fundamental wave and the plurality of harmonics extracted in the extraction step to the learning model in which the learning is performed , And a calculation step of calculating the first fusion rate.