JP4595934B2

JP4595934B2 - Voice evaluation apparatus and voice evaluation method

Info

Publication number: JP4595934B2
Application number: JP2006335807A
Authority: JP
Inventors: 達也入山
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-12-13
Filing date: 2006-12-13
Publication date: 2010-12-08
Anticipated expiration: 2026-12-13
Also published as: JP2008145940A

Description

本発明は、音声を評価する技術に関する。 The present invention relates to a technique for evaluating speech.

利用者の歌唱の巧拙を評価する歌唱評価機能を備えた各種のカラオケ装置が提供されている。この種のカラオケ装置として、例えば、特許文献１には、利用者の歌唱位置を検出するとともに、歌唱音声の各音楽要素とその歌唱位置に対応した基準音声の基準フレームの各音楽要素と比較し、この比較結果に基づいて歌唱を評価することによって、歌唱音声の音程、音量及び声質などを評価する装置が提案されている。
特開２００１−１１７５６８号公報 Various karaoke apparatuses having a song evaluation function for evaluating the skill of a user's song are provided. As this kind of karaoke apparatus, for example, in Patent Document 1, a user's singing position is detected and compared with each music element of the singing voice and each music element of the reference frame of the reference voice corresponding to the singing position. An apparatus for evaluating the pitch, volume, and voice quality of a singing voice by evaluating a song based on the comparison result has been proposed.
JP 2001-117568 A

ところで、歌唱音声の音程や音量によって巧拙が評価される以外にも、音声の明瞭度によっても聴取者による歌唱の巧拙の評価が左右される場合もある。例えば、聴取者によっては音声が明瞭であるほど評価が高いと感じるものもいる。このような音声の明瞭度を利用者が把握できれば好適である。
本発明は上述した背景の下になされたものであり、利用者が音声の明瞭度を把握することのできる技術を提供することを目的とする。 By the way, in addition to the evaluation of skill by the pitch and volume of the singing voice, the evaluation of the skill of the singing by the listener may be influenced by the clarity of the voice. For example, some listeners feel that the evaluation is higher as the voice is clearer. It is preferable if the user can grasp such clarity of speech.
The present invention has been made under the above-described background, and an object of the present invention is to provide a technology that allows a user to grasp the intelligibility of speech.

本発明の好適な態様である音声評価装置は、音声を表す音声信号を予め定められた複数の周波数成分毎に分析し、周波数スペクトルを算出するレベル算出手段と、前記レベル算出手段により算出された周波数スペクトルから、予め決められた１つの第ｎフォルマントのレベルを第１のレベルとして特定するとともに、前記第ｎフォルマントと第（ｎ＋１）フォルマントとの間の谷の位置のレベルを第２のレベルとして特定するレベル特定手段と、前記レベル特定手段が特定した第１のレベルと第２のレベルとに基づいて前記音声の明瞭度を算出する算出手段であって、前記第２のレベルに対する前記第１のレベルの比率が大きくなるほど明瞭度が高くなるように当該明瞭度を算出する明瞭度算出手段と、前記明瞭度算出手段が算出した明瞭度を示す明瞭度情報を出力する明瞭度情報出力手段とを具備することを特徴としている。 The speech evaluation apparatus according to a preferred aspect of the present invention analyzes a speech signal representing speech for each of a plurality of predetermined frequency components and calculates a frequency spectrum, and the level calculation unit calculates the frequency spectrum. From the frequency spectrum, a predetermined level of the nth formant is specified as the first level, and the level of the valley position between the nth formant and the (n + 1) th formant is set as the second level. Level specifying means for specifying, and calculation means for calculating the articulation level of the voice based on the first level and the second level specified by the level specifying means, wherein the first level relative to the second level is calculated. The intelligibility calculation means for calculating the intelligibility so that the intelligibility becomes higher as the ratio of the levels of It is characterized by comprising the intelligibility information output means for outputting the clarity information indicating.

また、本発明の別の好適な態様である音声評価装置は、音声を表す音声信号を予め定められた複数の周波数成分毎に分析し、周波数成分毎のレベルを算出するレベル算出手段と、前記レベル算出手段が算出した周波数成分毎のレベルを、当該レベルの降順に整列した場合に、当該整列されたレベルの列において予め定められた第１の順位のレベルを第１のレベルとして特定するとともに、当該整列されたレベルの列において前記第１の順位よりも低い予め定められた第２の順位のレベルを第２のレベルとして特定するレベル特定手段と、前記レベル特定手段が特定した第１のレベルと第２のレベルとに基づいて前記音声の明瞭度を算出する算出手段であって、前記第２のレベルに対する前記第１のレベルの比率が大きくなるほど明瞭度が高くなるように当該明瞭度を算出する明瞭度算出手段と、前記明瞭度算出手段が算出した明瞭度を示す明瞭度情報を出力する明瞭度情報出力手段とを具備することを特徴としている。 Further, the speech evaluation apparatus according to another preferred aspect of the present invention analyzes a speech signal representing speech for each of a plurality of predetermined frequency components, and calculates a level for each frequency component; When the level for each frequency component calculated by the level calculating means is arranged in descending order of the level , the level of the first order predetermined in the arranged level column is specified as the first level . , a level specifying means for specifying the level of the second order defined lower advance than the first rank as the second level in the aligned level columns of the first to the level specifying means is identified A calculating means for calculating the intelligibility of the voice based on a level and a second level, wherein the intelligibility increases as the ratio of the first level to the second level increases; And clarity calculating means for calculating the intelligibility to so that is characterized by comprising the intelligibility information output means for outputting the clarity information indicating the clarity of the clarity calculation means has calculated.

上述した態様において、前記明瞭度情報出力手段が出力した明瞭度情報に基づいて、当該明瞭度情報の示す明瞭度を報知する明瞭度報知手段を備えてもよい。
また、上述した態様において、前記音声信号のレベルを算出する第２のレベル算出手段と、前記音声信号において予め定められた周波数帯域のレベルを算出する第３のレベル算出手段と、前記第２のレベル算出手段が算出したレベルに対する前記第３のレベル算出手段が算出したレベルの比率が大きくなるほど評価が高くなるように、前記第２のレベル算出手段が算出したレベルと前記第３のレベル算出手段が算出したレベルとに基づいて、前記音声の声質を評価する声質評価手段と、前記声質評価手段の評価結果を示す声質評価情報を出力する声質評価情報出力手段とを具備してもよい。
また、上述した態様において、前記声質評価情報出力手段が出力した声質評価情報に基づいて、当該声質評価情報の示す声質評価結果を報知する声質評価結果報知手段を備えてもよい。 The aspect mentioned above WHEREIN: You may provide the clarity notification means which alert | reports the clarity shown by the said clarity information based on the clarity information which the said clarity information output means output.
In the aspect described above, the second level calculating means for calculating the level of the audio signal, the third level calculating means for calculating the level of a predetermined frequency band in the audio signal, and the second level calculating means. The level calculated by the second level calculation unit and the third level calculation unit so that the evaluation becomes higher as the ratio of the level calculated by the third level calculation unit to the level calculated by the level calculation unit increases. May further comprise voice quality evaluation means for evaluating the voice quality of the voice and voice quality evaluation information output means for outputting voice quality evaluation information indicating an evaluation result of the voice quality evaluation means.
Moreover, the aspect mentioned above WHEREIN: You may provide the voice quality evaluation result alerting | reporting means which alert | reports the voice quality evaluation result which the said voice quality evaluation information shows based on the voice quality evaluation information which the said voice quality evaluation information output means output.

また、上述した態様において、前記明瞭度算出手段が算出した明瞭度に応じて、前記声質評価手段の評価結果を修正する声質評価修正手段を備え、前記声質評価情報出力手段は、前記声質評価修正手段により修正された評価結果を示す評価結果情報を出力してもよい。 Further, in the above-described aspect, the voice quality evaluation information output unit includes the voice quality evaluation correction unit that corrects the evaluation result of the voice quality evaluation unit according to the intelligibility calculated by the intelligibility calculation unit. You may output the evaluation result information which shows the evaluation result corrected by the means.

本発明によれば、利用者が音声の明瞭度を把握することができる。 According to the present invention, a user can grasp the intelligibility of speech.

次に、本発明を実施するための最良の形態を説明する。
＜Ａ：構成＞
図１は、この発明の一実施形態であるカラオケ装置１のハードウェア構成を示すブロック図である。図において、制御部１１は、ＣＰＵ（Central Processing Unit）やＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）を備え、ＲＯＭ又は記憶部１２に記憶されているコンピュータプログラムを読み出して実行することにより、バスＢＵＳを介してカラオケ装置１の各部を制御する。記憶部１２は、制御部１１によって実行されるプログラムやその実行時に使用されるデータを記憶するための記憶手段であり、例えばハードディスク装置である。表示部１３は、液晶パネルなどを備え、制御部１１の制御の下で、カラオケ装置１を操作するためのメニュー画面や、背景画像に歌詞テロップを重ねたカラオケ画面などの各種画面を表示する。操作部１４は、利用者による操作に応じた信号を制御部１１に出力する。マイクロフォン１５は、利用者が発音した音声を収音し、収音した音声を表す音声信号（アナログ信号）を出力する。音声処理部１６は、マイクロフォン１５が出力する音声信号（アナログ信号）をデジタルデータに変換する。また、音声処理部１６は、デジタルデータをアナログ信号に変換してスピーカ１７に出力する。スピーカ１７は、音声処理部１６でデジタルデータからアナログ信号に変換され出力される音声信号に応じた強度で放音する放音手段である。 Next, the best mode for carrying out the present invention will be described.
<A: Configuration>
FIG. 1 is a block diagram showing a hardware configuration of a karaoke apparatus 1 according to an embodiment of the present invention. In the figure, the control unit 11 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), and a RAM (Random Access Memory), and reads and executes a computer program stored in the ROM or the storage unit 12. The respective units of the karaoke apparatus 1 are controlled via the bus BUS. The storage unit 12 is a storage unit for storing a program executed by the control unit 11 and data used at the time of execution, and is, for example, a hard disk device. The display unit 13 includes a liquid crystal panel and the like, and displays various screens such as a menu screen for operating the karaoke device 1 and a karaoke screen in which lyrics telop is superimposed on a background image under the control of the control unit 11. The operation unit 14 outputs a signal corresponding to the operation by the user to the control unit 11. The microphone 15 picks up the sound produced by the user and outputs a sound signal (analog signal) representing the picked up sound. The audio processing unit 16 converts an audio signal (analog signal) output from the microphone 15 into digital data. The audio processing unit 16 converts the digital data into an analog signal and outputs the analog signal to the speaker 17. The speaker 17 is a sound emitting unit that emits sound with an intensity corresponding to the sound signal that is converted from the digital data into an analog signal and output by the sound processing unit 16.

なお、この実施形態では、マイクロフォン１５とスピーカ１７とがカラオケ装置１に含まれている場合について説明するが、音声処理部１６に入力端子及び出力端子を設け、オーディオケーブルを介してその入力端子に外部マイクロフォンを接続する構成としても良く、同様に、オーディオケーブルを介してその出力端子に外部スピーカを接続するとしても良い。また、この実施形態では、マイクロフォン１５から音声処理部１６へ入力される音声信号及び音声処理部１６からスピーカ１７へ出力される音声信号がアナログ音声信号である場合について説明するが、デジタル音声データを入出力するようにしても良い。このような場合には、音声処理部１６にてＡ／Ｄ変換やＤ／Ａ変換を行う必要はない。 In this embodiment, the case where the microphone 15 and the speaker 17 are included in the karaoke apparatus 1 will be described. However, the audio processing unit 16 is provided with an input terminal and an output terminal, and the input terminal is connected to the input terminal via an audio cable. An external microphone may be connected. Similarly, an external speaker may be connected to the output terminal via an audio cable. In this embodiment, the case where the audio signal input from the microphone 15 to the audio processing unit 16 and the audio signal output from the audio processing unit 16 to the speaker 17 are analog audio signals will be described. You may make it input / output. In such a case, the audio processing unit 16 does not need to perform A / D conversion or D / A conversion.

記憶部１２は、図示のように、伴奏データ記憶領域１２１と、背景画データ記憶領域１２２と、歌詞データ記憶領域１２３と、お手本音声データ記憶領域１２４とを有している。伴奏データ記憶領域１２１には、ＭＩＤＩ（Musical Instruments Digital Interface）形式などのデータ形式であって、各楽曲の伴奏楽音を構成する伴奏データが楽曲毎に記憶されている。背景画データ記憶領域１２２には、カラオケ伴奏の際に表示される背景画像を表す背景画データが記憶されている。歌詞データ記憶領域１２３には、カラオケ伴奏の際に歌詞テロップとして表示される楽曲の歌詞を表す歌詞データが記憶されている。お手本音声データ記憶領域１２４には、楽曲のお手本となる音声（以下、「お手本音声」）を表す例えばＷＡＶＥ形式などの音声データが記憶されている。 As illustrated, the storage unit 12 includes an accompaniment data storage area 121, a background image data storage area 122, a lyrics data storage area 123, and a model voice data storage area 124. In the accompaniment data storage area 121, accompaniment data constituting an accompaniment musical sound of each music in a data format such as MIDI (Musical Instruments Digital Interface) is stored for each music. The background image data storage area 122 stores background image data representing a background image displayed at the time of karaoke accompaniment. The lyric data storage area 123 stores lyric data representing the lyric of the music displayed as the lyrics telop at the time of karaoke accompaniment. In the model voice data storage area 124, voice data in the WAVE format or the like representing voices (hereinafter referred to as “model voices”) as models of music is stored.

＜Ｂ：動作＞
次に、カラオケ装置１が行う処理の流れについて、図２に示すフローチャートを参照しつつ説明する。まず、利用者は、カラオケ装置１の操作部１４を操作して、歌唱したい楽曲を選択する操作を行う。操作部１４は、操作された内容に応じた信号を制御部１１へ出力する。制御部１１は、操作部１４から出力される操作信号に応じて楽曲を選択する（ステップＳ１）。
制御部１１は、選択した楽曲の背景画と歌詞テロップを表示部１３に表示させるとともに、カラオケ伴奏を開始する（ステップＳ２）。すなわち、制御部１１は、伴奏データ記憶領域１２１から伴奏データを読み出して音声処理部１６に供給し、音声処理部１６は、伴奏データをアナログ信号に変換し、スピーカ１７に供給する。スピーカ１７は、供給されるアナログ信号に応じて、伴奏音を放音する。また、制御部１１は、歌詞データ記憶領域１２３から歌詞データ読み出すとともに、背景画データ記憶領域１２２から背景画データを読み出して、歌詞テロップと背景画を表示部１３に表示させる。 <B: Operation>
Next, the flow of processing performed by the karaoke apparatus 1 will be described with reference to the flowchart shown in FIG. First, a user operates the operation part 14 of the karaoke apparatus 1, and performs operation which selects the music to sing. The operation unit 14 outputs a signal corresponding to the operated content to the control unit 11. The control part 11 selects a music according to the operation signal output from the operation part 14 (step S1).
The control unit 11 causes the background image and lyrics telop of the selected music to be displayed on the display unit 13 and starts karaoke accompaniment (step S2). That is, the control unit 11 reads the accompaniment data from the accompaniment data storage area 121 and supplies the accompaniment data to the audio processing unit 16. The audio processing unit 16 converts the accompaniment data into an analog signal and supplies the analog signal to the speaker 17. The speaker 17 emits an accompaniment sound according to the supplied analog signal. Further, the control unit 11 reads out the lyrics data from the lyrics data storage area 123 and also reads out the background image data from the background image data storage area 122 and causes the display unit 13 to display the lyrics telop and the background image.

練習者は、スピーカ１７から放音される伴奏にあわせて歌唱を行う。このとき、練習者の音声はマイクロフォン１５によって収音されて音声信号に変換され、音声処理部１６へと出力される（ステップＳ３）。音声処理部１６は、マイクロフォンから出力される音声信号をデジタルデータ（以下、単に「音声信号」という）に変換する。 The practitioner sings along with the accompaniment emitted from the speaker 17. At this time, the practitioner's voice is picked up by the microphone 15, converted into a voice signal, and output to the voice processing unit 16 (step S3). The audio processing unit 16 converts an audio signal output from the microphone into digital data (hereinafter simply referred to as “audio signal”).

制御部１１は、音声信号を所定時間長（例えば、「３msec」）のフレーム単位に分析し、フレーム単位で音声のレベルとスペクトルとを検出する（ステップＳ４）。すなわち、制御部１１は、フレーム単位で音声信号のレベルを検出するとともに、音声信号を複数の周波数成分に分離し、周波数成分毎のレベルを算出（スペクトルを検出）する。この実施形態では、制御部１１は、ＦＦＴ（Fast Fourier Transform）を用いて音声からスペクトルを検出する。
図３は、スペクトルの検出結果を示す図である。図３において、横軸は周波数を示し、縦軸はレベルを示す。図３においては、音声信号Ｓ１と音声信号Ｓ２の２つの音声信号のスペクトルを示している。 The control unit 11 analyzes the audio signal in frame units of a predetermined time length (for example, “3 msec”), and detects the audio level and spectrum in frame units (step S4). That is, the control unit 11 detects the level of the audio signal in units of frames, separates the audio signal into a plurality of frequency components, and calculates the level for each frequency component (detects the spectrum). In this embodiment, the control unit 11 detects a spectrum from speech using FFT (Fast Fourier Transform).
FIG. 3 is a diagram illustrating a spectrum detection result. In FIG. 3, the horizontal axis indicates the frequency, and the vertical axis indicates the level. FIG. 3 shows the spectra of two audio signals, that is, the audio signal S1 and the audio signal S2.

次いで、制御部１１は、検出したスペクトルにおいて、レベルの変化が山となって現れる位置を特定するとともに、レベルの変化が谷となって現れる位置を特定し、特定した山の位置のレベルをフォルマントレベル（第１のレベル）として特定するとともに、特定した谷の位置のレベルを谷レベル（第２のレベル）として特定する（ステップＳ５）。
ここで、この実施形態において制御部１１が行うステップＳ５に示すレベル特定処理について説明する。まず、制御部１１は、ステップＳ４において算出した周波数成分毎のレベルを、当該レベルの降順に整列する。図４は、図３に示したスペクトル検出結果について、レベルの降順にソートした内容を示す図である。図４において、横軸は要素の数を示し、縦軸はレベルを示す。この実施形態では、ソートされたレベルの列において先頭から１／４に位置する周波数成分のレベルを山レベルとみなし、一方、３／４に位置するレベルを谷レベルとみなす。すなわち、制御部１１は、ソートされたレベルの列において、先頭から１／４に位置するレベルをフォルマントレベルとして特定する。また、制御部１１は、ソートされたレベルの列において、先頭から３／４番目に位置するレベルを谷レベルとして特定する。図４に示す例においては、レベルＬ１がフォルマントレベルとして特定され、レベルＬ２が谷レベルとして特定される。このように、先頭から全体の総数に対する１／４番目のレベルと全体の総数に対する３／４番目のレベルを特定することで、フォルマントレベルと谷レベルに近い値を特定することができる。 Next, the control unit 11 specifies a position where the level change appears as a peak in the detected spectrum, specifies a position where the level change appears as a valley, and determines the level of the specified peak position as a formant. While specifying as the level (first level), the level of the specified valley position is specified as the valley level (second level) (step S5).
Here, the level specifying process shown in step S5 performed by the control unit 11 in this embodiment will be described. First, the control unit 11 arranges the level for each frequency component calculated in step S4 in descending order of the level. FIG. 4 is a diagram showing the contents of the spectrum detection results shown in FIG. 3 sorted in descending order of level. In FIG. 4, the horizontal axis indicates the number of elements, and the vertical axis indicates the level. In this embodiment, in the sorted level column, the level of the frequency component located at ¼ from the head is regarded as the peak level, while the level located at 3/4 is regarded as the valley level. That is, the control unit 11 specifies the level located at ¼ from the head as the formant level in the sorted level column. In addition, the control unit 11 specifies the level that is 3 / 4th from the top in the sorted level column as the valley level. In the example shown in FIG. 4, the level L1 is specified as the formant level, and the level L2 is specified as the valley level. In this way, by specifying the ¼th level with respect to the total number from the top and the ¾th level with respect to the total number, it is possible to specify values close to the formant level and the valley level.

次いで、制御部１１は、フォルマントレベルと谷レベルとに基づいて、音声の明瞭度を算出する（ステップＳ６）。このとき、制御部１１は、谷レベルに対するフォルマントレベルの比率が大きくなるほど明瞭度が高くなるように、明瞭度の算出処理を行う。この実施形態では、制御部１１は、以下の式を用いて、フォルマントレベルＡと谷レベルＢから明瞭度Ｃ（ｄＢ）を算出する。
Ｃ＝２０＊log（Ａ／Ｂ）…（式１） Next, the control unit 11 calculates the speech intelligibility based on the formant level and the valley level (step S6). At this time, the control unit 11 performs a clarity calculation process so that the clarity increases as the ratio of the formant level to the valley level increases. In this embodiment, the control unit 11 calculates the clarity C (dB) from the formant level A and the valley level B using the following equation.
C = 20 * log (A / B) (Formula 1)

このように、谷レベルに対するフォルマントレベルの比率が高いほど明瞭度が高い。具体的には、例えば、図４に示す例においては、音声信号Ｓ１のほうが音声信号Ｓ２よりも谷レベルに対するフォルマントレベルの比率が高く、これにより、音声信号Ｓ１の明瞭度は音声信号Ｓ２の明瞭度よりも高くなる。 Thus, the higher the ratio of the formant level to the valley level, the higher the clarity. Specifically, for example, in the example shown in FIG. 4, the ratio of the formant level to the valley level is higher in the audio signal S1 than in the audio signal S2, so that the clarity of the audio signal S1 is higher than that of the audio signal S2. Higher than degree.

次いで、制御部１１は、声質の評価を行う。まず、制御部１１は、音声信号から高域周波数帯域のレベル（以下、「高域レベル」）を算出する（ステップＳ７）。この実施形態では、制御部１１は、１ｋＨｚ以上の周波数帯域のレベルを高域レベルとして算出する。次いで、制御部１１は、ステップＳ４で算出した全帯域のレベルとステップＳ７で算出した高域レベルとの比率に応じて声質を評価する（ステップＳ８）。このとき、制御部１１は、ステップＳ４で算出したレベルに対するステップＳ７で算出したレベルの比率が大きくなるほど評価が高くなるように、音声の声質評価処理を行う。この実施形態では、制御部１１は、以下の式を用いて、ステップＳ７で算出した高域レベルＥと、ステップＳ４で算出した全帯域のレベルＦとを用いて声質値Ｄを算出する。
Ｄ＝２０＊log（Ｅ／Ｆ）…（式２）
このように、この実施形態では、制御部１１は、音声に含まれる高域の周波数成分が多いほど声質の評価を高くする。 Next, the control unit 11 evaluates the voice quality. First, the control unit 11 calculates the level of the high frequency band (hereinafter, “high frequency level”) from the audio signal (step S7). In this embodiment, the control unit 11 calculates the level of the frequency band of 1 kHz or higher as the high frequency level. Next, the control unit 11 evaluates the voice quality according to the ratio between the level of all bands calculated in step S4 and the high frequency level calculated in step S7 (step S8). At this time, the control unit 11 performs a voice quality evaluation process so that the higher the ratio of the level calculated in step S7 to the level calculated in step S4, the higher the evaluation. In this embodiment, the control unit 11 calculates a voice quality value D by using the following formula, using the high frequency level E calculated in step S7 and the level F of all bands calculated in step S4.
D = 20 * log (E / F) (Formula 2)
Thus, in this embodiment, the control unit 11 increases the evaluation of voice quality as the number of high frequency components included in the voice increases.

図５は、ステップＳ８に示す声質の評価の一例を説明するための図である。図５は、音声信号のスペクトルを表す図であり、図５において、横軸は周波数を示し、縦軸はレベルを示す。また、図５においては、音声信号Ｓ３と音声信号Ｓ４との２つの相異なる音声信号のスペクトルを示している。図５において、全体のレベルに対する高域レベルの比率は、音声信号Ｓ３のほうが音声信号Ｓ４よりも大きく、そのため、音声信号Ｓ３と音声信号Ｓ４との声質値をそれぞれ算出すると、音声信号Ｓ３の声質値が音声信号Ｓ４の声質値よりも高くなる。 FIG. 5 is a diagram for explaining an example of voice quality evaluation shown in step S8. FIG. 5 is a diagram illustrating a spectrum of an audio signal. In FIG. 5, the horizontal axis indicates the frequency, and the vertical axis indicates the level. FIG. 5 shows two different audio signal spectra of the audio signal S3 and the audio signal S4. In FIG. 5, the ratio of the high frequency level to the overall level is greater in the audio signal S3 than in the audio signal S4. Therefore, when the voice quality values of the audio signal S3 and the audio signal S4 are respectively calculated, the voice quality of the audio signal S3 is calculated. The value is higher than the voice quality value of the audio signal S4.

次いで、制御部１１は、ステップＳ６で算出した明瞭度を示す明瞭度情報と、ステップＳ８で算出した声質の評価結果を表す声質評価情報とを、表示部１３に出力する。表示部１３は、制御部１１から供給される明瞭度情報に基づいて、当該明瞭度情報の示す明瞭度を表示することによって利用者に報知する（ステップＳ９）。 Next, the control unit 11 outputs the intelligibility information indicating the intelligibility calculated in step S6 and the voice quality evaluation information indicating the evaluation result of the voice quality calculated in step S8 to the display unit 13. Based on the intelligibility information supplied from the control unit 11, the display unit 13 notifies the user by displaying the intelligibility indicated by the intelligibility information (step S9).

図６は、表示部１３に表示される画面の一例を示す図である。図示のように、表示部１３には、歌詞テロップＡ１と、明瞭度の度合いを示す棒グラフＡ２と、声質の度合いを示す棒グラフＡ３とが表示される。棒グラフＡ２の棒線Ａ２１の長さにより明瞭度の高低が表される。また、棒グラフＡ３の棒線Ａ３１の長さにより声質の評価結果が表される。制御部１１は、算出した明瞭度に基づいた長さの棒線Ａ２１を表示部１３に表示させるとともに、算出した声質の評価結果に基づいた長さの棒線Ａ３１を表示部１３に表示させる。 FIG. 6 is a diagram illustrating an example of a screen displayed on the display unit 13. As illustrated, the display unit 13 displays a lyrics telop A1, a bar graph A2 indicating the degree of intelligibility, and a bar graph A3 indicating the degree of voice quality. The level of clarity is expressed by the length of the bar line A21 of the bar graph A2. Further, the voice quality evaluation result is represented by the length of the bar line A31 of the bar graph A3. The control unit 11 causes the display unit 13 to display a bar line A21 having a length based on the calculated clarity, and causes the display unit 13 to display a bar line A31 having a length based on the calculated voice quality evaluation result.

利用者は、表示部１３に表示される棒グラフＡ２を視認することで、自身の音声の明瞭度の度合いを把握することができる。また、利用者は、表示部１３に表示される棒グラフＡ３を視認することで、声質の評価結果を把握することができる。 The user can grasp the degree of articulation of his / her voice by visually recognizing the bar graph A2 displayed on the display unit 13. Further, the user can grasp the evaluation result of the voice quality by visually recognizing the bar graph A3 displayed on the display unit 13.

制御部１１は、処理を終了するか否かを判定する（ステップＳ１０）。この判定は、例えば、楽曲の伴奏音の再生が終了したか、又は、利用者によって伴奏音の再生を中断するための操作が行われたか否かを判定することによって行ってもよい。制御部１１は、処理を終了すると判定した場合には（ステップＳ１０；ＹＥＳ）、そのまま処理を終了する一方、処理を継続すると判定した場合には（ステップＳ１０；ＮＯ）、ステップＳ３の処理へ戻り、音声を収音し、収音した音声に基づいて明瞭度や声質を報知する処理を繰り返し行う（ステップＳ３〜ステップＳ９）。なお、この実施形態では、制御部１１は、明瞭度を算出した後に声質値を算出したが、処理の順序はこれに限らず、声質値の算出を先に行うようにしてもよく、その処理順序は任意である。 The control unit 11 determines whether or not to end the process (step S10). This determination may be performed, for example, by determining whether or not the accompaniment sound reproduction of the music has ended, or whether or not an operation for interrupting the accompaniment sound reproduction has been performed by the user. When it is determined that the process is to be ended (step S10; YES), the control unit 11 ends the process as it is, and when it is determined that the process is to be continued (step S10; NO), the control unit 11 returns to the process of step S3. Then, the process of collecting the voice and notifying the intelligibility and voice quality based on the collected voice is repeatedly performed (steps S3 to S9). In this embodiment, the control unit 11 calculates the voice quality value after calculating the clarity. However, the order of processing is not limited to this, and the voice quality value may be calculated first. The order is arbitrary.

ステップＳ３〜ステップＳ９の処理が繰り返されることにより、表示される棒線Ａ２１の長さと棒線Ａ３１の長さとは、利用者の音声の明瞭度と声質とに応じて、時間の経過に伴って変動する。すなわち、この実施形態では、歌唱している最中に、その歌唱音声の明瞭度と声質とが、リアルタイムで利用者に報知される。 By repeating the processing of step S3 to step S9, the length of the bar line A21 and the length of the bar line A31 that are displayed depend on the intelligibility and voice quality of the user's voice as time passes. fluctuate. That is, in this embodiment, while singing, the clarity and voice quality of the singing voice are notified to the user in real time.

このように、この実施形態では、利用者が歌唱を行うと、カラオケ装置１は、その歌唱音声に応じて明瞭度や声質を利用者に報知する。これにより、利用者は、自身の歌唱音声の明瞭度や声質を把握することができる。
また、この実施形態では、カラオケ装置１が、利用者の歌唱音声に応じて声質の評価結果を報知する。これにより、利用者は、自身の声質がどのようなものであるかを把握することができる。また、この実施形態では、利用者が歌唱を行っている最中に、利用者が良い声（声質、明瞭度等）で歌ったかどうかが即座に画面に表示されるため、利用者は、より良い声で歌うことを意識することができる。 Thus, in this embodiment, when a user sings, the karaoke apparatus 1 notifies the user of clarity and voice quality according to the singing voice. Thereby, the user can grasp the intelligibility and voice quality of his / her singing voice.
Moreover, in this embodiment, the karaoke apparatus 1 alert | reports the evaluation result of voice quality according to a user's song voice. Thereby, the user can grasp what his / her voice quality is. In this embodiment, whether or not the user sang with a good voice (voice quality, articulation, etc.) while the user is singing is immediately displayed on the screen. You can be conscious of singing with a good voice.

＜Ｃ：変形例＞
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、他の様々な形態で実施可能である。以下にその一例を示す。
（１）上述の実施形態では、音声信号にＦＦＴを施してスペクトルを検出することによって音声信号を周波数成分毎に分析した。音声信号を分析する方法はこれに限らず、例えば、複数のバンドパスフィルタの出力を用いて分析を行ってもよい。この場合は、例えば、複数のバンドパスフィルタの出力に基づいて所定の周波数帯域のレベルを算出し、算出した所定の周波数帯域のレベルと全体のレベルとの比に応じて声質を評価すればよい。 <C: Modification>
As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above, It can implement with another various form. An example is shown below.
(1) In the above-described embodiment, the audio signal is analyzed for each frequency component by performing FFT on the audio signal and detecting the spectrum. The method of analyzing the audio signal is not limited to this, and for example, the analysis may be performed using outputs of a plurality of bandpass filters. In this case, for example, the level of a predetermined frequency band may be calculated based on the outputs of a plurality of bandpass filters, and the voice quality may be evaluated according to the ratio between the calculated level of the predetermined frequency band and the overall level. .

（２）上述の実施形態では、制御部１１が、スペクトルを検出して、図３に示すような、周波数の昇順に整列したときにレベルの変化が山となって現れる位置と谷となって現れる位置とを特定する構成としたが、昇順に限らず、降順に整列してもよい。レベルの変化が山となって現れる位置と谷となって現れる位置とを特定できる態様であればよい。
また、上述の実施形態では、制御部１１が、図４に示すように、周波数成分毎のレベルをレベルの降順に整列したが、昇順に整列してもよい。 (2) In the above-described embodiment, when the control unit 11 detects the spectrum and arranges the spectrum in ascending order as shown in FIG. Although it is configured to specify the appearing position, it is not limited to the ascending order but may be arranged in the descending order. Any mode that can identify the position where the level change appears as a mountain and the position where it appears as a valley is acceptable.
In the above-described embodiment, as shown in FIG. 4, the control unit 11 arranges the levels for each frequency component in descending order of levels, but may arrange them in ascending order.

（３）上述した実施形態では、複数の周波数成分をレベルの高い順にソート（整列）し、そのソート結果を用いてフォルマントレベルと谷レベルとを特定した。フォルマントレベルと谷レベルとの特定方法はこれに限らず、フォルマントレベルと谷レベルとを特定できるものであればどのようなものであってもよい。
その一例として、フォルマントを用いた算出方法を以下に示す。まず、制御部１１は、音声信号にＦＦＴを施してスペクトルを算出する。次いで、制御部１１は、その分析の結果得られた周波数スペクトルから、第１、第２、及び第３フォルマントと夫々対応するフォルマントレベルを抽出する。フォルマントとは、音声のスペクトル上の優勢な周波数成分であり、周波数の低い順に第１フォルマント、第２フォルマント、第３フォルマント、第４フォルマント…と呼ばれる。制御部１１は、フォルマントレベルを抽出すると、今度は、第１フォルマントと第２フォルマントとの間の谷の位置におけるレベルを谷レベルとして特定する。 (3) In the above-described embodiment, a plurality of frequency components are sorted (aligned) in descending order of level, and the formant level and valley level are specified using the sorting result. The method for specifying the formant level and the valley level is not limited to this, and any method may be used as long as the formant level and the valley level can be specified.
As an example, a calculation method using formants is shown below. First, the control unit 11 calculates the spectrum by performing FFT on the audio signal. Next, the control unit 11 extracts formant levels corresponding to the first, second, and third formants from the frequency spectrum obtained as a result of the analysis. The formant is a dominant frequency component on the voice spectrum, and is called first formant, second formant, third formant, fourth formant,. After extracting the formant level, the control unit 11 specifies the level at the valley position between the first formant and the second formant as the valley level.

なお、上述の例では、第１フォルマントのフォルマントレベルと、第１フォルマントと第２フォルマントとの間の谷の位置におけるレベルを谷レベルとして特定したが、これに限らず、第ｎフォルマント（ｎは自然数）のフォルマントレベルと、第ｎフォルマントと第（ｎ＋１）フォルマントとの間の谷の位置におけるレベルとを特定すればよく、ｎの値は１に限定されない。具体的には、例えば、第２フォルマントのフォルマントレベルと第２フォルマントと第３フォルマントとの間の谷の位置のレベルとを特定してもよい。
このように、特定する山の位置は、１番目に現れる山（第１フォルマント）であってもよく、また、２番目に現れる山（第２フォルマント）であってもよく、第ｎ番目に現れる山（第ｎフォルマント：ｎは自然数）であればよい。同様に、特定する谷の位置は、第ｎ番目（ｎは自然数）に現れる谷の位置であればよい In the above example, the formant level of the first formant and the level at the valley position between the first formant and the second formant are specified as the valley level. However, the present invention is not limited to this, and the nth formant (n is A natural number) formant level and a level at the valley position between the nth formant and the (n + 1) th formant may be specified, and the value of n is not limited to 1. Specifically, for example, the formant level of the second formant and the level of the valley position between the second formant and the third formant may be specified.
Thus, the position of the specified mountain may be the first appearing mountain (first formant), may be the second appearing mountain (second formant), and appears nth. It may be a mountain (nth formant: n is a natural number). Similarly, the position of the valley to be identified may be the position of the valley that appears at the nth (n is a natural number).

（４）上述の実施形態では、（式１）を用いて明瞭度を算出したが、明瞭度の算出方法はこれに限定されるものではない。例えば、スペクトルを算出し、算出したスペクトルにおいてレベルの変化が山となって現れる位置を特定するとともに、レベルの変化が谷となって現れる位置を特定し、山レベルと谷レベルとの差値を算出してその差値が大きくなるほど明瞭度が高くなるように明瞭度を算出するようにしてもよい。要するに、谷レベルに対する山レベルの比率が大きいほど明瞭度が高くなるような算出方法であればどのようなものであってもよい。 (4) In the above-described embodiment, the intelligibility is calculated using (Equation 1), but the intelligibility calculation method is not limited to this. For example, the spectrum is calculated, the position where the level change appears as a peak in the calculated spectrum is specified, the position where the level change appears as a valley is specified, and the difference value between the peak level and the valley level is calculated. The intelligibility may be calculated so that the higher the difference value is, the higher the intelligibility is. In short, any calculation method may be used as long as the ratio of the mountain level to the valley level is larger so that the clarity becomes higher.

また、上述の実施形態では、（式２）を用いて声質値を算出したが、声質値の算出方法はこれに限定されるものではなく、要するに、声質の評価方法は、全周波数帯域のレベルに対する高域周波数帯域のレベルの比率が大きくなるほど評価が高くなるように、声質を評価するものであればどのようなものであってもよい。 In the above-described embodiment, the voice quality value is calculated using (Equation 2). However, the calculation method of the voice quality value is not limited to this, and in short, the voice quality evaluation method is the level of the entire frequency band. As long as the ratio of the level of the high frequency band with respect to increases, the evaluation becomes higher, so long as the voice quality is evaluated.

（５）上述した実施形態では、明瞭度の度合いを、それぞれの度合いを示す棒グラフを表示することによって報知した。報知の態様はこれに限らない。例えば、図７に示すように、明瞭度の度合いに応じて表示部１３に表示させる背景画を変更することによって明瞭度を報知してもよい。この場合は、例えば、明瞭度が高いほど、図７（ｂ）に示すように、表示する画像の数を多くし、一方、明瞭度が低いほど、図７（ａ）に示すように、表示する画像の数を少なくすることによって報知してもよい。
また、例えば、明瞭度に応じて背景色を変更することによって、明瞭度を報知してもよい。この場合は、例えば、明瞭度が高いほど、図８（ｂ）に示すように、背景色を明るくする一方、明瞭度が低いほど、図８（ａ）に示すように、背景色を暗くしてもよい。また、例えば、図９に示すように、明瞭度の度合いを示すメータＡ５を表示部１３に表示させることによって報知してもよい。図９においては、メータＡ５における針Ａ５１の位置によって明瞭度が報知される。また、明瞭度の報知の態様は表示に限らず、例えば、明瞭度を表す音声メッセージを放音することによって報知してもよい。要するに、制御部１１が算出した明瞭度を報知するものであればどのような態様であってもよい。 (5) In the above-described embodiment, the degree of clarity is notified by displaying a bar graph indicating each degree. The notification mode is not limited to this. For example, as shown in FIG. 7, the clarity may be notified by changing the background image displayed on the display unit 13 according to the degree of clarity. In this case, for example, as the intelligibility is high, the number of images to be displayed is increased as shown in FIG. 7B. On the other hand, as the intelligibility is low, as shown in FIG. You may alert | report by reducing the number of the images to perform.
In addition, for example, the clarity may be notified by changing the background color according to the clarity. In this case, for example, the higher the intelligibility, the brighter the background color as shown in FIG. 8 (b), while the lower the intelligibility, the darker the background color as shown in FIG. 8 (a). May be. Further, for example, as shown in FIG. 9, the meter A5 indicating the degree of intelligibility may be displayed on the display unit 13 for notification. In FIG. 9, the intelligibility is notified by the position of the needle A51 in the meter A5. In addition, the manner of notifying the intelligibility is not limited to display, and for example, the intelligibility may be informed by emitting a voice message representing intelligibility. In short, any mode may be used as long as the degree of clarity calculated by the control unit 11 is notified.

また、声質の評価の報知態様も、上述した明瞭度の報知態様と同様であり、棒グラフを表示するに限らず、背景色の変更や音声メッセージの放音などによって声質の評価結果を報知してもよく、制御部１１による声質の評価結果を報知するものであればどのような態様であってもよい。 In addition, the voice quality evaluation notification mode is the same as the above-described clarity notification mode, not only displaying a bar graph, but also reporting the voice quality evaluation result by changing the background color or emitting a voice message. Any form may be used as long as the result of voice quality evaluation by the control unit 11 is notified.

また、上述した実施形態では、制御部１１は、明瞭度を表す明瞭度情報や声質の評価結果を示す声質評価情報を表示部１３に出力した。明瞭度情報や声質評価結果情報の出力先は、表示部１３に限らず、例えば、通信ネットワークを介して明瞭度情報や声質評価結果情報を送信することによって出力してもよく、また、ハードディスクなどの記憶手段に出力して当該記憶手段に記憶させる態様であってもよい。要するに、制御部１１が、明瞭度情報や声質評価結果情報を出力すればよい。 Further, in the above-described embodiment, the control unit 11 outputs the clarity information indicating the clarity and the voice quality evaluation information indicating the evaluation result of the voice quality to the display unit 13. The output destination of the intelligibility information and the voice quality evaluation result information is not limited to the display unit 13 and may be output by transmitting the intelligibility information or the voice quality evaluation result information through a communication network, or a hard disk or the like. It is also possible to output to the storage means and store it in the storage means. In short, the control unit 11 may output clarity information and voice quality evaluation result information.

（６）上述の実施形態では、マイクロフォン１５が収音する利用者の歌唱音声について明瞭度と声質度を算出した。これに加えて、お手本音声データ記憶領域１２４に記憶されたお手本音声データについても明瞭度と声質度とを算出し、表示部１３に表示することによって利用者に報知してもよい。この場合に表示される画面の一例を図１０及び図１１に示す。図１０において、棒グラフＡ２の棒線Ａ２１の長さにより利用者の明瞭度の高低が表され、一方、棒グラフＡ４の棒線Ａ４１の長さによりお手本音声の明瞭度の高低が表される。また、図１１において、メータＡ６における針Ａ６１の位置によって利用者の明瞭度が表され、一方、針Ａ６２の位置によってお手本音声の明瞭度が表される。この場合は、利用者の評価結果とお手本の評価結果とがあわせて報知されるから、利用者は、自身の歌唱と手本の歌唱とを比較することができる。 (6) In the above-described embodiment, the clarity and the voice quality are calculated for the user's singing voice collected by the microphone 15. In addition to this, for the model voice data stored in the model voice data storage area 124, the clarity and voice quality may be calculated and displayed on the display unit 13 to notify the user. An example of the screen displayed in this case is shown in FIGS. In FIG. 10, the length of the bar line A21 of the bar graph A2 represents the level of clarity of the user, while the length of the bar line A41 of the bar graph A4 represents the level of clarity of the model voice. In FIG. 11, the clarity of the user is represented by the position of the needle A61 in the meter A6, while the clarity of the model voice is represented by the position of the needle A62. In this case, since the user's evaluation result and the model evaluation result are notified together, the user can compare his / her singing with the model singing.

（７）上述した実施形態では、利用者の音声を評価したが、評価する音声はお手本音声のみであってもよい。この場合は、お手本音声の評価結果のみを表示させてもよい。また、例えば、複数の利用者が同時に歌唱を行い、それぞれの利用者の音声を評価して評価結果を並べて表示してもよい。このように、１つの音声を評価して報知してもよく、また、複数の音声を並列に評価してもよい。また、評価の対象となる音声は、マイクロフォン１５で収音される音声であってもよく、また、記憶部１２に予め記憶された音声データであってもよい。また、例えば、カラオケ装置１に通信部を設け、通信部を介して受信される音声データを評価してもよく、評価の対象となる音声は、音声を表すものであればどのようなものであってもよい。 (7) In the above-described embodiment, the user's voice is evaluated, but the voice to be evaluated may be only the model voice. In this case, only the evaluation result of the model voice may be displayed. Further, for example, a plurality of users may sing at the same time, evaluate each user's voice, and display the evaluation results side by side. Thus, one voice may be evaluated and notified, or a plurality of voices may be evaluated in parallel. The voice to be evaluated may be voice collected by the microphone 15 or voice data stored in advance in the storage unit 12. In addition, for example, a communication unit may be provided in the karaoke device 1 and voice data received via the communication unit may be evaluated. The voice to be evaluated is any voice as long as it represents voice. There may be.

（８）上述した実施形態においては、周波数成分をレベルの降順にソートし、ソートされた列において先頭から１／４の位置の周波数成分のレベルと３／４の位置の周波数成分のレベルとを特定した。特定する位置はこれに限らず、例えば、１／５番目の周波数成分のレベルと４／５番目の周波数成分のレベルとを特定してもよい。要するに、レベルの山と谷との差値に近い値を算出するものであればどのようなものであってもよく、予め定められた第１の位置に位置する周波数成分のレベルと予め定められた第２の位置に位置する周波数成分のレベルとの差値を算出すればよい。 (8) In the above-described embodiment, the frequency components are sorted in descending order of level, and the level of the frequency component at the position 1/4 and the level of the frequency component at the position 3/4 from the top in the sorted column. Identified. The position to be identified is not limited to this. For example, the level of the 1 / 5th frequency component and the level of the 4 / 5th frequency component may be identified. In short, any value may be used as long as it calculates a value close to the difference value between the peak and valley of the level, and the level of the frequency component located at the predetermined first position is determined in advance. The difference value from the level of the frequency component located at the second position may be calculated.

（９）上述した実施形態では、明瞭度を示す棒グラフと声質の評価結果を示す棒グラフとをあわせて表示したが、これに限らず、明瞭度のみを報知する構成としてもよく、また、声質の評価結果のみを報知する構成としてもよい。 (9) In the above-described embodiment, the bar graph indicating the intelligibility and the bar graph indicating the evaluation result of the voice quality are displayed together. However, the present invention is not limited thereto, and only the intelligibility may be notified. It is good also as a structure which alert | reports only an evaluation result.

（１０）上述の実施形態において、明瞭度の度合いを声質の評価に関係させてもよい。この場合、制御部１１は、算出した明瞭度に応じて声質評価の評価結果を修正し、修正された評価結果を示す評価結果情報を表示部１３に出力してもよい。具体的には、例えば、制御部１１が、明瞭度が低いほど評価結果が低くなるように声質評価の評価結果を修正してもよい。また、例えば、算出された明瞭度が予め定められた閾値以下である場合には、声質の評価結果を予め定められた低い値にする（又は評価しない）ようにしてもよい。
例えば、利用者が、子音を伸ばして発音するといったように歌唱を不真面目に歌唱したとする。従来の装置では、このように利用者が不真面目に歌唱した場合であっても声質が高く評価される場合があったが、この実施形態では、明瞭度を評価に反映させることで、不真面目な歌唱を低く評価することができ、不真面目な歌唱が高く評価されるのを防ぐことができる。 (10) In the above-described embodiment, the degree of clarity may be related to the evaluation of voice quality. In this case, the control unit 11 may correct the evaluation result of the voice quality evaluation according to the calculated clarity, and output evaluation result information indicating the corrected evaluation result to the display unit 13. Specifically, for example, the control unit 11 may correct the evaluation result of the voice quality evaluation so that the evaluation result becomes lower as the intelligibility becomes lower. For example, when the calculated clarity is equal to or less than a predetermined threshold, the voice quality evaluation result may be set to a predetermined low value (or not evaluated).
For example, suppose that the user sang singing in a serious manner, such as pronouncing a consonant. In the conventional apparatus, the voice quality may be highly evaluated even when the user sings in a serious manner as described above, but in this embodiment, the intelligibility is reflected in the evaluation by reflecting the intelligibility in the evaluation. Singing can be evaluated low, and serious singing can be prevented from being evaluated highly.

（１１）上述した実施形態において、声質の評価結果を消費カロリに換算し、報知してもよい。この場合、制御部１１が、以下の式を用いて、声質値Ｇから消費カロリＨを算出してもよい。なお、以下の式において、ｋは係数を示す。
Ｈ＝k×Ｇ…（式３） (11) In the above-described embodiment, the voice quality evaluation result may be converted into calorie consumption and notified. In this case, the control unit 11 may calculate the consumed calorie H from the voice quality value G using the following equation. In the following formula, k represents a coefficient.
H = k × G (Formula 3)

（１２）上述した実施形態では、音声信号のレベルと、その音声信号における１ｋＨｚ以上の周波数帯域のレベルとの比率に基づいて声質を評価した。声質の評価の態様はこれに限らず、例えば、音声信号のレベルとその音声信号における２ｋＨｚ以上の周波数帯域のレベルとの比率に応じて声質を評価してもよく、また、例えば、４ｋＨｚ以上の周波数帯域のレベルとの比率に応じて声質を評価してもよく、要するに、音声信号の全体のレベルとその音声信号において予め定められた周波数帯域のレベルとの比率に応じて音声の声質を評価する態様であればどのようなものであってもよい。 (12) In the above-described embodiment, the voice quality is evaluated based on the ratio between the level of the audio signal and the level of the frequency band of 1 kHz or higher in the audio signal. The aspect of the evaluation of the voice quality is not limited to this. For example, the voice quality may be evaluated according to the ratio between the level of the voice signal and the level of the frequency band of 2 kHz or higher in the voice signal. Voice quality may be evaluated according to the ratio with the level of the frequency band. In short, the voice quality is evaluated according to the ratio between the overall level of the audio signal and the level of the frequency band predetermined in the audio signal. Any mode may be used.

（１３）上述した実施形態では、カラオケ装置１を本発明に係る音声評価装置として適用したが、音声評価装置として適用される装置はカラオケ装置に限らず、例えばサーバ装置やパーソナルコンピュータ、移動体通信端末など、様々な装置が本発明に係る音声評価装置として適用可能である。 (13) In the above-described embodiment, the karaoke apparatus 1 is applied as the voice evaluation apparatus according to the present invention. However, the apparatus applied as the voice evaluation apparatus is not limited to the karaoke apparatus, and for example, a server device, a personal computer, or mobile communication. Various devices such as a terminal are applicable as the voice evaluation device according to the present invention.

（１４）上述したカラオケ装置１の制御部１１によって実現されるプログラムは、磁気テープ、磁気ディスク、フレキシブルディスク、光記録媒体、光磁気記録媒体、ＲＡＭ、ＲＯＭなどの記録媒体に記録した状態で提供し得る。また、インターネットのようなネットワーク経由でカラオケ装置１にダウンロードさせることも可能である。 (14) The program realized by the control unit 11 of the karaoke apparatus 1 described above is provided in a state of being recorded on a recording medium such as a magnetic tape, a magnetic disk, a flexible disk, an optical recording medium, a magneto-optical recording medium, a RAM, or a ROM. Can do. It is also possible to download to the karaoke apparatus 1 via a network such as the Internet.

カラオケ装置１の構成の一例を示すブロック図である。1 is a block diagram illustrating an example of a configuration of a karaoke apparatus 1. FIG. カラオケ装置１の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the karaoke apparatus. 音声のスペクトルの一例を示す図である。It is a figure which shows an example of the spectrum of an audio | voice. 音声の周波数成分毎のレベルをソートした結果を示す図である。It is a figure which shows the result of having sorted the level for every frequency component of an audio | voice. 音声のスペクトルの一例を示す図である。It is a figure which shows an example of the spectrum of a sound. 表示部１３に表示される画面の一例を示す図である。6 is a diagram illustrating an example of a screen displayed on the display unit 13. FIG. 表示部１３に表示される画面の一例を示す図である。6 is a diagram illustrating an example of a screen displayed on the display unit 13. FIG. 表示部１３に表示される画面の一例を示す図である。6 is a diagram illustrating an example of a screen displayed on the display unit 13. FIG. 表示部１３に表示される画面の一例を示す図である。6 is a diagram illustrating an example of a screen displayed on the display unit 13. FIG. 表示部１３に表示される画面の一例を示す図である。6 is a diagram illustrating an example of a screen displayed on the display unit 13. FIG. 表示部１３に表示される画面の一例を示す図である。6 is a diagram illustrating an example of a screen displayed on the display unit 13. FIG.

Explanation of symbols

１…カラオケ装置、１１…制御部、１２…記憶部、１３…表示部、１４…操作部、１５…マイクロフォン、１６…音声処理部、１７…スピーカ、１２１…伴奏データ記憶領域、１２２…背景画データ記憶領域、１２３…歌詞データ記憶領域、１２４…お手本音声データ記憶領域。 DESCRIPTION OF SYMBOLS 1 ... Karaoke apparatus, 11 ... Control part, 12 ... Memory | storage part, 13 ... Display part, 14 ... Operation part, 15 ... Microphone, 16 ... Sound processing part, 17 ... Speaker, 121 ... Accompaniment data storage area, 122 ... Background image Data storage area, 123 ... Lyric data storage area, 124 ... Model voice data storage area.

Claims

Level calculation means for analyzing a voice signal representing voice for each of a plurality of predetermined frequency components and calculating a level for each frequency component;
When the level for each frequency component calculated by the level calculation means is arranged in descending order of the level, the first rank level determined in advance in the arranged level column is specified as the first level. And a level specifying means for specifying, as the second level, a predetermined second rank level lower than the first rank in the aligned level column;
The calculation means for calculating the intelligibility of the sound based on the first level and the second level specified by the level specifying means, wherein the ratio of the first level to the second level increases. Clarity calculating means for calculating the intelligibility so as to increase the intelligibility;
A speech evaluation apparatus comprising: clarity information output means for outputting clarity information indicating the clarity calculated by the clarity calculation means.

Level calculation means for analyzing a voice signal representing voice for each of a plurality of predetermined frequency components and calculating a frequency spectrum;
The level of one predetermined nth formant is specified as the first level from the frequency spectrum calculated by the level calculating means, and the valley position between the nth formant and the (n + 1) th formant Level specifying means for specifying the level of the second level as a second level;
The calculation means for calculating the intelligibility of the sound based on the first level and the second level specified by the level specifying means, wherein the ratio of the first level to the second level increases. Clarity calculating means for calculating the intelligibility so as to increase the intelligibility;
A speech evaluation apparatus comprising: clarity information output means for outputting clarity information indicating the clarity calculated by the clarity calculation means.

The speech evaluation apparatus according to claim 1, further comprising: a clarity notifying unit that notifies the clarity indicated by the clarity information based on the clarity information output by the clarity information output unit.

Second level calculating means for calculating the level of the audio signal;
Third level calculation means for calculating a level of a predetermined frequency band in the audio signal;
The level calculated by the second level calculation unit and the third level so that the evaluation increases as the ratio of the level calculated by the third level calculation unit to the level calculated by the second level calculation unit increases. Voice quality evaluation means for evaluating the voice quality of the voice based on the level calculated by the level calculation means;
The voice evaluation apparatus according to any one of claims 1 to 3, further comprising voice quality evaluation information output means for outputting voice quality evaluation information indicating an evaluation result of the voice quality evaluation means.

The voice evaluation apparatus according to claim 4, further comprising voice quality evaluation result notifying means for notifying a voice quality evaluation result indicated by the voice quality evaluation information based on the voice quality evaluation information output by the voice quality evaluation information output means.

Voice quality evaluation correction means for correcting the evaluation result of the voice quality evaluation means according to the clarity calculated by the clarity calculation means;
The voice evaluation apparatus according to claim 5, wherein the voice quality evaluation information output means outputs evaluation result information indicating the evaluation result corrected by the voice quality evaluation correction means.

A voice evaluation method of a voice evaluation apparatus provided with a control means,
A first step in which the control means analyzes a sound signal representing sound for each of a plurality of predetermined frequency components, and calculates a level for each frequency component;
When the control means arranges the calculated level for each frequency component in descending order of the level, the control unit specifies a first rank level predetermined in the arranged level column as the first level. And a second step of identifying, as the second level, a predetermined second rank level lower than the first rank in the aligned level column;
The control means calculates speech intelligibility based on the first level and the second level so that the intelligibility increases as the ratio of the first level to the second level increases. A third step of
The control means comprises a fourth step of outputting intelligibility information indicating the calculated intelligibility.

A voice evaluation method of a voice evaluation apparatus provided with a control means,
A first step in which the control means analyzes a sound signal representing sound for each of a plurality of predetermined frequency components, and calculates a frequency spectrum;
Position of the valley between the said control means, from the calculated frequency spectrum, as well as identify the level of one of the n formant which is determined in advance as a first level, and the n-th formant and the (n + 1) formant A second step of identifying the level of the second as a second level;
The control means calculates speech intelligibility based on the first level and the second level so that the intelligibility increases as the ratio of the first level to the second level increases. A third step of
The control means comprises a fourth step of outputting intelligibility information indicating the calculated intelligibility.