JP2016156943A

JP2016156943A - Display controller, display control method and program

Info

Publication number: JP2016156943A
Application number: JP2015034384A
Authority: JP
Inventors: 誠司黒川; Seiji Kurokawa
Original assignee: Brother Industries Ltd
Current assignee: Brother Industries Ltd
Priority date: 2015-02-24
Filing date: 2015-02-24
Publication date: 2016-09-01
Anticipated expiration: 2035-02-24
Also published as: JP6256379B2

Abstract

PROBLEM TO BE SOLVED: To provide a display controller, a display control method and a program that enable a speaker to grasp, in an easier-to-understand manner, a difference in way of speaking between a voice as a model and a speaker's voice based upon a pitch, sound pressure and a pronunciation feature quantity.SOLUTION: A voice information display device is configured: to determine a position, a width and a color of a model voice line representing change of a voice as a model in reading a character string aloud based pon a pitch, sound pressure and a pronunciation feature quantity specified from model voice waveform data; to determine a position, a width and a color of a speaker voice line showing change of a voice that a speaker speaks in reading the character string aloud based pon a pitch, sound pressure and a pronunciation feature quantity specified from speaker voice waveform data; to display the model voice line and speaker voice line on a screen so that they can be compared with each other; and to display the character string on the screen along a time-base direction.SELECTED DRAWING: Figure 3

Description

本発明は、話者が文字列を音読したときに発した音声を視覚的に表現することが可能なシステム等の技術分野に関する。 The present invention relates to a technical field such as a system capable of visually expressing a voice uttered when a speaker reads a character string aloud.

近年、語学学習、発声発話訓練等の支援を目的として、話者が文字列を音読したときに発した音声を視覚的に表現する技術が知られている。例えば、特許文献１には、基準音声に対応し且つ発声のタイミング、発声長、音程及び促音を表す図形を表示し、発声部分を図形の色を変更することにより表示するシステムが開示されている。一方、特許文献２には、お手本となる歌い方のデータを任意に設定し、その歌い方に対する練習者の歌い方の違いを視覚的に捉えることができるシステムが開示されている。このシステムでは、模範歌唱の音程の推移をピッチデータに従って折れ線で表示し、折れ線と横軸との間の領域の色を模範歌唱の音量データの値に応じて変化するように表示するようになっている。 In recent years, for the purpose of supporting language learning, utterance utterance training, and the like, there is known a technique for visually expressing speech uttered when a speaker reads a character string aloud. For example, Patent Document 1 discloses a system that displays a graphic corresponding to the reference voice and representing the timing of utterance, the length of utterance, the pitch, and the prompt sound, and changing the color of the utterance part. . On the other hand, Patent Literature 2 discloses a system that can arbitrarily set data of a singing method as a model and visually grasp a difference in a practitioner's singing method with respect to the singing method. In this system, the transition of the pitch of the model song is displayed as a broken line according to the pitch data, and the color of the area between the broken line and the horizontal axis is displayed so as to change according to the value of the volume data of the model song. ing.

特開２００３−１８６３７９号公報JP 2003-186379 A 特開２００６−２７６６９３号公報JP 2006-276893 A

しかしながら、従来、音程と音量などをパラメータとして線やグラフを表示する技術があったものの、手本となる音声と話者の音声との発し方の違いを、話者が一見して、より分り易く把握できるシステムが知られていなかった。 However, although there has been a technology to display lines and graphs using parameters such as pitch and volume as parameters, the speaker can understand the difference in how the voice is used as an example and the voice of the speaker at a glance. A system that can be easily grasped was not known.

本発明は、以上の点に鑑みてなされたものであり、音高、音圧、及び発音特徴量に基づいて手本となる音声と、話者の音声との発し方の違いを、話者に一見して、より分り易く把握させることが可能な表示制御装置、表示制御方法、及びプログラムを提供する。 The present invention has been made in view of the above points, and the difference in the manner of utterance between the voice that serves as a model and the voice of the speaker based on the pitch, the sound pressure, and the pronunciation feature amount is described. The present invention provides a display control device, a display control method, and a program that can be understood at a glance.

上記課題を解決するために、請求項１に記載の発明は、複数の文字により構成される文字列のテキストデータと、前記文字列を音読するときの手本となる音声の波形を示す第１音声波形データに基づいて所定単位毎に特定された音高、音圧、及び発音特徴量を記憶する記憶手段と、話者が前記文字列を音読したときに発した音声の波形を示す第２音声波形データを入力する入力手段と、前記第１音声波形データに基づいて、所定単位毎に音高、音圧、及び発音特徴量を特定する特定手段と、前記記憶手段に記憶された音高、音圧、及び前記発音特徴量に基づいて、時間軸方向に伸び、且つ、前記時間軸方向と直交する軸方向に時系列的に変化する第１のラインを画面に表示させる第１制御手段と、前記テキストデータに基づいて、前記文字列を構成する各文字を前記時間軸方向に沿って前記画面に表示させる第２制御手段と、前記特定手段により特定された音高、音圧、及び前記発音特徴量に基づいて、時間軸方向に伸び、且つ、前記時間軸方向と直交する軸方向に時系列的に変化する第２のラインを、前記第１のラインと比較可能に前記画面に表示させる第３制御手段と、を備え、前記第１制御手段及び前記第３制御手段は、それぞれ、前記音高に基づいて、前記時間軸方向と直交する軸方向における前記ラインの位置を前記所定単位毎に決定する第１決定部と、前記音圧に基づいて、前記ラインの幅を前記所定単位毎に決定する第２決定部と、前記発音特徴量に基づいて、前記ラインの色を前記所定単位毎に決定する第３決定部と、を備えることを特徴とする。 In order to solve the above-mentioned problem, the invention according to claim 1 is a first example showing text data of a character string composed of a plurality of characters and a waveform of a voice as a model when the character string is read aloud. Storage means for storing a pitch, a sound pressure, and a pronunciation feature amount specified for each predetermined unit based on the speech waveform data, and a second waveform indicating a speech waveform generated when the speaker reads the character string aloud. Input means for inputting speech waveform data, specification means for specifying a pitch, sound pressure, and pronunciation feature amount for each predetermined unit based on the first speech waveform data, and a pitch stored in the storage means First control means for displaying on the screen a first line that extends in the time axis direction and changes in time series in an axis direction orthogonal to the time axis direction, based on the sound pressure and the sound production feature amount And the character string based on the text data Based on the second control means for displaying each character constituting the screen on the screen along the time axis direction, the pitch, the sound pressure, and the pronunciation feature quantity specified by the specifying means, the character extends in the time axis direction. And a third control means for displaying a second line that changes in time series in an axial direction orthogonal to the time axis direction on the screen so as to be comparable to the first line. The first control unit and the third control unit each determine a position of the line in the axial direction orthogonal to the time axis direction for each predetermined unit based on the pitch, and the sound A second determination unit that determines the width of the line for each predetermined unit based on pressure, and a third determination unit that determines the color of the line for each predetermined unit based on the pronunciation feature amount. It is characterized by providing.

請求項２に記載の発明は、請求項１に記載の表示制御装置において、前記第２制御手段は、前記音高に基づいて、前記時間軸方向と直交する軸方向における前記文字の位置を前記所定単位毎に決定する第４決定部と、前記音圧に基づいて、前記文字のサイズを前記所定単位毎に決定する第５決定部と、を備えることを特徴とする。 According to a second aspect of the present invention, in the display control device according to the first aspect, the second control means determines the position of the character in the axial direction orthogonal to the time axis direction based on the pitch. A fourth determination unit that determines each predetermined unit, and a fifth determination unit that determines the size of the character for each predetermined unit based on the sound pressure.

請求項３に記載の発明は、請求項１または２に記載の表示制御装置において、前記発音特徴量は、前記音声波形データのスペクトルによって区別可能な特徴量であり、前記第３決定部は、前記発音特徴量の別に応じて前記ラインの色を決定することを特徴とする。 According to a third aspect of the present invention, in the display control device according to the first or second aspect, the pronunciation feature amount is a feature amount that can be distinguished by a spectrum of the speech waveform data, and the third determination unit includes: The color of the line is determined according to the pronunciation feature amount.

請求項４に記載の発明は、請求項３に記載の表示制御装置において、前記所定単位は、文字単位であり、前記記憶手段は、前記第１音声波形データに基づいて、文字単位毎に特定された前記発音特徴量として母音を記憶し、前記特定手段は、前記第２音声波形データに基づいて、文字単位毎に前記発音特徴量として母音を特定し、前記第３決定部は、前記特定された母音の別に応じた色を前記ラインの色として決定することを特徴とする。 According to a fourth aspect of the present invention, in the display control device according to the third aspect, the predetermined unit is a character unit, and the storage means is specified for each character unit based on the first speech waveform data. The vowel is stored as the generated pronunciation feature value, the specifying means specifies the vowel as the pronunciation feature value for each character unit based on the second speech waveform data, and the third determining unit According to another aspect of the present invention, a color corresponding to each of the generated vowels is determined as the color of the line.

請求項５に記載の発明は、請求項４に記載の表示制御装置において、前記記憶手段は、前記第１音声波形データに基づいて、文字単位毎に特定された前記発音特徴量として母音及び前記母音以外の音成分を記憶し、前記特定手段は、前記第２音声波形データに基づいて、文字単位毎に前記発音特徴量として母音及び前記母音以外の音成分を特定し、前記第３決定部は、前記特定された母音の別に応じた色と、前記母音以外の音成分に応じた色との混合色を前記ラインの色として決定することを特徴とする。 According to a fifth aspect of the present invention, in the display control device according to the fourth aspect, the storage means uses the vowel and the vowel as the pronunciation feature amount specified for each character based on the first speech waveform data. Sound components other than vowels are stored, and the specifying means specifies sound components other than vowels and vowels as the pronunciation feature amount for each character unit based on the second speech waveform data, and the third determining unit Is characterized in that a mixed color of a color corresponding to the specified vowel and a color corresponding to a sound component other than the vowel is determined as the color of the line.

請求項６に記載の発明は、請求項３に記載の表示制御装置において、前記所定単位は、所定時間単位であり、前記記憶手段は、前記第１音声波形データに基づいて、所定時間単位毎に特定された前記発音特徴量として少なくとも第１フォルマント周波数と第２フォルマント周波数とを記憶し、前記特定手段は、前記第２音声波形データに基づいて、所定時間単位毎に前記発音特徴量として少なくとも第１フォルマント周波数と第２フォルマント周波数とを特定し、前記第３決定部は、前記特定された第１フォルマント周波数の値と前記特定された第２フォルマント周波数の値との組合せに応じた色を前記ラインの色として決定することを特徴とする。 According to a sixth aspect of the present invention, in the display control apparatus according to the third aspect, the predetermined unit is a predetermined time unit, and the storage unit is configured to perform predetermined time unit based on the first speech waveform data. At least a first formant frequency and a second formant frequency are stored as the pronunciation feature quantity specified in step (b), and the specifying means at least as the pronunciation feature quantity for each predetermined time unit based on the second speech waveform data. A first formant frequency and a second formant frequency are specified, and the third determining unit selects a color corresponding to a combination of the specified first formant frequency value and the specified second formant frequency value. It is determined as the color of the line.

請求項７に記載の発明は、請求項６に記載の表示制御装置において、所定の第１フォルマント周波数の値と所定の第２フォルマント周波数の値との組合せに対して所定の基準色が設定されており、前記特定された第１フォルマント周波数の値と前記特定された第２フォルマント周波数の値との組合せに応じた色は、所定の第１フォルマント周波数の値と前記特定された第１フォルマント周波数の値の差と、所定の第２フォルマント周波数の値と前記特定された第２フォルマント周波数の値の差との自乗和が大きいほど、前記基準色からの変化度合いが大きいことを特徴とする。 According to a seventh aspect of the present invention, in the display control device according to the sixth aspect, a predetermined reference color is set for a combination of a predetermined first formant frequency value and a predetermined second formant frequency value. The color according to the combination of the specified first formant frequency value and the specified second formant frequency value is a predetermined first formant frequency value and the specified first formant frequency. The greater the sum of squares of the difference between the two values and the difference between the predetermined second formant frequency value and the specified second formant frequency value, the greater the degree of change from the reference color.

請求項８に記載の発明は、請求項６に記載の表示制御装置において、前記記憶手段は、前記第１音声波形データに基づいて、所定時間単位毎に特定された前記発音特徴量として少なくとも第１フォルマント周波数と第２フォルマント周波数とノイズ成分とを記憶し、前記特定手段は、前記第２音声波形データに基づいて、所定時間単位毎に前記発音特徴量として少なくとも第１フォルマント周波数と第２フォルマント周波数とノイズ成分とを特定し、前記第３決定部は、前記特定された第１フォルマント周波数の値と前記特定された第２フォルマント周波数の値との組合せに応じた色と、前記特定されたノイズ成分の値に応じた色との混合色を前記ラインの色として決定することを特徴とする。 According to an eighth aspect of the present invention, in the display control device according to the sixth aspect, the storage means at least as the pronunciation feature amount specified for each predetermined time unit based on the first speech waveform data. A formant frequency, a second formant frequency, and a noise component are stored, and the specifying means at least the first formant frequency and the second formant as the pronunciation feature amount for each predetermined time unit based on the second speech waveform data. A frequency and a noise component are specified, and the third determining unit determines the color according to a combination of the specified first formant frequency value and the specified second formant frequency value, and the specified A mixed color with a color corresponding to a value of a noise component is determined as the color of the line.

請求項９に記載の発明は、請求項３に記載の表示制御装置において、前記所定単位は、所定時間単位であり、前記記憶手段は、前記第１音声波形データに基づいて、所定時間単位毎に特定された前記発音特徴量として少なくとも第１フォルマント周波数と第２フォルマント周波数とを記憶し、前記特定手段は、前記第２音声波形データに基づいて、所定時間単位毎に前記発音特徴量として少なくとも第１フォルマント周波数と第２フォルマント周波数とを特定し、前記第３決定部は、前記特定された第１フォルマント周波数の値に応じた色と前記特定された第２フォルマント周波数の値に応じた色との混合色を前記ラインの色として決定することを特徴とする。 According to a ninth aspect of the present invention, in the display control apparatus according to the third aspect, the predetermined unit is a predetermined time unit, and the storage unit is configured to perform a predetermined time unit based on the first speech waveform data. At least a first formant frequency and a second formant frequency are stored as the pronunciation feature quantity specified in step (b), and the specifying means at least as the pronunciation feature quantity for each predetermined time unit based on the second speech waveform data. A first formant frequency and a second formant frequency are specified, and the third determining unit determines a color according to the value of the specified first formant frequency and a color according to the value of the specified second formant frequency. Is determined as the color of the line.

請求項１０に記載の発明は、請求項９に記載の表示制御装置において、前記記憶手段は、前記第１音声波形データに基づいて、所定時間単位毎に特定された前記発音特徴量として少なくとも第１フォルマント周波数と第２フォルマント周波数とノイズ成分とを記憶し、前記特定手段は、前記第２音声波形データに基づいて、所定時間単位毎に前記発音特徴量として少なくとも第１フォルマント周波数と第２フォルマント周波数とノイズ成分とを特定し、前記第３決定部は、前記特定された第１フォルマント周波数の値に応じた色と前記特定された第２フォルマント周波数の値に応じた色との混合色と、前記ノイズ成分の値に応じた色との混合色を前記ラインの色として決定することを特徴とする。 According to a tenth aspect of the present invention, in the display control device according to the ninth aspect, the storage means at least as the pronunciation feature amount specified for each predetermined time unit based on the first speech waveform data. A formant frequency, a second formant frequency, and a noise component are stored, and the specifying means at least the first formant frequency and the second formant as the pronunciation feature amount for each predetermined time unit based on the second speech waveform data. A frequency and a noise component are specified, and the third determination unit includes a mixed color of a color according to the value of the specified first formant frequency and a color according to the value of the specified second formant frequency. A mixed color with a color corresponding to the value of the noise component is determined as the color of the line.

請求項１１に記載の発明は、１つ以上のコンピュータにより実行される表示制御方法であって、複数の文字により構成される文字列のテキストデータと、前記文字列を音読するときの手本となる音声の波形を示す第１音声波形データに基づいて、所定単位毎に特定された音高、音圧、及び発音特徴量を記憶手段に記憶する記憶ステップと、話者が前記文字列を音読したときに発した音声の波形を示す第２音声波形データを入力する入力ステップと、前記第２音声波形データに基づいて、所定単位毎に音高、音圧、及び発音特徴量を特定する特定ステップと、前記記憶ステップにより記憶された音高、音圧、及び前記発音特徴量に基づいて、時間軸方向に伸び、且つ、前記時間軸方向と直交する軸方向に時系列的に変化する第１のラインを画面に表示させる第１制御ステップと、前記テキストデータに基づいて、前記文字列を構成する各文字を前記時間軸方向に沿って前記画面に表示させる第２制御ステップと、前記特定ステップにより特定された音高、音圧、及び前記発音特徴量に基づいて、時間軸方向に伸び、且つ、前記時間軸方向と直交する軸方向に時系列的に変化する第２のラインを、前記第１のラインと比較可能に前記画面に表示させる第３制御ステップと、を備え、前記第１制御ステップ及び前記第３制御ステップは、それぞれ、前記音高に基づいて、前記時間軸方向と直交する軸方向における前記ラインの位置を前記所定単位毎に決定するステップと、前記音圧に基づいて、前記ラインの幅を前記所定単位毎に決定するステップと、前記発音特徴量に基づいて、前記ラインの色を前記所定単位毎に決定するステップと、を含むことを特徴とする。 The invention according to claim 11 is a display control method executed by one or more computers, and includes text data of a character string composed of a plurality of characters, and a model for reading the character string aloud. A storage step for storing in the storage means the pitch, sound pressure, and pronunciation feature amount specified for each predetermined unit based on the first speech waveform data indicating the waveform of the voice, and the speaker reads the character string aloud An input step for inputting second voice waveform data indicating the waveform of the voice uttered at the time, and a specification for specifying a pitch, a sound pressure, and a pronunciation feature amount for each predetermined unit based on the second voice waveform data And a time series that extends in the time axis direction and changes in a time series in an axis direction orthogonal to the time axis direction, based on the pitch, the sound pressure, and the sound generation feature amount stored in the storing step. 1 line screen A first control step of displaying, a second control step of displaying each character constituting the character string on the screen along the time axis direction based on the text data, and the sound specified by the specifying step A second line that extends in the time axis direction and changes in a time series in an axis direction orthogonal to the time axis direction based on high, sound pressure, and the sound generation feature amount, is the first line. A third control step for displaying on the screen in a comparable manner, wherein the first control step and the third control step are each based on the pitch and in the axial direction orthogonal to the time axis direction. Determining a line position for each predetermined unit; determining a line width for each predetermined unit based on the sound pressure; and Characterized in that it comprises the steps of determining emissions of colors for each of the predetermined unit, the.

請求項１２に記載の発明は、複数の文字により構成される文字列のテキストデータと、前記文字列を音読するときの手本となる音声の波形を示す第１音声波形データに基づいて、所定単位毎に特定された音高、音圧、及び発音特徴量を記憶手段に記憶する記憶ステップと、話者が前記文字列を音読したときに発した音声の波形を示す第２音声波形データを入力する入力ステップと、前記第２音声波形データに基づいて、所定単位毎に音高、音圧、及び発音特徴量を特定する特定ステップと、前記記憶ステップにより記憶された音高、音圧、及び前記発音特徴量に基づいて、時間軸方向に伸び、且つ、前記時間軸方向と直交する軸方向に時系列的に変化する第１のラインを画面に表示させる第１制御ステップと、前記テキストデータに基づいて、前記文字列を構成する各文字を前記時間軸方向に沿って前記画面に表示させる第２制御ステップと、前記特定ステップにより特定された音高、音圧、及び前記発音特徴量に基づいて、時間軸方向に伸び、且つ、前記時間軸方向と直交する軸方向に時系列的に変化する第２のラインを、前記第１のラインと比較可能に前記画面に表示させる第３制御ステップと、をコンピュータに実行させ、前記第１制御ステップ及び前記第３制御ステップは、それぞれ、前記音高に基づいて、前記時間軸方向と直交する軸方向における前記ラインの位置を前記所定単位毎に決定するステップと、前記音圧に基づいて、前記ラインの幅を前記所定単位毎に決定するステップと、前記発音特徴量に基づいて、前記ラインの色を前記所定単位毎に決定するステップと、を含むことを特徴とする。 The invention according to claim 12 is based on text data of a character string composed of a plurality of characters and first sound waveform data indicating a sound waveform serving as a model when the character string is read aloud. A storage step for storing the pitch, sound pressure, and pronunciation feature quantity specified for each unit in the storage means, and second speech waveform data indicating a waveform of speech uttered when the speaker reads the character string aloud. An input step for inputting, a specifying step for specifying a pitch, a sound pressure, and a pronunciation feature amount for each predetermined unit based on the second speech waveform data; a pitch, a sound pressure stored by the storing step; And a first control step for displaying, on the screen, a first line extending in a time axis direction and changing in a time series in an axis direction orthogonal to the time axis direction based on the pronunciation feature amount, and the text Based on data Based on the second control step of displaying each character constituting the character string on the screen along the time axis direction, the pitch, the sound pressure, and the pronunciation feature amount specified by the specifying step, A third control step for displaying a second line extending in the axial direction and changing in time series in an axial direction orthogonal to the time axis direction on the screen in a manner comparable to the first line; And causing the computer to execute the first control step and the third control step, respectively, for determining the position of the line in the axial direction orthogonal to the time axis direction for each predetermined unit based on the pitch. And determining the width of the line for each predetermined unit based on the sound pressure, and determining the color of the line for each predetermined unit based on the pronunciation feature amount. , Characterized in that it comprises a.

請求項１，３，１１，１２に記載の発明によれば、文字列を音読するときの手本となる音声の変化を表す第１のラインの位置、幅、及び色を第１音声波形データから特定された音高、音圧、及び発音特徴量に基づいて決定し、話者が上記文字列を音読するときに発した音声の変化を表す第２のラインの位置、幅、及び色を第２音声波形データから特定された音高、音圧、及び発音特徴量に基づいて決定して、第１のラインと第２のラインとを比較可能に画面に表示させるとともに、上記文字列を時間軸方向に沿って画面に表示させるように構成したので、手本となる音声と、話者の音声との発し方の違いを、話者に一見して、より分り易く把握させることができる。 According to the first, third, eleventh, and twelfth aspects of the present invention, the position, width, and color of the first line that represents a change in voice that serves as a model when reading a character string aloud are set to the first voice waveform data. The position, width, and color of the second line that represents a change in speech that is produced when the speaker reads the character string aloud are determined based on the pitch, sound pressure, and pronunciation feature amount specified from It is determined based on the pitch, sound pressure, and pronunciation feature quantity specified from the second speech waveform data, and the first line and the second line are displayed on the screen so that they can be compared. Since it is configured to be displayed on the screen along the time axis direction, it is possible to make the speaker understand the difference in how the voice is used as an example and the voice of the speaker at a glance. .

請求項２に記載の発明によれば、表示させる文字の位置及びサイズを、音高及び音圧に基づいて決定するように構成したので、手本となる音声と、話者の音声との発し方の違いを、話者に対して、より一層、分り易く把握させることができる。 According to the invention described in claim 2, since the position and size of the character to be displayed are determined based on the pitch and the sound pressure, the voice of the model and the voice of the speaker are emitted. This makes it possible for the speaker to understand the difference in direction more easily.

請求項４に記載の発明によれば、文字単位毎に特定された母音の別に応じた色をラインの色として決定するように構成したので、手本となる音声と、話者の音声との発し方の違いを、話者に対して、文字毎に明確に把握させることができる。 According to the fourth aspect of the invention, since the color corresponding to each vowel specified for each character unit is determined as the line color, the voice of the model and the voice of the speaker It is possible to make the speaker clearly grasp the difference in how to utter each character.

請求項５に記載の発明によれば、ラインの色のバリエーションを増やすことができるので、話者に対して与える視覚的効果を向上させることができる。 According to the fifth aspect of the present invention, since the line color variation can be increased, the visual effect given to the speaker can be improved.

請求項６に記載の発明によれば、所定時間毎に特定された第１フォルマント周波数の値と第２フォルマント周波数の値との組合せに応じた色をラインの色として決定するように構成したので、時間経過に応じて滑らかにラインの色を変化させることができ、手本となる音声と、話者の音声との発し方の違いを、話者に対して、より一層、分り易く把握させることができる。 According to the invention described in claim 6, since the color corresponding to the combination of the value of the first formant frequency and the value of the second formant frequency specified every predetermined time is determined as the line color. The color of the line can be changed smoothly over time, allowing the speaker to understand the difference between the voice of the model and the speaker's voice even more clearly. be able to.

請求項７，８に記載の発明によれば、ラインの色のバリエーションを増やすことができるので、話者に対して与える視覚的効果を向上させることができる。 According to the seventh and eighth aspects of the invention, the line color variation can be increased, so that the visual effect given to the speaker can be improved.

請求項９に記載の発明によれば、所定時間毎に特定された第１フォルマント周波数の値に応じた色と第２フォルマント周波数の値に応じた色との混合色をラインの色として決定するように構成したので、時間経過に応じて滑らかにラインの色を変化させることができ、手本となる音声と、話者の音声との発し方の違いを、話者に対して、より一層、分り易く把握させることができる。 According to the ninth aspect of the present invention, the mixed color of the color corresponding to the value of the first formant frequency specified every predetermined time and the color corresponding to the value of the second formant frequency is determined as the line color. Because the line color can be changed smoothly over time, the difference between the voice of the model and the voice of the speaker can be further improved for the speaker. Can be understood easily.

請求項１０に記載の発明によれば、ラインの色のバリエーションを増やすことができるので、話者に対して与える視覚的効果を向上させることができる。 According to the tenth aspect of the present invention, since the line color variation can be increased, the visual effect given to the speaker can be improved.

本実施形態に係る音声情報表示装置Ｓの概要構成例を示す図である。It is a figure which shows the schematic structural example of the audio | voice information display apparatus S which concerns on this embodiment. フォルマント分布図の一例を示す図である。It is a figure which shows an example of a formant distribution map. 話者の音読中において、画面に表示される手本音声ライン及び話者音声ラインの一例を示す図である。It is a figure which shows an example of the example audio | voice line and speaker audio | voice line which are displayed on a screen during a speaker's reading aloud. 音声情報表示装置Ｓにおける処理の流れ及び処理で用いられるデータを示す図である。It is a figure which shows the data used by the flow of a process in the audio | voice information display apparatus S, and a process. 図４に示す音声描画データ生成処理内容の一例を示す図である。It is a figure which shows an example of the audio | voice drawing data production | generation processing content shown in FIG. 図４に示す画面描画処理内容の一例を示す図である。It is a figure which shows an example of the screen drawing process content shown in FIG.

以下、本発明の実施形態を図面に基づいて説明する。なお、以下に説明する実施形態は、本発明を音声情報表示装置に適用した場合の実施形態である。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In addition, embodiment described below is embodiment at the time of applying this invention to an audio | voice information display apparatus.

［１.音声情報表示装置Ｓの構成及び機能］
始めに、図１を参照して、本発明の実施形態に係る音声情報表示装置Ｓの構成及び機能について説明する。図１は、本実施形態に係る音声情報表示装置Ｓの概要構成例を示す図である。なお、音声情報表示装置の一例として、パーソナルコンピュータや、携帯型情報端末（スマートフォン等）などが挙げられる。図１に示すように、音声情報表示装置Ｓは、通信部１、記憶部２、制御部３、操作部４、及びインターフェース（ＩＦ）部５等を備えて構成され、これらの構成要素はバス６に接続されている。操作部４は、ユーザからの操作指示を受け付け、受け付けた操作に応じた信号を制御部３へ出力する。インターフェース部５には、マイクＭ、及びディスプレイＤ等が接続される。マイクＭは、語学学習や発声発話訓練等を行う話者が、複数の文字により構成される文字列（例えば、アナウンスされる文字列）を音読したときに発した音声を集音する。ディスプレイＤは、制御部３からの描画指令にしたがって、話者に提供する音声情報を画面に表示する。音声情報とは、音声の変化を表す音声ライン及び上記文字列を含む情報である。なお、マイクＭ、及びディスプレイＤは、音声情報表示装置Ｓと一体型であってもよいし、別体であってもよい。 [1. Configuration and function of the voice information display device S]
First, the configuration and function of the audio information display device S according to the embodiment of the present invention will be described with reference to FIG. FIG. 1 is a diagram illustrating a schematic configuration example of the audio information display device S according to the present embodiment. Note that examples of the voice information display device include a personal computer and a portable information terminal (smart phone or the like). As shown in FIG. 1, the audio information display device S includes a communication unit 1, a storage unit 2, a control unit 3, an operation unit 4, an interface (IF) unit 5, and the like. 6 is connected. The operation unit 4 receives an operation instruction from the user and outputs a signal corresponding to the received operation to the control unit 3. The interface unit 5 is connected to a microphone M, a display D, and the like. The microphone M collects a sound produced when a speaker who performs language learning, utterance utterance training, or the like reads a character string composed of a plurality of characters (for example, an announced character string). The display D displays voice information to be provided to the speaker on the screen according to a drawing command from the control unit 3. The voice information is information including a voice line representing a change in voice and the character string. The microphone M and the display D may be integrated with the audio information display device S or may be separate.

通信部１は、有線または無線によりネットワーク（図示せず）に接続してサーバ等と通信を行う。記憶部２は、例えばハードディスクドライブ等からなり、ＯＳ（オペレーティングシステム）、及び表示制御処理プログラム（本発明のプログラムの一例）等を記憶する。表示制御処理プログラムは、コンピュータとしての制御部３に、後述する表示制御処理を実行させるプログラムである。表示制御処理プログラムは、アプリケーションとして、所定のサーバからダウンロードされてもよいし、ＣＤ、ＤＶＤ等の記録媒体に記憶されて提供されてもよい。また、記憶部２は、複数の文字により構成される文字列のテキストデータと、この文字列を音読するときの手本となる音声の波形を示す第１音声波形データ（以下、「手本音声波形データ」という）を記憶する。ここで、テキストデータには、例えば、各文字の発音タイミング（例えば、発音開始からの経過時間）が文字毎に対応付けられて含まれる。また、音読対象となる文字列の例として、例えば、語学学習またはアナウンス訓練などで用いられる文字列、または歌唱に用いられる文字列などが挙げられる。なお、手本音声波形データは、所定の音声ファイル形式で記憶される。 The communication unit 1 communicates with a server or the like by connecting to a network (not shown) by wire or wireless. The storage unit 2 includes, for example, a hard disk drive and stores an OS (Operating System), a display control processing program (an example of the program of the present invention), and the like. The display control processing program is a program that causes the control unit 3 as a computer to execute display control processing described later. The display control processing program may be downloaded from a predetermined server as an application, or may be provided by being stored in a recording medium such as a CD or a DVD. The storage unit 2 also stores text data of a character string composed of a plurality of characters and first sound waveform data (hereinafter referred to as “example sound”) indicating a sound waveform as a model when the character string is read aloud. Waveform data ”). Here, the text data includes, for example, the sound generation timing of each character (for example, the elapsed time from the start of sound generation) in association with each character. Examples of character strings to be read aloud include, for example, character strings used in language learning or announcement training, or character strings used in singing. The model voice waveform data is stored in a predetermined voice file format.

制御部３は、コンピュータとしてのＣＰＵ（Center Processing Unit）、ＲＯＭ（Read Only Memory）、及びＲＡＭ（Random Access Memory）等により構成される。制御部３は、表示制御処理プログラムにより、音声処理部３１及び描画制御部３２として機能する。音声処理部３１は、本発明における入力手段、及び特定手段の一例である。描画制御部３２は、本発明における第１制御手段、第２制御手段、及び第３制御手段の一例である。記憶部２または制御部３におけるＲＡＭは、本発明における記憶手段の一例である。 The control unit 3 includes a CPU (Center Processing Unit) as a computer, a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. The control unit 3 functions as an audio processing unit 31 and a drawing control unit 32 by a display control processing program. The voice processing unit 31 is an example of an input unit and a specifying unit in the present invention. The drawing control unit 32 is an example of a first control unit, a second control unit, and a third control unit in the present invention. The RAM in the storage unit 2 or the control unit 3 is an example of a storage unit in the present invention.

音声処理部３１は、所定の音声ファイル形式で記憶された手本音声波形データを記憶部２から入力する。また、音声処理部３１は、話者が上記文字列を音読したときに発した音声であってマイクＭにより集音された音声の波形を示す第２音声波形データ（以下、「話者音声波形データ」という）を入力する。手本音声波形データ及び話者音声波形データを総称して音声波形データという。音声波形データは、それぞれ、離散化された時系列の音圧波形データであり、例えば、サンプリングレート44.1kHz、量子化16bit、及びモノラルの波形データである。そして、音声処理部３１は、手本音声波形データに基づいて、所定単位毎に音高、音圧、及び発音特徴量を特定する。また、音声処理部３１は、話者音声波形データに基づいて、所定単位毎に音高、音圧、及び発音特徴量を特定する。なお、所定単位とは、時間単位（例えば、10ms〜100ms等）であってもよいし、文字単位であってもよい。 The voice processing unit 31 inputs model voice waveform data stored in a predetermined voice file format from the storage unit 2. The voice processing unit 31 also outputs second voice waveform data (hereinafter referred to as “speaker voice waveform”) indicating a waveform of a voice collected when the speaker reads the character string and collected by the microphone M. Data). The model voice waveform data and the speaker voice waveform data are collectively referred to as voice waveform data. The audio waveform data is discrete time-series sound pressure waveform data, for example, sampling rate of 44.1 kHz, quantization of 16 bits, and monaural waveform data. Then, the voice processing unit 31 specifies the pitch, the sound pressure, and the pronunciation feature amount for each predetermined unit based on the model voice waveform data. In addition, the voice processing unit 31 specifies a pitch, a sound pressure, and a pronunciation feature amount for each predetermined unit based on the speaker voice waveform data. Note that the predetermined unit may be a time unit (for example, 10 ms to 100 ms) or a character unit.

ここで、音高（ピッチ）とは、音の高さをいう。音声処理部３１は、例えば、音声波形データから例えば所定時間毎に切り出したデータから基本周波数（Hz）を算出し、算出した基本周波数（Hz）を音高として所定時間毎に特定（判定）する。これにより特定された音高を示す音高データは、所定時間毎に記憶される。なお、音高の算出方法には、例えば、ゼロクロス法やベクトル自己相関等の公知の手法を適用できる。 Here, the pitch (pitch) refers to the pitch of the sound. For example, the speech processing unit 31 calculates a fundamental frequency (Hz) from data cut out from speech waveform data at predetermined time intervals, for example, and specifies (determines) the calculated fundamental frequency (Hz) at predetermined time intervals as a pitch. . The pitch data indicating the pitch specified in this way is stored every predetermined time. For the pitch calculation method, for example, a known method such as a zero cross method or vector autocorrelation can be applied.

次に、音圧とは、音波による空気の圧力の変化分（Pa）をいう。本実施形態では、音圧として、瞬時音圧（Pa）の二乗平均平方根（RMS）である実効音圧（Pa）の大きさを計算上扱い易い数値で表した音圧レベル(dB)を適用する。音圧レベル(dB)は、広義には音量ともいう。音声処理部３１は、例えば、音声波形データから例えば所定時間毎に切り出したデータから音圧レベル(dB)を算出し、算出した音圧レベル(dB)を音圧として所定時間毎に特定する。これにより特定された音圧を示す音圧データは、所定時間毎に記憶される。 Next, the sound pressure refers to a change (Pa) in air pressure due to sound waves. In this embodiment, the sound pressure level (dB) that represents the magnitude of effective sound pressure (Pa), which is the root mean square (RMS) of instantaneous sound pressure (Pa), is expressed as a numerical value that is easy to handle in calculation. To do. The sound pressure level (dB) is also called volume in a broad sense. For example, the sound processing unit 31 calculates the sound pressure level (dB) from data cut out from the sound waveform data, for example, at predetermined time intervals, and specifies the calculated sound pressure level (dB) at predetermined time intervals as the sound pressure. Sound pressure data indicating the sound pressure specified in this way is stored every predetermined time.

次に、発音特徴量とは、音声波形データの周波数スペクトルによって区別可能な特徴量である。周波数スペクトルは、例えば、周波数を横軸（Ｘ軸）にとり、当該周波数における音のパワー（例えば、音圧レベルの二乗）を縦軸（Ｙ軸）にとったときの周波数のスペクトルを示す。発音特徴量の例として、音素が挙げられる。１音素は、基本的には、１文字に対応する。音素の例として、母音のみ、子音のみ、子音と母音との組合せの３つが挙げられる。母音には、ａ（あ）、ｉ（い）、ｕ（う）、ｅ（え）、ｏ（お）の５母音がある。子音には、母音以外の音成分（例えば、ｋ、ｓ、ｔ、ｎ、ｈ、ｍ、ｙ、ｒ、ｗ・・・など）がある。また、例えば、日本語の「か」という音素は、ローマ字表記では“ｋａ”であるから、子音と母音の組合せということになる。また、日本語の「しゃ」という音素（これを１文字として認識）は、ローマ字表記では“ｓｙａ”であるから、これも、子音と母音との組合せということになる。また、日本語の「で」という音素は、ローマ字表記では“ｄｅ”であるから、これも、子音と母音との組合せということになる。なお、日本語の「ん」という音素は、ローマ字表記では“ｎ”または“ｍ”であるから、これは、子音である。 Next, the pronunciation feature value is a feature value that can be distinguished by the frequency spectrum of the speech waveform data. The frequency spectrum indicates, for example, a frequency spectrum when the frequency is taken on the horizontal axis (X axis) and the power of the sound at the frequency (for example, the square of the sound pressure level) is taken on the vertical axis (Y axis). An example of the pronunciation feature amount is a phoneme. One phoneme basically corresponds to one character. Examples of phonemes include three vowels only, consonants only, and combinations of consonants and vowels. There are five vowels: a (a), i (i), u (u), e (e), and o (o). Consonants include sound components other than vowels (for example, k, s, t, n, h, m, y, r, w...). In addition, for example, the phoneme “ka” in Japanese is “ka” in Roman notation, and is therefore a combination of consonants and vowels. Also, the phoneme “sha” (recognized as one character) in Japanese is “sya” in the Roman alphabet, so this is also a combination of a consonant and a vowel. In addition, since the phoneme “de” in Japanese is “de” in Roman notation, this is also a combination of consonants and vowels. Note that the Japanese phoneme “n” is “n” or “m” in Roman notation, so it is a consonant.

音声波形データが示す音声の波形から、文字間（言い換えれば、音素間）を明確に区切ることが可能な場合、発音特徴量の例として、母音の別、または音素の別を特定（判定）可能である。この場合、音声処理部３１は、例えば、音声波形データから例えば文字毎（言い換えれば、音素毎）に切り出したデータをフーリエ解析（ＦＦＴ）することで周波数スペクトルを算出する。そして、音声処理部３１は、算出した周波数スペクトルと、予め用意された音素毎の周波数スペクトルを示すテンプレートとを比較（テンプレートマッチング）することで発音特徴量としての母音または音素を文字毎に特定する。これにより特定された母音または音素を示す発音特徴量は、文字毎に記憶される。つまり、文字毎に、母音の別、または音素の別が記憶されることになる。なお、日本語の場合、子音のみで構成される「ん」という文字があるため、母音の別とは、「あ」、「い」、「う」、「え」、「お」、及び「ん」の別になる。また、音素を特定する場合、文字毎に切り出された音声の波形を、振幅が比較的小さい子音部と、振幅が比較的大きい母音部に分離してフーリエ解析することが望ましい。 When it is possible to clearly delimit between letters (in other words, between phonemes) from the speech waveform indicated by the speech waveform data, it is possible to identify (determine) vowel distinction or phoneme distinction as an example of pronunciation features It is. In this case, for example, the speech processing unit 31 calculates a frequency spectrum by performing Fourier analysis (FFT) on data cut out from speech waveform data, for example, for each character (in other words, for each phoneme). Then, the speech processing unit 31 compares the calculated frequency spectrum with a template indicating the frequency spectrum for each phoneme prepared in advance (template matching) to specify a vowel or phoneme as a pronunciation feature amount for each character. . The pronunciation feature quantity indicating the vowel or phoneme specified in this way is stored for each character. That is, for each character, a vowel distinction or a phoneme distinction is stored. In Japanese, there is a character “n” that consists only of consonants, so that “a”, “i”, “u”, “e”, “o”, “ It will be another. When specifying phonemes, it is desirable to perform a Fourier analysis by separating a speech waveform cut out for each character into a consonant part having a relatively small amplitude and a vowel part having a relatively large amplitude.

発音特徴量の他の例として、フォルマント周波数及びノイズ周波数が挙げられる。フォルマント周波数とは、周波数スペクトルにより特定されるスペクトル包絡において山となる周波数をいい、周波数の低い方から第１フォルマント周波数、第２フォルマント周波数、第３フォルマント周波数・・・という。一般に、第１フォルマント周波数を横軸にとり、第２フォルマント周波数を縦軸にとったときのフォルマント分布（２次元座標平面）に基づき、母音の別を、ある程度判定することができる。図２に、フォルマント分布図の一例を示す。なお、フォルマント分布は、性別や言語等によって変わる。図２に示すフォルマント分布の２次元平面において、領域Ｒ１内は、母音“ａ”に相当する領域を示し、領域Ｒ２内は、母音“ｉ”に相当する領域を示し、領域Ｒ３内は、母音“ｕ”に相当する領域を示し、領域Ｒ４内は、母音“ｅ”に相当する領域を示し、領域Ｒ５内は、母音“ｏ”に相当する領域を示す。例えば、第１フォルマント周波数の値が１０００（Hz）、第２フォルマント周波数の値が１２００（Hz）である組合せは、母音“ａ”に相当する領域内にある。なお、第１フォルマント周波数の値と第２フォルマント周波数の値との組合せが、複数の領域内にある場合もある。また、ノイズ周波数とは、基本周波数及びその倍音以外の周波数以外のノイズ成分の周波数をいう。なお、明確な基本周波数が無い状態の時刻において、ノイズ成分のスペクトルを周波数軸方向に平滑化し、そのスペクトル包絡の中で一番大きな山の頂点をノイズ中心周波数という。子音は、ノイズ周波数を多く含んでいる。 Other examples of the pronunciation feature amount include formant frequency and noise frequency. The formant frequency is a frequency that becomes a peak in the spectrum envelope specified by the frequency spectrum, and is called the first formant frequency, the second formant frequency, the third formant frequency,... In general, the vowel distinction can be determined to some extent based on the formant distribution (two-dimensional coordinate plane) when the first formant frequency is taken on the horizontal axis and the second formant frequency is taken on the vertical axis. FIG. 2 shows an example of a formant distribution diagram. Note that the formant distribution varies depending on gender, language, and the like. In the two-dimensional plane of the formant distribution shown in FIG. 2, the region R1 indicates the region corresponding to the vowel “a”, the region R2 indicates the region corresponding to the vowel “i”, and the region R3 indicates the vowel. A region corresponding to “u” is shown, a region R4 indicates a region corresponding to the vowel “e”, and a region R5 indicates a region corresponding to the vowel “o”. For example, a combination in which the value of the first formant frequency is 1000 (Hz) and the value of the second formant frequency is 1200 (Hz) is in a region corresponding to the vowel “a”. In some cases, the combination of the value of the first formant frequency and the value of the second formant frequency is within a plurality of regions. The noise frequency refers to the frequency of noise components other than the fundamental frequency and frequencies other than harmonics thereof. Note that at the time when there is no clear fundamental frequency, the spectrum of the noise component is smoothed in the frequency axis direction, and the peak of the largest peak in the spectrum envelope is called the noise center frequency. Consonants contain a lot of noise frequencies.

発音特徴量がフォルマント周波数及びノイズ周波数である場合、例えば、音声処理部３１は、音声波形データをフーリエ解析し、周波数ビン単位でノイズ成分と調波成分とに分離し、分離したそれぞれの成分を再度逆フーリエ解析により、ノイズ成分の音声波形データと、調波成分の音声波形データとを生成する。そして、音声処理部３１は、ノイズ成分の音声波形データから例えば所定時間毎に切り出したデータをフーリエ解析することでノイズ成分の周波数スペクトルを算出する。さらに、音声処理部３１は、算出したノイズ成分の周波数スペクトルにおける周波数軸に対する平滑化を行うことでノイズ周波数を算出し、算出したノイズ周波数を発音特徴量として所定時間毎に特定する。これにより特定されたノイズ周波数を示す発音特徴量は、所定時間毎に記憶される。なお、ノイズ成分のスペクトル包絡は大きな山形になることが多いため、音声処理部３１は、その山の頂点をノイズ中心周波数として特定し、記憶する。一方、音声処理部３１は、調波成分の音声波形データから例えば所定時間毎に切り出したデータをフーリエ解析することで調波成分の周波数スペクトルを算出する。そして、音声処理部３１は、算出した調波成分の周波数スペクトルからケプストラム法によりフォルマント周波数を算出し、算出したフォルマント周波数を発音特徴量として所定時間毎に特定する。或いは、音声処理部３１は、調波成分の音声波形データから所定時間毎に切り出したデータに対して線形予測符号（LPC：linear predictive coding）法を用いてフォルマント周波数を算出し、算出したフォルマント周波数を発音特徴量として所定時間毎に特定する。以上のように特定されたフォルマント周波数を示す発音特徴量は、所定時間毎に記憶される。なお、音声処理部３１は、ケプストラム法、または線形予測符号法により求められたスペクトル包絡線の山の第１ピーク及び第２ピークを、第１フォルマント周波数及び第２フォルマント周波数を特定し、記憶する。また、上記において、再度逆フーリエ解析により生成されたノイズ成分の音声波形データと、調波成分の音声波形データとを合成して音声波形データに戻すように構成してもよい。この場合、戻された音声波形データに基づいて音圧等を特定するとよい。 When the pronunciation feature amount is a formant frequency and a noise frequency, for example, the speech processing unit 31 performs Fourier analysis on the speech waveform data, separates the noise component and the harmonic component in frequency bin units, and separates the separated components. The speech waveform data of the noise component and the speech waveform data of the harmonic component are generated again by inverse Fourier analysis. Then, the voice processing unit 31 calculates the frequency spectrum of the noise component by performing Fourier analysis on data cut out from the voice waveform data of the noise component, for example, every predetermined time. Further, the sound processing unit 31 calculates a noise frequency by performing smoothing on the frequency axis in the frequency spectrum of the calculated noise component, and specifies the calculated noise frequency as a pronunciation feature amount at predetermined time intervals. The pronunciation feature quantity indicating the noise frequency specified in this way is stored every predetermined time. Since the spectrum envelope of the noise component often has a large mountain shape, the sound processing unit 31 specifies and stores the peak of the mountain as the noise center frequency. On the other hand, the voice processing unit 31 calculates the frequency spectrum of the harmonic component by performing Fourier analysis on data cut out from the voice waveform data of the harmonic component, for example, every predetermined time. Then, the sound processing unit 31 calculates a formant frequency from the calculated frequency spectrum of the harmonic component by a cepstrum method, and specifies the calculated formant frequency as a pronunciation feature amount at predetermined time intervals. Alternatively, the speech processing unit 31 calculates a formant frequency by using a linear predictive coding (LPC) method for data cut out from speech waveform data of harmonic components at predetermined time intervals, and calculates the calculated formant frequency. Is specified at predetermined time intervals as a pronunciation feature amount. The pronunciation feature amount indicating the formant frequency specified as described above is stored every predetermined time. The speech processing unit 31 specifies the first formant frequency and the second formant frequency, and stores the first peak and the second peak of the peak of the spectrum envelope obtained by the cepstrum method or the linear predictive coding method. . Further, in the above, the speech waveform data of the noise component generated again by the inverse Fourier analysis and the speech waveform data of the harmonic component may be synthesized and returned to the speech waveform data. In this case, sound pressure or the like may be specified based on the returned speech waveform data.

また、音声処理部３１は、上述したように生成したノイズ成分の音声波形データから例えば所定時間毎に切り出したデータから音圧レベル(dB)を算出し、算出した音圧レベル(dB)をノイズ音圧として所定時間毎に特定してもよい。これにより特定されたノイズ音圧を示すノイズ音圧データは、所定時間毎に記憶される。また、音声処理部３１は、上述したように生成した調波成分の音声波形データから例えば所定時間毎に切り出したデータから音圧レベル(dB)を算出し、算出した音圧レベル(dB)を調波音圧として所定時間毎に特定してもよい。これにより特定された調波音圧を示す調波音圧データは、所定時間毎に記憶される。 In addition, the sound processing unit 31 calculates a sound pressure level (dB) from, for example, data cut out every predetermined time from the sound waveform data of the noise component generated as described above, and the calculated sound pressure level (dB) is calculated as noise. The sound pressure may be specified every predetermined time. Noise sound pressure data indicating the noise sound pressure specified in this way is stored every predetermined time. In addition, the sound processing unit 31 calculates a sound pressure level (dB) from, for example, data cut out every predetermined time from the sound waveform data of the harmonic component generated as described above, and calculates the calculated sound pressure level (dB). The harmonic sound pressure may be specified every predetermined time. The harmonic sound pressure data indicating the harmonic sound pressure specified in this way is stored every predetermined time.

次に、描画制御部３２は、手本音声波形データから特定された音高、音圧、及び発音特徴量に基づいて、時間軸（例えば、横軸）方向に伸び、且つ、時間軸方向と直交する軸（例えば、縦軸）方向に時系列的に変化する第１の音声ライン（以下、「手本音声ライン」という）をディスプレイＤの画面に表示させる。ここで、描画制御部３２は、手本音声ラインを表示させる際に、手本音声波形データから特定された音高に基づいて、時間軸方向と直交する軸方向における手本音声ラインの位置（座標）を例えば所定時間毎に決定し、手本音声波形データから特定された音圧に基づいて、手本音声ラインの幅（線幅）を例えば所定時間毎に決定する。さらに、描画制御部３２は、手本音声波形データから特定された発音特徴量に基づいて、手本音声ラインの色（線色）を所定単位毎（例えば、文字毎、または所定時間毎）に決定する。 Next, the drawing control unit 32 extends in the time axis (for example, the horizontal axis) direction based on the pitch, sound pressure, and pronunciation feature amount specified from the model voice waveform data, and the time axis direction A first voice line (hereinafter referred to as “example voice line”) that changes in time series in an orthogonal axis (for example, vertical axis) direction is displayed on the screen of the display D. Here, when displaying the model voice line, the drawing control unit 32 displays the position of the model voice line in the axial direction orthogonal to the time axis direction (based on the pitch specified from the model voice waveform data ( Coordinate) is determined, for example, every predetermined time, and the width (line width) of the model voice line is determined, for example, every predetermined time based on the sound pressure specified from the model voice waveform data. Further, the drawing control unit 32 sets the color (line color) of the model voice line for each predetermined unit (for example, for each character or for each predetermined time) based on the pronunciation feature amount specified from the model voice waveform data. decide.

ここで、音声ラインの色の決定方法の具体例について説明する。 Here, a specific example of a method for determining the color of an audio line will be described.

（１）音声ラインの色の決定方法の具体例１
例えば、描画制御部３２は、特定された母音の別に応じて線色が異なるように文字毎に決定（つまり、母音別に予め設定された色を線色として決定）する。これにより、手本となる音声と、話者の音声との発し方の違いを、話者に対して、文字毎に明確に把握させることができる。或いは、描画制御部３２は、特定された母音の別に応じた色と、特定された子音に応じた色との混合色を線色として文字毎に決定してもよい。これにより、音声ラインの色のバリエーションを増やすことができるので、話者に対して与える視覚的効果を向上させることができる。この場合、子音には、母音に設定された色以外の色が予め設定される。 (1) Specific example 1 of audio line color determination method
For example, the drawing control unit 32 determines for each character so that the line color differs according to the specified vowel (that is, the color set in advance for each vowel is determined as the line color). Thereby, the speaker can be made to grasp clearly the difference in how to utter the voice as a model and the voice of the speaker for each character. Alternatively, the drawing control unit 32 may determine, for each character, a mixed color of a color corresponding to the specified vowel and a color corresponding to the specified consonant as a line color. Thereby, since the variation of the color of an audio | voice line can be increased, the visual effect given with respect to a speaker can be improved. In this case, a color other than the color set for the vowel is set in advance as the consonant.

（２）音声ラインの色の決定方法の具体例２
例えば、描画制御部３２は、特定された第１フォルマント周波数の値と、特定された第２フォルマント周波数の値との組合せに応じた色を線色として所定時間毎に決定する。これにより、時間経過に応じて滑らかに音声ラインの色を変化させることができ、手本となる音声と、話者の音声との発し方の違いを、話者に対して、より一層、分り易く把握させることができる。また、音声波形データが示す音声の波形から、文字間を明確に区切ることが困難な場合であっても、手本となる音声と、話者の音声との発し方の違いを、話者に対して、より一層、分り易く把握させることができる。この場合、例えば、所定の第１フォルマント周波数の値と、所定の第２フォルマント周波数の値との複数の基準組合せ毎に所定の基準色が予め設定される。この基準組合せは、母音の数分あり、母音毎に異なる基準色があることが望ましい。例えば、図２に示すフォルマント分布の２次元平面において、母音“ａ”の領域Ｒ１のほぼ中心部に相当する基準組合せに対して所定の基準色として「青」が予め設定される。この基準組合せは、（x1,y1）＝（第１フォルマント周波数の値,第２フォルマント周波数の値）を示すか、或いは、（x1,y1）を含む領域Ｒ１内の座標群を示す。また、母音“i”の領域Ｒ２のほぼ中心部に相当する基準組合せに対しては、所定の基準色として「赤」が予め設定される。この基準組合せは、（x2,y2）＝（第１フォルマント周波数の値,第２フォルマント周波数の値）を示すか、或いは、（x2,y2）を含む領域Ｒ２内の座標群を示す。また、母音“ｕ”の領域Ｒ３のほぼ中心部に相当する基準組合せに対しては、所定の基準色として「緑」が設定される。この基準組合せは、（x3,y3）＝（第１フォルマント周波数の値,第２フォルマント周波数の値）を示すか、或いは、（x3,y3）を含む領域Ｒ３内の座標群を示す。また、母音“ｅ”の領域Ｒ４のほぼ中心部に相当する基準組合せに対しては、所定の基準色として「紫」が設定される。この基準組合せは、（x4,y4）＝（第１フォルマント周波数の値,第２フォルマント周波数の値）を示すか、或いは、（x4,y4）を含む領域Ｒ４内の座標群を示す。また、母音“ｏ”の領域Ｒ５のほぼ中心部に相当する基準組合せに対しては、所定の基準色として「黄」が設定される。この基準組合せは、（x5,y5）＝（第１フォルマント周波数の値,第２フォルマント周波数の値）を示すか、或いは、（x5,y5）を含む領域Ｒ５内の座標群を示す。 (2) Specific example 2 of audio line color determination method
For example, the drawing control unit 32 determines a color corresponding to a combination of the specified first formant frequency value and the specified second formant frequency value as a line color at predetermined time intervals. As a result, the color of the voice line can be changed smoothly over time, and the difference between the voice of the model and the voice of the speaker can be further understood for the speaker. It can be easily grasped. In addition, even if it is difficult to clearly separate the characters from the speech waveform indicated by the speech waveform data, the speaker will be able to explain the difference in how the speech is used as an example and the speech of the speaker. On the other hand, it is possible to make it easier to understand. In this case, for example, a predetermined reference color is preset for each of a plurality of reference combinations of a predetermined first formant frequency value and a predetermined second formant frequency value. This reference combination is as many as the number of vowels, and it is desirable that there are different reference colors for each vowel. For example, in the two-dimensional plane of the formant distribution shown in FIG. 2, “blue” is preset as a predetermined reference color for the reference combination corresponding to the substantially central portion of the region R1 of the vowel “a”. This reference combination indicates (x1, y1) = (value of the first formant frequency, value of the second formant frequency) or indicates a group of coordinates in the region R1 including (x1, y1). Also, “red” is set in advance as a predetermined reference color for the reference combination corresponding to the substantially central portion of the region R2 of the vowel “i”. This reference combination indicates (x2, y2) = (value of the first formant frequency, value of the second formant frequency) or indicates a group of coordinates in the region R2 including (x2, y2). Further, “green” is set as the predetermined reference color for the reference combination corresponding to the substantially central portion of the region R3 of the vowel “u”. This reference combination indicates (x3, y3) = (value of the first formant frequency, value of the second formant frequency) or indicates a group of coordinates in the region R3 including (x3, y3). Further, “purple” is set as the predetermined reference color for the reference combination corresponding to the substantially central portion of the region R4 of the vowel “e”. This reference combination indicates (x4, y4) = (value of the first formant frequency, value of the second formant frequency) or indicates a group of coordinates in the region R4 including (x4, y4). Further, “yellow” is set as the predetermined reference color for the reference combination corresponding to the substantially central portion of the region R5 of the vowel “o”. This reference combination indicates (x5, y5) = (value of the first formant frequency, value of the second formant frequency) or indicates a group of coordinates in the region R5 including (x5, y5).

そして、描画制御部３２は、特定された第１フォルマント周波数の値と、特定された第２フォルマント周波数の値との組合せ（x0,y0）から、例えば上記２次元平面上における距離が最も近い基準組合せ（或いは、組合せ（x0,y0）を含む基準組合せ）に対して予め設定された色（調波色という）を線色として所定時間毎に決定する。或いは、描画制御部３２は、特定された第１フォルマント周波数の値と特定された第２フォルマント周波数の値との組合せに応じた色（調波色）と、特定されたノイズ周波数（例えば、ノイズ中心周波数）の値に応じた色（ノイズ成分の値に応じた色の一例であり、ノイズ色という）との混合色を線色として所定時間毎に決定してもよい。これにより、音声ラインの色のバリエーションを増やすことができるので、話者に対して与える視覚的効果を向上させることができる。この場合、ノイズ周波数（例えば、ノイズ中心周波数）の値には、上記基準色以外の色が予め設定される。 Then, the drawing control unit 32 determines, for example, a reference having the closest distance on the two-dimensional plane from the combination (x0, y0) of the specified first formant frequency value and the specified second formant frequency value. A color (referred to as a harmonic color) preset for the combination (or a reference combination including the combination (x0, y0)) is determined as a line color every predetermined time. Alternatively, the drawing control unit 32 may select a color (harmonic color) corresponding to the combination of the specified first formant frequency value and the specified second formant frequency value and the specified noise frequency (for example, noise). A mixed color with a color corresponding to the value of the (center frequency) (which is an example of a color corresponding to the value of the noise component and is referred to as a noise color) may be determined as a line color every predetermined time. Thereby, since the variation of the color of an audio | voice line can be increased, the visual effect given with respect to a speaker can be improved. In this case, a color other than the reference color is set in advance as the value of the noise frequency (for example, the noise center frequency).

或いは、描画制御部３２は、基準組合せである（xα,yβ）から座標平面上での距離が離れるほど、この基準組合せに対して予め設定された基準色からの変化度合いが大きくなるように、特定された第１フォルマント周波数の値と特定された第２フォルマント周波数の値との組合せに応じた色（調波色）を所定時間毎に決定してもよい。これにより、音声ラインの色のバリエーションを増やすことができるので、話者に対して与える視覚的効果を向上させることができる。ここで、（xα,yβ）は、（x1,y1）、（x2,y2）、（x3,y3）、（x4,y4）、または（x5,y5）を示す。例えば、（xα,yβ）が（x1,y1）である場合、図２に示す領域Ｒ１の外縁に近づくほど、基準組合せ（x1,y1）に対して予め設定された基準色（例えば、（R輝度値,G輝度値,B輝度値）＝（0,0,255））からの変化度合いが大きくなる。基準色からの変化度合いが大きいとは、例えば、基準色からの色差が大きいことを意味する。色差とは、例えば、色をＲＧＢで表す場合、ＲＧＢの３次元色空間における基準色の座標（0,0,255）（一例として、色を８ビットで表現）からのユークリッド距離により求まる。この場合、描画制御部３２は、所定の第１フォルマント周波数の値xαと特定された第１フォルマント周波数の値x0の差と、所定の第２フォルマント周波数の値yβと特定された第２フォルマント周波数の値y0の差との自乗和（（xα-x0)^2 + (yβ-y0)^2）を算出（または自乗和の平方根を算出）する。そして、描画制御部３２は、算出した自乗和（または自乗和の平方根）が大きいほど、上記基準色からの変化度合いが大きくなるように、特定された第１フォルマント周波数の値と特定された第２フォルマント周波数の値との組合せに応じた色を計算により決定する。例えば、組合せに応じた色は、基準色（0,0,255）におけるR輝度値（0）またはG輝度値(0)に上記自乗和に比例した値（ただし、255以下）が加算されることで決定されるか、或いは、B輝度値(255)から上記自乗和に比例した値（ただし、255以下）が減算されることで決定される。 Alternatively, the drawing control unit 32 increases the degree of change from the reference color set in advance for the reference combination as the distance on the coordinate plane increases from the reference combination (xα, yβ). You may determine the color (harmonic color) according to the combination of the value of the specified 1st formant frequency and the value of the specified 2nd formant frequency for every predetermined time. Thereby, since the variation of the color of an audio | voice line can be increased, the visual effect given with respect to a speaker can be improved. Here, (xα, yβ) represents (x1, y1), (x2, y2), (x3, y3), (x4, y4), or (x5, y5). For example, when (xα, yβ) is (x1, y1), the closer to the outer edge of the region R1 shown in FIG. 2, the more the reference color (for example, (R The degree of change from (luminance value, G luminance value, B luminance value) = (0,0,255)) increases. A large degree of change from the reference color means, for example, that the color difference from the reference color is large. For example, when the color is expressed in RGB, the color difference is obtained from the Euclidean distance from the coordinates (0, 0, 255) of the reference color in the RGB three-dimensional color space (for example, the color is expressed by 8 bits). In this case, the drawing control unit 32 determines the difference between the predetermined first formant frequency value xα and the specified first formant frequency value x0 and the predetermined second formant frequency value yβ. Calculate the sum of squares ((xα-x0) ^ 2 + (yβ-y0) ^ 2) with the difference between the values of y0 (or calculate the square root of the sum of squares). Then, the drawing control unit 32 specifies the value of the specified first formant frequency so that the degree of change from the reference color increases as the calculated square sum (or the square root of the square sum) increases. The color corresponding to the combination with the value of the two formant frequencies is determined by calculation. For example, the color according to the combination is obtained by adding a value (however, 255 or less) proportional to the sum of squares to the R luminance value (0) or G luminance value (0) in the reference color (0,0,255). It is determined by subtracting a value (however, 255 or less) proportional to the square sum from the B luminance value (255).

（３）音声ラインの色の決定方法の具体例３
例えば、描画制御部３２は、特定された第１フォルマント周波数の値に応じた色（調波色）と特定された第２フォルマント周波数の値に応じた色（調波色）との混合色を線色として所定時間毎に決定する。これにより、時間経過に応じて滑らかにラインの色を変化させることができ、手本となる音声と、話者の音声との発し方の違いを、話者に対して、より一層、分り易く把握させることができる。また、音声波形データが示す音声の波形から、文字間を明確に区切ることが困難な場合であっても、手本となる音声と、話者の音声との発し方の違いを、話者に対して、より一層、分り易く把握させることができる。例えば、描画制御部３２は、特定された第１フォルマント周波数の値を例えば255以下のG輝度値に変換（例えば、輝度値＝周波数／ｋ（係数）の変換式による）することで、このG輝度値を第１フォルマント周波数の値に応じた色として決定する。さらに、描画制御部３２は、特定された第２フォルマント周波数の値を例えば255以下のB輝度値に変換することで、このB輝度値を第２フォルマント周波数の値に応じた色として決定する。そして、描画制御部３２は、決定したG輝度値とB輝度値により求まる上記混合色（例えば、（R輝度値,G輝度値,B輝度値））を線色として決定する。ここで、残りの輝度値であるR輝度値は例えば0〜255の中の任意の値であってもよいが、描画制御部３２は、上述した調波音圧データが示す調波音圧の値（音圧レベル(dB)）を例えば255以下のR輝度値に変換することで、G輝度値とB輝度値により求まる上記混合色の濃さを調整すればより効果的である。なお、第１フォルマント周波数の値と第２フォルマント周波数の値との輝度値の組合せは、G輝度値とB輝度値との組合せではなく、B輝度値とG輝度値との組合せ、R輝度値とG輝度値との組合せ、R輝度値とB輝度値との組合せ、B輝度値とR輝度値との組合せ、またはG輝度値とR輝度値との組合せであってもよく、これらの場合も残りの輝度値を調波音圧の値に応じて調整するとよい。 (3) Specific example 3 of voice line color determination method
For example, the drawing control unit 32 displays a mixed color of a color (harmonic color) corresponding to the specified first formant frequency value and a color (harmonic color) corresponding to the specified second formant frequency value. The line color is determined every predetermined time. This makes it possible to change the color of the line smoothly over time, making it easier for the speaker to understand the difference between the voice of the model and the voice of the speaker. It can be grasped. In addition, even if it is difficult to clearly separate the characters from the speech waveform indicated by the speech waveform data, the speaker will be able to explain the difference in how the speech is used as an example and the speech of the speaker. On the other hand, it is possible to make it easier to understand. For example, the drawing control unit 32 converts the value of the specified first formant frequency into a G luminance value of, for example, 255 or less (for example, according to a conversion formula of luminance value = frequency / k (coefficient)). The luminance value is determined as a color corresponding to the value of the first formant frequency. Further, the drawing control unit 32 converts the specified second formant frequency value into a B luminance value of, for example, 255 or less, thereby determining the B luminance value as a color corresponding to the second formant frequency value. Then, the drawing control unit 32 determines the mixed color (for example, (R luminance value, G luminance value, B luminance value)) obtained from the determined G luminance value and B luminance value as a line color. Here, the R luminance value that is the remaining luminance value may be any value from 0 to 255, for example. However, the drawing control unit 32 determines the harmonic sound pressure value (the harmonic sound pressure data described above) ( By converting the sound pressure level (dB) to an R luminance value of, for example, 255 or less, it is more effective to adjust the density of the mixed color obtained from the G luminance value and the B luminance value. Note that the combination of the luminance value of the first formant frequency value and the second formant frequency value is not a combination of the G luminance value and the B luminance value, but a combination of the B luminance value and the G luminance value, and the R luminance value. And G luminance values, R luminance values and B luminance values, B luminance values and R luminance values, or G luminance values and R luminance values. The remaining luminance value may be adjusted according to the value of the harmonic sound pressure.

或いは、描画制御部３２は、特定された第１フォルマント周波数の値に応じた色（調波色）と特定された第２フォルマント周波数の値に応じた色（調波色）との混合色（例えば、上述したように決定される（R輝度値,G輝度値,B輝度値））と、特定されたノイズ周波数（例えば、ノイズ中心周波数）の値に応じた色（ノイズ色）との混合色を線色として所定時間毎に決定してもよい。これにより、音声ラインの色のバリエーションを増やすことができるので、話者に対して与える視覚的効果を向上させることができる。この場合、ノイズ周波数（例えば、ノイズ中心周波数）の値には、上記基準色以外の色が予め設定される。なお、描画制御部３２は、上述したノイズ成分音圧データが示すノイズ成分音圧の値（音圧レベル(dB)）に応じて、ノイズ周波数（例えば、ノイズ中心周波数）の値に応じた色の濃さを調整すればより効果的である。 Alternatively, the drawing control unit 32 mixes a color (harmonic color) according to the value of the specified first formant frequency (harmonic color) and a color (harmonic color) according to the value of the specified second formant frequency ( For example, as described above (R luminance value, G luminance value, B luminance value)) and a color (noise color) corresponding to the value of the specified noise frequency (for example, noise center frequency) The color may be determined every predetermined time as a line color. Thereby, since the variation of the color of an audio | voice line can be increased, the visual effect given with respect to a speaker can be improved. In this case, a color other than the reference color is set in advance as the value of the noise frequency (for example, the noise center frequency). The drawing control unit 32 selects a color corresponding to the value of the noise frequency (for example, the noise center frequency) according to the value of the noise component sound pressure (sound pressure level (dB)) indicated by the noise component sound pressure data. It is more effective to adjust the density of.

以上説明した決定方法以外の方法で、発音特徴量に基づいて音声ラインの色が決定されてもよい。 The color of the audio line may be determined based on the pronunciation feature amount by a method other than the determination method described above.

また、描画制御部３２は、上記文字列のテキストデータを記憶部２から取得し、取得したテキストデータに基づいて、この文字列を構成する各文字を時間軸（例えば、横軸）方向に沿って手本音声ラインに対応させるようにディスプレイＤの画面に表示させる。ここで、描画制御部３２は、文字を表示させる際に、手本音声波形データから特定された音高に基づいて、時間軸方向と直交する軸方向における文字の位置を文字毎に決定し、手本音声波形データから特定された音圧に基づいて、文字のサイズを文字毎に決定する。 In addition, the drawing control unit 32 acquires the text data of the character string from the storage unit 2 and, based on the acquired text data, moves each character constituting the character string along the time axis (for example, the horizontal axis) direction. Display on the screen of the display D so as to correspond to the model voice line. Here, when displaying the character, the drawing control unit 32 determines the position of the character for each character in the axial direction orthogonal to the time axis direction based on the pitch specified from the model voice waveform data. The character size is determined for each character based on the sound pressure specified from the sample speech waveform data.

更に、描画制御部３２は、話者音声波形データから特定された音高、音圧、及び発音特徴量に基づいて、時間軸（例えば、横軸）方向に伸び、且つ、時間軸方向と直交する軸（例えば、縦軸）方向に時系列的に変化する第２の音声ライン（以下、「話者音声ライン」という）を、手本音声ラインと比較可能にディスプレイＤの画面に表示させる。ここで、描画制御部３２は、話者音声ラインを表示させる際に、話者音声波形データから特定された音高に基づいて、時間軸方向と直交する軸方向における話者音声ラインの位置を例えば所定時間毎に決定し、話者音声波形データから特定された音圧に基づいて、話者音声ラインの幅を例えば所定時間毎に決定する。さらに、描画制御部３２は、話者音声波形データから特定された発音特徴量に基づいて、話者音声ラインの色を所定単位毎（例えば、文字毎、または所定時間毎）に決定する。なお、話者音声ラインの色の決定方法の具体例については、上述した手本音声ラインの色の決定方法の具体例１〜３と同様である。 Further, the drawing control unit 32 extends in the time axis (for example, the horizontal axis) direction and is orthogonal to the time axis direction based on the pitch, sound pressure, and pronunciation feature quantity specified from the speaker voice waveform data. A second voice line (hereinafter referred to as “speaker voice line”) that changes in time series in the direction of the axis (for example, the vertical axis) is displayed on the screen of the display D so as to be comparable to the model voice line. Here, when displaying the speaker voice line, the drawing control unit 32 determines the position of the speaker voice line in the axial direction orthogonal to the time axis direction based on the pitch specified from the speaker voice waveform data. For example, it is determined every predetermined time, and based on the sound pressure specified from the speaker voice waveform data, the width of the speaker voice line is determined, for example, every predetermined time. Further, the drawing control unit 32 determines the color of the speaker voice line for each predetermined unit (for example, for each character or every predetermined time) based on the pronunciation feature amount specified from the speaker voice waveform data. A specific example of the method for determining the color of the speaker voice line is the same as the specific examples 1 to 3 of the method for determining the color of the model voice line described above.

また、描画制御部３２は、上記取得したテキストデータに基づいて、この文字列を構成する各文字を時間軸（例えば、横軸）方向に沿って話者音声ラインに対応させるようにディスプレイＤの画面に表示させる。例えば、描画制御部３２は、話者音声波形データから特定された音高に基づいて、時間軸方向と直交する軸方向における文字の位置を文字毎に決定し、話者音声波形データから特定された音圧に基づいて、文字のサイズを文字毎に決定する。 In addition, the drawing control unit 32 uses the display D so that each character constituting the character string corresponds to the speaker voice line along the time axis (for example, the horizontal axis) based on the acquired text data. Display on the screen. For example, the drawing control unit 32 determines the character position in the axial direction orthogonal to the time axis direction for each character based on the pitch specified from the speaker voice waveform data, and is specified from the speaker voice waveform data. Based on the sound pressure, the character size is determined for each character.

図３は、話者の音読中において、画面に表示される手本音声ライン及び話者音声ラインの一例を示す図である。なお、図３は、発音特徴量として音素が特定された場合の画面例である。図３に示す画面には、手本音声ライン表示部５１と、話者音声ライン表示部５２、及び音素／線色対応付け表示部５３が設けられている。手本音声ライン表示部５１及び話者音声ライン表示部５２内は、それぞれ、音高を縦軸（Ｙ軸）にとり、時間を横軸（Ｘ軸）とった座標平面で構成されている。手本音声ライン表示部５１内には、手本音声ライン５１ａ及び文字列を構成する各文字５１ｂが表示されている。また、話者音声ライン表示部５２には、話者音声ライン５２ａ及び文字列を構成する各文字５２ｂが表示されている。また、手本音声ライン表示部５１の時間軸（横軸）方向の時間の刻み幅と、及び話者音声ライン表示部５２の時間軸（横軸）方向の時間の刻み幅は、一致するように構成されているので、話者音声ライン５２ａと手本音声ライン５１ａとを比較可能に表示させることができる。 FIG. 3 is a diagram illustrating an example of a model voice line and a speaker voice line displayed on the screen while the speaker is reading aloud. FIG. 3 shows an example of a screen when a phoneme is specified as the pronunciation feature quantity. In the screen shown in FIG. 3, a model voice line display unit 51, a speaker voice line display unit 52, and a phoneme / line color association display unit 53 are provided. Each of the model voice line display unit 51 and the speaker voice line display unit 52 is composed of coordinate planes in which the pitch is the vertical axis (Y axis) and the time is the horizontal axis (X axis). In the model voice line display unit 51, a model voice line 51a and each character 51b constituting a character string are displayed. Further, the speaker voice line display section 52 displays the speaker voice line 52a and the characters 52b constituting the character string. In addition, the time increment in the time axis (horizontal axis) direction of the model voice line display unit 51 and the time increment in the time axis (horizontal axis) direction of the speaker voice line display unit 52 seem to match. Thus, the speaker voice line 52a and the model voice line 51a can be displayed in a comparable manner.

音素／線色対応付け表示部５３には、音素（この例では、母音と“ん”）と線色（音声ラインの色）との対応関係が表示されている。音素と線色との対応付けは、予め設定される。この例では、「あ」、「い」、「う」、「え」、「お」の母音に対して、それぞれ、「青」、「赤」、「緑」、「紫」、「黄」が予め設定されている。また、「ん」に対して、「灰」（グレー）が予め設定されている。なお、「ん」以外の子音に対しても、「灰」が予め設定されてもよいし、子音毎に異なる線色が予め設定されてもよい。例えば音素が「か」のように、子音と母音の組合せである場合、例えば、子音に対して予め設定された線色と、母音に対して予め設定された線色とを合成した混合色となる。なお、子音に対して線色が予め設定されてなくてもよい。この場合、例えば音素が「か」のように、子音と母音の組合せである場合、例えば、母音に対して設定された線色となる。また、図３に示す現在発話位置は、文字列を音読している話者の現在の音読位置を示している。 The phoneme / line color correspondence display section 53 displays the correspondence between phonemes (in this example, vowels and “n”) and line colors (colors of voice lines). The association between phonemes and line colors is preset. In this example, “blue”, “red”, “green”, “purple”, “yellow” for the vowels of “a”, “i”, “u”, “e”, “o”, respectively. Is preset. Further, “gray” (gray) is preset for “n”. Note that “grey” may be set in advance for consonants other than “n”, or a different line color may be set in advance for each consonant. For example, when the phoneme is a combination of a consonant and a vowel such as “ka”, for example, a mixed color obtained by synthesizing a line color preset for a consonant and a line color preset for a vowel Become. Note that the line color may not be preset for the consonant. In this case, for example, when the phoneme is a combination of a consonant and a vowel such as “ka”, for example, the line color set for the vowel is used. The current utterance position shown in FIG. 3 indicates the current reading position of the speaker who is reading the character string.

また、手本音声ライン表示部５１に表示される手本音声ライン５１ａを構成する各点の縦軸（Ｙ軸）方向における位置（ｙ座標）は、手本音声波形データから特定された音高に基づいて、横軸（Ｘ軸）方向に所定時間（例えば、10ms）毎に決定されている。この場合、音高が高いほど、手本音声ライン５１ａを構成する点の位置が座標平面上で、ｙ座標の値が大きい位置に決定されるようになっている。また、手本音声ライン表示部５１に表示される手本音声ライン５１ａの幅は、手本音声波形データから特定された音圧に基づいて、横軸（Ｘ軸）方向に所定時間（例えば、10ms）毎に決定されている。この場合、音圧が高いほど、手本音声ライン５１ａの幅が太く決定されるようになっている。なお、手本音声ライン５１ａの幅は、手本音声ライン５１ａが伸びる（進む）方向と直交する方向の幅（つまり、太さ）である。また、特定された音圧が閾値以下である区間は、手本音声ライン５１ａを表示しないように構成してもよい。図３に示す区間ｘ３は、発話されない区間であり、音圧が閾値以下となるため、手本音声ライン５１ａが表示されていない。このように、手本音声ライン５１ａは、途中で切れていてもよく、特に、話者が発した声が小さい場合、途切れ途切れの話者音声ライン５２ａになる場合もある。なお、話者音声ライン表示部５２に表示される話者音声ライン５２ａの位置、及び幅についても、話者音声波形データから特定された音高、及び音圧に基づいて、手本音声ライン５１ａと同様の方法で決定される。 In addition, the position (y coordinate) in the vertical axis (Y-axis) direction of each point constituting the model voice line 51a displayed on the model voice line display unit 51 is the pitch specified from the model voice waveform data. Is determined every predetermined time (for example, 10 ms) in the horizontal axis (X-axis) direction. In this case, as the pitch is higher, the positions of the points constituting the example voice line 51a are determined to be positions where the y-coordinate value is larger on the coordinate plane. The width of the model voice line 51a displayed on the model voice line display unit 51 is determined based on the sound pressure specified from the model voice waveform data for a predetermined time in the horizontal axis (X-axis) direction (for example, 10ms). In this case, the higher the sound pressure is, the wider the width of the model voice line 51a is determined. The width of the model voice line 51a is the width (that is, the thickness) in the direction orthogonal to the direction in which the model voice line 51a extends (advances). Moreover, you may comprise so that the audio | voice part 51a may not be displayed in the area where the specified sound pressure is below a threshold value. A section x3 shown in FIG. 3 is a section where speech is not performed, and the sound pressure is equal to or lower than the threshold value, so the model voice line 51a is not displayed. As described above, the model voice line 51a may be cut off in the middle. In particular, when the voice uttered by the speaker is low, the voice line 52a may be cut off. Note that the position and width of the speaker voice line 52a displayed on the speaker voice line display unit 52 is also based on the pitch and sound pressure specified from the speaker voice waveform data. It is determined by the same method.

また、手本音声ライン表示部５１に表示される手本音声ライン５１ａの色（線色）は、手本音声波形データから特定された音素に基づいて、横軸（Ｘ軸）方向に文字（つまり、音素）毎に決定されている。これは、上述した、音声ラインの色の決定方法の具体例１により決定される。例えば、子音と母音の組合せである「ご」に対応する区間ｘ１の線色は、子音“ｇ”に対して予め設定された色（母音以外の音成分に応じた色の一例）と、母音“ｏ” に対して予め設定された色（母音の別に応じた色）との混合色になっている。また、例えば、母音「あ」に対応する区間ｘ２の線色は、母音“ａ” に対して予め設定された色（母音の別に応じた色）になっている。文字毎の区間の時間長さは、上記所定時間（例えば、10ms）より長くなる。また、文字毎の区間の時間長さは、互いに同一であるとは限らず、図３に示すように、ｘ１（例えば、100ms）とｘ２（例えば、300ms）との時間長さは異なっている。なお、話者音声ライン表示部５２に表示される話者音声ライン５２ａの色についても、話者音声波形データから特定された音素に基づいて、手本音声ライン５１ａと同様の方法で決定される。 Further, the color (line color) of the model voice line 51a displayed on the model voice line display unit 51 is based on the phoneme specified from the model voice waveform data in the horizontal axis (X-axis) direction (character ( That is, it is determined for each phoneme). This is determined by the above-described specific example 1 of the method for determining the color of the audio line. For example, the line color of the section x1 corresponding to “go”, which is a combination of consonants and vowels, is a color set in advance for the consonant “g” (an example of a color corresponding to a sound component other than the vowels) and a vowel. It is a mixed color with a preset color (color corresponding to each vowel) for “o”. Further, for example, the line color of the section x2 corresponding to the vowel “A” is a color set in advance for the vowel “a” (a color corresponding to the vowel). The time length of the section for each character is longer than the predetermined time (for example, 10 ms). Further, the time lengths of the sections for each character are not necessarily the same, and as shown in FIG. 3, the time lengths of x1 (for example, 100 ms) and x2 (for example, 300 ms) are different. . The color of the speaker voice line 52a displayed on the speaker voice line display unit 52 is also determined by the same method as that of the model voice line 51a based on the phoneme specified from the speaker voice waveform data. .

さらに、手本音声ライン表示部５１に表示される各文字５１ｂの縦軸（Ｙ軸）方向における位置（ｙ座標）は、手本音声波形データから特定された音高に基づいて、横軸（Ｘ軸）方向に文字毎に決定されている。この場合、音高が高いほど、各文字５１ｂの位置が座標平面上で、ｙ座標の値が大きい位置に決定されるようになっている。ここで用いられる音高は、例えば、文字毎の区間（例えば、ｘ１）内に含まれる複数の音高の平均値とされる。また、手本音声ライン表示部５１に表示される各文字５１ｂのサイズが、横軸（Ｘ軸）方向に文字毎に決定されている。この場合、音圧が高いほど、各文字５１ｂのサイズが大きくなるようになっている。ここで用いられる音圧は、例えば、文字毎の区間（例えば、ｘ１）内に含まれる複数の音圧の平均値とされる。なお、話者音声ライン表示部５２に表示される各文字５２ｂの位置、及びサイズについても、話者音声波形データから特定された音高、及び音圧に基づいて、各文字５１ｂと同様の方法で決定される。 Further, the position (y coordinate) in the vertical axis (Y-axis) direction of each character 51b displayed on the model voice line display unit 51 is determined based on the pitch specified from the model voice waveform data. It is determined for each character in the (X-axis) direction. In this case, as the pitch is higher, the position of each character 51b is determined to be a position where the value of the y coordinate is larger on the coordinate plane. The pitch used here is, for example, an average value of a plurality of pitches included in a section (for example, x1) for each character. Further, the size of each character 51b displayed on the model voice line display unit 51 is determined for each character in the horizontal axis (X-axis) direction. In this case, the higher the sound pressure, the larger the size of each character 51b. The sound pressure used here is, for example, an average value of a plurality of sound pressures included in a section for each character (for example, x1). The position and size of each character 52b displayed on the speaker voice line display unit 52 is also the same method as each character 51b based on the pitch and sound pressure specified from the speaker voice waveform data. Determined by

以上のように、手本音声ライン５１ａ及び文字列を構成する各文字５１ｂと、話者音声ライン５２ａ及び文字列を構成する各文字５２ｂとが比較可能に表示されるので、手本となる音声と、話者が音声との発し方の違いを、話者に一見して、より分り易く把握させることができる。 As described above, the model voice line 51a and each character 51b constituting the character string and the speaker voice line 52a and each character 52b constituting the character string are displayed so as to be comparable. Thus, it is possible to make the speaker understand the difference in how the speaker utters the voice more easily.

なお、図３の例では、発音特徴量として特定された音素に基づいて線色が決定された場合を示したため、各文字の区間の境で色が明確に変化している。しかし、発音特徴量としてフォルマント周波数等が特定された場合、手本音声ライン５１ａと話者音声ライン５２ａの色は、それぞれ、特定されたフォルマント周波数（または、フォルマント周波数及びノイズ周波数）に基づいて、横軸（Ｘ軸）方向に所定時間（例えば、10ms）毎に決定される。このため、手本音声ライン５１ａと話者音声ライン５２ａの色は、図３に示すよりも滑らかに変化する（言い換えれば、グラデーションのように変化する）ことになる。 In the example of FIG. 3, since the line color is determined based on the phoneme specified as the pronunciation feature amount, the color clearly changes at the boundary of each character section. However, when the formant frequency or the like is specified as the pronunciation feature amount, the colors of the model voice line 51a and the speaker voice line 52a are based on the specified formant frequency (or formant frequency and noise frequency), respectively. It is determined every predetermined time (for example, 10 ms) in the horizontal axis (X-axis) direction. For this reason, the colors of the model voice line 51a and the speaker voice line 52a change more smoothly than those shown in FIG. 3 (in other words, change like a gradation).

［２.音声情報表示装置Ｓの動作例］
次に、図４〜図６を参照して、音声情報表示装置Ｓの動作の一例について説明する。以下に説明する動作例は、発音特徴量としてフォルマント周波数等が特定された場合の例である。図４は、音声情報表示装置Ｓにおける処理の流れ及び処理で用いられるデータを示す図である。図５は、図４に示す音声描画データ生成処理内容の一例を示す図である。図６は、図４に示す画面描画処理内容の一例を示す図である。 [2. Example of operation of voice information display device S]
Next, an example of the operation of the audio information display device S will be described with reference to FIGS. The operation example described below is an example in the case where a formant frequency or the like is specified as the pronunciation feature amount. FIG. 4 is a diagram showing a flow of processing in the voice information display device S and data used in the processing. FIG. 5 is a diagram showing an example of the voice drawing data generation processing content shown in FIG. FIG. 6 is a diagram illustrating an example of the contents of the screen drawing process illustrated in FIG.

図４において、先ず、例えば話者が操作部４を介して音読対象のお手本となる所望の音声ファイルを指定すると、制御部３は、音声ファイル入力処理を実行し、指定された音声ファイルに格納された手本音声波形データを入力する（ステップＳ１）。次いで、制御部３の音声処理部３１は、入力した手本音声波形データに基づいて、音声描画データ生成処理を実行する（ステップＳ２）。音声描画データ生成処理では、図５に示すように、音高データ算出処理（ステップＳ２１）、音圧データ算出処理（ステップＳ２２）、及び発音特徴量特定処理（ステップＳ２３）が実行される。音高データ算出処理、音圧データ算出処理、及び発音特徴量特定処理は、直列的に実行されてもよいし、並列的に実行されてもよい。直列的に実行される場合、音高データ算出処理、音圧データ算出処理、及び発音特徴量特定処理のうち、どの順番で処理が実行されてもよい。 In FIG. 4, first, for example, when a speaker designates a desired voice file as a model of reading aloud through the operation unit 4, the control unit 3 executes a voice file input process and stores it in the designated voice file. The inputted model voice waveform data is input (step S1). Next, the voice processing unit 31 of the control unit 3 executes voice drawing data generation processing based on the input model voice waveform data (step S2). In the voice drawing data generation process, as shown in FIG. 5, a pitch data calculation process (step S21), a sound pressure data calculation process (step S22), and a pronunciation feature amount specifying process (step S23) are executed. The pitch data calculation process, the sound pressure data calculation process, and the pronunciation feature quantity specifying process may be executed in series or in parallel. When executed in series, the processing may be executed in any order among the pitch data calculation processing, the sound pressure data calculation processing, and the pronunciation feature amount identification processing.

音高データ算出処理（ステップＳ２１）では、音声処理部３１は、入力した手本音声波形データに基づいて、所定時間毎に音高を特定する音高特定処理（ステップＳ２１１）を実行する。音高特定処理では、音声処理部３１は、例えば、手本音声波形データから所定時間毎に切り出したデータから基本周波数（Hz）を算出し、算出した基本周波数（Hz）を音高として所定時間毎に特定する。そして、音声処理部３１は、所定時間毎に特定した音高を示す音高データを時系列で算出する。 In the pitch data calculation process (step S21), the voice processing unit 31 executes a pitch specification process (step S211) for specifying the pitch at predetermined time intervals based on the input model voice waveform data. In the pitch specifying process, the voice processing unit 31 calculates, for example, a fundamental frequency (Hz) from data cut out from the model voice waveform data every predetermined time, and uses the calculated fundamental frequency (Hz) as a pitch for a predetermined time. Identify every time. Then, the voice processing unit 31 calculates pitch data indicating the pitch specified every predetermined time in time series.

音圧データ算出処理（ステップＳ２２）では、音声処理部３１は、入力した手本音声波形データに基づいて、所定時間毎に音圧を特定する音圧特定処理（ステップＳ２２１）を実行する。音圧特定処理では、音声処理部３１は、例えば、手本音声波形データから所定時間毎に切り出したデータから音圧レベル(dB)を算出し、算出した音圧レベル(dB)を音圧として所定時間毎に特定する。そして、音声処理部３１は、所定時間毎に特定した音圧を示す音圧データを時系列で算出する。 In the sound pressure data calculation process (step S22), the sound processing unit 31 executes a sound pressure specifying process (step S221) for specifying a sound pressure every predetermined time based on the input model sound waveform data. In the sound pressure specifying process, for example, the sound processing unit 31 calculates a sound pressure level (dB) from data cut out from the model sound waveform data at predetermined time intervals, and uses the calculated sound pressure level (dB) as the sound pressure. Identifies every predetermined time. Then, the sound processing unit 31 calculates sound pressure data indicating the sound pressure specified every predetermined time in time series.

発音特徴量特定処理（ステップＳ２３）では、音声処理部３１は、例えば、入力した手本音声波形データにおけるノイズ成分と調波成分を分離するノイズ・調波成分分離処理（ステップＳ２３１）を実行する。ノイズ・調波成分分離処理では、音声処理部３１は、例えば、手本音声波形データをフーリエ解析し、周波数ビン単位でノイズ成分と調波成分とに分離し、分離したそれぞれの成分を再度逆フーリエ解析により、ノイズ成分の音声波形データと、調波成分の音声波形データとを生成する。 In the pronunciation feature amount specifying process (step S23), the speech processing unit 31 executes, for example, a noise / harmonic component separation process (step S231) for separating the noise component and the harmonic component in the input sample speech waveform data. . In the noise / harmonic component separation processing, for example, the speech processing unit 31 performs Fourier analysis on the sample speech waveform data, separates the noise component and the harmonic component in frequency bin units, and reverses each separated component again. Voice waveform data of noise components and voice waveform data of harmonic components are generated by Fourier analysis.

次いで、音声処理部３１は、生成したノイズ成分の音声波形データに基づいて、所定時間毎にノイズ音圧を特定するノイズ音圧特定処理（ステップＳ２３２）と、所定時間毎にノイズ中心周波数を算出するノイズ中心周波数算出処理（ステップＳ２３３）とを実行する。ノイズ音圧特定処理では、音声処理部３１は、例えば、ノイズ成分の音声波形データから所定時間毎に切り出したデータから音圧レベル(dB)を算出し、算出した音圧レベル(dB)をノイズ音圧として所定時間毎に特定する。そして、音声処理部３１は、所定時間毎に特定したノイズ音圧を示すノイズ音圧データを時系列で算出する。一方、ノイズ中心周波数算出処理では、音声処理部３１は、例えば、ノイズ成分の音声波形データから所定時間毎に切り出したデータをフーリエ解析することでノイズ成分の周波数スペクトルを算出する。そして、音声処理部３１は、算出したノイズ成分の周波数スペクトルにおける周波数軸に対する平滑化を行うことでノイズ周波数を所定時間毎に特定し、そのスペクトル包絡の中で頂点をノイズ中心周波数として時系列で算出する。 Next, the sound processing unit 31 calculates a noise sound pressure specifying process (step S232) for specifying a noise sound pressure at every predetermined time and a noise center frequency at every predetermined time based on the generated sound waveform data of the noise component. The noise center frequency calculation process (step S233) is executed. In the noise sound pressure specifying process, the sound processing unit 31 calculates, for example, a sound pressure level (dB) from data cut out every predetermined time from the sound waveform data of the noise component, and uses the calculated sound pressure level (dB) as noise. The sound pressure is specified every predetermined time. Then, the sound processing unit 31 calculates noise sound pressure data indicating the noise sound pressure specified every predetermined time in time series. On the other hand, in the noise center frequency calculation process, the sound processing unit 31 calculates the frequency spectrum of the noise component by, for example, Fourier-analyzing data cut out from the sound waveform data of the noise component every predetermined time. Then, the audio processing unit 31 specifies the noise frequency every predetermined time by performing smoothing on the frequency axis in the frequency spectrum of the calculated noise component, and uses the apex as the noise center frequency in the spectrum envelope in time series. calculate.

また、音声処理部３１は、生成した調波成分の音声波形データに基づいて、所定時間毎に調波音圧を特定する調波音圧特定処理（ステップＳ２３４）と、所定時間毎にフォルマント周波数を算出するフォルマント算出処理（ステップＳ２３５）とを実行する。調波音圧特定処理では、音声処理部３１は、例えば、調波成分の音声波形データから所定時間毎に切り出したデータから音圧レベル(dB)を算出し、算出した音圧レベル(dB)を調波音圧として所定時間毎に特定する。そして、音声処理部３１は、所定時間毎に特定した調波音圧を示す調波音圧データを時系列で算出する。一方、フォルマント算出処理では、音声処理部３１は、例えば、調波成分の音声波形データから所定時間毎に切り出したデータに対して線形予測符号法を用いて第１フォルマント周波数及び第２フォルマント周波数を所定時間毎に時系列で算出する。 Further, the sound processing unit 31 calculates a harmonic sound pressure specifying process (step S234) for specifying the harmonic sound pressure for each predetermined time and a formant frequency for each predetermined time based on the generated sound waveform data of the harmonic component. The formant calculation process (step S235) is executed. In the harmonic sound pressure specifying process, the voice processing unit 31 calculates a sound pressure level (dB) from data cut out every predetermined time from the voice waveform data of the harmonic component, for example, and calculates the calculated sound pressure level (dB). The harmonic sound pressure is specified every predetermined time. Then, the voice processing unit 31 calculates harmonic sound pressure data indicating the harmonic sound pressure specified every predetermined time in time series. On the other hand, in the formant calculation process, the speech processing unit 31 calculates the first formant frequency and the second formant frequency using, for example, the linear predictive coding method for the data cut out from the harmonic waveform speech waveform data at predetermined time intervals. It is calculated in time series at every predetermined time.

そして、音声処理部３１は、ステップＳ２の音声描画データ生成処理により算出された音高データ、音圧データ、ノイズ音圧データ、ノイズ中心周波数、調波音圧データ、第１フォルマント周波数、及び第２フォルマント周波数を、図４に示すように、音声描画データとして、例えば記憶部２に設けられた描画ＤＢ（データベース）に記憶する。次いで、制御部３は、音声ファイル入力処理を実行し、例えば話者により指定された音声ファイルに対応付けられたテキストデータであって、音読対象となる文字列のテキストデータを入力し（ステップＳ３）、図４に示すように、テキスト描画データとして、描画ＤＢに記憶する。 Then, the sound processing unit 31 performs the pitch data, the sound pressure data, the noise sound pressure data, the noise center frequency, the harmonic sound pressure data, the first formant frequency, and the second calculated by the sound drawing data generation process in step S2. As shown in FIG. 4, the formant frequency is stored as voice drawing data in, for example, a drawing DB (database) provided in the storage unit 2. Next, the control unit 3 executes a voice file input process, and inputs text data of a character string to be read aloud, for example, text data associated with a voice file designated by the speaker (step S3). 4), as shown in FIG. 4, it is stored in the drawing DB as text drawing data.

一方、話者が文字列の音読を開始すると、この文字列の音読中に発せられた音声がマイクＭにより集音され、集音された音声の波形を示す話者音声波形データが、インターフェース部５を介して音声情報表示装置Ｓへ出力される。そして、制御部３は、音声入力処理を実行することで、マイクＭから話者音声波形データを入力する（ステップＳ４）。次いで、制御部３の音声処理部３１は、入力した話者音声波形データに基づいて、図５に示す音声描画データ生成処理を実行する（ステップＳ５）。この音声描画データ生成処理では、手本音声波形データの場合と同様の方法で、音高データ算出処理（ステップＳ２１）、音圧データ算出処理（ステップＳ２２）、及び発音特徴量特定処理（ステップＳ２３）が実行される。そして、音声処理部３１は、ステップＳ５の音声描画データ生成処理により算出された音高データ、音圧データ、ノイズ音圧データ、ノイズ中心周波数、調波音圧データ、第１フォルマント周波数、及び第２フォルマント周波数を、音声描画データとして順次、描画制御部３２へ出力する。 On the other hand, when the speaker starts reading the character string, the voice generated during the reading of the character string is collected by the microphone M, and the speaker voice waveform data indicating the waveform of the collected voice is displayed in the interface unit. 5 to the audio information display device S. And the control part 3 inputs speaker audio | voice waveform data from the microphone M by performing an audio | voice input process (step S4). Next, the voice processing unit 31 of the control unit 3 executes the voice drawing data generation process shown in FIG. 5 based on the input speaker voice waveform data (step S5). In the voice drawing data generation process, the pitch data calculation process (step S21), the sound pressure data calculation process (step S22), and the pronunciation feature amount specification process (step S23) are performed in the same manner as in the case of the model voice waveform data. ) Is executed. Then, the sound processing unit 31 performs the pitch data, the sound pressure data, the noise sound pressure data, the noise center frequency, the harmonic sound pressure data, the first formant frequency, and the second calculated by the sound drawing data generation process in step S5. The formant frequency is sequentially output to the drawing control unit 32 as voice drawing data.

次いで、制御部３の描画制御部３２は、描画データベースから取得される音声描画データ（手本）及びテキスト描画データに基づいて画面描画処理（手本用の画面描画処理）を実行し、且つ音声処理部３１から取得される音声描画データ（話者）及び描画データベースから取得されるテキスト描画データに基づいて画面描画処理（話者用の画面描画処理）を実行する（ステップＳ６）。なお、手本用の画面描画処理と話者用の画面描画処理とは、直列的に実行されてもよいし、並列的に実行されてもよい。また、話者の音読中にリアルタイムで話者用の画面描画処理が実行されてもよい。なお、図６の例では、１つの画面描画処理内容を示しているが、この画面描画処理は、手本用と話者用とのそれぞれで実行される。 Next, the drawing control unit 32 of the control unit 3 executes screen drawing processing (screen drawing processing for example) based on the voice drawing data (example) and text drawing data acquired from the drawing database, and the voice. Based on the voice drawing data (speaker) acquired from the processing unit 31 and the text drawing data acquired from the drawing database, a screen drawing process (screen drawing process for the speaker) is executed (step S6). Note that the model screen drawing process and the speaker screen drawing process may be executed in series or in parallel. Further, the screen drawing process for the speaker may be executed in real time while the speaker is reading aloud. In addition, although the example of FIG. 6 shows the content of one screen drawing process, this screen drawing process is performed for each of the model and the speaker.

画面描画処理では、図６に示すように、描画制御部３２は、テキスト表示処理（ステップＳ６１）と、線表示処理（ステップＳ６２）と、線幅変更処理（ステップＳ６３）と、ノイズ色生成処理（ステップＳ６４）と、調波色生成処理（ステップＳ６５）と、色合成表示処理（ステップＳ６６）とを実行する。なお、テキスト表示処理と線表示処理は、直列的に実行されてもよいし、並列的に実行されてもよい。直列的に実行される場合、テキスト表示処理と線表示処理のうち、どの順番で処理が実行されてもよい。 In the screen drawing process, as shown in FIG. 6, the drawing control unit 32 performs a text display process (step S61), a line display process (step S62), a line width change process (step S63), and a noise color generation process. (Step S64), harmonic color generation processing (Step S65), and color composition display processing (Step S66) are executed. The text display process and the line display process may be executed in series or in parallel. When executed in series, the process may be executed in any order among the text display process and the line display process.

テキスト表示処理（ステップＳ６１）では、描画制御部３２は、取得したテキスト描画データに基づいて表示文字（つまり、画面に表示させる文字）、及び横軸（Ｘ軸）方向における表示文字の表示位置を文字毎に決定する。表示文字の表示位置は、例えば、テキスト描画データに含まれる発音タイミングにより決定される。ただし、話者により音読された文字の表示位置は、例えばラベリング処理により決定される。ラベリング処理は、テキスト描画データと、話者音声波形データと、話者音声波形データの周波数スペクトログラムとに基づいて、音読（発話）内容に則した音素ラベルの付与と、音素間の境界位置の特定を行う処理である。ラベリング処理には、公知の様々な手法を適用できるので、詳しい説明を省略する。次いで、描画制御部３２は、取得した音声描画データに含まれる音高データが示す音高に基づいて、縦軸（Ｙ軸）方向における表示文字の表示位置を決定し、取得した音声描画データに含まれる音圧データが示す音圧に基づいて、表示文字のサイズを決定する。そして、描画制御部３２は、決定した表示文字、表示位置、及びサイズにしたがってディスプレイＤへ描画指令を与えることで画面に表示文字を描画する。 In the text display process (step S61), the drawing control unit 32 determines the display position of the display character (that is, the character to be displayed on the screen) and the display character in the horizontal axis (X axis) direction based on the acquired text drawing data. Decide for each character. The display position of the display character is determined by, for example, the sound generation timing included in the text drawing data. However, the display position of the character read aloud by the speaker is determined by, for example, a labeling process. The labeling process is based on text rendering data, speaker speech waveform data, and frequency spectrogram of speaker speech waveform data, and assigns phoneme labels according to the reading (utterance) content and specifies the boundary position between phonemes. It is a process to perform. Since various known methods can be applied to the labeling process, detailed description thereof is omitted. Next, the drawing control unit 32 determines the display position of the display character in the vertical axis (Y-axis) direction based on the pitch indicated by the pitch data included in the acquired voice drawing data, and the acquired voice drawing data The size of the display character is determined based on the sound pressure indicated by the included sound pressure data. And the drawing control part 32 draws a display character on a screen by giving a drawing command to the display D according to the determined display character, display position, and size.

線表示処理（ステップＳ６２）では、描画制御部３２は、取得した音声描画データに含まれる音高データが示す音高に基づいて、上述したように、縦軸（Ｙ軸）方向における音声ライン（手本音声ラインまたは話者音声ライン）の表示位置を所定時間毎に決定する。また、描画制御部３２は、取得した音声描画データに含まれる音圧データが示す音圧が閾値以下である区間を特定する。そして、描画制御部３２は、決定した表示位置にしたがってディスプレイＤへ描画指令を与えることで画面に時系列的に変化する音声ラインを描画する。なお、音圧が閾値以下である区間が特定されていれば、描画制御部３２は、この区間には音声ラインを描画しない。 In the line display process (step S62), the drawing control unit 32, as described above, based on the pitch indicated by the pitch data included in the acquired voice drawing data, the voice line (Y axis) direction ( A display position of a model voice line or a speaker voice line is determined every predetermined time. In addition, the drawing control unit 32 identifies a section in which the sound pressure indicated by the sound pressure data included in the acquired voice drawing data is equal to or less than a threshold value. And the drawing control part 32 draws the audio | voice line which changes to a screen in time series by giving a drawing command to the display D according to the determined display position. If a section where the sound pressure is equal to or less than the threshold is specified, the drawing control unit 32 does not draw a voice line in this section.

線幅変更処理（ステップＳ６３）では、描画制御部３２は、取得した音声描画データに含まれる音圧データが示す音圧に基づいて、上述したように、音声ラインの幅を所定時間毎に決定する。そして、描画制御部３２は、決定した幅にしたがってディスプレイＤへ描画指令を与えることで音声ラインの幅を変更させる。 In the line width changing process (step S63), the drawing control unit 32 determines the width of the audio line at predetermined time intervals as described above based on the sound pressure indicated by the sound pressure data included in the acquired audio drawing data. To do. Then, the drawing control unit 32 changes the width of the audio line by giving a drawing command to the display D according to the determined width.

ノイズ色生成処理（ステップＳ６４）では、描画制御部３２は、例えば、取得した音声描画データに含まれるノイズ中心周波数の値に応じた色を決定し、決定した色の濃さを、音声描画データに含まれるノイズ音圧データが示すノイズ音圧の値に応じて決定することでノイズ色を所定時間毎に生成する。 In the noise color generation process (step S64), for example, the drawing control unit 32 determines a color according to the value of the noise center frequency included in the acquired voice drawing data, and uses the determined color density as the voice drawing data. The noise color is generated every predetermined time by determining according to the value of the noise sound pressure indicated by the noise sound pressure data included in the.

調波色生成処理（ステップＳ６５）では、描画制御部３２は、例えば、取得した音声描画データに含まれる第１フォルマント周波数の値に応じた色と、音声描画データに含まれる第２フォルマント周波数の値に応じた色との混合色を決定し、決定した混合色の濃さを音声描画データに含まれる調波音圧データが示す調波音圧の値に応じて決定することで調波色を所定時間毎に生成する。 In the harmonic color generation process (step S65), the drawing control unit 32, for example, the color according to the value of the first formant frequency included in the acquired voice drawing data and the second formant frequency included in the voice drawing data. Determine the mixed color with the color corresponding to the value, and determine the harmonic color by determining the darkness of the determined mixed color according to the harmonic sound pressure value indicated by the harmonic sound pressure data included in the voice drawing data Generate every hour.

色合成表示処理（ステップＳ６６）では、描画制御部３２は、ノイズ色生成処理により生成されたノイズ色と、調波色生成処理により生成された調波色とを合成（ミックス）して線色を所定時間毎に決定する。そして、描画制御部３２は、決定した線色にしたがってディスプレイＤへ描画指令を与えることで音声ラインに色を付ける。 In the color composition display process (step S66), the drawing control unit 32 synthesizes (mixes) the noise color generated by the noise color generation process and the harmonic color generated by the harmonic color generation process to obtain a line color. Is determined every predetermined time. Then, the drawing control unit 32 gives a color to the audio line by giving a drawing command to the display D according to the determined line color.

以上説明したように、上記実施形態によれば、文字列を音読するときの手本となる音声の変化を表す手本音声ラインの位置、幅、及び色を手本音声波形データから特定された音高、音圧、及び発音特徴量に基づいて決定し、話者が上記文字列を音読するときに発した音声の変化を表す話者音声ラインの位置、幅、及び色を話者音声波形データから特定された音高、音圧、及び発音特徴量に基づいて決定して、手本音声ラインと話者音声ラインとを比較可能に画面に表示させるとともに、上記文字列を時間軸方向に沿って画面に表示させるように構成したので、手本となる音声と、話者の音声との発し方の違いを、話者に一見して、より分り易く把握させることができる。 As described above, according to the above embodiment, the position, width, and color of the model voice line representing the change of the model voice when reading the character string is specified from the model voice waveform data. Speaker voice waveform indicating the position, width, and color of the speaker voice line, which is determined based on the pitch, sound pressure, and pronunciation feature amount, and represents the change in voice produced when the speaker reads the character string aloud. Based on the pitch, sound pressure, and pronunciation feature amount specified from the data, the model voice line and the speaker voice line are displayed on the screen so that they can be compared. Since it is configured to be displayed on the screen along the way, it is possible to make the speaker understand the difference between the voice of the example and the voice of the speaker at a glance.

１通信部
２記憶部
３制御部
４操作部
５インターフェース部
６バス
３１音声処理部
３２描画制御部
Ｓ音声情報表示装置 DESCRIPTION OF SYMBOLS 1 Communication part 2 Memory | storage part 3 Control part 4 Operation part 5 Interface part 6 Bus 31 Voice processing part 32 Drawing control part S Voice information display apparatus

Claims

Pitch and sound specified for each predetermined unit based on text data of a character string composed of a plurality of characters and first sound waveform data indicating a sound waveform serving as a model when the character string is read aloud Storage means for storing pressure and pronunciation features;
Input means for inputting second speech waveform data indicating a waveform of a speech uttered when a speaker reads out the character string;
A specifying means for specifying a pitch, a sound pressure, and a pronunciation feature amount for each predetermined unit based on the first speech waveform data;
A first line that extends in the time axis direction and changes in a time series in an axis direction orthogonal to the time axis direction based on the pitch, sound pressure, and the sound generation feature value stored in the storage means. First control means for displaying on the screen;
Second control means for displaying each character constituting the character string on the screen along the time axis direction based on the text data;
A second line that extends in the time axis direction and changes in time series in the axis direction orthogonal to the time axis direction based on the pitch, sound pressure, and sound generation feature amount specified by the specifying means. Is displayed on the screen so as to be comparable with the first line,
With
The first control means and the third control means are respectively
A first determining unit that determines the position of the line in the axial direction orthogonal to the time axis direction for each of the predetermined units based on the pitch;
A second determining unit that determines the width of the line for each predetermined unit based on the sound pressure;
A third determining unit that determines the color of the line for each predetermined unit based on the pronunciation feature amount;
A display control apparatus comprising:

The second control means includes
A fourth determining unit that determines the position of the character in the axial direction orthogonal to the time axis direction for each predetermined unit based on the pitch;
A fifth determining unit that determines the size of the character for each of the predetermined units based on the sound pressure;
The display control apparatus according to claim 1, further comprising:

The pronunciation feature amount is a feature amount distinguishable by a spectrum of the speech waveform data,
The display control apparatus according to claim 1, wherein the third determination unit determines a color of the line according to the pronunciation feature amount.

The predetermined unit is a character unit,
The storage means stores a vowel as the pronunciation feature amount specified for each character unit based on the first speech waveform data,
The specifying means specifies a vowel as the pronunciation feature amount for each character unit based on the second speech waveform data,
The display control apparatus according to claim 3, wherein the third determination unit determines a color according to the specified vowel as the color of the line.

The storage means stores a vowel and a sound component other than the vowel as the pronunciation feature amount specified for each character unit based on the first speech waveform data;
The specifying means specifies a vowel and a sound component other than the vowel as the pronunciation feature amount for each character unit based on the second speech waveform data;
The said 3rd determination part determines the mixed color of the color according to the specified said vowel and the color according to sound components other than the said vowel as the color of the said line. The display control apparatus described.

The predetermined unit is a predetermined time unit,
The storage means stores at least a first formant frequency and a second formant frequency as the pronunciation feature amount specified for each predetermined time unit based on the first speech waveform data,
The specifying means specifies at least a first formant frequency and a second formant frequency as the sounding feature amount for each predetermined time unit based on the second speech waveform data;
The said 3rd determination part determines the color according to the combination of the value of the specified said 1st formant frequency and the value of the specified said 2nd formant frequency as a color of the said line. 4. The display control device according to 3.

A predetermined reference color is set for a combination of a predetermined first formant frequency value and a predetermined second formant frequency value;
The color corresponding to the combination of the value of the specified first formant frequency and the value of the specified second formant frequency is the value of the predetermined first formant frequency and the value of the specified first formant frequency. 7. The degree of change from the reference color increases as the sum of squares of the difference and the difference between a value of a predetermined second formant frequency and the value of the specified second formant frequency increases. The display control apparatus described.

The storage means stores at least a first formant frequency, a second formant frequency, and a noise component as the pronunciation feature amount specified for each predetermined time unit based on the first speech waveform data;
The specifying means specifies at least a first formant frequency, a second formant frequency, and a noise component as the pronunciation feature amount for each predetermined time unit based on the second speech waveform data;
The third determining unit includes a color corresponding to a combination of the specified first formant frequency value and the specified second formant frequency value, and a color corresponding to the specified noise component value. The display control apparatus according to claim 6, wherein the mixed color is determined as the color of the line.

The predetermined unit is a predetermined time unit,
The storage means stores at least a first formant frequency and a second formant frequency as the pronunciation feature amount specified for each predetermined time unit based on the first speech waveform data,
The specifying means specifies at least a first formant frequency and a second formant frequency as the sounding feature amount for each predetermined time unit based on the second speech waveform data;
The third determining unit determines a mixed color of a color according to the value of the specified first formant frequency and a color according to the value of the specified second formant frequency as the color of the line. The display control apparatus according to claim 3, wherein

The storage means stores at least a first formant frequency, a second formant frequency, and a noise component as the pronunciation feature amount specified for each predetermined time unit based on the first speech waveform data;
The specifying means specifies at least a first formant frequency, a second formant frequency, and a noise component as the pronunciation feature amount for each predetermined time unit based on the second speech waveform data;
The third determining unit is responsive to a mixed color of a color corresponding to the value of the specified first formant frequency and a color corresponding to the value of the specified second formant frequency, and a value of the noise component. The display control apparatus according to claim 9, wherein a mixed color with a color is determined as a color of the line.

A display control method executed by one or more computers,
A pitch specified for each predetermined unit based on text data of a character string composed of a plurality of characters and first sound waveform data indicating a sound waveform serving as a model when the character string is read aloud, A storage step of storing the sound pressure and the pronunciation feature quantity in the storage means;
An input step of inputting second voice waveform data indicating a waveform of a voice uttered when the speaker reads out the character string;
A specifying step of specifying a pitch, a sound pressure, and a pronunciation feature amount for each predetermined unit based on the second speech waveform data;
A first line that extends in the time axis direction and changes in time series in an axis direction orthogonal to the time axis direction based on the pitch, sound pressure, and the sound generation feature value stored in the storage step. A first control step for displaying on the screen;
A second control step of causing each character constituting the character string to be displayed on the screen along the time axis direction based on the text data;
A second line extending in the time axis direction and time-sequentially changing in the axial direction orthogonal to the time axis direction based on the pitch, sound pressure, and the sound generation feature amount specified in the specifying step. Is displayed on the screen in a manner comparable to the first line, and
With
The first control step and the third control step are respectively
Determining the position of the line in the axial direction orthogonal to the time axis direction based on the pitch for each of the predetermined units;
Determining the width of the line for each predetermined unit based on the sound pressure;
Determining the color of the line for each predetermined unit based on the pronunciation feature amount;
A display control method comprising:

A pitch specified for each predetermined unit based on text data of a character string composed of a plurality of characters and first sound waveform data indicating a sound waveform serving as a model when the character string is read aloud, A storage step of storing the sound pressure and the pronunciation feature quantity in the storage means;
An input step of inputting second voice waveform data indicating a waveform of a voice uttered when the speaker reads out the character string;
A specifying step of specifying a pitch, a sound pressure, and a pronunciation feature amount for each predetermined unit based on the second speech waveform data;
A first line that extends in the time axis direction and changes in time series in an axis direction orthogonal to the time axis direction based on the pitch, sound pressure, and the sound generation feature value stored in the storage step. A first control step for displaying on the screen;
A second control step of causing each character constituting the character string to be displayed on the screen along the time axis direction based on the text data;
A second line extending in the time axis direction and time-sequentially changing in the axial direction orthogonal to the time axis direction based on the pitch, sound pressure, and the sound generation feature amount specified in the specifying step. Is displayed on the screen in a manner comparable to the first line, and
To the computer,
The first control step and the third control step are respectively
Determining the position of the line in the axial direction orthogonal to the time axis direction based on the pitch for each of the predetermined units;
Determining the width of the line for each predetermined unit based on the sound pressure;
Determining the color of the line for each predetermined unit based on the pronunciation feature amount;
The program characterized by including.