JPH05165494A

JPH05165494A - Voice recognizing device

Info

Publication number: JPH05165494A
Application number: JP3330807A
Authority: JP
Inventors: Katsuya Oba; 克哉大場; Teru Hirayama; 輝平山; Tadamichi Tagawa; 忠道田川; Masaaki Kato; 正明加藤
Original assignee: Oki Electric Industry Co Ltd; Osaka Gas Co Ltd
Current assignee: Oki Electric Industry Co Ltd; Osaka Gas Co Ltd
Priority date: 1991-12-13
Filing date: 1991-12-13
Publication date: 1993-07-02

Abstract

PURPOSE:To obtain a low-priced system facilitating the study of a learning person without being hourly limited by providing a correspondence display means to display the prescribed acoustic feature amount of a voice signal and a language symbol corresponding to the amount while making them correspondent. CONSTITUTION:This device is equipped with an input means 50 to input the voice of an unspecified speaker, voice recognizing means 51 to obtain the language symbol by executing voice recognition according to the voice signal, and display means 54 to display the contents of making the prescribed acoustic characteristic amount of the voice signal correspondent to the language symbol corresponding to the amount by normalizing them concerning the two kinds of voices at least while making those amount and symbol correspondent. Namely, a normalizing part and graphic display part 54 inputs the voice data of the learning person outputted from the voice recognizing means 51 and model voice data from a model data acquisition part 52, normalizes speaking time for each word to which both data are correspondent, and graphically displays the voice data of the learning person and the model voice data (intonation, stress).

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、教育などの分野に応用
可能な音声認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device applicable to fields such as education.

【０００２】[0002]

【従来の技術】従来、例えば、外国語教育における発音
（スピーキング）の訓練を行う方法としては次のような
ものが知られている。（１）外国語教育専門家やネイテ
ィブスピーカーに自分の発音を聞いてもらい、アドバイ
スを受け、学習者の欠点を指摘してもらう。（２）市販
の外国語学習用テープや２カ国語放送などを聞き、それ
をまねて発音する。また、参考書などに記載されている
口の形や調音点等を示す図表をもとに練習する。自分の
発音を、磁気テープなどに録音し、学習用テープに録音
されている音声と聞き比べてみる。（３）計算機等を利
用した発音練習装置により練習する。すなわち、発音練
習用のＣＡＩ（computer assisted instruction：コン
ピュータ利用の教育）システムが知られている。これ
は、ネイティブスピーカーの発音を聞く。学習者の声を
録音し、再生する。ネイティブスピーカーの音声波形
と、学習者の音声波形をグラフィック表示する。また、
ホルマントの表示や、時間毎の周波数や音圧の変化が表
示される。2. Description of the Related Art Conventionally, for example, the following methods have been known as methods for training pronunciation (speaking) in foreign language education. (1) Ask a foreign language education expert or native speaker to listen to their pronunciation, receive advice, and point out the learner's shortcomings. (2) Listen to commercially available foreign language learning tapes and bilingual broadcasts, and imitate them. Also, practice based on the charts that show mouth shapes, articulation points, etc., described in reference books. Record your pronunciation on a magnetic tape and compare it with the voice recorded on the learning tape. (3) Practice using a pronunciation practice device that uses a computer or the like. That is, a CAI (computer assisted instruction) system for pronunciation practice is known. It listens to native speaker pronunciation. Record and play back the learner's voice. Graphic display of native speaker's voice waveform and learner's voice waveform. Also,
Formants are displayed, and changes in frequency and sound pressure are displayed over time.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上記の
ような従来の発音練習方法では次のような課題がある。（１）外国語教育専門家やネイティブスピーカーによる
訓練方法では、大きな効果が期待できるが、時間的な制
約を受け、また費用もかかる。（２）音声テープなどに
よる方法では、いつでも練習でき、また安価ではある
が、発した発音を客観的に評価できないため、学習者は
自分の発音の何処が良くないかを把握することが困難で
ある。（３）計算機などを利用して発音練習を行う方法
では、確かにネイティブスピーカーの音声波形と学習者
の音声波形が表示されるが、そこから何をどの様に読み
取れば、学習者の発音がネイティブスピーカーの発音に
近づくかが分かりにくいという課題がある。However, the conventional pronunciation training method as described above has the following problems. (1) Training methods by foreign language education specialists and native speakers can be expected to have great effects, but they are time-consuming and costly. (2) With a method using a voice tape, etc., although it is possible to practice anytime and it is inexpensive, it is difficult for the learner to grasp where his or her pronunciation is not good because the pronunciation that is produced cannot be evaluated objectively. is there. (3) In the method of practicing pronunciation using a computer, the native speaker's voice waveform and the learner's voice waveform are displayed, but if you read what from that, the learner's pronunciation will be pronounced. There is a problem that it is difficult to know if it approaches the pronunciation of a native speaker.

【０００４】本発明は、このような従来の発音訓練方法
の課題を考慮し、時間的制約を受けず、安価で、しかも
学習者の学習が容易であるようなシステムを実現できる
音声認識装置を提供することを目的とするものである。In consideration of the above problems of the conventional pronunciation training method, the present invention provides a speech recognition apparatus which can realize a system that is not restricted by time, is inexpensive, and is easy for learners to learn. It is intended to be provided.

【０００５】[0005]

【課題を解決するための手段】第１の本発明は、音声を
入力する入力手段と、その音声信号から音声認識を行
い、言語シンボルを得る認識手段と、前記音声信号の所
定の音響的特徴量とそれに対応する前記言語シンボルと
を対応付けながら表示する対応表示手段とを備えた音声
認識装置である。The first aspect of the present invention is to provide an input means for inputting voice, a recognition means for performing voice recognition from the voice signal to obtain a language symbol, and a predetermined acoustic characteristic of the voice signal. It is a voice recognition device provided with correspondence display means for displaying an amount and the language symbol corresponding thereto in association with each other.

【０００６】第２の本発明は、不特定話者の音声を入力
する入力手段と、その音声信号から音声認識を行い、言
語シンボルを得る認識手段と、前記音声信号の所定の音
響的特徴量とそれに対応する前記言語シンボルとの対応
付け内容を、少なくとも２種類の音声について、正規化
して対応させ表示する表示手段とを備えた音声認識装置
である。A second aspect of the present invention is an input means for inputting a voice of an unspecified speaker, a recognition means for performing voice recognition from the voice signal to obtain a language symbol, and a predetermined acoustic feature amount of the voice signal. And a display means for displaying the content of correspondence between the language symbol and the corresponding language symbol, for at least two types of voices in a normalized manner and in correspondence.

【０００７】第３の本発明は、不特定話者の音声を入力
する入力手段と、その音声信号から音声認識を行い、言
語シンボルを得る認識手段と、前記音声信号の所定の音
響的特徴量とそれに対応する前記言語シンボルとの対応
付け内容を、少なくとも２種類の音声について、比較す
る比較手段とを備えた音声認識装置である。A third aspect of the present invention is an input means for inputting a voice of an unspecified speaker, a recognition means for performing voice recognition from the voice signal to obtain a language symbol, and a predetermined acoustic feature amount of the voice signal. And a comparing means for comparing the content of correspondence between the language symbol and the corresponding language symbol with respect to at least two types of voices.

【０００８】[0008]

【作用】第１の本発明では、入力手段で入力された音声
の信号から、認識手段によって音声認識を行い、言語シ
ンボルを得る。また、対応表示手段は、音声信号の所定
の音響的特徴量とそれに対応する言語シンボルとを対応
付けながら表示する。In the first aspect of the present invention, speech recognition is performed from the speech signal input by the input means by the recognition means to obtain a language symbol. Further, the correspondence display means displays the predetermined acoustic feature amount of the voice signal and the corresponding language symbol in association with each other.

【０００９】第２の本発明では、入力手段で入力された
不特定話者の音声信号から音声認識を行い、言語シンボ
ルを得る。表示手段は、音声信号の所定の音響的特徴量
とそれに対応する言語シンボルとの対応付け内容を、少
なくとも２種類の音声について、正規化して表示する。In the second aspect of the present invention, speech recognition is performed from the speech signal of the unspecified speaker input by the input means to obtain a language symbol. The display means normalizes and displays at least two types of sounds the correspondence content between the predetermined acoustic feature amount of the sound signal and the corresponding language symbol.

【００１０】第２の本発明では、入力手段で入力された
不特定話者の音声信号から音声認識を行い、言語シンボ
ルを得る。音声信号の所定の音響的特徴量とそれに対応
する言語シンボルとの対応付け内容を、比較手段によっ
て、少なくとも２種類の音声について、比較する。In the second aspect of the present invention, speech recognition is performed from the speech signal of the unspecified speaker input by the input means to obtain a language symbol. The comparing means compares the correspondence between the predetermined acoustic feature amount of the voice signal and the corresponding language symbol for at least two types of voice.

【００１１】[0011]

【実施例】以下本発明の一実施例について図面を参照し
ながら説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings.

【００１２】図１は本発明の音声認識装置の一実施例で
ある音声データのグラフィック表示を行うための、コン
ピュータを利用したシステムを示す図である。同図に於
て、学習者の音声がマイク１５から入力され、不特定話
者連続音声認識手段１４により解析される。本発明の処
理を行うプログラムと必要なデータが磁気ディスク１３
に格納されており、ＣＰＵ１１により、メモリ１２にロ
ードされる。キーボード１７、マウス１８は、後に述べ
る練習文を選択する際に使用され、イントネーション、
ストレスなどがディスプレイ１６にグラフィック表示さ
れるようになっている。FIG. 1 is a diagram showing a system using a computer for graphically displaying voice data which is an embodiment of the voice recognition apparatus of the present invention. In the figure, the learner's voice is input from the microphone 15 and analyzed by the unspecified speaker continuous voice recognition means 14. The program for performing the processing of the present invention and necessary data are the magnetic disk 13
Stored in the memory 12 and loaded into the memory 12 by the CPU 11. The keyboard 17 and mouse 18 are used when selecting a practice sentence, which will be described later.
The stress or the like is graphically displayed on the display 16.

【００１３】図２は、本発明で使用する不特定話者連続
音声認識手段の出力の一例である音声データである。認
識した単語数２１、各単語の文字列の配列２２、各単語
のセグメント（認識された音素）数の配列２３、各セグ
メントのピッチ周波数の平均の配列２４、各セグメント
の音圧の平均の配列２５、各セグメントの発話時間の配
列２６からなる。FIG. 2 shows voice data as an example of the output of the unspecified speaker continuous voice recognition means used in the present invention. Number of recognized words 21, array of character strings of each word 22, array of number of segments (recognized phonemes) 23 of each word, array of average pitch frequency 24 of each segment, array of average sound pressure of each segment 25, and an array 26 of utterance times of each segment.

【００１４】以下の実施例１は、第１の本発明の一実施
例であって、認識対象である音声信号のイントネーショ
ンまたはストレスを、認識結果である単語に対応させて
表示する装置を示す。Example 1 below is an example of the first aspect of the present invention and shows an apparatus for displaying the intonation or stress of a voice signal to be recognized in association with a word as a recognition result.

【００１５】また、以下の説明で配列のｎ番目とは、最
初の要素を０番目とし、以下１番目、２番目．．．とす
る。Further, in the following description, the nth element of the array means that the first element is the 0th element, and the first element, the second element ,. ．． And

【００１６】（実施例１）図３は、本発明の実施例１の
ブロック図である。マイクロフォン１５等の入力手段３
０から学習者の音声を入力し、図２に示した音声データ
を出力する、認識手段の一例としての音声認識処理部３
１、および音声認識処理部３１から出力される音声デー
タを入力とし、これから入力音声のイントネーション、
ストレスを認識結果の単語列に対応させてディスプレイ
に表示する、対応表示手段の一例としてのグラフィック
表示部３２からなる。(First Embodiment) FIG. 3 is a block diagram of a first embodiment of the present invention. Input means 3 such as a microphone 15
A voice recognition processing unit 3 as an example of a recognition unit that inputs a learner's voice from 0 and outputs the voice data shown in FIG.
1, and the voice data output from the voice recognition processing unit 31 as an input, and the intonation of the input voice,
The graphic display unit 32 is an example of correspondence display means for displaying the stress on the display in association with the word string of the recognition result.

【００１７】図４は、その音声認識処理部３１の内容を
示すフローチャートである。すなわち図４で、ステップ
３１０では、マイクロフォン１５からの音声信号につい
てアナログ処理、Ａ／Ｄ変換がおこなわれる。ここで、
入力：アナログ連続音声信号、出力：デジタル音声信
号、処理概要：アンチ・エイリアスフィルタ、サンプリ
ング周波数１６ｋＨｚで、デジタル信号に変換するもの
である。FIG. 4 is a flow chart showing the contents of the voice recognition processing section 31. That is, in FIG. 4, in step 310, analog processing and A / D conversion are performed on the audio signal from the microphone 15. here,
Input: analog continuous audio signal, output: digital audio signal, processing outline: anti-alias filter, with a sampling frequency of 16 kHz, to convert to a digital signal.

【００１８】ステップ３１１では、音響パラメータ変換
がおこなわれる。ここで、入力：デジタル音声信号、出
力：フレーム（６．６ミリ秒）毎の２３の音響的特徴
量、処理概要：デジタル信号を６．６ミリ秒のフレーム
に分け、スペクトル分析などにより、２３の音響的特徴
量を抽出するものである。At step 311, acoustic parameter conversion is performed. Here, input: digital audio signal, output: 23 acoustic feature values for each frame (6.6 ms), processing outline: the digital signal is divided into 6.6 ms frames, and 23 The acoustic feature quantity of is extracted.

【００１９】ステップ３１２では、音素コード（音素
片）変換がおこなわれる。ここで、入力：フレーム毎の
音響的特徴量、出力：セグメント毎の音素コード（音素
片）、およびセグメント毎の音響的特徴量、処理概要：
フレーム毎の音響的特徴量をもとに各フレームに音素コ
ード（約１８００種類）を割り当てるものである。似通
った特徴を持つ隣あったフレームはまとめ、１つのセグ
メントとするものである。At step 312, phoneme code (phoneme piece) conversion is performed. Here, input: acoustic feature amount for each frame, output: phoneme code (phoneme piece) for each segment, and acoustic feature amount for each segment, processing outline:
A phoneme code (about 1800 kinds) is assigned to each frame based on the acoustic feature amount of each frame. Adjacent frames with similar characteristics are combined into one segment.

【００２０】ステップ３１３では、音素コード（音素
片）／音素変換がおこなわれる。ここで、入力：セグメ
ント毎の音素コード（音素片）およびセグメント毎の音
響的特徴量、出力：セグメント毎の音素の候補、および
セグメント毎の音響的特徴量、外部参照データ：音素コ
ードブック３１５、処理概要：音素コードブックを参照
し、セグメント毎にそのセグメントが各音素（約５０種
類）に対応する可能性を割り当てる。In step 313, phoneme code (phoneme piece) / phoneme conversion is performed. Here, input: phoneme code (phoneme piece) for each segment and acoustic feature amount for each segment, output: phoneme candidate for each segment, and acoustic feature amount for each segment, external reference data: phoneme codebook 315, Outline of processing: With reference to the phoneme codebook, a possibility that the segment corresponds to each phoneme (about 50 types) is assigned to each segment.

【００２１】ステップ３１４では、音素／単語列変換が
おこなわれる。ここで、入力：セグメント毎の音素の候
補およびセグメント毎の音響的特徴量、出力：単語列お
よびセグメント毎の音響的特徴量、外部参照データ：単
語辞書３１６、処理概要：単語辞書を参照し、入力のセ
グメント列と各単語に対応する音素列との距離を計算
し、最も近いものを認識結果として出力する。In step 314, phoneme / word string conversion is performed. Here, input: phoneme candidates for each segment and acoustic feature amount for each segment, output: acoustic feature amount for each word string and segment, external reference data: word dictionary 316, processing outline: refer to word dictionary, The distance between the input segment sequence and the phoneme sequence corresponding to each word is calculated, and the closest one is output as the recognition result.

【００２２】なお、上記音素コードブック３１５は、音
素コード（音素片）毎にその音素コードがある音素であ
る可能性をもつテーブルであり、また、上記単語辞書３
１６は、単語文字列とそれに対応する音素列を持つ辞書
である。The phoneme codebook 315 is a table having a possibility that each phoneme code (phoneme piece) has a phoneme code, and the word dictionary 3 is used.
Reference numeral 16 is a dictionary having word character strings and corresponding phoneme strings.

【００２３】図５及び図６は、図３のグラフィック表示
部３２の処理フローの例である。5 and 6 are examples of the processing flow of the graphic display unit 32 of FIG.

【００２４】４０１は、前処理であり、ディスプレイに
Ｘ軸、Ｙ軸を表示し、また、各変数の初期化を行う。す
なわち、処理中の単語が認識結果の単語列の何番目かを
示す変数L_word_cnt、処理中のセグメントが認識結果の
セグメントの配列の何番目かを示す変数L_seg_cnt、グ
ラフィック表示する折れ線グラフ（後述する図１５参
照）の各標本点の値を示す変数（配列）の０番目L_x
[0],L_y[0]にそれぞれ０を代入する。Reference numeral 401 is a pre-processing which displays the X-axis and the Y-axis on the display and initializes each variable. That is, a variable L_word_cnt showing which number of the word string of the recognition result is the word being processed, a variable L_seg_cnt showing what number of the array of the segment of the recognition result is the segment being processed, and a line graph to be displayed graphically (see below. 0) L_x of the variable (array) that indicates the value of each sample point
Substitute 0 for each of [0] and L_y [0].

【００２５】４０２では、音声認識結果の音声データの
単語数をL_word_numに代入する。At 402, the number of words in the voice data of the voice recognition result is substituted into L_word_num.

【００２６】４０３では、L_word_cntとL_word_numの比
較を行う。L_word_cntがL_word_numより小さければ、処
理すべき単語が存在するということであり、ステップ４
０４へ進む。L_word_cntがL_word_num以上であれば全て
の単語についてグラフィック表示が終了したということ
であり、全ての処理を終える。At 403, L_word_cnt and L_word_num are compared. If L_word_cnt is smaller than L_word_num, it means that there is a word to be processed, and step 4
Go to 04. If L_word_cnt is greater than or equal to L_word_num, it means that the graphic display has been completed for all words, and all processing is completed.

【００２７】４０４では、次の折れ線グラフ表示時にＸ
軸の下部に現在処理している単語を表示することを示す
フラグdisp_word_flagをＯＮにする。At 404, X is displayed at the next line graph display.
The flag disp_word_flag indicating that the word currently being processed is displayed at the bottom of the axis is set to ON.

【００２８】４０５では、現在処理している単語が何番
目のセグメントまでかを示す変数L_seg_numに音声デー
タの各単語のセグメント数の配列のL_word_cnt番目の要
素を加える。At 405, the L_word_cnt th element of the array of the segment number of each word of the audio data is added to the variable L_seg_num which indicates up to what segment the word currently being processed is.

【００２９】４０６では、L_seg_cntとL_seg_numとの比
較を行う。L_seg_cntがL_seg_numより小さければ、現在
処理している単語に対応するセグメントに未処理のセグ
メントが存在するということであり、ステップ４０７へ
進む。L_seg_cntがL_seg_num以上であれば、現在処理し
ている単語に対応する全てのセグメントに対し処理を終
えたということであり、ステップ４１０へ進む。At 406, L_seg_cnt and L_seg_num are compared. If L_seg_cnt is smaller than L_seg_num, it means that there is an unprocessed segment in the segment corresponding to the word currently being processed, and the process proceeds to step 407. If L_seg_cnt is greater than or equal to L_seg_num, it means that processing has been completed for all the segments corresponding to the word currently being processed, and the routine proceeds to step 410.

【００３０】４０７では、表示すべき折れ線グラフの標
本点を取得する。すなわち、L_x[L_seg_cnt + 1]に、{L
_x[L_seg_cnt] + （音声データのセグメントの発話時間
の配列のL_seg_cnt番目の要素}を代入し、L_y[L_seg_cn
t + 1]に、（音声データのセグメントのピッチ周波数の
平均の配列のL_seg_cnt番目の要素）または、（音声デ
ータのセグメントの音圧の平均の配列のL_seg_cnt番目
の要素）を代入する。イントネーションの折れ線グラフ
表示時は、セグメントのピッチ周波数の平均の配列の要
素、ストレスの折れ線グラフ表示時には、セグメントの
音圧の平均の配列の要素を代入する。At 407, sample points of the line graph to be displayed are acquired. That is, L_x [L_seg_cnt + 1] becomes {L
_x [L_seg_cnt] + (L_seg_cnt th element of the utterance time array of the voice data segment) is substituted, and L_y [L_seg_cn
Substituting (L_seg_cnt th element of average array of pitch frequencies of audio data segments) or (L_seg_cnt th element of average array of sound pressure of audio data segments) to [t + 1]. When the line graph of the intonation is displayed, the elements of the average array of the pitch frequencies of the segments are substituted, and when the line graph of the stress is displayed, the elements of the average array of the sound pressure of the segments are substituted.

【００３１】４０８では、実際に折れ線グラフの表示を
行う。すなわち、{(L_x[L_seg_cnt],L_y[L_seg_cnt),
(L_x[L_seg_cnt + 1],L_y[L_seg_cnt +1])}の２点を結
ぶ線分を表示する。At 408, a line graph is actually displayed. That is, {(L_x [L_seg_cnt], L_y [L_seg_cnt),
Display the line segment connecting the two points (L_x [L_seg_cnt + 1], L_y [L_seg_cnt +1])}.

【００３２】４０９では、Ｘ軸の下部に対応する単語文
字列を表示するかどうかの判断を行う。つまり、disp_w
ord_flagがＯＮであるかどうか調べ、ＯＮであればステ
プ４１０へＯＦＦであれば、ステップ４１２へ進む。At 409, it is determined whether to display the word character string corresponding to the lower part of the X axis. That is, disp_w
It is checked whether ord_flag is ON, and if it is ON, the process proceeds to step 410, and if it is OFF, the process proceeds to step 412.

【００３３】４１０では、(L_x[L_seg_cnt + 1],0)の直
下に音声データの単語文字列の配列２２のL_word_cnt番
目の要素である文字列を表示する。At 410, the character string which is the L_word_cnt th element of the word character string array 22 of the voice data is displayed immediately below (L_x [L_seg_cnt + 1], 0).

【００３４】４１１では、１つの単語につき単語文字列
の表示は１度でいいのでdisp_word_flagをＯＦＦにす
る。In 411, since the display of the word character string need only be performed once for one word, disp_word_flag is turned off.

【００３５】４１２では、処理を次のセグメントに移す
ため、L_seg_cntに１を加え、ステップ４０６へ戻る。At 412, 1 is added to L_seg_cnt to move the processing to the next segment, and the process returns to step 406.

【００３６】４１３では、処理を次の単語に移すため、
L_word_cntに１を加え、ステップ４０３へ戻る。At 413, since the processing is moved to the next word,
Add 1 to L_word_cnt and return to step 403.

【００３７】次に、データの具体例を用いて上記の処理
をより具体的に説明する。Next, the above processing will be described more specifically by using a specific example of data.

【００３８】学習者のマイク１５を使用して音声を入力
すると、音声認識手段３１でスペクトル分析等の手法に
より音声認識処理を行う。ここでは、音声認識手段３１
の出力として、図２に示した音声データが得られたとす
る。When a voice is input using the learner's microphone 15, the voice recognition means 31 performs a voice recognition process by a technique such as spectrum analysis. Here, the voice recognition means 31
It is assumed that the audio data shown in FIG.

【００３９】グラフィック表示部３２では、この音声デ
ータを入力として以下の処理を行う。The graphic display unit 32 receives the voice data and performs the following processing.

【００４０】まず、前処理として、Ｘ軸、Ｙ軸の表示、
L_word_cnt、L_seg_cnt、L_x[0]、L_y[0]にそれぞれ０
を代入する（ステップ４０１）。続いて、L_word_numに
音声データの単語数の値である４を代入する（ステップ
４０２）。First, as preprocessing, display of X-axis and Y-axis,
0 for L_word_cnt, L_seg_cnt, L_x [0], and L_y [0] respectively
Is substituted (step 401). Then, 4 which is the value of the number of words of the voice data is substituted into L_word_num (step 402).

【００４１】ここで、L_word_cntとL_word_numの比較を
行う（ステップ４０３）が、今L_word_cntが０、L_word
_numが４なので、ステップ４０４へ進む。Here, L_word_cnt and L_word_num are compared (step 403). Now, L_word_cnt is 0, L_word
Since _num is 4, the process proceeds to step 404.

【００４２】ステップ４０４では、今から処理する単語
の文字列を次の折れ線表示時にＸ軸の下部に表示するこ
とを示すdisp_word_flagをＯＮにする。次に、L_seg_nu
mに音声データの各単語のセグメント数の配列のL_word_
cnt番目の要素を加える。ここで、L_seg_num＝０、L_wo
rd_cnt＝０である。音声データの各単語のセグメント数
の配列の０番目の要素は３であるのでL_seg_num＝３と
なる（ステップ４０５）。In step 404, the disp_word_flag indicating that the character string of the word to be processed now is displayed at the bottom of the X axis at the time of displaying the next polygonal line is turned ON. Then L_seg_nu
An array L_word_ of the number of segments of each word of audio data in m
Add the cnt th element. Here, L_seg_num = 0, L_wo
rd_cnt = 0. Since the 0th element of the array of the segment number of each word of the voice data is 3, L_seg_num = 3 (step 405).

【００４３】ステップ４０６では、L_seg_cntとL_seg_n
umの比較を行う。今、L_seg_cnt＝０、L_seg_num＝３な
ので、処理はステップ４０７へ進む。In step 406, L_seg_cnt and L_seg_n
Compare um. Since L_seg_cnt = 0 and L_seg_num = 3 now, the process proceeds to step 407.

【００４４】ここで、実際に折れ線グラフ表示を行うた
めの標本点を取得する。すなわち、L_x[1]にL_x[0]の値
と音声データの各セグメントの発話時間の配列の０番目
の要素の値を加え、２４を代入する。また、例えばイン
トネーションのグラフを表示するのであれば、L_y[1]に
音声データの各セグメントのピッチ周波数の平均の配列
の０番目の要素の値１０４を代入する（ステップ４０
７）。続いて実際に、折れ線グラフを表示する。つま
り、{(L_x[0], L_y[0]), (L_x[1], L_y[1])}の値である
{(0,0),(24,104)}の２点を結ぶ線分を表示する（ステッ
プ４０８）。Here, the sample points for actually displaying the line graph are acquired. That is, the value of L_x [0] and the value of the 0th element of the utterance time array of each segment of the audio data are added to L_x [1], and 24 is substituted. Further, for example, if an intonation graph is displayed, the value 104 of the 0th element of the average array of pitch frequencies of each segment of audio data is substituted into L_y [1] (step 40).
7). Then, the line graph is actually displayed. That is, it is the value of {(L_x [0], L_y [0]), (L_x [1], L_y [1])}
A line segment connecting two points {(0,0), (24,104)} is displayed (step 408).

【００４５】次に、disp_word_flagがＯＮであるかどう
かを調べ（ステップ４０９）、今ＯＮなので、Ｘ軸の下
部に対応する単語文字列を表示するための処理を行う。
つまり、音声データの単語文字列の配列の０番目の値で
ある"how"を(x[1],0)の直下に表示する（ステップ４１
０）。"how"を処理している間、もうこの文字列を表示
する必要はないので、disp_word_flagをＯＦＦにする
（ステップ４１１）。Next, it is checked whether or not disp_word_flag is ON (step 409). Since it is ON now, processing for displaying the word character string corresponding to the lower part of the X axis is performed.
That is, "how", which is the 0th value of the word character string array of the voice data, is displayed immediately below (x [1], 0) (step 41).
0). While processing "how", there is no need to display this character string anymore, so disp_word_flag is set to OFF (step 411).

【００４６】０番目のセグメントに対する処理が終了し
たので、L_seg_cntに１を加え１として、ステップ４０
６へ戻る。１番目のセグメントについても上記と同様に
折れ線グラフ表示を行う（ステップ４０７、４０８）。Since the processing for the 0th segment is completed, 1 is added to L_seg_cnt to make it 1 and step 40
Return to 6. A line graph is displayed for the first segment as well (steps 407 and 408).

【００４７】ステップ４０９では、disp_word_flagがＯ
ＦＦになっているので、単語文字列の表示処理は行わ
ず、ステップ４１２へ進む。At step 409, disp_word_flag is set to O.
Since it is FF, the display process of the word character string is not performed, and the process proceeds to step 412.

【００４８】このようにして、２番目のセグメントまで
の処理が終わったとする。このとき、L_seg_cntが３と
なっている（ステップ４１２）。ステップ４０６での比
較の結果、L_seg_cntがL_seg_num以上になったので、ス
テップ４１３へ進み、次の単語へ処理を進めるために、
L_word_cntに１を加え、ステップ４０３へ戻る。In this way, the processing up to the second segment is completed. At this time, L_seg_cnt is 3 (step 412). As a result of the comparison in step 406, L_seg_cnt is equal to or more than L_seg_num. Therefore, the process proceeds to step 413, and the process is performed on the next word.
Add 1 to L_word_cnt and return to step 403.

【００４９】同様にして、４番目の単語まで全ての処理
が終わったとする。すると、L_word_cntが４となり（ス
テップ４１３）、４０３での比較の結果全ての単語に付
いて処理が終わったとして、本実施例の処理が終了す
る。Similarly, it is assumed that all the processes up to the fourth word have been completed. Then, L_word_cnt becomes 4 (step 413), and as a result of the comparison in 403, it is determined that the processing has been completed for all the words, and the processing of this embodiment ends.

【００５０】（実施例２）実施例２は、第２の本発明の
音声認識装置の一実施例であって、外国語発音練習にお
いて、手本となるある文章を入力した音声のイントネー
ションまたはストレスを、認識した単語に対応させてグ
ラフィック表示し、さらに学習者の同じ文章の音声のイ
ントネーションまたはストレスを、認識した単語を一致
させ正規化して、単語毎に、手本のグラフィックに対応
させて表示する装置の例である。(Embodiment 2) The embodiment 2 is an embodiment of the speech recognition apparatus of the second invention, and is the intonation or stress of the speech in which a certain sentence is input in the practice of pronunciation of a foreign language. Is displayed graphically in correspondence with the recognized words, and the intonation or stress of the learner's voice in the same sentence is normalized by matching the recognized words and displayed in correspondence with the model graphic for each word. It is an example of a device to do.

【００５１】図７は、その実施例２のブロック図であ
る。FIG. 7 is a block diagram of the second embodiment.

【００５２】音声認識手段５１は、学習者が音声を入力
し、認識結果として、音声データを出力する手段であ
り、実施例１の音声認識手段３１と同一の機能を持つ。
また、音声認識処理の結果得られる単語列は、無音部分
を除くと、学習者により選択された練習文の練習文文字
列６１（図８参照）と同じであるとする。The voice recognition means 51 is a means for a learner to input a voice and output voice data as a recognition result, and has the same function as the voice recognition means 31 of the first embodiment.
Further, the word string obtained as a result of the voice recognition processing is assumed to be the same as the practice sentence character string 61 (see FIG. 8) of the practice sentence selected by the learner, except for the silent parts.

【００５３】練習文提示部５２は、教材データ５５に保
存されている教材ファイルの練習文番号６０と練習文文
字列６１を利用して練習文のリストを表示し、学習者に
これから練習する文と自分の性別の入力を促す。学習者
により練習文が選択され、性別が入力されると、選択さ
れた練習文に対応する練習文番号と性別を出力する。The practice sentence presenting section 52 displays a list of practice sentences using the practice sentence number 60 and the practice sentence character string 61 of the teaching material file stored in the teaching material data 55, and presents the learner with a sentence to be practiced. And urge you to enter your gender. When the learner selects a practice sentence and inputs the sex, the practice sentence number and the gender corresponding to the selected practice sentence are output.

【００５４】手本データ取得部５３は、練習文提示部５
２の出力である学習者が選択した練習文の番号と学習者
の性別を入力とし、教材データ５５を読み込み、練習文
の番号に対応する音声データファイルを手本音声データ
５６から得る手段である。The model data acquisition unit 53 is a practice sentence presentation unit 5.
It is a means for inputting the number of the practice sentence selected by the learner and the gender of the learner, which is the output of 2, and reading the teaching material data 55 to obtain the voice data file corresponding to the number of the practice sentence from the model voice data 56. ..

【００５５】本発明の表示手段の一例としての正規化部
およびグラフィック表示部５４は、音声認識手段５１か
ら出力される学習者の音声データと手本データ取得部５
３からの手本音声データを入力とし、両者の対応する単
語毎の発話時間を正規化し、学習者の音声データと手本
の音声データ（イントネーション、ストレス）をグラフ
ィック表示する手段である。図１５及び図１６は、その
ようなグラフィック表示手段５４の画面の一例である。The normalization section and the graphic display section 54 as an example of the display means of the present invention include the learner's voice data and the model data acquisition section 5 output from the voice recognition means 51.
It is a means for inputting the model voice data from No. 3, normalizing the utterance time for each word corresponding to both, and graphically displaying the voice data of the learner and the voice data of the model (intonation, stress). 15 and 16 are examples of screens of such a graphic display means 54.

【００５６】図８は、教材ファイルの例であり、練習文
番号６１、練習文の文字列６２、男性用の音声データフ
ァイル名６３、女性用の音声データファイル姪４からな
る。FIG. 8 shows an example of a teaching material file, which comprises a practice sentence number 61, a practice sentence character string 62, a voice data file name 63 for men, and a voice data file niece 4 for women.

【００５７】図９は、手本音声データファイルの例であ
り、学習者の音声を認識した結果の音声データと同じ構
造の手本音声データが保存されている。これら手本のデ
ータは、予め第１の本発明で述べたようにして得られた
ものである。FIG. 9 is an example of a model voice data file, in which the model voice data having the same structure as the voice data obtained by recognizing the learner's voice is stored. The data of these examples are obtained in advance as described in the first aspect of the present invention.

【００５８】図１０は、正規化及びグラフィック表示部
５４の処理フローの例である。FIG. 10 shows an example of the process flow of the normalization and graphic display unit 54.

【００５９】８０１は、前処理であり、Ｘ軸、Ｙ軸の表
示、現在処理している学習者の音声データの単語が何番
目かを示すL_word_cnt、現在処理している学習者の音声
データのセグメントが何番目かを示すL_seg_cnt、学習
者のイントネーション（ストレス）の折れ線グラフ表示
時の標本点となる配列の０番目の要素L_x[0]，L_y[0]，
現在処理している手本音声データの単語が何番目かを示
すM_word_cnt，現在処理している手本音声データのセグ
メントが何番目かを示すM_seg_cnt，手本のイントネー
ション（ストレス）の折れ線グラフ表示時の標本点とな
る配列の０番目の要素M_x[0]，M_y[0]のそれぞれに０を
代入する。Reference numeral 801 denotes preprocessing, which is the display of the X-axis and Y-axis, L_word_cnt indicating the number of the word in the voice data of the learner currently being processed, and the voice data of the learner currently being processed. L_seg_cnt indicating the number of the segment, the 0th element L_x [0], L_y [0] of the array that is the sample point when the learner's intonation (stress) line graph is displayed,
M_word_cnt indicating the number of the word of the model voice data currently being processed, M_seg_cnt indicating the number of the segment of the model voice data currently being processed, and the line graph display of the intonation (stress) of the model 0 is assigned to each of the 0th element M_x [0] and M_y [0] of the array which becomes the sample point of.

【００６０】８０１１では、音声認識結果の音声データ
の単語数をL_word_numに代入する。At 8011, the number of words of the voice data of the voice recognition result is substituted into L_word_num.

【００６１】８０２は、全ての単語について、処理が終
わったかどうかを調べるために、L_word_cntとL_word_n
umの比較を行う。L_word_cntがL_word_numより小さいと
は、まだ処理すべき単語が残っているということであ
り、ステップ８０３へ進み、そうでないときは、全ての
処理を終了する。802 checks L_word_cnt and L_word_n in order to check whether processing has been completed for all words.
Compare um. If L_word_cnt is smaller than L_word_num, it means that there are still words to be processed, and the process proceeds to step 803. If not, all the processes are terminated.

【００６２】８０３では、手本音声データのM_word_cnt
番目の単語について折れ線グラフと対応する単語文字列
の表示を行う。このステップの詳細なフローの例を図１
１及び図１２に示す。In 803, M_word_cnt of the model voice data
Display the line graph and the corresponding word string for the th word. An example of the detailed flow of this step is shown in FIG.
1 and FIG.

【００６３】８０５では、L_word_cnt番目の学習者の単
語に対応するイントネーションまたはストレスを手本の
イントネーションまたはストレスに対応させて表示する
ための倍率を求める。この処理の詳細フローの例を図１
３に示す。At 805, a magnification for displaying the intonation or stress corresponding to the L_word_cnt th learner's word in association with the model intonation or stress is determined. An example of the detailed flow of this processing is shown in FIG.
3 shows.

【００６４】８０６では、L_word_cnt番目の学習者の単
語に対応するイントネーションまたはストレスを折れ線
グラフ表示する。この時８０５で求めた倍率により手本
の対応する単語と同じ位置（Ｘ座標）に表示する。At 806, the intonation or stress corresponding to the L_word_cnt th learner's word is displayed as a line graph. At this time, it is displayed at the same position (X coordinate) as the corresponding word in the model according to the magnification obtained in 805.

【００６５】８０７では、学習者、手本共に次の単語へ
処理を進めるため、L_word_cnt，M_word_cntの値にそれ
ぞれ１を加える。At 807, 1 is added to the values of L_word_cnt and M_word_cnt so that the learner and the example can proceed to the next word.

【００６６】図１１及び図１２は、図１０のステップ８
０３の詳細フローの例である。11 and 12 show step 8 of FIG.
It is an example of a detailed flow of 03.

【００６７】９０１では、次の折れ線グラフ表示時にＸ
軸の下部に単語文字列を表示することを示すdisp_word_
flagをＯＮにする。また、この単語を発話するのに要し
た時間を示す変数M_word_durに０を代入する。In 901, X is displayed at the next line graph display.
Disp_word_, which indicates to display the word string at the bottom of the axis
Turn flag on. Also, 0 is substituted for the variable M_word_dur indicating the time required to speak this word.

【００６８】９０２では、現在処理している単語が何番
目のセグメントまでかを示す変数M_seg_numに音声デー
タの各単語のセグメント数の配列のM_word_cnt番目の要
素を加える。At 902, the M_word_cnt th element of the array of the segment number of each word of the audio data is added to the variable M_seg_num which indicates up to what segment the word currently being processed is.

【００６９】９０３では、M_seg_cntとM_seg_numとの比
較を行う。M_seg_cntがM_seg_numより小さければ、現在
処理している単語に対応するセグメントに未処理のセグ
メントが存在するということであり、ステップ９０４へ
進む。M_seg_cntがM_seg_num以上であれば、現在処理し
ている単語に対応する全てのセグメントに対し処理を終
えたということであり、処理を終了する。At 903, M_seg_cnt and M_seg_num are compared. If M_seg_cnt is smaller than M_seg_num, it means that there is an unprocessed segment in the segment corresponding to the word currently processed, and the process proceeds to step 904. If M_seg_cnt is greater than or equal to M_seg_num, it means that processing has been completed for all the segments corresponding to the word currently being processed, and the processing ends.

【００７０】９０４では、表示すべき折れ線グラフの標
本点を取得する。すなわち、M_x[M_seg_cnt + 1]に、{M
_x[M_seg_cnt] +（音声データのセグメントの発話時間
の配列のM_seg_cnt番目の要素}を代入し、M_y[M_seg_cn
t + 1]に、(音声データのセグメントのピッチ周波数の
平均の配列のM_seg_cnt番目の要素)または、(音声デー
タのセグメントの音圧の平均の配列のM_seg_cnt番目の
要素）を代入する。イントネーションの折れ線グラフ表
示時は、セグメントのピッチ周波数の平均の配列の要
素、ストレスの折れ線グラフ表示時には、セグメントの
音圧の平均の配列の要素を代入する。At 904, the sample points of the line graph to be displayed are acquired. That is, in M_x [M_seg_cnt + 1],
_x [M_seg_cnt] + (M_seg_cnt th element of the utterance time array of the voice data segment) is substituted, and M_y [M_seg_cn
Substitute (M_seg_cnt th element of average array of pitch frequencies of audio data segments) or (M_seg_cnt th element of average array of sound pressure of audio data segments) to [t + 1]. When the line graph of the intonation is displayed, the elements of the average array of the pitch frequencies of the segments are substituted, and when the line graph of the stress is displayed, the elements of the average array of the sound pressure of the segments are substituted.

【００７１】また、同時にM_word_durに音声データのセ
グメントの発話時間の配列のM_seg_cnt番目の要素を加
える。At the same time, the M_seg_cnt th element of the utterance time array of the voice data segment is added to M_word_dur.

【００７２】９０５では、実際に折れ線グラフの表示を
行う。すなわち、{(M_x[M_seg_cnt],M_y[M_seg_cnt),(M
_x[M_seg_cnt + 1],M_y[M_seg_cnt + 1])}の２点を結ぶ
線分を表示する。At 905, a line graph is actually displayed. That is, ((M_x [M_seg_cnt], M_y [M_seg_cnt), (M
_x [M_seg_cnt + 1], M_y [M_seg_cnt + 1])} is displayed.

【００７３】９０６では、Ｘ軸の下部に対応する単語文
字列を表示するかどうかの判断を行う。つまり、disp_w
ord_flagがＯＮであるかどうか調べ、ＯＮであればステ
ップ９０７へＯＦＦであれば、ステップ９０９へ進む。At 906, it is determined whether to display the word character string corresponding to the lower part of the X axis. That is, disp_w
It is checked whether ord_flag is ON. If it is ON, the process proceeds to step 907. If it is OFF, the process proceeds to step 909.

【００７４】９０７では、(M_x[M_seg_cnt],0)の直下に
認識結果の単語文字列の配列２２のM_word_cnt番目の要
素である文字列を表示する。At 907, the character string which is the M_word_cnt th element of the recognition result word character string array 22 is displayed immediately below (M_x [M_seg_cnt], 0).

【００７５】９０８では、１つの単語につき単語文字列
の表示は１度でいいのでdisp_word_flagをＯＦＦにす
る。In 908, the word character string need only be displayed once for one word, so disp_word_flag is set to OFF.

【００７６】９０９では、処理を次のセグメントに移す
ため、M_seg_cntに１を加え、ステップ９０３へ戻る。In 909, in order to move the processing to the next segment, 1 is added to M_seg_cnt, and the process returns to step 903.

【００７７】図１３は、図１０のステップ８０５の詳細
フローの例である。FIG. 13 is an example of the detailed flow of step 805 in FIG.

【００７８】１００１では、現在処理している学習者の
単語のセグメント数を示す変数L_word_seg_numに音声デ
ータの各単語のセグメント数の配列のL_word_cnt番目の
値を代入し、また、この単語を発話するのに要した時間
を示す変数L_word_dur,ループ用の変数ｉをそれぞれ０
に初期化する。In 1001, the L_word_cnt-th value of the array of the segment number of each word of the voice data is substituted into the variable L_word_seg_num which indicates the number of the segment of the word of the learner who is currently processing, and this word is uttered. 0 for the variable L_word_dur, which indicates the time required for
Initialize to.

【００７９】１００２では、ｉとL_word_seg_numの比較
を行う。ｉがL_word_seg_numより小さければステップ１
００３へ進み、ｉがL_word_seg_num以上であればステッ
プ１００５へ進む。At 1002, i is compared with L_word_seg_num. If i is smaller than L_word_seg_num, step 1
The process proceeds to step 003, and if i is greater than or equal to L_word_seg_num, the process proceeds to step 1005.

【００８０】１００３では、現在処理している単語のｉ
番目のセグメントの時間をL_word_durに加える。すなわ
ち、学習者の音声データの各セグメントの発話時間の配
列の(L_seg_cnt+i)番目の値をL_word_durに加える。At 1003, i of the word currently being processed is displayed.
Add the time of the th segment to L_word_dur. That is, the (L_seg_cnt + i) th value of the utterance time array of each segment of the learner's voice data is added to L_word_dur.

【００８１】１００４では、次のセグメントの発話時間
を求めるためｉに１を加える。At 1004, 1 is added to i to obtain the speech time of the next segment.

【００８２】１００５では、現在処理している学習者の
単語を表示するときの倍率を求め、処理を終了する。つ
まり、倍率を示す変数rateに(M_word_dur/L_word_dur)
を代入し、処理を終える。At 1005, the magnification when displaying the word of the learner who is currently processing is calculated, and the processing is ended. In other words, the variable rate that indicates the magnification is (M_word_dur / L_word_dur)
Is assigned and the processing ends.

【００８３】図１４は、図１０のステップ８０６の詳細
フローの例である。FIG. 14 is an example of the detailed flow of step 806 in FIG.

【００８４】１１０１では、現在処理中の単語に対応す
るセグメントが全体で何番目のセグメントまでかを示す
変数L_seg_numに音声データの各単語のセグメント数の
配列のL_word_cnt番目の要素の値を加える。At 1101, the value of the L_word_cnt th element of the array of the number of segments of each word of the audio data is added to the variable L_seg_num which indicates up to which segment in total the segment corresponding to the word currently being processed.

【００８５】１１０２では、現在処理を行っている単語
に対応する全てのセグメントに対し処理を終えたかどう
かを調べる。すなわち、L_seg_cntとL_seg_numの値を比
較し、L_seg_cntがL_seg_numより小さければステップ１
１０３へ進み、そうでなければ処理を終了する。At 1102, it is checked whether or not all segments corresponding to the word currently being processed have been processed. That is, the values of L_seg_cnt and L_seg_num are compared, and if L_seg_cnt is smaller than L_seg_num, step 1
If not, the process ends.

【００８６】１１０３では、表示すべき折れ線グラフの
標本点を取得する。すなわち、L_x[L_seg_cnt + 1]に、
{L_x[L_seg_cnt] +（音声データのセグメントの発話時
間の配列のL_seg_cnt番目の要素)*rate}を代入し、L_y
[L_seg_cnt + 1]に、(音声データのセグメントのピッチ
周波数の平均の配列のL_seg_cnt番目の要素)または、
(音声データのセグメントの音圧の平均の配列のL_seg_c
nt番目の要素）を代入する。イントネーションの折れ線
グラフ表示時は、セグメントのピッチ周波数の平均の配
列の要素、ストレスの折れ線グラフ表示時には、セグメ
ントの音圧の平均の配列の要素を代入する。At 1103, the sample points of the line graph to be displayed are acquired. That is, in L_x [L_seg_cnt + 1],
Substitute {L_x [L_seg_cnt] + (L_seg_cnt th element of the utterance time array of the audio data segment) * rate} and set L_y
In [L_seg_cnt + 1], (L_seg_cnt th element of the average array of pitch frequencies of the audio data segment) or
(L_seg_c of the average array of sound pressures of segments of audio data
nt element). When the line graph of the intonation is displayed, the elements of the average array of the pitch frequencies of the segments are substituted, and when the line graph of the stress is displayed, the elements of the average array of the sound pressure of the segments are substituted.

【００８７】１１０４では、実際に折れ線グラフの表示
を行う。すなわち、{(L_x[L_seg_cnt],L_y[L_seg_cnt),
(L_x[L_seg_cnt + 1],L_y[L_seg_cnt + 1])}の２点を結
ぶ線分を表示する。At 1104, a line graph is actually displayed. That is, {(L_x [L_seg_cnt], L_y [L_seg_cnt),
Display the line segment connecting the two points (L_x [L_seg_cnt + 1], L_y [L_seg_cnt + 1])}.

【００８８】１１０５では、処理を次のセグメントに移
すため、M_seg_cntに１を加え、ステップ１１０２へ戻
る。At 1105, 1 is added to M_seg_cnt to move the processing to the next segment, and the process returns to step 1102.

【００８９】次に、データの具体例を用いて上記の処理
を説明する。Next, the above processing will be described using a specific example of data.

【００９０】練習文提示部５２により、図８の教材ファ
イルをもとに練習文リストが学習者に提示されたとす
る。It is assumed that the practice sentence presenting section 52 presents the practice sentence list to the learner based on the teaching material file shown in FIG.

【００９１】続いて、学習者により、練習文として、"h
ow do you do"が選択され、また、学習者は男性である
とする。Then, the learner gives a practice sentence "h
ow do you do "is selected and the learner is male.

【００９２】手本データ取得部５３は、教材ファイル５
５を参照し、"how do you do"の男性用の手本音声デー
タファイル名"M_m1.dat"を得る。"M_m1.dat"の内容は、
図９に示す手本データファイルの例と同一であるとす
る。The model data acquisition unit 53 uses the teaching material file 5
Referring to step 5, obtain a model voice data file name "M_m1.dat" for men of "how do you do". The content of "M_m1.dat" is
It is assumed that it is the same as the example of the model data file shown in FIG.

【００９３】さらに、学習者の音声入力による音声認識
手段５１の出力は、図２に示すとおりであるとする。Further, it is assumed that the output of the voice recognition means 51 by the voice input of the learner is as shown in FIG.

【００９４】正規化及びグラフィック表示部５４では、
学習者音声データ、手本音声データを入力とし、以下の
ように処理を行う。In the normalization and graphic display section 54,
The learner voice data and the model voice data are input, and the processing is performed as follows.

【００９５】まず、Ｘ軸、Ｙ軸の表示、各種変数の初期
化を行う（ステップ８０１）。First, the X and Y axes are displayed and various variables are initialized (step 801).

【００９６】次に、L_word_numに学習者の音声データの
単語数を代入する。ここでは、４が代入される（ステッ
プ８０１１）。Next, the number of words in the learner's voice data is substituted into L_word_num. Here, 4 is substituted (step 8011).

【００９７】ここで、L_word_cntとL_word_numの比較を
行う（ステップ８０２）が、今L_word_cntが０、L_word
_numが４なので、ステップ８０３へ進む。Here, L_word_cnt and L_word_num are compared (step 802), but now L_word_cnt is 0, L_word
Since _num is 4, the process proceeds to step 803.

【００９８】ステップ８０３では、L_word_cnt番目の手
本音声データの単語に対応するイントネーション（スト
レス）を表示する。At step 803, the intonation (stress) corresponding to the word of the L_word_cnt th model voice data is displayed.

【００９９】すなわち、ステップ９０１では、手本にお
けるその単語の発話時間を示す変数M_word_durを０に初
期化し、さらに今から処理する単語の文字列を次の折れ
線グラフ表示時にＸ軸の下部に表示することを示すdisp
_word_flagをＯＮにする。That is, in step 901, the variable M_word_dur indicating the utterance time of the word in the model is initialized to 0, and the character string of the word to be processed now is displayed at the bottom of the X axis when the next line graph is displayed. Indicating disp
Turn on _word_flag.

【０１００】次に、M_seg_numに音声データの各単語の
セグメント数の配列のL_word_cnt番目の要素を加える。
ここで、M_seg_num＝０、L_word_cnt＝０である。音声
データの各単語のセグメント数の配列の０番目の要素は
２であるのでM_seg_num＝２となる（ステップ９０
２）。Next, the L_word_cnt th element of the array of the number of segments of each word of the audio data is added to M_seg_num.
Here, M_seg_num = 0 and L_word_cnt = 0. Since the 0th element of the array of the segment number of each word of the voice data is 2, M_seg_num = 2 (step 90).
2).

【０１０１】ステップ９０３では、M_seg_cntとM_seg_n
umの比較を行う。今、M_seg_cnt＝０、M_seg_num＝２な
ので、処理はステップ９０４へ進む。At step 903, M_seg_cnt and M_seg_n
Compare um. Since M_seg_cnt = 0 and M_seg_num = 2 now, the process proceeds to step 904.

【０１０２】ここで、実際に折れ線グラフ表示を行うた
めの標本点を取得する。すなわち、M_x[1]にM_x[0]の値
と音声データの各セグメントの発話時間の配列のM_seg_
cnt＝０番目の要素の値を加え、１４を代入する。ま
た、例えばイントネーションのグラフを表示するのであ
れば、M_y[1]に音声データの各セグメントのピッチ周波
数の平均の配列の０番目の要素の値１１４を代入する。
さらに、M_word_durに現在のM_word_durに手本音声デー
タの各セグメントの発話時間の配列のM_seg_cnt＝０番
目の値１４を加え、１４とする。（ステップ９０４）。Here, the sample points for actually displaying the line graph are acquired. That is, the value of M_x [0] is added to M_x [1] and M_seg_ of the array of the utterance time of each segment of audio data.
The value of the cnt = 0th element is added and 14 is substituted. Further, for example, when displaying a graph of intonation, the value 114 of the 0th element of the average array of pitch frequencies of each segment of audio data is substituted into M_y [1].
Further, M_word_dur is added to the current M_word_dur by adding 14 to the M_seg_cnt = 0th value of the array of the utterance time of each segment of the model voice data. (Step 904).

【０１０３】続いて実際に、折れ線グラフを表示する。
つまり、{(L_x[0], L_y[0]), (L_x[1], L_y[1])}の値で
ある{(0,0), (14,114)}の２点を結ぶ線分を表示する
（ステップ９０５）。Then, a line graph is actually displayed.
That is, the line segment connecting the two points {(0,0), (14,114)}, which are the values of {(L_x [0], L_y [0]), (L_x [1], L_y [1])}, It is displayed (step 905).

【０１０４】次に、disp_word_flagがＯＮであるかどう
かを調べ（ステップ９０６）、今ＯＮなので、Ｘ軸の下
部に対応する単語文字列を表示するための処理を行う。
つまり、音声データの単語文字列の配列の０番目の値で
ある"how"を(x[1],0)の直下に表示する（ステップ９０
７）。"how"を処理している間、もうこの文字列を表示
する必要はないので、disp_word_flagをＯＦＦにする
（ステップ９０８）。Next, it is checked whether or not disp_word_flag is ON (step 906). Since it is ON now, processing for displaying the word character string corresponding to the lower part of the X axis is performed.
That is, "how", which is the 0th value of the word character string array of the voice data, is displayed immediately below (x [1], 0) (step 90).
7). While processing "how", there is no need to display this character string anymore, so disp_word_flag is set to OFF (step 908).

【０１０５】０番目のセグメントに対する処理が終了し
たので、M_seg_cntに１を加え１として（ステップ９０
９）、ステップ９０３へ戻る。１番目のセグメントにつ
いても上記と同様に折れ線グラフ表示を行う（ステップ
９０５、９０６）。Since the processing for the 0th segment is completed, 1 is added to M_seg_cnt to make it 1 (step 90
9) and returns to step 903. A line graph is displayed for the first segment as well (steps 905 and 906).

【０１０６】ステップ４０９では、disp_word_flagがＯ
ＦＦになっているので、単語文字列の表示処理は行わ
ず、ステップ９０９へ進む。At step 409, disp_word_flag is set to O.
Since it is FF, the display processing of the word character string is not performed, and the process proceeds to step 909.

【０１０７】このとき、M_seg_cntが２となっている
（ステップ９０９）。ステップ９０３での比較の結果、
M_seg_cntがM_seg_num以上になったので、手本の０番目
の単語に対する処理を終え、ステップ８０５へ進む。こ
こで、M_word_dur＝２７となっている。At this time, M_seg_cnt is 2 (step 909). As a result of the comparison in step 903,
Since M_seg_cnt has become equal to or larger than M_seg_num, the process for the 0th word of the example is finished, and the process proceeds to step 805. Here, M_word_dur = 27.

【０１０８】次に、学習者のL_word_cnt番目の単語に対
応するセグメントのデータを手本に対応させて表示する
ための倍率を計算する（ステップ８０５）。Next, the magnification for displaying the data of the segment corresponding to the L_word_cnt th word of the learner in correspondence with the model is calculated (step 805).

【０１０９】すなわち、学習者のL_word_cnt番目の単語
に対応するセグメント数を示す変数L_word_seg_numに音
声データの各単語のセグメント数の配列のL_word_cnt＝
０番目の値３を代入し、L_word_dur,ｉにそれぞれ０を
代入する。That is, in the variable L_word_seg_num indicating the number of segments corresponding to the learner's L_word_cnt th word, L_word_cnt = of the array of the segment number of each word of the audio data is set.
The 0th value 3 is substituted, and 0 is substituted for L_word_dur, i.

【０１１０】次に、L_word_cnt＝０番目の単語に対応す
る全てのセグメントに対して処理を終えたかどうか調べ
るために、ｉとL_word_seg_numを比較する。今、ｉ＝
０、L_word_seg_num＝３なので、処理はステップ１００
３へ進む。Next, i is compared with L_word_seg_num in order to check whether or not all segments corresponding to the word L_word_cnt = 0th word have been processed. I =
Since 0 and L_word_seg_num = 3, the process is step 100.
Go to 3.

【０１１１】ここで、L_word_durに現在のL_word_dur＝
０と音声データの各セグメントの発話時間の配列の(L_s
eg_cnt + i)＝０番目の値２４の和である２４を代入す
る（ステップ１００３）。Here, the current L_word_dur in L_word_dur =
0 and the (L_s
eg_cnt + i) = 24, which is the sum of the 0th value 24, is substituted (step 1003).

【０１１２】さらに、次のセグメントの発話時間を調べ
るためｉに１を加える（ステップ１００４）。Further, 1 is added to i to check the speech time of the next segment (step 1004).

【０１１３】このようにして、処理を２番目のセグメン
トまで終えたとする。このとき、ｉ＝３，L_word_dur＝
４６となり、ステップ１００２の比較により、ステップ
１００５へ進む。In this way, it is assumed that the processing is completed up to the second segment. At this time, i = 3, L_word_dur =
46, and the process proceeds to step 1005 by comparing step 1002.

【０１１４】ステップ１００５では、(M_word_dur＝２
７)／(L_word_dur＝４６)をrateに代入する。すなわ
ち、rateに約０．５９が代入される。At step 1005, (M_word_dur = 2
7) / (L_word_dur = 46) is substituted for rate. That is, about 0.59 is substituted for rate.

【０１１５】学習者の音声データを表示する際の倍率が
求められた（ステップ８０５）ので、この倍率を利用し
て、L_word_cnt＝０番目の学習者の単語に対応するイン
トネーション（ストレス）のグラフィック表示を行う
（ステップ８０６）。Since the scale factor for displaying the learner's voice data is obtained (step 805), the scale factor is used to graphically display the intonation (stress) corresponding to the L_word_cnt = 0-th learner's word. Is performed (step 806).

【０１１６】ステップ１１０１では、L_seg_numに音声
データの各単語のセグメント数の配列のL_word_cnt番目
の要素を加える。ここで、L_seg_num＝０、L_word_cnt
＝０である。音声データの各単語のセグメント数の配列
の０番目の要素は３であるのでL_seg_num＝３となる。In step 1101, the L_word_cnt th element of the array of the segment number of each word of the audio data is added to L_seg_num. Here, L_seg_num = 0, L_word_cnt
= 0. Since the 0th element of the array of the number of segments of each word of the voice data is 3, L_seg_num = 3.

【０１１７】ステップ１１０２では、L_seg_cntとL_seg
_numの比較を行う。今、L_seg_cnt＝０、L_seg_num＝３
なので、処理はステップ１１０３へ進む。At step 1102, L_seg_cnt and L_seg
Compare _num. Now L_seg_cnt = 0, L_seg_num = 3
Therefore, the process proceeds to step 1103.

【０１１８】ここで、実際に折れ線グラフ表示を行うた
めの標本点を取得する。すなわち、L_x[1]にL_x[0]の値
に音声データの各セグメントの発話時間の配列の０番目
の要素の値と倍率rateの積を加える。ここでは、0+24*
0.59=14.16を代入する。また、例えばイントネーション
のグラフを表示するのであれば、L_y[1]に音声データの
各セグメントのピッチ周波数の平均の配列の０番目の要
素の値１０４を代入する（ステップ１１０３）。続いて
実際に、折れ線グラフを表示する。つまり、{(L_x[0],
L_y[0]), (L_x[1], L_y[1])}の値である{(0,0), (14.1
6,104)}の２点を結ぶ線分を表示する（ステップ１１０
４）。Here, sample points for actually displaying the line graph are acquired. That is, the value of L_x [0] is added to L_x [1] by the product of the value of the 0th element of the array of the speech time of each segment of the audio data and the scaling factor rate. Here, 0 + 24 *
Substitute 0.59 = 14.16. If an intonation graph is to be displayed, the value 104 of the 0th element of the average pitch frequency array of each segment of the audio data is assigned to L_y [1] (step 1103). Then, the line graph is actually displayed. That is, {(L_x [0],
The value of L_y [0]), (L_x [1], L_y [1])} is {(0,0), (14.1
6, 104)} is displayed (step 110)
4).

【０１１９】学習者のL_seg_cnt＝０番目のセグメント
のデータの表示が終わったので、処理を次のセグメント
に進めるため、L_seg_cntに１を加える（ステップ１１
０５）。Since the display of the learner's L_seg_cnt = 0th segment data is finished, 1 is added to L_seg_cnt in order to advance the processing to the next segment (step 11).
05).

【０１２０】このようにして、処理を進め、学習者の２
番目のセグメントまでの処理が終わるとL_seg_cnt＝３
となり、ステップ１１０２の比較の結果、学習者のL_wo
rd_cnt＝０番目の単語の処理を終わり、ステップ８０７
へ進む。In this way, the processing is advanced and the learner's 2
L_seg_cnt = 3 after processing up to the th segment
Then, as a result of the comparison in step 1102, the learner's L_wo
rd_cnt = 0 processing of the 0th word ends, and step 807
Go to.

【０１２１】ステップ８０７では、処理を次の単語に進
めるため、L_word_cntに１を加え、１とし、ステップ８
０２へ戻る。In step 807, 1 is added to L_word_cnt to set 1 in order to advance the processing to the next word, and step 8
Return to 02.

【０１２２】このようにして処理を進め、３番目の単語
まで処理を終えたとき、L_word_cnt＝４となり、ステッ
プ８０２の比較により、本実施例の全ての処理を終了す
る。When the processing is advanced in this way and the processing is completed up to the third word, L_word_cnt = 4, and by the comparison in step 802, all the processing of this embodiment is ended.

【０１２３】なお、第３の本発明として、手本となる人
の音響的特徴量と言語シンボルとの対応付け内容と、学
習者のそれとを、上記実施例のように、正規化して表示
するのではなく、あるいは表示に加えて、両者の対応付
け内容を比較手段で比較して、その不一致の部分につい
て、ある単語のイントネーションを下げるようになどと
いう指示を表示するようにしてもよい。図１６の１３０
はその指示の一例である。As the third aspect of the present invention, the correspondence between the acoustic feature amount of the person who serves as a model and the language symbol and that of the learner are normalized and displayed as in the above embodiment. Alternatively, or in addition to the display, the comparison means may compare the contents of correspondence between the two and display an instruction to lower the intonation of a certain word in the mismatched portion. 130 of FIG.
Is an example of the instruction.

【０１２４】また、手本となる音声データなどは、予め
記憶手段に格納されてなく、いつでもネイティブスピー
カーなどによって入力でき、音声信号の所定の音響的特
徴量とそれに対応する言語シンボルとの対応付け内容が
得られるようになっていてもよい。Further, the voice data as a model is not stored in the storage means in advance and can be input at any time by a native speaker or the like, and a predetermined acoustic feature amount of the voice signal is associated with a corresponding language symbol. The content may be available.

【０１２５】また、本発明の音響的特徴量とは、イント
ネーションやストレスに限らず、フォルマント等の他の
音響的特徴量であってもよい。The acoustic feature amount of the present invention is not limited to intonation and stress, and may be another acoustic feature amount such as formant.

【０１２６】また、本発明の言語的シンボルとは、言
葉、発音記号、音素コード、音素片など、言語に関する
シンボルであればどのようなものでもよい。The linguistic symbol of the present invention may be any symbol related to the language, such as words, phonetic symbols, phoneme codes, and phoneme pieces.

【０１２７】また、本発明の各手段は、コンピュータを
用いてソフトウェア的に実現しても、それら機能を有す
る専用のハード回路を用いて実現してもかまわない。Each means of the present invention may be realized by software using a computer or by using a dedicated hardware circuit having those functions.

【０１２８】[0128]

【発明の効果】以上の説明から明らかなように、第１の
本発明は、音声信号の所定の音響的特徴量とそれに対応
する言語シンボルとを対応付けながら表示する対応表示
手段を備えるので、その音響的特徴量と言語シンボルと
の対応関係が分かりやすいという長所を有する。As is apparent from the above description, the first aspect of the present invention includes the correspondence display means for displaying the predetermined acoustic feature amount of the voice signal and the corresponding language symbol in association with each other. It has an advantage that the correspondence between the acoustic feature amount and the language symbol is easy to understand.

【０１２９】また、第２の本発明は、音声信号の所定の
音響的特徴量とそれに対応する言語シンボルとを対応付
ける対応手段と、少なくとも２種類の音声について、対
応手段によって得られた対応付け内容を比較する比較手
段とを備えるので、例えば、１種類の音声をネイティブ
スピーカーの音声とし、他方の種類の音声を学習者の音
声とすると、それらの比較によって、学習者の悪い所な
どを的確に指摘することが可能になる。また、第３の本
発明は、音声信号の所定の音響的特徴量とそれに対応す
る言語シンボルとを対応付ける対応手段と、少なくとも
２種類の音声について、各対応手段によって得られた対
応付け内容を正規化して表示する表示手段とを備えるの
で、例えば、１種類の音声をネイティブスピーカーの音
声とし、他方の種類の音声を学習者の音声とすると、そ
れらの対応付け内容が正規化されて表示されるので、そ
の表示を見て、学習者の悪い所などを的確に理解するこ
とが可能になる。The second aspect of the present invention relates to a correspondence means for associating a predetermined acoustic feature quantity of a voice signal with a language symbol corresponding to the correspondence means, and a correspondence content obtained by the correspondence means for at least two kinds of voices. Since there is provided a comparison means for comparing, for example, when one type of voice is the voice of the native speaker and the other type of voice is the voice of the learner, by comparing them, the bad place of the learner can be accurately identified. It becomes possible to point out. Further, according to the third aspect of the present invention, associating means for associating a predetermined acoustic feature amount of a voice signal with a corresponding language symbol, and at least two types of speech, the associating content obtained by each corresponding means is normalized. Since the display means for displaying the converted sound is provided, for example, when one kind of sound is a native speaker sound and the other kind of sound is a learner's sound, the associated contents are normalized and displayed. Therefore, it is possible to accurately understand the bad places of the learner by looking at the display.

[Brief description of drawings]

【図１】本発明の音声認識装置の一実施例を示すブロッ
ク図である。FIG. 1 is a block diagram showing an embodiment of a voice recognition device of the present invention.

【図２】本発明の音声認識手段の音声データである。FIG. 2 is voice data of a voice recognition means of the present invention.

【図３】第１の本発明の音声認識装置の一実施例を示す
ブロック図である。FIG. 3 is a block diagram showing an embodiment of a voice recognition device of the first present invention.

【図４】本発明の音声認識手段の動作を示すフローチャ
ートである。FIG. 4 is a flowchart showing the operation of the voice recognition means of the present invention.

【図５】第１の本発明の対応表示手段の動作を説明する
ためのフローチャートである。FIG. 5 is a flow chart for explaining the operation of the correspondence display means of the first present invention.

【図６】第１の本発明の対応表示手段の動作を説明する
ためのフローチャートである。FIG. 6 is a flowchart for explaining the operation of the correspondence display means of the first present invention.

【図７】第２の本発明の音声認識装置の一実施例を示す
ブロック図である。FIG. 7 is a block diagram showing an embodiment of a voice recognition device of the second invention.

【図８】同実施例の教材ファイルの一例を示す構成図で
ある。FIG. 8 is a configuration diagram showing an example of a teaching material file of the embodiment.

【図９】同実施例の手本音声データファイルの一例を示
す構成図である。FIG. 9 is a configuration diagram showing an example of a model voice data file of the embodiment.

【図１０】同実施例の動作を説明するためのフローチャ
ートである。FIG. 10 is a flowchart for explaining the operation of the embodiment.

【図１１】図１０のステップ８０３の詳細フローチャー
トである。11 is a detailed flowchart of step 803 of FIG.

【図１２】図１０のステップ８０３の詳細フローチャー
トである。12 is a detailed flowchart of step 803 in FIG.

【図１３】図１０のステップ８０５の詳細フローチャー
トである。FIG. 13 is a detailed flowchart of step 805 in FIG.

【図１４】図１０のステップ８０６の詳細フローチャー
トである。FIG. 14 is a detailed flowchart of step 806 in FIG.

【図１５】第２の本発明の音声認識装置の表示画面を示
す図である。FIG. 15 is a diagram showing a display screen of a voice recognition device according to a second aspect of the present invention.

【図１６】第２の本発明の音声認識装置の表示画面を示
す図である。FIG. 16 is a diagram showing a display screen of a voice recognition device according to a second aspect of the present invention.

【符号の説明】３０入力手段３１音声認識手段３２グラフィック表示部５０入力手段５１音声認識手段５２練習文提示部５３手本データ取得部５４正規化及びグラフィック表示部５５教材データ５６手本音声データ[Explanation of Codes] 30 Input Means 31 Voice Recognition Means 32 Graphic Display Unit 50 Input Means 51 Voice Recognition Means 52 Practice Sentence Presentation Unit 53 Model Data Acquisition Unit 54 Normalization and Graphic Display Unit 55 Teaching Material Data 56 Model Voice Data

───────────────────────────────────────────────────── フロントページの続き (72)発明者田川忠道東京都港区虎ノ門１丁目７番12号沖電気工業株式会社内 (72)発明者加藤正明愛知県名古屋市千種区内山三丁目８番10号株式会社沖テクノシステムズラボラトリ内 ─────────────────────────────────────────────────── ─── Continued Front Page (72) Inventor Tadamichi Tagawa 1-7-12 Toranomon, Minato-ku, Tokyo Within Oki Electric Industry Co., Ltd. (72) Inventor Masaaki Kato 3-8-10 Uchiyama, Chikusa-ku, Nagoya, Aichi Prefecture Oki Techno Systems Laboratory Co., Ltd.

Claims

[Claims]

1. An input unit for inputting voice, a recognition unit for performing voice recognition from the voice signal to obtain a language symbol, and a predetermined acoustic feature amount of the voice signal and the corresponding language symbol. A voice recognition device comprising: a corresponding display means for displaying while attaching.

2. An input unit for inputting a voice of an unspecified speaker, a recognition unit for performing voice recognition from the voice signal to obtain a language symbol, a predetermined acoustic feature amount of the voice signal and the corresponding acoustic feature amount. The contents associated with the language symbol are
A voice recognition device, comprising: a display unit that normalizes and displays at least two types of voices.

3. An input unit for inputting a voice of an unspecified speaker, a recognition unit for performing voice recognition from the voice signal to obtain a language symbol, a predetermined acoustic feature amount of the voice signal and the corresponding acoustic feature amount. The contents associated with the language symbol are
A voice recognition device comprising: a comparison means for comparing at least two types of voices.

4. The voice recognition device according to claim 2, wherein the two types of voices are a voice that serves as a model and a voice of a learner, and are used for educational purposes.

5. The voice recognition apparatus according to claim 4, wherein the correspondence content of the model voice is acquired in advance and stored in a storage unit.