JP4811993B2

JP4811993B2 - Audio processing apparatus and program

Info

Publication number: JP4811993B2
Application number: JP2005241264A
Authority: JP
Inventors: 玲子山田; 秀行渡辺; 博章田川; 宏明加藤
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-08-23
Filing date: 2005-08-23
Publication date: 2011-11-09
Anticipated expiration: 2025-08-23
Also published as: JP2007057692A

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that in a conventional voice processing apparatus, voice processing according to a speaker's characteristic of a person to be evaluated, who is a speaker of voice, cannot be performed, and as a result, the voice processing with high accuracy cannot performed. <P>SOLUTION: The voice processing apparatus comprises: a sampling section for acquiring a first voice data by sampling received voice with a stored first sampling frequency; a vocal tract length normalizing processor for storing a formant frequency of the person to be evaluated which is the formant frequency of the person to be evaluated who is a speaker of the voice received by a voice receiving section, and for acquiring a second voice data by performing sampling processing on the received voice with a second sampling frequency, that is, the first sampling frequency/(formant frequency of a teacher data/formant frequency of the person to be evaluated), is performed according to the speaker's characteristic; and a voice processing section for processing the second voice data. Thereby the voice processing according to the speaker's characteristic is performed. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、入力された音声を評価したり、入力された音声を認識したりする音声処理装置等に関するものである。 The present invention relates to a speech processing apparatus that evaluates input speech and recognizes input speech.

従来の技術として、以下の音声処理装置がある（特許文献１参照）。本音声処理装置は、語学学習装置であり、当該語学学習装置は、学習者が選択した役割の発音をレファランスデータと比較して一致度によって点数化して表示し、点数によって適当な次の画面を自動に表示することにより、学習能率を向上させる装置である。本従来の音声処理装置は、入力された音声信号は音声認識技術により分析された後、学習者発音のスペクトルと抑揚とが学習者発音表示ボックスに表れるという構成になっている。そして、従来の技術においては、標準音データと学習者の発音のスペクトル、および抑揚が比較されて点数が表示される。 As a conventional technique, there is the following voice processing apparatus (see Patent Document 1). This speech processing device is a language learning device, and the language learning device displays the pronunciation of the role selected by the learner by comparing it with the reference data, scoring it according to the degree of coincidence, and displaying an appropriate next screen depending on the score. It is a device that improves learning efficiency by displaying automatically. The conventional speech processing apparatus is configured such that the input speech signal is analyzed by speech recognition technology, and then the learner's pronunciation spectrum and inflection appear in the learner pronunciation display box. In the conventional technique, the standard sound data is compared with the learner's pronunciation spectrum and intonation, and the score is displayed.

また、従来の技術として、以下の音声処理装置がある（特許文献２参照）。本音声処理装置は歌唱音声評価装置であり、本歌唱音声評価装置は、歌唱音声の周波数成分を抽出する抽出手段と、当該抽出された周波数成分から基本周波数成分と倍音周波数成分とをそれぞれ抽出する特定周波数成分抽出手段と、特定周波数成分抽出手段によって抽出された基本周波数成分に対する倍音周波数成分の比率に応じて、歌唱音声の評価を示す評価値を算出する評価手段とを備える。そして、本歌唱音声評価装置は、歌唱音声の周波数成分に基づいてその声質の良否を適正に評価し、これを歌唱音声の採点結果に反映させることにより、歌唱音声の採点をより人間の感性に近づけることを狙いとしている。 Further, as a conventional technique, there is the following voice processing apparatus (see Patent Document 2). The voice processing device is a singing voice evaluation device, and the singing voice evaluation device extracts a frequency component of the singing voice and a fundamental frequency component and a harmonic frequency component from the extracted frequency component, respectively. Specific frequency component extraction means, and evaluation means for calculating an evaluation value indicating evaluation of the singing voice according to the ratio of the harmonic frequency component to the fundamental frequency component extracted by the specific frequency component extraction means. And this singing voice evaluation apparatus evaluates the quality of the voice quality appropriately based on the frequency component of the singing voice, and reflects this in the singing voice scoring result, thereby making the singing voice scoring more human sensitive. The aim is to get closer.

さらに、従来の技術として、以下の音声処理装置がある（特許文献３参照）。本音声処理装置は音声認識装置であり、入力音声パターンと標準パターンを、ＤＰ法を用いて照合し、最も照合距離の小さい標準パターンを認識結果とする音声認識装置であり、照合結果を用いて入力パターンを音素に分割し、各音素の継続時間と標準継続時間とのずれの分散を計算し、これを照合距離に付加することで距離を補正することを特徴とする。そして、分割部で照合結果を用いて音素に分割し、時間長ずれ計算部で標準継続時間とのずれの分散を計算し、距離補正部で照合距離を補正するように構成する。また、本音声認識装置は、時間長のずれを計算する対象音素を選択する音素選択部、距離補正する対象単語を選択する単語選択部を有し、単語の認識性能を高できる、というものである。
特開２００３−２２８２７９（第１頁、第１図等）特開２００５−１０７０８８（第１頁、第１図等）特開平６−４０９６（第１頁、第１図等） Further, as a conventional technique, there is the following voice processing apparatus (see Patent Document 3). This speech processing device is a speech recognition device, which is a speech recognition device that collates an input speech pattern with a standard pattern using the DP method, and uses a standard pattern with the smallest collation distance as a recognition result. The input pattern is divided into phonemes, the variance of the deviation between the duration of each phoneme and the standard duration is calculated, and this is added to the verification distance to correct the distance. Then, the dividing unit divides into phonemes using the collation result, the time length deviation calculating unit calculates the variance of the deviation from the standard duration, and the distance correcting unit corrects the collation distance. In addition, the speech recognition apparatus includes a phoneme selection unit that selects a target phoneme for calculating a time length shift and a word selection unit that selects a target word for distance correction, so that the word recognition performance can be improved. is there.
JP 2003-228279 A (first page, FIG. 1 etc.) JP-A-2005-107088 (first page, FIG. 1 etc.) JP-A-6-4096 (first page, FIG. 1 etc.)

しかしながら、特許文献１や特許文献２の従来の技術においては、音声（歌声も含む）の話者である評価対象者の話者特性に応じた音声処理が行えず、その結果、精度の高い音声処理ができなかった。具体的には、従来の技術においては、例えば、評価対象者の声道長の違いにより、スペクトル包絡が高周波数域または低周波数域に伸縮するが、従来の発音評定装置や歌唱音声評価装置などの音声処理装置において、かかるスペクトル包絡の伸縮により、評価結果が異なる。つまり、従来の技術においては、同様の上手さの発音や歌唱でも、評価対象者の声道長の違いにより、発音や歌唱の評価結果が異なり、精度の高い評価ができなかった。 However, in the conventional techniques of Patent Document 1 and Patent Document 2, speech processing according to the speaker characteristics of the evaluation target person who is a speaker of speech (including singing voice) cannot be performed, and as a result, highly accurate speech Could not process. Specifically, in the conventional technology, for example, the spectral envelope expands or contracts to a high frequency range or a low frequency range due to a difference in the vocal tract length of the evaluation subject. In the speech processing apparatus, the evaluation result varies depending on the expansion and contraction of the spectrum envelope. In other words, in the conventional technology, even with the same skillful pronunciation and singing, the evaluation results of pronunciation and singing differ depending on the vocal tract length of the person to be evaluated, and high-accuracy evaluation cannot be performed.

また、特許文献１の音声処理装置において、標準音データと学習者の発音のスペクトル、および抑揚が比較されて点数が表示される構成であるので、両者の類似度の評定の精度が低く、また、リアルタイムに高速に点数を表示するためには、処理能力が極めて高いＣＰＵ、多量のメモリが必要であった。 In addition, since the voice processing device of Patent Document 1 is configured to display the score by comparing the standard sound data with the learner's pronunciation spectrum and intonation, the accuracy of the similarity between the two is low, and In order to display the score at high speed in real time, a CPU having a very high processing capability and a large amount of memory are required.

また、特許文献１の音声処理装置において、無音区間があれば、類似度が低く評価されると考えられ、評価の精度が低かった。また、音素の置換や挿入や欠落など、特殊な事象が発生していることを検知できなかった。 Further, in the speech processing apparatus of Patent Document 1, if there is a silent section, it is considered that the similarity is evaluated low, and the accuracy of the evaluation is low. In addition, it was not possible to detect the occurrence of special events such as phoneme substitution, insertion, or omission.

さらに、例えば、特許文献３に示すような音声認識処理を行う音声処理装置において、評価対象者の声道長の違いにより、スペクトル包絡の伸縮が生じるが、かかる評価対象者の話者特性に応じた音声認識処理を行っておらず、精度の高い音声認識ができなかった。 Further, for example, in a speech processing apparatus that performs speech recognition processing as shown in Patent Document 3, the spectral envelope expands or contracts due to the difference in the vocal tract length of the evaluation target person, but depending on the speaker characteristics of the evaluation target person Voice recognition processing was not performed, and high-accuracy voice recognition was not possible.

本第一の発明の音声処理装置は、比較される対象の音声に関するデータであり、1以上の音韻毎のデータである教師データを１以上格納している教師データ格納部と、音声を受け付ける音声受付部と、第一サンプリング周波数を格納している第一サンプリング周波数格納部と、前記第一サンプリング周波数で、前記音声受付部が受け付けた音声をサンプリングし、第一音声データを取得するサンプリング部と、前記教師データのフォルマント周波数である教師データフォルマント周波数を格納している教師データフォルマント周波数格納部と、前記音声受付部が受け付けた音声の話者である評価対象者のフォルマント周波数である評価対象者フォルマント周波数を格納している評価対象者フォルマント周波数格納部と、第二サンプリング周波数「前記第一サンプリング周波数／（教師データフォルマント周波数／評価対象者フォルマント周波数）」で、前記音声受付部が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理部と、前記第二音声データを処理する音声処理部を具備する音声処理装置である。
かかる構成により、評価対象者の話者特性に応じた精度の高い音声処理ができる。 The speech processing apparatus according to the first aspect of the present invention is a teacher data storage unit that stores one or more teacher data, which is data for each of one or more phonemes, and data relating to a target speech to be compared; A receiving unit, a first sampling frequency storing unit storing a first sampling frequency, a sampling unit that samples the sound received by the sound receiving unit at the first sampling frequency, and acquires first sound data; A teacher data formant frequency storage unit that stores a teacher data formant frequency that is a formant frequency of the teacher data, and an evaluation target person that is a formant frequency of an evaluation target person who is a speaker of the voice received by the voice receiving unit Evaluation target formant frequency storage unit storing formant frequency and second sampling frequency Normalization of vocal tract length to obtain second voice data by performing sampling processing on the voice received by the voice reception unit at “first sampling frequency / (teacher data formant frequency / evaluator formant frequency)” An audio processing apparatus including a processing unit and an audio processing unit that processes the second audio data.
With this configuration, it is possible to perform highly accurate speech processing according to the speaker characteristics of the evaluation target person.

また、本第二の発明の音声処理装置は、第一の発明に対して、前記音声処理部は、前記第二音声データを、フレームに区分するフレーム区分手段と、前記区分されたフレーム毎の音声データであるフレーム音声データを１以上得るフレーム音声データ取得手段と、前記教師データと前記１以上のフレーム音声データに基づいて、前記音声受付部が受け付けた音声の評定を行う評定手段と、前記評定手段における評定結果を出力する出力手段を具備する音声処理装置である。
かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができる。 Further, in the audio processing device of the second invention, in contrast to the first invention, the audio processing unit includes a frame dividing means for dividing the second audio data into frames, and the divided frames for each of the divided frames. Frame audio data obtaining means for obtaining one or more frame sound data as sound data, rating means for evaluating the sound received by the sound receiving unit based on the teacher data and the one or more frame sound data; The speech processing apparatus includes an output unit that outputs a rating result in the rating unit.
With this configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation target person.

また、本第三の発明の音声処理装置は、第二の発明に対して、前記評定手段は、前記１以上のフレーム音声データのうちの少なくとも一のフレーム音声データに対する最適状態を決定する最適状態決定手段と、前記最適状態決定手段が決定した最適状態における確率値を取得する最適状態確率値取得手段と、前記最適状態確率値取得手段が取得した確率値をパラメータとして音声の評定値を算出する評定値算出手段を具備する音声処理装置である。
かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができる。 Further, in the audio processing device according to the third aspect of the present invention, in contrast to the second aspect, the rating means determines an optimal state for at least one frame audio data of the one or more frame audio data. A voice evaluation value is calculated using the determination means, the optimum state probability value acquisition means for acquiring the probability value in the optimum state determined by the optimum state determination means, and the probability value acquired by the optimum state probability value acquisition means as parameters. It is a voice processing device comprising rating value calculation means.
With this configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation target person.

また、本第四の発明の音声処理装置は、第二の発明に対して、前記評定手段は、前記１以上のフレーム音声データの最適状態を決定する最適状態決定手段と、前記最適状態決定手段が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値を、発音区間毎に取得する発音区間フレーム音韻確率値取得手段と、前記発音区間フレーム音韻確率値取得手段が取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する評定値算出手段を具備する音声処理装置である。
かかる構成により、評価対象者の話者特性に応じた、さらに精度の高い音声の評定ができる。 The speech processing apparatus according to the fourth aspect of the present invention is the second aspect of the invention, wherein the rating means includes an optimum state determining means for determining an optimum state of the one or more frame sound data, and the optimum state determining means. Is acquired by the pronunciation interval frame phoneme probability value acquisition means for acquiring, for each pronunciation interval, one or more probability values in the state of the entire phoneme having the optimal state of each frame determined by The speech processing apparatus includes a rating value calculation unit that calculates a rating value of speech using one or more probability values for each of one or more pronunciation intervals as a parameter.
With this configuration, it is possible to evaluate speech with higher accuracy according to the speaker characteristics of the evaluation target person.

また、本第五の発明の音声処理装置は、第二の発明に対して、前記音声処理部は、前記フレーム毎の入力音声データに基づいて、特殊な音声が入力されたことを検知する特殊音声検知手段をさらに具備し、前記評定手段は、前記教師データと前記入力音声データと前記特殊音声検知手段における検知結果に基づいて、前記音声受付部が受け付けた音声の評定を行う音声処理装置である。
かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができ、かつ特殊音声を検知し、かかる特殊音声に対応した音声の評定ができる。 Further, in the voice processing device according to the fifth aspect of the present invention, in contrast to the second invention, the voice processing unit detects that a special voice is input based on the input voice data for each frame. The voice processing device further includes a voice detection unit, and the rating unit is a voice processing device that evaluates the voice received by the voice reception unit based on the teacher data, the input voice data, and the detection result in the special voice detection unit. is there.
With this configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation subject, detect special speech, and evaluate speech corresponding to the special speech.

また、本第六の発明の音声処理装置は、第五の発明に対して、前記特殊音声検知手段は、無音を示すＨＭＭに基づくデータである無音データを格納している無音データ格納手段と、前記入力音声データおよび前記無音データに基づいて、無音の区間を検出する無音区間検出手段を具備する音声処理装置である。
かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができ、かつ無音区間を検知し、かかる無音区間に対応した音声の評定ができる。 The voice processing device according to the sixth aspect of the present invention is the voice processing apparatus according to the fifth aspect, wherein the special voice detection means includes silence data storage means for storing silence data which is data based on HMM indicating silence, The speech processing apparatus includes a silent section detecting means for detecting a silent section based on the input voice data and the silent data.
With such a configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation target person, detect a silent section, and evaluate speech corresponding to the silent section.

また、本第七の発明の音声処理装置は、第五の発明に対して、前記特殊音声検知手段は、一の音素の後半部および当該音素の次の音素の前半部の評定値が所定の条件を満たすことを検知し、前記評定手段は、前記特殊音声検知手段が前記所定の条件を満たすことを検知した場合に、少なくとも音素の挿入があった旨を示す評定結果を構成する音声処理装置である。
かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができ、かつ音素の挿入を検知し、かかる音素の挿入に対応した音声の評定ができる。 Further, in the speech processing apparatus of the seventh invention, in contrast to the fifth invention, the special speech detection means has a predetermined rating value of the second half of one phoneme and the first half of the next phoneme after the phoneme. A voice processing device that detects that a condition is satisfied, and the rating means constitutes a rating result indicating that at least a phoneme has been inserted when the special voice detecting means detects that the predetermined condition is satisfied It is.
With this configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation target person, detect insertion of phonemes, and evaluate speech corresponding to insertion of such phonemes.

また、本第八の発明の音声処理装置は、第七の発明に対して、前記特殊音声検知手段は、一の音素の評定値が所定の条件を満たすことを検知し、前記評定手段は、前記特殊音声検知手段が前記所定の条件を満たすことを検知した場合に、少なくとも音素の置換または欠落があった旨を示す評定結果を構成する音声処理装置である。
かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができ、かつ音素の置換または欠落を検知し、かかる音素の置換または欠落に対応した音声の評定ができる。 Further, in the voice processing device according to the eighth aspect of the invention, in contrast to the seventh aspect, the special voice detection unit detects that a rating value of one phoneme satisfies a predetermined condition, and the rating unit includes: When the special speech detection means detects that the predetermined condition is satisfied, the speech processing apparatus constitutes an evaluation result indicating that at least a phoneme has been replaced or missing.
With this configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation subject, detect phoneme replacement or omission, and evaluate speech corresponding to such phoneme substitution or omission.

また、本第九の発明の音声処理装置は、第二から第八いずれかの発明に対して、前記音声処理装置は、カラオケ評価装置であって、前記音声受付部は、評価対象者の歌声の入力を受け付け、前記音声処理部は、前記歌声を評価する音声処理装置である。
かかる構成により、カラオケ評価装置として利用できる。 The speech processing apparatus according to the ninth aspect of the present invention provides the speech processing apparatus according to any one of the second to eighth aspects, wherein the speech processing apparatus is a karaoke evaluation apparatus, and the speech reception unit The voice processing unit is a voice processing device that evaluates the singing voice.
With this configuration, it can be used as a karaoke evaluation apparatus.

また、本第十の発明の音声処理装置は、第九の発明に対して、前記フレーム区分手段は、前記音声をフレームに区分し、かつ、前記第二音声データをフレームに区分し、前記フレーム音声データ取得手段は、前記音声が区分されたフレーム毎の音声データである第一フレーム音声データを１以上得て、かつ前記第二音声データが区分されたフレーム毎の音声データである第二フレーム音声データを１以上得、前記評定手段は、前記教師データと前記１以上の第一フレーム音声データに基づいて、前記音声受付部が受け付けた音声の評定を行う第一評定手段と、前記教師データと前記１以上の第二フレーム音声データに基づいて、前記音声受付部が受け付けた音声の評定を行う第二評定手段と、前記第一評定手段における評定結果と前記第二評定手段における評定結果に基づいて、最終的な評定結果を得る評定結果取得手段とを具備する音声処理装置である。
かかる構成により、優れたカラオケ評価装置として利用できる。 Further, in the sound processing device according to the tenth aspect of the present invention, in contrast to the ninth aspect, the frame classification means divides the sound into frames and divides the second sound data into frames. The audio data acquisition means obtains one or more first frame audio data that is audio data for each frame into which the audio is divided, and a second frame that is audio data for each frame into which the second audio data is divided One or more voice data is obtained, and the rating means includes a first rating means for rating the voice received by the voice receiving unit based on the teacher data and the one or more first frame voice data, and the teacher data. And second rating means for rating the voice received by the voice receiving unit based on the one or more second frame voice data, the rating result in the first rating means, and the second rating. Based on the evaluation result of the unit, a voice processing apparatus and a rating result obtaining means for obtaining a final assessment results.
With this configuration, it can be used as an excellent karaoke evaluation apparatus.

また、本第十一の発明の音声処理装置は、第九、第十いずれかの発明に対して、前記音声受付部は、所定の母音の音声を受け付けた後、評価対象者の歌声の入力を受け付け、前記サンプリング部は、前記第一サンプリング周波数で、前記母音の音声をもサンプリングし、前記サンプリングした母音の音声に基づいて、評価対象者のフォルマント周波数である評価対象者フォルマント周波数を取得する評価対象者フォルマント周波数取得部をさらに具備し、前記評価対象者フォルマント周波数格納部の評価対象者フォルマント周波数は、前記評価対象者フォルマント周波数取得部が取得した評価対象者フォルマント周波数である音声処理装置である。
かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができる。
また、本第十二の発明の音声処理装置は、第一の発明に対して、前記音声処理部は、前記第二音声データに基づいて、音声認識処理を行う音声処理装置である。
かかる構成により、評価対象者の話者特性に応じた精度の高い音声認識ができる。 In addition, the speech processing apparatus according to the eleventh aspect of the invention is the input of the singing voice of the evaluation subject after the speech acceptance unit accepts the speech of a predetermined vowel, with respect to any of the ninth and tenth aspects The sampling unit also samples the voice of the vowel at the first sampling frequency, and obtains an evaluation target formant frequency that is an evaluation target formant frequency based on the sampled vowel voice. The speech processing apparatus further comprising an evaluation subject formant frequency acquisition unit, wherein the evaluation subject formant frequency of the evaluation subject formant frequency storage unit is the evaluation subject formant frequency acquired by the evaluation subject formant frequency acquisition unit. is there.
With this configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation target person.
The speech processing apparatus according to the twelfth aspect of the present invention is the speech processing apparatus according to the first aspect, wherein the speech processing unit performs speech recognition processing based on the second speech data.
With this configuration, it is possible to perform highly accurate speech recognition according to the speaker characteristics of the evaluation target person.

本発明による音声処理装置によれば、評価対象者の話者特性に応じた精度の高い音声処理ができる。 According to the speech processing device of the present invention, speech processing with high accuracy according to the speaker characteristics of the evaluation subject can be performed.

以下、音声処理装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。
（実施の形態１） Hereinafter, embodiments of a speech processing apparatus and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.
(Embodiment 1)

本実施の形態において、比較対象の音声と入力音声の類似度の評定を精度高く、かつ高速にできる音声処理装置について説明する。本音声処理装置は、音声（歌唱を含む）を評価する発音評定装置である。特に、本音声処理装置は、入力音声のフレームに対する最適状態の事後確率を、動的計画法を用いて算出することから、当該事後確率をＤＡＰ（ＤｙｎａｍｉｃＡＰｏｓｔｅｒｉｏｒｉＰｒｏｂａｂｉｌｉｔｙ）と呼び、ＤＡＰに基づく類似度計算法および発音評定装置をＤＡＰＳと呼ぶ。 In the present embodiment, a description will be given of a speech processing apparatus that can evaluate the similarity between the comparison target speech and the input speech with high accuracy and high speed. This speech processing device is a pronunciation rating device that evaluates speech (including singing). In particular, since the speech processing apparatus calculates the posterior probability of the optimum state with respect to the frame of the input speech using dynamic programming, the posterior probability is called DAP (Dynamic A Positive Probability) and is based on DAP. The degree calculation method and pronunciation rating device are called DAPS.

また、本実施の形態における音声処理装置は、例えば、語学学習や物真似練習やカラオケ評定などに利用できる。図１は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部１１０を具備する。
音声処理部１１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、評定手段１１０３、出力手段１１０４を具備する。
評定手段１１０３は、最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２、評定値算出手段１１０３３を具備する。 In addition, the speech processing apparatus according to the present embodiment can be used for language learning, imitation practice, karaoke evaluation, and the like. FIG. 1 is a block diagram of a speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. , An evaluation subject formant frequency storage unit 108, a vocal tract length normalization processing unit 109, and a speech processing unit 110.
The audio processing unit 110 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a rating unit 1103, and an output unit 1104.
The rating unit 1103 includes an optimum state determination unit 11031, an optimum state probability value acquisition unit 11032, and a rating value calculation unit 11033.

なお、音声処理装置は、キーボード３４２、マウス３４３などの入力手段からの入力を受け付ける。また、音声処理装置は、マイク３４５などの音声入力手段から音声入力を受け付ける。さらに、音声処理装置は、ディスプレイ３４４などの出力デバイスに情報を出力する。 Note that the voice processing apparatus accepts input from input means such as a keyboard 342 and a mouse 343. The voice processing apparatus accepts voice input from voice input means such as a microphone 345. Further, the sound processing apparatus outputs information to an output device such as the display 344.

入力受付部１０１は、音声処理装置の動作開始を指示する動作開始指示や、入力した音声の評定結果の出力態様の変更を指示する出力態様変更指示や、処理を終了する終了指示などの入力を受け付ける。かかる指示等の入力手段は、テンキーやキーボードやマウスやメニュー画面によるもの等、何でも良い。入力受付部１０１は、テンキーやキーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The input receiving unit 101 inputs an operation start instruction for instructing an operation start of the speech processing apparatus, an output mode change instruction for instructing an output mode change of the input speech evaluation result, an end instruction for ending the process, and the like. Accept. The input means for such an instruction may be anything such as a numeric keypad, keyboard, mouse or menu screen. The input receiving unit 101 can be realized by a device driver for input means such as a numeric keypad and a keyboard, control software for a menu screen, and the like.

教師データ格納部１０２は、教師データとして比較される対象の音声に関するデータであり、音韻毎の隠れマルコフモデル（ＨＭＭ）に基づくデータを１以上格納している。教師データは、音韻毎の隠れマルコフモデル（ＨＭＭ）を連結したＨＭＭに基づくデータであることが好適である。また、教師データは、入力される音声を構成する音素に対応するＨＭＭを、入力順序に従って連結されているＨＭＭに基づくデータであることが好適である。ただし、教師データは、必ずしも、音韻毎のＨＭＭを連結したＨＭＭに基づくデータである必要はない。教師データは、全音素のＨＭＭの、単なる集合であっても良い。また、教師データは、必ずしもＨＭＭに基づくデータである必要はない。教師データは、単一ガウス分布モデルや、確率モデル（ＧＭＭ：ガウシャンミクスチャモデル）や、統計モデルなど、他のモデルに基づくデータでも良い。ＨＭＭに基づくデータは、例えば、フレーム毎に、状態識別子と遷移確率の情報を有する。また、ＨＭＭに基づくデータは、例えば、複数の学習対象言語を母国語として話す外国人が発声した２以上のデータから学習した（推定した）モデルでも良い。教師データ格納部１０２は、ハードディスクやＲＯＭなどの不揮発性の記録媒体が好適であるが、ＲＡＭなどの揮発性の記録媒体でも実現可能である。 The teacher data storage unit 102 is data relating to a target voice to be compared as teacher data, and stores one or more data based on a hidden Markov model (HMM) for each phoneme. The teacher data is preferably HMM-based data obtained by connecting hidden Markov models (HMM) for each phoneme. In addition, the teacher data is preferably data based on an HMM in which HMMs corresponding to phonemes constituting input speech are linked in accordance with the input order. However, the teacher data does not necessarily need to be data based on the HMM obtained by connecting the HMMs for each phoneme. The teacher data may be a simple set of all phoneme HMMs. The teacher data does not necessarily need to be data based on the HMM. The teacher data may be data based on other models such as a single Gaussian distribution model, a probability model (GMM: Gaussian mixture model), and a statistical model. The data based on the HMM has, for example, a state identifier and transition probability information for each frame. The data based on the HMM may be, for example, a model learned (estimated) from two or more data uttered by a foreigner who speaks a plurality of learning target languages as a native language. The teacher data storage unit 102 is preferably a non-volatile recording medium such as a hard disk or a ROM, but can also be realized by a volatile recording medium such as a RAM.

音声受付部１０３は、音声を受け付ける。音声受付部１０３は、例えば、マイク３４５のドライバーソフトで実現され得る。また、なお、音声受付部１０３は、マイク３４５とそのドライバーから実現されると考えても良い。音声は、マイク３４５から入力されても良いし、磁気テープやＣＤ−ＲＯＭなどの記録媒体から読み出すことにより入力されても良い。 The voice reception unit 103 receives voice. The voice reception unit 103 can be realized by, for example, driver software for the microphone 345. In addition, it may be considered that the voice reception unit 103 is realized by the microphone 345 and its driver. The sound may be input from the microphone 345 or may be input by reading from a recording medium such as a magnetic tape or a CD-ROM.

教師データフォルマント周波数格納部１０４は、教師データのフォルマント周波数である教師データフォルマント周波数を格納している。教師データフォルマント周波数は、第一フォルマント周波数（Ｆ１）でも、第二フォルマント周波数（Ｆ２）でも、第三フォルマント周波数（Ｆ３）等でも良い。教師データフォルマント周波数格納部１０４の教師データフォルマント周波数は、予め格納されていても良いし、評価時に、動的に、教師データから取得しても良い。音声データからフォルマント周波数を取得する技術は、公知技術であるので説明を省略する。教師データフォルマント周波数格納部１０４は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The teacher data formant frequency storage unit 104 stores a teacher data formant frequency that is a formant frequency of teacher data. The teacher data formant frequency may be the first formant frequency (F1), the second formant frequency (F2), the third formant frequency (F3), or the like. The teacher data formant frequency in the teacher data formant frequency storage unit 104 may be stored in advance, or may be dynamically acquired from the teacher data at the time of evaluation. Since a technique for obtaining a formant frequency from audio data is a known technique, a description thereof will be omitted. The teacher data formant frequency storage unit 104 is preferably a nonvolatile recording medium, but can also be realized by a volatile recording medium.

第一サンプリング周波数格納部１０５は、第一のサンプリング周波数である第一サンプリング周波数を格納している。第一サンプリング周波数は、評価対象者の音声を、最初にサンプリングする場合のサンプリング周波数である。第一サンプリング周波数格納部１０５は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The first sampling frequency storage unit 105 stores a first sampling frequency that is a first sampling frequency. The first sampling frequency is a sampling frequency when the voice of the evaluation subject is first sampled. The first sampling frequency storage unit 105 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

サンプリング部１０６は、第一サンプリング周波数格納部１０５の第一サンプリング周波数で、音声受付部１０３が受け付けた音声をサンプリングし、第一音声データを取得する。なお、受け付けた音声をサンプリングする技術は公知技術であるので、詳細な説明を省略する。サンプリング部１０６は、通常、ＭＰＵやメモリ等から実現され得る。サンプリング部１０６の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The sampling unit 106 samples the sound received by the sound receiving unit 103 at the first sampling frequency of the first sampling frequency storage unit 105 and acquires first sound data. Since the technique for sampling the received voice is a known technique, detailed description thereof is omitted. The sampling unit 106 can be usually realized by an MPU, a memory, or the like. The processing procedure of the sampling unit 106 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評価対象者フォルマント周波数取得部１０７は、サンプリング部１０６が取得した第一音声データから、評価対象者のフォルマント周波数である評価対象者フォルマント周波数を取得する。評価対象者フォルマント周波数も、第一フォルマント周波数（Ｆ１）でも、第二フォルマント周波数（Ｆ２）でも、第三フォルマント周波数（Ｆ３）でも良い。ただし、評価対象者フォルマント周波数と教師データフォルマント周波数は同一種のフォルマント周波数である。サンプリングして取得した第一音声データから、フォルマント周波数を取得する技術は公知技術であるので、詳細な説明を省略する。評価対象者フォルマント周波数取得部１０７は、通常、ＭＰＵやメモリ等から実現され得る。評価対象者フォルマント周波数取得部１０７の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The evaluation target person formant frequency acquisition unit 107 acquires the evaluation target person formant frequency, which is the formant frequency of the evaluation target person, from the first sound data acquired by the sampling unit 106. The formant frequency to be evaluated may also be the first formant frequency (F1), the second formant frequency (F2), or the third formant frequency (F3). However, the evaluation subject formant frequency and the teacher data formant frequency are the same type of formant frequency. Since the technique for acquiring the formant frequency from the first audio data obtained by sampling is a known technique, detailed description thereof is omitted. The evaluation subject formant frequency acquisition unit 107 can be usually realized by an MPU, a memory, or the like. The processing procedure of the evaluation subject formant frequency acquisition unit 107 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評価対象者フォルマント周波数格納部１０８は、音声受付部１０３が受け付けた音声の話者である評価対象者のフォルマント周波数である評価対象者フォルマント周波数を、少なくとも一時的に格納している。評価対象者フォルマント周波数格納部１０８の評価対象者フォルマント周波数は、通常、評価対象者フォルマント周波数取得部１０７が取得したフォルマント周波数であるが、予め評価対象者フォルマント周波数を格納していても良い。評価対象者フォルマント周波数格納部１０８に、予め評価対象者フォルマント周波数が格納されている場合、本音声処理装置において、評価対象者フォルマント周波数取得部１０７は不要である。評価対象者フォルマント周波数格納部１０８は、不揮発性の記録媒体でも、揮発性の記録媒体でも良い。 The evaluation subject formant frequency storage unit 108 stores at least temporarily the evaluation subject formant frequency, which is the formant frequency of the evaluation subject who is the speaker of the voice received by the voice receiving unit 103. The evaluation subject formant frequency in the evaluation subject formant frequency storage unit 108 is usually the formant frequency acquired by the evaluation subject formant frequency acquisition unit 107, but the evaluation subject formant frequency may be stored in advance. When the evaluation subject formant frequency is stored in advance in the evaluation subject formant frequency storage unit 108, the evaluation subject formant frequency acquisition unit 107 is not required in the speech processing apparatus. The evaluation subject formant frequency storage unit 108 may be a non-volatile recording medium or a volatile recording medium.

声道長正規化処理部１０９は、第二サンプリング周波数で、音声受付部１０３が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る。第二サンプリング周波数は、「第一サンプリング周波数／（教師データフォルマント周波数／評価対象者フォルマント周波数）」で算出されるサンプリング周波数である。声道長正規化処理部１０９は、音声受付部１０３が受け付けた音声をサンプリング処理して得られた第一音声データを、リサンプリング処理して第二音声データを得ることが好適であるが、音声受付部１０３が受け付けた音声をサンプリング処理し、直接的に第二音声データを得ても良い。直接的に第二音声データを得る場合、例えば、サンプリング処理を行うハードウェアが可変のサンプリング周波数でサンプリング処理を行えることが必要である。声道長正規化処理部１０９は、通常、演算「教師データフォルマント周波数／評価対象者フォルマント周波数」を行い、周波数スケール（「ｒ」とする）を得る。そして、声道長正規化処理部１０９は、第一サンプリング周波数格納部１０５の第一サンプリング周波数（Ｆｓ）と「ｒ」に基づいて、演算「Ｆｓ／ｒ」を行い、新しいサンプリング周波数（Ｆｓ／ｒ）を得る。この新しいサンプリング周波数（Ｆｓ／ｒ）が第二サンプリング周波数である。次に、声道長正規化処理部１０９は、第一音声データに対して、第二サンプリング周波数（Ｆｓ／ｒ）で、リサンプリング処理を行い、第二音声データを得る。声道長正規化処理部１０９は、通常、ＭＰＵやメモリ等から実現され得る。声道長正規化処理部１０９の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The vocal tract length normalization processing unit 109 performs sampling processing on the voice received by the voice receiving unit 103 at the second sampling frequency to obtain second voice data. The second sampling frequency is a sampling frequency calculated by “first sampling frequency / (teacher data formant frequency / evaluator formant frequency)”. The vocal tract length normalization processing unit 109 preferably resamples the first audio data obtained by sampling the audio received by the audio receiving unit 103 to obtain second audio data. The audio received by the audio receiving unit 103 may be sampled to obtain second audio data directly. When the second audio data is obtained directly, for example, it is necessary that the hardware that performs the sampling process can perform the sampling process at a variable sampling frequency. The vocal tract length normalization processing unit 109 normally performs an operation “teacher data formant frequency / evaluator formant frequency” to obtain a frequency scale (“r”). Then, the vocal tract length normalization processing unit 109 performs an operation “Fs / r” based on the first sampling frequency (Fs) and “r” of the first sampling frequency storage unit 105 to obtain a new sampling frequency (Fs / r). This new sampling frequency (Fs / r) is the second sampling frequency. Next, the vocal tract length normalization processing unit 109 performs resampling processing on the first sound data at the second sampling frequency (Fs / r) to obtain second sound data. The vocal tract length normalization processing unit 109 can usually be realized by an MPU, a memory, or the like. The processing procedure of the vocal tract length normalization processing unit 109 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

音声処理部１１０は、第二音声データを処理する。音声処理部１１０は、ここでは、評定処理である。ただし、音声処理部１１０は、音声認識や音声出力などの他の音声処理を行っても良い。音声出力は、単に、リサンプリング処理された音声を出力する処理である。なお、本実施の形態において、音声処理部１１０は、評定処理を行うものとして、説明する。音声処理部１１０は、通常、ＭＰＵやメモリ等から実現され得る。音声処理部１１０の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The audio processing unit 110 processes the second audio data. Here, the voice processing unit 110 is a rating process. However, the voice processing unit 110 may perform other voice processing such as voice recognition and voice output. The audio output is simply a process of outputting the resampled audio. In the present embodiment, the audio processing unit 110 will be described as performing a rating process. The audio processing unit 110 can usually be realized by an MPU, a memory, or the like. The processing procedure of the audio processing unit 110 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

音声処理部１１０を構成しているフレーム区分手段１１０１は、第二音声データを、フレームに区分する。フレーム区分手段１１０１は、通常、ＭＰＵやメモリ等から実現され得る。フレーム区分手段１１０１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 Frame classification means 1101 constituting the audio processing unit 110 divides the second audio data into frames. The frame partitioning means 1101 can usually be realized by an MPU, a memory, or the like. The processing procedure of the frame sorting unit 1101 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

音声処理部１１０を構成しているフレーム音声データ取得手段１１０２は、区分されたフレーム毎の音声データであるフレーム音声データを1以上得る。フレーム音声データ取得手段１１０２は、通常、ＭＰＵやメモリ等から実現され得る。フレーム音声データ取得手段１１０２の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The frame sound data acquisition means 1102 constituting the sound processing unit 110 obtains one or more frame sound data that is sound data for each divided frame. The frame audio data acquisition unit 1102 can be usually realized by an MPU, a memory, or the like. The processing procedure of the frame audio data acquisition unit 1102 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

音声処理部１１０を構成している評定手段１１０３は、教師データ格納部１０２の教師データと1以上のフレーム音声データに基づいて、音声受付部１０３が受け付けた音声の評定を行う。評定方法の具体例は、後述する。「音声受付部１０３が受け付けた音声を評定する」の概念には、第二音声データを評定することも含まれることは言うまでもない。評定手段１１０３は、通常、ＭＰＵやメモリ等から実現され得る。評定手段１１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating means 1103 constituting the voice processing unit 110 evaluates the voice received by the voice receiving unit 103 based on the teacher data in the teacher data storage unit 102 and one or more frame voice data. A specific example of the rating method will be described later. Needless to say, the concept of “rating the voice received by the voice receiving unit 103” includes rating the second voice data. The rating means 1103 can usually be realized by an MPU, a memory, or the like. The processing procedure of the rating means 1103 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定手段１１０３を構成している最適状態決定手段１１０３１は、１以上のフレーム音声データのうちの少なくとも一のフレーム音声データに対する最適状態を決定する。最適状態決定手段１１０３１は、例えば、全音韻ＨＭＭから、比較される対象（学習対象）の単語や文章などの音声を構成する1以上の音素に対応するＨＭＭを取得し、当該取得した1以上のＨＭＭから、音素の順序で連結したデータ（比較される対象の音声に関するデータであり、音韻毎の隠れマルコフモデルを連結したＨＭＭに基づくデータ）を構成する。そして、構成した当該データ、および取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。なお、最適状態を決定するアルゴリズムは、例えば、Ｖｉｔｅｒｂｉアルゴリズムである。また、教師データは、上述の比較される対象の音声に関するデータであり、音韻毎の隠れマルコフモデルを連結したＨＭＭに基づくデータと考えても良いし、連結される前のデータであり、全音韻ＨＭＭのデータと考えても良い。
評定手段１１０３を構成している最適状態確率値取得手段１１０３２は、最適状態決定手段１１０３１が決定した最適状態における確率値を取得する。 Optimal state determination means 11031 constituting the rating means 1103 determines an optimal state for at least one frame sound data of one or more frame sound data. The optimum state determination unit 11031 acquires, for example, an HMM corresponding to one or more phonemes constituting speech such as words or sentences to be compared (learning target) from the whole phoneme HMM, and the acquired one or more From the HMM, data concatenated in the order of phonemes (data related to the speech to be compared and data based on the HMM concatenating hidden Markov models for each phoneme) is constructed. Then, based on each feature vector o _t constituting constituting the relevant data, and the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t). The algorithm for determining the optimum state is, for example, the Viterbi algorithm. In addition, the teacher data is data relating to the above-described target speech to be compared, and may be considered as data based on HMM obtained by concatenating hidden Markov models for each phoneme. It may be considered as HMM data.
The optimum state probability value acquisition unit 11032 constituting the rating unit 1103 acquires the probability value in the optimum state determined by the optimum state determination unit 11031.

評定手段１１０３を構成している評定値算出手段１１０３３は、最適状態確率値取得手段１１０３２が取得した確率値をパラメータとして音声の評定値を算出する。評定値算出手段１１０３３は、上記確率値を如何に利用して、評定値を算出するかは問わない。評定値算出手段１１０３３は、例えば、最適状態確率値取得手段１１０３２が取得した確率値と、当該確率値に対応するフレームの全状態における確率値の総和をパラメータとして音声の評定値を算出する。評定値算出手段１１０３３は、ここでは、通常、フレームごとに評定値を算出する。 The rating value calculating means 11033 constituting the rating means 1103 calculates the voice rating value using the probability value acquired by the optimum state probability value acquiring means 11032 as a parameter. It does not matter how the rating value calculation means 11033 uses the probability value to calculate the rating value. The rating value calculation unit 11033 calculates a voice rating value using, for example, the sum of the probability value acquired by the optimum state probability value acquisition unit 11032 and the probability value in all states of the frame corresponding to the probability value as a parameter. Here, the rating value calculation means 11033 normally calculates a rating value for each frame.

最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２、評定値算出手段１１０３３は、通常、ＭＰＵやメモリ等から実現され得る。最適状態決定手段１１０３１等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The optimum state determination unit 11031, the optimum state probability value acquisition unit 11032, and the rating value calculation unit 11033 can be usually realized by an MPU, a memory, or the like. The processing procedure of the optimum state determining unit 11031 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力手段１１０４は、評定手段１１０３における評定結果を出力する。出力手段１１０４の出力態様は、種々考えられる。出力とは、ディスプレイへの表示、プリンタへの印字、音出力、外部の装置への送信、記録媒体への蓄積等を含む概念である。出力手段１１０４は、例えば、評定結果を視覚的に表示する。出力手段１１０４は、例えば、フレーム単位、または／および音素・単語単位、または／および発声全体の評定結果を視覚的に表示する。出力手段１１０４は、ディスプレイ３４４やスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力手段１１０４は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。
次に、本音声処理装置の動作について図２、図３のフローチャートを用いて説明する。 The output unit 1104 outputs the rating result from the rating unit 1103. Various output modes of the output means 1104 can be considered. Output is a concept that includes display on a display, printing on a printer, sound output, transmission to an external device, storage in a recording medium, and the like. The output unit 1104 visually displays the evaluation result, for example. The output unit 1104 visually displays, for example, the evaluation result of the frame unit, or / and the phoneme / word unit, or / and the entire utterance. The output unit 1104 may or may not include an output device such as the display 344 or a speaker. The output unit 1104 can be realized by driver software for an output device or driver software for an output device and an output device.
Next, the operation of the speech processing apparatus will be described with reference to the flowcharts of FIGS.

（ステップＳ２０１）入力受付部１０１は、音声処理装置の動作開始を指示する動作開始指示を受け付けたか否かを判断する。動作開始指示を受け付ければステップＳ２０２に行き、動作開始指示を受け付けなければステップＳ２１７に飛ぶ。
（ステップＳ２０２）音声受付部１０３は、音声を受け付けたか否かを判断する。音声を受け付ければステップＳ２０３に行き、音声を受け付けなければステップＳ２１６に飛ぶ。 (Step S201) The input reception unit 101 determines whether an operation start instruction for instructing the operation start of the speech processing apparatus has been received. If the operation start instruction is accepted, the process goes to step S202, and if the operation start instruction is not accepted, the process jumps to step S217.
(Step S202) The voice receiving unit 103 determines whether a voice is received. If a voice is accepted, the process goes to step S203, and if no voice is accepted, the process jumps to step S216.

（ステップＳ２０３）サンプリング部１０６は、第一サンプリング周波数格納部１０５に格納されている第一サンプリング周波数を読み込み、当該第一サンプリング周波数で、音声受付部１０３が受け付けた音声をサンプリングし、第一音声データを得る。 (Step S203) The sampling unit 106 reads the first sampling frequency stored in the first sampling frequency storage unit 105, samples the audio received by the audio receiving unit 103 at the first sampling frequency, and outputs the first audio. Get the data.

（ステップＳ２０４）声道長正規化処理部１０９は、音声受付部１０３が受け付けた音声から、第二音声データを得る。かかる第二音声データを得る処理である声道長正規化処理の詳細については、図３のフローチャートを用いて、詳細に説明する。なお、声道長正規化処理は、個人差を吸収する評定のための前処理である。
（ステップＳ２０５）フレーム区分手段１１０１は、ステップＳ２０４で得た第二音声データを図示しないバッファに一時格納する。 (Step S204) The vocal tract length normalization processing unit 109 obtains second voice data from the voice received by the voice receiving unit 103. Details of the vocal tract length normalization process, which is a process for obtaining such second audio data, will be described in detail with reference to the flowchart of FIG. The vocal tract length normalization process is a pre-process for rating that absorbs individual differences.
(Step S205) The frame sorting unit 1101 temporarily stores the second audio data obtained in Step S204 in a buffer (not shown).

（ステップＳ２０６）フレーム区分手段１１０１は、バッファに一時格納した第二音声データをフレームに区分する。かかる段階で、区分されたフレーム毎の音声データであるフレーム音声データが構成されている。フレーム区分手段１１０１が行うフレーム分割の処理は、例えば、フレーム音声データ取得手段１１０２がフレーム音声データを取り出す際の前処理であり、入力された音声のデータを、すべてのフレームに一度に分割するとは限らない。
（ステップＳ２０７）フレーム音声データ取得手段１１０２は、カウンタｉに１を代入する。 (Step S206) The frame segmentation means 1101 segments the second audio data temporarily stored in the buffer into frames. At this stage, frame audio data which is audio data for each divided frame is configured. The frame division processing performed by the frame classification unit 1101 is, for example, preprocessing when the frame audio data acquisition unit 1102 extracts frame audio data, and the input audio data is divided into all frames at once. Not exclusively.
(Step S207) The frame audio data acquisition means 1102 substitutes 1 for the counter i.

（ステップＳ２０８）フレーム音声データ取得手段１１０２は、ｉ番目のフレームが存在するか否かを判断する。ｉ番目のフレームが存在すればステップＳ２０９に行き、ｉ番目のフレームが存在しなければステップＳ２１１に行く。 (Step S208) The frame audio data acquisition unit 1102 determines whether or not the i-th frame exists. If the i-th frame exists, the process goes to step S209. If the i-th frame does not exist, the process goes to step S211.

（ステップＳ２０９）フレーム音声データ取得手段１１０２は、ｉ番目のフレーム音声データを取得する。フレーム音声データの取得とは、例えば、当該分割された音声データを音声分析し、特徴ベクトルデータを抽出することである。なお、フレーム音声データは、例えば、入力された音声データをフレーム分割されたデータである。また、フレーム音声データは、例えば、当該分割された音声データから音声分析され、抽出された特徴ベクトルデータを有する。本特徴ベクトルデータは、例えば、三角型フィルタを用いたチャネル数２４のフィルタバンク出力を離散コサイン変換したＭＦＣＣであり、その静的パラメータ、デルタパラメータおよびデルタデルタパラメータをそれぞれ１２次元、さらに正規化されたパワーとデルタパワーおよびデルタデルタパワー（３９次元）を有する。
（ステップＳ２１０）フレーム音声データ取得手段１１０２は、カウンタｉを１、インクリメントする。ステップＳ２０８に戻る。 (Step S209) The frame sound data acquisition unit 1102 acquires the i-th frame sound data. The acquisition of the frame sound data means, for example, performing sound analysis on the divided sound data and extracting feature vector data. Note that the frame audio data is, for example, data obtained by dividing the input audio data into frames. The frame audio data includes, for example, feature vector data extracted from the divided audio data by audio analysis. This feature vector data is, for example, MFCC obtained by discrete cosine transform of a filter bank output of 24 channels using a triangular filter, and the static parameter, delta parameter, and delta delta parameter are further normalized to 12 dimensions, respectively. Power and delta power and delta delta power (39th dimension).
(Step S210) The frame audio data acquisition unit 1102 increments the counter i by 1. The process returns to step S208.

（ステップＳ２１１）最適状態決定手段１１０３１は、全フレームの最適状態を決定する。最適状態決定手段１１０３１が最適状態を決定するアルゴリズムは、例えば、Ｖｉｔｅｒｂｉアルゴリズムによる。Ｖｉｔｅｒｂｉアルゴリズムは、公知のアルゴリズムであるので、詳細な説明は省略する。 (Step S211) Optimal state determination means 11031 determines the optimal state of all frames. The algorithm for determining the optimum state by the optimum state determining unit 11031 is, for example, the Viterbi algorithm. Since the Viterbi algorithm is a known algorithm, detailed description thereof is omitted.

（ステップＳ２１２）最適状態確率値取得手段１１０３２は、全フレームの全状態の前向き尤度、および後向き尤度を算出する。最適状態確率値取得手段１１０３２は、例えば、全てのＨＭＭを用いて、フォワード・バックワードアルゴリズムにより、前向き尤度、および後向き尤度を算出する。
（ステップＳ２１３）最適状態確率値取得手段１１０３２は、ステップＳ２１２で取得した前向き尤度、および後向き尤度を用いて、最適状態の確率値（最適状態確率値）を、すべて算出する。 (Step S212) The optimum state probability value acquisition unit 11032 calculates the forward likelihood and the backward likelihood of all states of all frames. For example, the optimal state probability value acquisition unit 11032 calculates the forward likelihood and the backward likelihood by the forward / backward algorithm using all the HMMs.
(Step S213) The optimal state probability value acquisition unit 11032 calculates all of the optimal state probability values (optimal state probability values) using the forward likelihood and the backward likelihood acquired in step S212.

（ステップＳ２１４）評定値算出手段１１０３３は、ステップＳ２１３で算出した1以上の最適状態確率値から、1以上のフレームの音声の評定値を算出する。評定値算出手段１１０３３が評定値を算出する関数は問わない。評定値算出手段１１０３３は、例えば、取得した最適状態確率値と、当該最適状態確率値に対応するフレームの全状態における確率値の総和をパラメータとして音声の評定値を算出する。詳細については、後述する。 (Step S214) The rating value calculation means 11033 calculates the rating value of the speech of one or more frames from the one or more optimal state probability values calculated in Step S213. The function for calculating the rating value by the rating value calculating means 11033 is not specified. The rating value calculation unit 11033 calculates a speech rating value using, for example, the sum of the acquired optimum state probability value and the probability value in all states of the frame corresponding to the optimum state probability value as a parameter. Details will be described later.

（ステップＳ２１５）出力手段１１０４は、ステップＳ２１４における評定結果（ここでは、音声の評定値）を、設定されている出力モードに従って、出力する。ステップＳ２０２に戻る。出力モードとは、評定値を数値で画面に表示するモード、評定値の遷移をグラフで画面に表示するモード、評定値を音声で出力するモード、評定値が所定の数値より低い場合に警告を示す情報を表示するモードなど、何でも良い。なお、ここでの出力モードは、ステップＳ２１８で設定されるモードである。 (Step S215) The output unit 1104 outputs the rating result (here, the voice rating value) in step S214 according to the set output mode. The process returns to step S202. The output mode is a mode in which the rating value is displayed on the screen as a numerical value, a mode in which the transition of the rating value is displayed on the screen as a graph, a mode in which the rating value is output by voice, and a warning when the rating value is lower than the predetermined value Any mode may be used, such as a mode for displaying information to be shown. The output mode here is a mode set in step S218.

（ステップＳ２１６）音声受付部１０３は、タイムアウトか否かを判断する。つまり、音声受付部１０３は、所定の時間以上、音声の入力を受け付けなかったか否かを判断する。タイムアウトであればステップＳ２０１に戻り、タイムアウトでなければステップＳ２０２に戻る。 (Step S216) The voice reception unit 103 determines whether or not a timeout has occurred. That is, the voice receiving unit 103 determines whether or not a voice input has been received for a predetermined time or more. If timed out, the process returns to step S201, and if not timed out, the process returns to step S202.

（ステップＳ２１７）入力受付部１０１は、出力態様変更指示を受け付けたか否かを判断する。出力態様変更指示を受け付ければステップＳ２１８に行き、出力態様変更指示を受け付なければステップＳ２１９に飛ぶ。出力態様変更指示は、上述した出力モードを有する情報である。
（ステップＳ２１８）出力手段１１０４は、ステップＳ２１７で受け付けた出力態様変更指示が有する出力モードを示す情報を書き込み、出力モードを設定する。ステップＳ２０１に戻る。
（ステップＳ２１９）入力受付部１０１は、終了指示を受け付けたか否かを判断する。終了指示を受け付ければ処理を終了し、終了指示を受け付なければステップＳ２０１に戻る。
なお、図２のフローチャートにおいて、本発音評定装置は、出力モードの設定機能を有しなくても良い。
次に、ステップＳ２０４における声道長正規化処理の詳細について、図３のフローチャートを用いて説明する。 (Step S217) The input receiving unit 101 determines whether an output mode change instruction has been received. If an output mode change instruction is accepted, the process proceeds to step S218, and if no output mode change instruction is received, the process jumps to step S219. The output mode change instruction is information having the output mode described above.
(Step S218) The output unit 1104 writes information indicating the output mode included in the output mode change instruction received in Step S217, and sets the output mode. The process returns to step S201.
(Step S219) The input receiving unit 101 determines whether an end instruction has been received. If an end instruction is accepted, the process ends. If no end instruction is accepted, the process returns to step S201.
In the flowchart of FIG. 2, the pronunciation evaluation device may not have an output mode setting function.
Next, details of the vocal tract length normalization process in step S204 will be described using the flowchart of FIG.

（ステップＳ３０１）評価対象者フォルマント周波数取得部１０７は、サンプリング部１０６のサンプリング処理により得られた第一音声データから、評価対象者フォルマント周波数（Ｆｉ）を取得し、評価対象者フォルマント周波数格納部１０８に一時格納する。評価対象者フォルマント周波数は、例えば、第二フォルマント周波数（Ｆ２）である。
（ステップＳ３０２）声道長正規化処理部１０９は、第一サンプリング周波数格納部１０５の第一サンプリング周波数（Ｆｓ）を読み出す。
（ステップＳ３０３）声道長正規化処理部１０９は、教師データフォルマント周波数格納部１０４の教師データフォルマント周波数を読み出す。 (Step S301) The evaluation target person formant frequency acquisition unit 107 acquires the evaluation target person formant frequency (Fi) from the first sound data obtained by the sampling processing of the sampling unit 106, and the evaluation target person formant frequency storage unit 108. Temporarily store in. The evaluation subject formant frequency is, for example, the second formant frequency (F2).
(Step S302) The vocal tract length normalization processing unit 109 reads the first sampling frequency (Fs) of the first sampling frequency storage unit 105.
(Step S303) The vocal tract length normalization processing unit 109 reads the teacher data formant frequency in the teacher data formant frequency storage unit 104.

（ステップＳ３０４）声道長正規化処理部１０９は、ステップＳ３０１で取得した評価対象者フォルマント周波数と、ステップＳ３０３で読み出した教師データフォルマント周波数から周波数スケールを算出する。具体的には、声道長正規化処理部１０９は、演算「教師データフォルマント周波数／評価対象者フォルマント周波数」を行い、周波数スケール（ｒ）を得る。 (Step S304) The vocal tract length normalization processing unit 109 calculates a frequency scale from the evaluation subject formant frequency acquired in step S301 and the teacher data formant frequency read in step S303. Specifically, the vocal tract length normalization processing unit 109 performs an operation “teacher data formant frequency / evaluator formant frequency” to obtain a frequency scale (r).

（ステップＳ３０５）声道長正規化処理部１０９は、ステップＳ３０２で読み出した第一サンプリング周波数（Ｆｓ）と周波数スケール（ｒ）に基づいて、演算「Ｆｓ／ｒ」を行い、第二サンプリング周波数（Ｆｓ／ｒ）を得る。 (Step S305) The vocal tract length normalization processing unit 109 performs an operation “Fs / r” based on the first sampling frequency (Fs) and the frequency scale (r) read in Step S302, and obtains the second sampling frequency ( Fs / r).

（ステップＳ３０６）声道長正規化処理部１０９は、サンプリング部１０６がサンプリングして得た第一音声データに対して、第二サンプリング周波数（Ｆｓ／ｒ）で、リサンプリング処理を行い、第二音声データを得る。なお、リサンプリング処理は公知技術であるので、詳細な説明を省略する。上位関数にリターンする。 (Step S306) The vocal tract length normalization processing unit 109 performs resampling processing at the second sampling frequency (Fs / r) on the first audio data obtained by sampling by the sampling unit 106, Get audio data. Since the resampling process is a known technique, detailed description thereof is omitted. Return to upper function.

なお、図２、図３のフローチャートにおいて、声道長正規化処理を行う対象の音声と、評価対象の音声が異なっても良い。つまり、例えば、音声受付部１０３は、所定の１以上の母音（例えば、「う」）の音声を受け付けた後、評価対象者の音声を受け付け、評価対象者フォルマント周波数取得部１０７は、当該１以上の母音の音声に基づいて、評価対象者フォルマント周波数を取得し、声道長正規化処理部１０９は、当該評価対象者フォルマント周波数をパラメータとして、声道長正規化処理を行う。そして、音声処理部１１０は、所定の母音の音声を受け付けた後に受け付けた音声を処理し、当該音声の評価を行っても良い。
以下、本実施の形態における音声処理装置の具体的な動作について説明する。本具体例において、音声処理装置が語学学習に利用される場合について説明する。 In the flowcharts of FIGS. 2 and 3, the voice to be subjected to vocal tract length normalization processing may be different from the voice to be evaluated. That is, for example, the voice receiving unit 103 receives the voice of the evaluation target person after receiving the voice of the predetermined one or more vowels (for example, “U”), and the evaluation target formant frequency acquisition unit 107 Based on the voice of the above vowel, the evaluation target person formant frequency is acquired, and the vocal tract length normalization processing unit 109 performs the vocal tract length normalization process using the evaluation target person formant frequency as a parameter. Then, the voice processing unit 110 may process the received voice after receiving the voice of a predetermined vowel and evaluate the voice.
Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In this specific example, the case where the speech processing apparatus is used for language learning will be described.

まず、本音声処理装置において、図示しない手段により、ネイティブ発音の音声データベースからネイティブ発音の音韻ＨＭＭを学習しておく。ここで、音韻の種類数をＬとし、ｌ番目の音韻に対するＨＭＭをλ_ｌとする。なお、かかる学習の処理については、公知技術であるので、詳細な説明は省略する。なお、ＨＭＭの仕様の例について、図４に示す。なお、ＨＭＭの仕様は、他の実施の形態における具体例の説明においても同様である。ただし、ＨＭＭの仕様が、他の仕様でも良いことは言うまでもない。 First, in this speech processing apparatus, a phonetic HMM of native pronunciation is learned from a speech database of native pronunciation by means not shown. Here, the number of phoneme types is L, and the HMM for the l-th phoneme is λ _l . Since this learning process is a known technique, a detailed description thereof is omitted. An example of HMM specifications is shown in FIG. The specification of the HMM is the same in the description of specific examples in other embodiments. However, it goes without saying that the specifications of the HMM may be other specifications.

そして、図示しない手段により、学習したＬ種類の音韻ＨＭＭから、学習対象の単語や文章などの音声を構成する１以上の音素に対応するＨＭＭを取得し、当該取得した１以上のＨＭＭを、音素の順序で連結した教師データを構成する。そして、当該教師データを教師データ格納部１０２に保持しておく。ここでは、例えば、比較される対象の音声は、単語「ｒｉｇｈｔ」の音声である。また、ここでは、教師データを発生した者（教師）は、大人である、とする。 Then, by means not shown, an HMM corresponding to one or more phonemes constituting speech such as words or sentences to be learned is acquired from the learned L types of phoneme HMMs, and the acquired one or more HMMs are converted into phonemes. The teacher data concatenated in this order is configured. Then, the teacher data is held in the teacher data storage unit 102. Here, for example, the target speech to be compared is the speech of the word “right”. Here, it is assumed that the person (teacher) who generated the teacher data is an adult.

次に、学習者（評価対象者）が、語学学習の開始の指示である動作開始指示を入力する。かかる指示は、例えば、マウスで所定のボタンを押下することによりなされる。なお、ここでは、学習者は、例えば、子供（５歳から１１歳）である、とする。 Next, the learner (evaluator) inputs an operation start instruction that is an instruction to start language learning. Such an instruction is made, for example, by pressing a predetermined button with a mouse. Here, it is assumed that the learner is, for example, a child (5 to 11 years old).

まず、学習者は、母音「う」を発音する、とする。かかる場合、本音声処理装置は、学習に、「う」を発声するように促すことは好適である。「う」を発声するように促すために、音声処理装置は、例えば、「"う"と発声してください。」と画面出力しても良いし、「"う"と発声してください。」と音声出力しても良い。また、母音「う」は、学習者の評価対象者フォルマント周波数を取得するために好適である。また、本音声処理装置は、第一サンプリング周波数として、「２２．０５ＫＨｚ」を保持している、とする。
そして、次に、サンプリング部１０６は、音声受付部１０３が受け付けた音声「う」をサンプリングし、「う」の第一音声データを得る。 First, it is assumed that the learner pronounces the vowel “U”. In such a case, it is preferable that the speech processing apparatus prompts the user to speak “U” for learning. In order to prompt the user to say “U”, the voice processing device may output, for example, “Please say“ U ”.” Or “Speak“ U ”.” May be output. Moreover, the vowel “U” is suitable for acquiring the learner's evaluation target formant frequency. In addition, it is assumed that the sound processing apparatus holds “22.05 KHz” as the first sampling frequency.
Then, the sampling unit 106 samples the voice “U” received by the voice receiving unit 103 to obtain first voice data “U”.

次に、評価対象者フォルマント周波数取得部１０７は、サンプリング部１０６が音声「う」をサンプリングして得た第一音声データから、第二フォルマント周波数を取得する。そして、この第二フォルマント周波数を評価対象者フォルマント周波数（Ｆｉとする。今、このＦｉが「１７２５Ｈｚ」であった、とする。そして、評価対象者フォルマント周波数取得部１０７は、Ｆｉ（１７２５Ｈｚ）を、評価対象者フォルマント周波数格納部１０８に一時格納する。 Next, the evaluation subject formant frequency acquisition unit 107 acquires the second formant frequency from the first audio data obtained by the sampling unit 106 sampling the audio “U”. Then, this second formant frequency is assumed to be the evaluation subject formant frequency (Fi. Now, let this Fi be “1725 Hz”. Then, the evaluation subject formant frequency acquisition unit 107 sets Fi (1725 Hz) to And temporarily stored in the evaluation formant formant frequency storage unit 108.

次に、声道長正規化処理部１０９は、教師データフォルマント周波数格納部１０４の教師データフォルマント周波数を読み出す。教師データフォルマント周波数格納部１０４に格納されている教師データフォルマント周波数は、大人の第二フォルマント周波数であり、今、「１１８４Ｈｚ」である、とする。また、教師データフォルマント周波数は、例えば、教師データを構築する場合に、教師に、例えば、「う」と発声してもらい、当該音声「う」をサンプリング処理した後、取得した第二フォルマント周波数である。 Next, the vocal tract length normalization processing unit 109 reads the teacher data formant frequency in the teacher data formant frequency storage unit 104. It is assumed that the teacher data formant frequency stored in the teacher data formant frequency storage unit 104 is the second formant frequency of an adult and is now “1184 Hz”. In addition, the teacher data formant frequency is, for example, when the teacher data is constructed, the teacher utters “U”, for example, and after sampling the audio “U”, the acquired second formant frequency is used. is there.

なお、図５に、年齢層別、性別ごとの、「う」の第一フォルマント周波数（Ｆ１）、第二フォルマント周波数（Ｆ２）の計測結果を示す。図５により、年齢、性別により、第一フォルマント周波数（Ｆ１）、第二フォルマント周波数（Ｆ２）の値が大きく異なることが分る。 FIG. 5 shows the measurement results of the first formant frequency (F1) and the second formant frequency (F2) of “U” for each age group and sex. FIG. 5 shows that the values of the first formant frequency (F1) and the second formant frequency (F2) are greatly different depending on the age and sex.

そして、次に、声道長正規化処理部１０９は、評価対象者フォルマント周波数「１７２５Ｈｚ」と教師データフォルマント周波数「１１８４Ｈｚ」から演算「教師データフォルマント周波数／評価対象者フォルマント周波数」を行い、周波数スケール（ｒ）を得る。具体的には、声道長正規化処理部１０９は、「１１８４／１７２５」により、周波数スケール「０．６８６」を得る。 Next, the vocal tract length normalization processing unit 109 performs an operation “teacher data formant frequency / evaluation target person formant frequency” from the evaluation subject formant frequency “1725 Hz” and the teacher data formant frequency “1184 Hz” to obtain a frequency scale. (R) is obtained. Specifically, the vocal tract length normalization processing unit 109 obtains the frequency scale “0.686” based on “1184/1725”.

次に、声道長正規化処理部１０９は、第一サンプリング周波数（Ｆｓ）と「ｒ」に基づいて、演算「Ｆｓ／ｒ」を行い、第二サンプリング周波数（Ｆｓ／ｒ）を得る。ここで、得た第二サンプリング周波数は、「２２．０５／０．６８６」により、「３２．１」である。そして、声道長正規化処理部１０９は、第二サンプリング周波数「３２．１ＫＨｚ」を一時格納する。 Next, the vocal tract length normalization processing unit 109 performs an operation “Fs / r” based on the first sampling frequency (Fs) and “r” to obtain a second sampling frequency (Fs / r). Here, the obtained second sampling frequency is “32.1” by “22.05 / 0.686”. Then, the vocal tract length normalization processing unit 109 temporarily stores the second sampling frequency “32.1 KHz”.

次に、学習者は、学習対象の音声「ｒｉｇｈｔ」を発音する。そして、音声受付部１０３は、学習者が発音した音声の入力を受け付ける。なお、音声処理装置は、学習者に「"ｒｉｇｈｔ"を発音してください。」などを表示、または音声出力するなどして、学習者に「ｒｉｇｈｔ」の発声を促すことは好適である。 Next, the learner pronounces the voice “right” to be learned. Then, the voice receiving unit 103 receives an input of a voice pronounced by the learner. It is preferable that the speech processing apparatus prompts the learner to utter “right” by displaying or outputting a voice such as “Please pronounce“ right ”” to the learner.

次に、サンプリング部１０６は、受け付けた音声「ｒｉｇｈｔ」をサンプリング周波数「２２．０５ＫＨｚ」でサンプリング処理する。そして、サンプリング部１０６は、音声「ｒｉｇｈｔ」の第一音声データを得る。
次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。
次に、音声処理部１１０は、第二音声データを、以下のように処理する。
まず、フレーム区分手段１１０１は、第二音声データを、短時間フレームに区分する。なお、フレームの間隔は、予め決められている、とする。 Next, the sampling unit 106 samples the received voice “right” at the sampling frequency “22.05 KHz”. Then, the sampling unit 106 obtains the first sound data of the sound “right”.
Next, the vocal tract length normalization processing unit 109 resamples the first voice data of “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data.
Next, the audio processing unit 110 processes the second audio data as follows.
First, the frame classification unit 1101 classifies the second audio data into short-time frames. It is assumed that the frame interval is determined in advance.

そして、フレーム音声データ取得手段１１０２は、フレーム区分手段１１０１が区分した音声データを、スペクトル分析し、特徴ベクトル系列「Ｏ＝ｏ_１，ｏ_２，・・・，ｏ_Ｔ」を算出する。なお、Ｔは、系列長である。ここで、特徴ベクトル系列は、各フレームの特徴ベクトルの集合である。また、特徴ベクトルは、例えば、三角型フィルタを用いたチャネル数２４のフィルタバンク出力を離散コサイン変換したＭＦＣＣであり、その静的パラメータ、デルタパラメータおよびデルタデルタパラメータをそれぞれ１２次元、さらに正規化されたパワーとデルタパワーおよびデルタデルタパワー（３９次元）を有する。また、スペクトル分析において、ケプストラム平均除去を施すことは好適である。なお、音声分析条件の例を図６の表に示す。なお、音声分析条件は、他の実施の形態における具体例の説明においても同様である。ただし、音声分析条件が、他の条件でも良いことは言うまでもない。また、音声分析の際のサンプリング周波数は、第一サンプリング周波数「２２．０５ＫＨｚ」である。 Then, the frame audio data acquisition unit 1102 performs spectrum analysis on the audio data classified by the frame classification unit 1101 and calculates a feature vector series “O = o ₁ , o ₂ ,..., O _T ”. T is a sequence length. Here, the feature vector series is a set of feature vectors of each frame. The feature vector is, for example, an MFCC obtained by performing discrete cosine transform on a filter bank output of 24 channels using a triangular filter, and the static parameter, the delta parameter, and the delta delta parameter are further normalized to 12 dimensions, respectively. Power and delta power and delta delta power (39th dimension). In spectral analysis, it is preferable to perform cepstrum average removal. An example of speech analysis conditions is shown in the table of FIG. Note that the voice analysis conditions are the same in the description of specific examples in other embodiments. However, it goes without saying that the voice analysis conditions may be other conditions. In addition, the sampling frequency at the time of voice analysis is the first sampling frequency “ 22.05 KHz”.

次に、最適状態決定手段１１０３１は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）「ｑ _ｔ ^＊」を決定する。最適状態決定手段１１０３１が最適状態「ｑ _ｔ ^＊」を決定するアルゴリズムは、例えば、Ｖｉｔｅｒｂｉアルゴリズムによる。かかる場合、最適状態決定手段１１０３１は、上記で連結したＨＭＭを用いて最適状態を決定する。最適状態決定手段１１０３１は、２以上のフレームの最適状態である最適状態系列を求めることとなる。 Then, the optimal state determination unit 11031, based on the feature vector o _t constituting the obtained feature vector series to determine the (optimal condition for the feature vector o _t) "q _t ^*" optimal state of a given frame . The algorithm by which the optimum state determination unit 11031 determines the optimum state “q _t ^* ” is, for example, the Viterbi algorithm. In such a case, the optimum state determination unit 11031 determines the optimum state using the HMM connected as described above. The optimum state determination unit 11031 obtains an optimum state sequence that is an optimum state of two or more frames.

次に、最適状態確率値取得手段１１０３２は、以下の数式１により、最適状態における最適状態確率値（γ_ｔ（ｑ_ｔ ^＊））を算出する。なお、γ_ｔ（ｑ_ｔ ^＊）は、状態ｊの事後確率関数γ_ｔ（ｊ）のｊにｑ_ｔ ^＊を代入した値である。そして、状態ｊの事後確率関数γ_ｔ（ｊ）は、数式２を用いて算出される。この確率値（γ_ｔ（ｊ））は、ｔ番目の特徴ベクトルｏ_ｔが状態ｊから生成された事後確率であり、動的計画法を用いて算出される。なお、ｊは、状態を識別する状態識別子である。

数式１において、ｑ_ｔは、ｏ_ｔに対する状態識別子を表す。この確率値（γ_ｔ（ｊ））は、ＨＭＭの最尤推定におけるＢａｕｍ−Ｗｅｌｃｈアルゴリズムの中で表れる占有度数に対応する。

Next, the optimum state probability value acquisition unit 11032 calculates an optimum state probability value (γ _t (q _t ^* )) in the optimum state according to the following Equation 1. Note that γ _t (q _t ^* ) is a value obtained by substituting q _t ^* for j in the posterior probability function γ _t (j) of the state j. Then, the posterior probability function γ _t (j) of the state j is calculated using Equation 2. This probability value _{(γ t} (j)) is, t th feature vector o _t is the posterior probability that is generated from the state j, is calculated using dynamic programming. Note that j is a state identifier for identifying a state.

In Equation 1, _{q t} represents the state identifier for _{o t.} This probability value (γ _t (j)) corresponds to the occupation frequency appearing in the Baum-Welch algorithm in the maximum likelihood estimation of the HMM.

数式２において、「α_ｔ（ｊ）」「β_ｔ（ｊ）」は、全部のＨＭＭを用いて、ｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄアルゴリズムにより算出される。「α_ｔ（ｊ）」は前向き尤度、「β_ｔ（ｊ）」は後向き尤度である。Ｂａｕｍ−Ｗｅｌｃｈアルゴリズム、ｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄアルゴリズムは、公知のアルゴリズムであるので、詳細な説明は省略する。
また、数式２において、Ｎは、全ＨＭＭに渡る状態の総数を示す。 In Equation 2, “α _t (j)” and “β _t (j)” are calculated by the forward-backward algorithm using all the HMMs. “Α _t (j)” is a forward likelihood, and “β _t (j)” is a backward likelihood. Since the Baum-Welch algorithm and the forward-backward algorithm are known algorithms, detailed description thereof is omitted.
In Equation 2, N represents the total number of states over all HMMs.

なお、評定手段１１０３は、まず最適状態を求め、次に、最適状態の確率値（なお、確率値は、０以上、１以下である。）を求めても良いし、評定手段１１０３は、まず、全状態の確率値を求め、その後、特徴ベクトル系列の各特徴ベクトルに対する最適状態を求め、当該最適状態に対応する確率値を求めても良い。 Note that the rating unit 1103 may first obtain an optimum state, and then obtain a probability value of the optimum state (where the probability value is 0 or more and 1 or less). Then, the probability values of all the states may be obtained, then the optimum state for each feature vector of the feature vector series may be obtained, and the probability value corresponding to the optimum state may be obtained.

次に、評定値算出手段１１０３３は、例えば、上記の取得した最適状態確率値と、当該最適状態確率値に対応するフレームの全状態における確率値の総和をパラメータとして音声の評定値を算出する。かかる場合、もし学習者のｔフレーム目に対応する発声が、教師データが示す発音（例えば、正しいネイティブな発音）に近ければ、数式２の（２）式の分子の値が、他の全ての可能な音韻の全ての状態と比較して大きくなり、結果的に最適状態の確率値（評定値）が大きくなる。逆にその区間が、教師データが示す発音に近くなければ、評定値は小さくなる。なお、どのネイティブ発音にも近くないような場合は、評定値はほぼ１／Ｎに等しくなる。Ｎは全ての音韻ＨＭＭにおける全ての状態の数であるから、通常、大きな値となり、この評定値は十分小さくなる。また、ここでは、評定値は最適状態における確率値と全ての可能な状態における確率値との比率で定義されている。したがって、収音環境等の違いにより多少のスペクトルの変動があったとしても、学習者が正しい発音をしていれば、その変動が相殺され評定値が高いスコアを維持する。よって、評定値算出手段１１０３３は、最適状態確率値取得手段１１０３２が取得した確率値と、当該確率値に対応するフレームの全状態における確率値の総和をパラメータとして音声の評定値を算出することは、極めて好適である。 Next, the rating value calculation means 11033 calculates a speech rating value using, for example, the sum of the acquired optimum state probability value and the probability value in all states of the frame corresponding to the optimum state probability value as a parameter. In such a case, if the utterance corresponding to the learner's t-th frame is close to the pronunciation indicated by the teacher data (for example, correct native pronunciation), the numerator value of Equation (2) in Equation 2 is set to all other values. As compared with all possible phoneme states, the probability value (rating value) in the optimum state increases. Conversely, if the interval is not close to the pronunciation indicated by the teacher data, the rating value becomes small. In the case where it is not close to any native pronunciation, the rating value is approximately equal to 1 / N. Since N is the number of all states in all phoneme HMMs, it is usually a large value, and this rating value is sufficiently small. Here, the rating value is defined by the ratio between the probability value in the optimum state and the probability value in all possible states. Therefore, even if there is some spectrum variation due to differences in the sound collection environment or the like, if the learner pronounces correctly, the variation is offset and a score with a high rating value is maintained. Therefore, the rating value calculation means 11033 calculates the voice rating value using the probability value acquired by the optimum state probability value acquisition means 11032 and the sum of the probability values in all states of the frame corresponding to the probability value as parameters. Is very suitable.

かかる評定値算出手段１１０３３が算出した評定値（「ＤＡＰスコア」とも言う。）を、図７、図８に示す。図７、図８において、横軸は分析フレーム番号、縦軸はスコアを％で表わしたものである。太い破線は音素境界，細い点線は状態境界（いずれもＶｉｔｅｒｂｉアルゴリズムで求まったもの）を表わしており，図の上部に音素名を表記している。図７は、アメリカ人男性による英語「ｒｉｇｈｔ」の発音のＤＡＰスコアを示す。なお、評定値を示すグラフの横軸、縦軸は、後述するグラフにおいても同様である。 The rating values (also referred to as “DAP score”) calculated by the rating value calculation means 11033 are shown in FIGS. 7 and 8, the horizontal axis represents the analysis frame number, and the vertical axis represents the score in%. A thick broken line represents a phoneme boundary, a thin dotted line represents a state boundary (both obtained by the Viterbi algorithm), and a phoneme name is shown at the top of the figure. FIG. 7 shows the DAP score for the pronunciation of English “right” by an American male. The horizontal axis and vertical axis of the graph indicating the rating value are the same in the graph described later.

図８は、日本人男性による英語「ｒｉｇｈｔ」の発音のＤＡＰスコアを示す。アメリカ人の発音は、日本人の発音と比較して、基本的にスコアが高い。なお、図７において、状態の境界において所々スコアが落ち込んでいることがわかる。 FIG. 8 shows a DAP score of pronunciation of English “right” by a Japanese male. American pronunciation is basically higher than Japanese pronunciation. In addition, in FIG. 7, it turns out that the score has fallen in some places in the boundary of a state.

そして、出力手段１１０４は、評定手段１１０３の評定結果を出力する。具体的には、例えば、出力手段１１０４は、図９に示すような態様で、評定結果を出力する。つまり、出力手段１１０４は、各フレームにおける発音の良さを表すスコア（スコアグラフ）として、各フレームの評定値を表示する。その他、出力手段１１０４は、学習対象の単語の表示（単語表示）、音素要素の表示（音素表示）、教師データの波形の表示（教師波形）、学習者の入力した発音の波形の表示（ユーザ波形）を表示しても良い。なお、図９において、「録音」ボタンを押下すれば、動作開始指示が入力されることとなり、「停止」ボタンを押下すれば、終了指示が入力されることとなる。また、音素要素の表示や波形の表示をする技術は公知技術であるので、その詳細説明を省略する。また、本音声処理装置は、学習対象の単語（図９の「ｗｏｒｄ１」など）や、音素（図９の「ｐ１」など）や、教師波形を出力されるためのデータを予め格納している、とする。 Then, the output unit 1104 outputs the rating result of the rating unit 1103. Specifically, for example, the output unit 1104 outputs the rating result in a manner as shown in FIG. That is, the output unit 1104 displays the rating value of each frame as a score (score graph) indicating the goodness of pronunciation in each frame. In addition, the output unit 1104 displays a word to be learned (word display), a phoneme element display (phoneme display), a teacher data waveform display (teacher waveform), and a pronunciation waveform input by the learner (user). Waveform) may be displayed. In FIG. 9, when the “Record” button is pressed, an operation start instruction is input, and when the “Stop” button is pressed, an end instruction is input. Further, since the technology for displaying phoneme elements and displaying waveforms is a known technology, its detailed description is omitted. Further, the speech processing apparatus stores in advance data for outputting a word to be learned (such as “word1” in FIG. 9), a phoneme (such as “p1” in FIG. 9), and a teacher waveform. , And.

また、図９において、フレーム単位以外に、音素単位、単語単位、発声全体の評定結果を表示しても良い。上記の処理において、フレーム単位の評定値を算出するので、単語単位、発声全体の評定結果を得るためには、フレーム単位の１以上の評定値をパラメータとして、単語単位、発声全体の評定値を算出する必要がある。かかる算出式は問わないが、例えば、単語を構成するフレーム単位の１以上の評定値の平均値を単語単位の評定値とする、ことが考えられる。 Moreover, in FIG. 9, you may display the evaluation result of a phoneme unit, a word unit, and the whole utterance other than a frame unit. In the above processing, the evaluation value for each frame is calculated. In order to obtain the evaluation result for each word and the whole utterance, the evaluation value for each word and the whole utterance is obtained using one or more evaluation values for each frame as parameters. It is necessary to calculate. Such a calculation formula is not limited. For example, it is conceivable that an average value of one or more rating values for each frame constituting a word is used as a rating value for each word.

なお、図９において、音声処理装置は、波形表示（教師波形またはユーザ波形）の箇所においてクリックを受け付けると、再生メニューを表示し、音素区間内ではその音素またはその区間が属する単語、波形全体を再生し、単語区間外（無音部）では波形全体のみを再生するようにしても良い。
また、出力手段１１０４の表示は、図１０に示すような態様でも良い。図１０において、音素ごとのスコア、単語のスコア、総合スコアが、数字で表示されている。
なお、出力手段１１０４の表示は、図７、図８のような表示でも良いことは言うまでもない。 In FIG. 9, when the voice processing device accepts a click at the location of the waveform display (teacher waveform or user waveform), it displays a playback menu, and within the phoneme section, the phoneme, the word to which the section belongs, and the entire waveform are displayed. It is possible to reproduce only the entire waveform outside the word section (silent part).
Further, the display of the output means 1104 may be in the form as shown in FIG. In FIG. 10, a score for each phoneme, a word score, and a total score are displayed in numbers.
Needless to say, the display of the output means 1104 may be as shown in FIGS.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差、特に声道長の違いに影響を受けない、精度の高い評定ができる。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. In this case, according to the present embodiment, highly accurate evaluation can be performed without being affected by individual differences, particularly differences in vocal tract length.

また、本実施の形態によれば、連結されたＨＭＭである連結ＨＭＭを用いて最適状態を求め、評定値を算出するので、高速に評定値を求めることができる。したがって、上記の具体例で述べたように、リアルタイムに、フレームごと、音素ごと、単語ごとの評定値を出力できる。また、本実施の形態によれば、動的計画法に基づいた事後確率を確率値として算出するので、さらに高速に評定値を求めることができる。また、本実施の形態によれば、フレームごとに確率値を算出するので、上述したように、フレーム単位だけではなく、または／および音素・単語単位、または／および発声全体の評定結果を出力でき、出力態様の自由度が高い。 Further, according to the present embodiment, since the optimum state is obtained by using the concatenated HMM, which is a concatenated HMM, and the evaluation value is calculated, the evaluation value can be obtained at high speed. Therefore, as described in the above specific example, the rating value for each frame, for each phoneme, and for each word can be output in real time. Further, according to the present embodiment, the posterior probability based on the dynamic programming is calculated as the probability value, so that the rating value can be obtained at higher speed. Further, according to the present embodiment, since the probability value is calculated for each frame, as described above, it is possible to output not only the frame unit but / or the phoneme / word unit or / and the entire utterance evaluation result. The degree of freedom of the output mode is high.

また、本実施の形態によれば、音声処理装置は、語学学習に利用することを主として説明したが、物真似練習や、カラオケ評定や、歌唱評定などに利用できる。つまり、本音声処理装置は、比較される対象の音声に関するデータとの類似度を精度良く、高速に評定し、出力でき、そのアプリケーションは問わない。つまり、例えば、本音声処理装置は、カラオケ評価装置であって、音声受付部は、評価対象者の歌声の入力を受け付け、音声処理部は、前記歌声を評価する、という構成でも良い。かかることは、他の実施の形態においても同様である。 Moreover, according to this Embodiment, although the audio processing apparatus was mainly demonstrated using for language learning, it can be utilized for imitation practice, karaoke rating, singing rating, etc. That is, the speech processing apparatus can accurately evaluate and output the similarity to the data related to the target speech to be compared with high speed, and the application is not limited. That is, for example, the voice processing device may be a karaoke evaluation device, and the voice reception unit may receive an input of the evaluation subject's singing voice, and the voice processing unit may evaluate the singing voice. The same applies to other embodiments.

また、本実施の形態において、音声の入力を受け付けた後または停止ボタン操作後に、スコアリング処理を実行するかどうかをユーザに問い合わせ、スコアリング処理を行うとの指示を受け付けた場合のみ、図１０に示すような音素スコア、単語スコア、総合スコアを出力するようにしても良い。 Further, in the present embodiment, after receiving voice input or operating the stop button, the user is inquired whether to execute scoring processing, and only when an instruction to perform scoring processing is received, FIG. A phoneme score, a word score, and a total score as shown in FIG.

また、本実施の形態において、教師データは、比較される対象の音声に関するデータであり、音韻毎の隠れマルコフモデル（ＨＭＭ）に基づくデータであるとして、主として説明したが、必ずしもＨＭＭに基づくデータである必要はない。教師データは、単一ガウス分布モデルや、確率モデル（ＧＭＭ：ガウシャンミクスチャモデル）や統計モデルなど、他のモデルに基づくデータでも良い。かかることは、他の実施の形態においても同様である。 Further, in the present embodiment, the teacher data is mainly related to the speech to be compared and is based on the hidden Markov model (HMM) for each phoneme, but is not necessarily data based on the HMM. There is no need. The teacher data may be data based on other models such as a single Gaussian distribution model, a probability model (GMM: Gaussian mixture model), and a statistical model. The same applies to other embodiments.

また、本実施の形態の具体例において、学習者は、母音「う」を発音し、音声処理装置は、かかる音声から第二サンプリング周波数を得た。しかし、学習者は、例えば、母音「あいえお」等、1以上の母音を発音し、かかる母音の音声から、音声処理装置は、第二サンプリング周波数を得ても良い。つまり、第二サンプリング周波数を得るために、学習者が発音する音は「う」に限られない。 In the specific example of the present embodiment, the learner pronounces the vowel “U”, and the speech processing apparatus obtains the second sampling frequency from the speech. However, the learner may pronounce one or more vowels such as the vowel “Aieo”, for example, and the speech processing apparatus may obtain the second sampling frequency from the speech of such vowels. That is, the sound produced by the learner to obtain the second sampling frequency is not limited to “U”.

また、本実施の形態において、音声処理装置が行う下記の処理を、一のＤＳＰ（デジタルシグナルプロセッサ）で行っても良い。つまり、本ＤＳＰは、第一サンプリング周波数を格納している第一サンプリング周波数格納部と、前記第一サンプリング周波数で、受け付けた音声をサンプリングし、第一音声データを取得するサンプリング部と、前記教師データのフォルマント周波数である教師データフォルマント周波数を格納している教師データフォルマント周波数格納部と、前記音声の話者である評価対象者のフォルマント周波数である評価対象者フォルマント周波数を格納している評価対象者フォルマント周波数格納部と、第二サンプリング周波数「前記第一サンプリング周波数／（教師データフォルマント周波数／評価対象者フォルマント周波数）」で、前記受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理部を具備するデジタルシグナルプロセッサ、である。かかることは、他の実施の形態でも同様である。 In the present embodiment, the following processing performed by the audio processing device may be performed by a single DSP (digital signal processor). That is, the DSP includes a first sampling frequency storage unit that stores a first sampling frequency, a sampling unit that samples received audio at the first sampling frequency, and acquires first audio data, and the teacher. A teacher data formant frequency storage unit storing teacher data formant frequency which is a formant frequency of data, and an evaluation object storing an evaluation object formant frequency which is a formant frequency of an evaluation object person who is the voice speaker A sampling form is performed on the received voice at the second formant frequency storage unit and the second sampling frequency “the first sampling frequency / (teacher data formant frequency / evaluation person formant frequency)”, and the second voice data With a vocal tract length normalization processing unit Digital signal processor, a. This also applies to other embodiments.

さらに、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における音声処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、第一サンプリング周波数で、受け付けた音声をサンプリングし、第一音声データを取得するサンプリングステップと、第二サンプリング周波数「第一サンプリング周波数／（教師データフォルマント周波数／評価対象者フォルマント周波数）」で、前記音声受付ステップで受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理ステップと、前記第二音声データを処理する音声処理ステップを実行させるためのプログラム、である。 Furthermore, the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. Note that the software that implements the speech processing apparatus according to the present embodiment is the following program. In other words, the program samples the received voice at the first sampling frequency and acquires the first voice data to the computer, and the second sampling frequency “first sampling frequency / (teacher data formant frequency / evaluation”. Target voice formant frequency) ", the voice received in the voice receiving step is subjected to a sampling process to obtain a second voice data, a vocal tract length normalization processing step, and a voice process to process the second voice data A program for executing a step.

また、上記プログラムにおいて、音声処理ステップは、前記第二音声データを、フレームに区分するフレーム区分ステップと、前記区分されたフレーム毎の音声データであるフレーム音声データを1以上得るフレーム音声データ取得ステップと、前記教師データと前記1以上のフレーム音声データに基づいて、前記受け付けた音声の評定を行う評定ステップと、前記評定ステップにおける評定結果を出力する出力ステップを具備する、ことは好適である。 Further, in the above program, the audio processing step includes a frame dividing step for dividing the second audio data into frames, and a frame audio data acquiring step for obtaining one or more frame audio data which are audio data for each of the divided frames. And a rating step for rating the received voice based on the teacher data and the one or more frame voice data, and an output step for outputting a rating result in the rating step.

さらに、上記プログラムにおいて、前記評定ステップは、前記1以上のフレーム音声データのうちの少なくとも一のフレーム音声データに対する最適状態を決定する最適状態決定ステップと、前記最適状態決定ステップで決定した最適状態における確率値を取得する最適状態確率値取得ステップと、前記最適状態確率値取得ステップで取得した確率値をパラメータとして音声の評定値を算出する評定値算出ステップを具備することは好適である。
（実施の形態２） Further, in the program, the rating step includes an optimum state determination step for determining an optimum state for at least one frame sound data of the one or more frame sound data, and an optimum state determined in the optimum state determination step. It is preferable to include an optimum state probability value obtaining step for obtaining a probability value, and a rating value calculating step for calculating a speech evaluation value using the probability value obtained in the optimum state probability value obtaining step as a parameter.
(Embodiment 2)

本実施の形態における音声処理装置は、実施の形態１の音声処理装置と比較して、評定部における評定アルゴリズムが異なる。本実施の形態において、評定値は、最適状態を含む音韻の中の全状態の確率値を発音区間で評価して、算出される。本実施の形態における音声処理装置が算出する事後確率を、実施の形態１におけるＤＡＰに対してｔ-ｐ−ＤＡＰと呼ぶ。 The speech processing apparatus according to the present embodiment differs from the speech processing apparatus according to the first embodiment in the rating algorithm in the rating unit. In the present embodiment, the rating value is calculated by evaluating the probability values of all the states in the phoneme including the optimum state in the pronunciation interval. The posterior probability calculated by the speech processing apparatus in the present embodiment is referred to as tp-DAP with respect to the DAP in the first embodiment.

図１１は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部１１１０、発声催促部１１０９を具備する。
音声処理部１１１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、評定手段１１１０３、出力手段１１０４を具備する。
評定手段１１１０３は、最適状態決定手段１１０３１、発音区間フレーム音韻確率値取得手段１１１０３２、評定値算出手段１１１０３３を具備する。
発音区間フレーム音韻確率値取得手段１１１０３２は、最適状態決定手段１１０３１が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値を、発音区間毎に取得する。 FIG. 11 is a block diagram of the speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 1110, and the utterance prompting unit 1109 are provided.
The audio processing unit 1110 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a rating unit 11103, and an output unit 1104.
The rating unit 11103 includes an optimum state determining unit 11031, a pronunciation interval frame phoneme probability value acquiring unit 111032, and a rating value calculating unit 111033.
The pronunciation interval frame phoneme probability value acquisition unit 111032 acquires, for each pronunciation interval, one or more probability values in the entire phoneme state having the optimal state of each frame determined by the optimal state determination unit 11031.

評定値算出手段１１１０３３は、発音区間フレーム音韻確率値取得手段１１１０３２が取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する。評定値算出手段１１１０３３は、例えば、最適状態決定手段１１０３１が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値の総和を、フレーム毎に得て、当該フレーム毎の確率値の総和に基づいて、発音区間毎の確率値の総和の時間平均値を１以上得て、当該１以上の時間平均値をパラメータとして音声の評定値を算出する。 The rating value calculation means 111033 calculates a speech rating value using one or more probability values for each of one or more pronunciation intervals acquired by the pronunciation interval frame phoneme probability value acquisition means 111032 as parameters. For example, the rating value calculation unit 1111033 obtains, for each frame, a sum of one or more probability values in the entire phoneme state having the optimal state of each frame determined by the optimal state determination unit 11031, and the probability value for each frame. 1 or more is obtained based on the sum of the above, and the speech rating value is calculated using the one or more time average values as parameters.

発音区間フレーム音韻確率値取得手段１１１０３２、および評定値算出手段１１１０３３は、通常、ＭＰＵやメモリ等から実現され得る。発音区間フレーム音韻確率値取得手段１１１０３２等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The pronunciation interval frame phoneme probability value acquisition unit 111032 and the rating value calculation unit 1111033 can be realized usually by an MPU, a memory, or the like. The processing procedure of the pronunciation interval frame phoneme probability value acquisition unit 111032 and the like is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

発声催促部１１０９は、入力受付部１０１が、動作開始指示を受け付けた場合、第二サンプリング周波数を算出するために、評価対象者に発声を促す処理を行ったり、評価対象者の発音評定のために発声を促す処理を行ったりする。評価対象者に発声を促す処理は、例えば、「〜を発音してください。」とディスプレイに表示したり、「〜を発音してください。」とスピーカーから音出力したりする処理である。発声催促部１１０９は、ディスプレイ３４４やスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。発声催促部１１０９は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。
次に、本音声処理装置の動作について図１２から図１４のフローチャートを用いて説明する。図１２等のフローチャートにおいて、図２、図３のフローチャートと異なるステップについてのみ説明する。
（ステップＳ１２０１）発声催促部１１０９は、第二サンプリング周波数算出用の発声を促すために、例えば、母音「う」と発声してください、とディスプレイに表示する。
（ステップＳ１２０２）音声受付部１０３は、音声を受け付けたか否かを判断する。音声を受け付ければステップＳ１２０３に行き、音声を受け付けなければステップＳ２１３に行く。 When the input reception unit 101 receives an operation start instruction, the utterance prompting unit 1109 performs a process of prompting the evaluation target person to utter in order to calculate the second sampling frequency, or for the pronunciation evaluation of the evaluation target person. Or urge the user to speak. The process of prompting the evaluation subject to speak is, for example, a process of displaying “Please pronounce ~” on the display or outputting a sound from the speaker “Please pronounce ~”. The utterance prompting unit 1109 may or may not include an output device such as the display 344 or a speaker. The utterance prompting unit 1109 can be implemented by output device driver software, or output device driver software and an output device.
Next, the operation of the voice processing apparatus will be described with reference to the flowcharts of FIGS. In the flowchart of FIG. 12 etc., only steps different from the flowcharts of FIG. 2 and FIG. 3 will be described.
(Step S <b> 1201) The utterance prompting unit 1109 displays on the display, for example, utter the vowel “U” in order to prompt the utterance for calculating the second sampling frequency.
(Step S1202) The voice receiving unit 103 determines whether a voice is received. If the voice is accepted, the process goes to step S1203. If the voice is not accepted, the process goes to step S213.

（ステップＳ１２０３）サンプリング部１０６は、第一サンプリング周波数格納部１０５に格納されている第一サンプリング周波数を読み込み、当該第一サンプリング周波数で、ステップＳ１２０２で受け付けた音声をサンプリングし、第一音声データを得る。 (Step S1203) The sampling unit 106 reads the first sampling frequency stored in the first sampling frequency storage unit 105, samples the sound received in step S1202 at the first sampling frequency, and obtains the first sound data. obtain.

（ステップＳ１２０４）声道長正規化処理部１０９は、ステップＳ１２０３で得た第一音声データから、第二サンプリング周波数を得る。かかる第二サンプリング周波数算出処理は、図１３のフローチャートを用いて説明する。
（ステップＳ１２０５）発声催促部１１０９は、評定用の発声を促すために、例えば、「ｒｉｇｈｔ」と発声してください、とディスプレイに表示する。
（ステップＳ１２０６）音声受付部１０３は、音声を受け付けたか否かを判断する。音声を受け付ければステップＳ１２０７に行き、音声を受け付けなければステップＳ２１３に行く。 (Step S1204) The vocal tract length normalization processing unit 109 obtains a second sampling frequency from the first speech data obtained in Step S1203. The second sampling frequency calculation process will be described with reference to the flowchart of FIG.
(Step S1205) The utterance prompting unit 1109 displays on the display, for example, “speak“ right ”in order to urge the utterance for evaluation.
(Step S1206) The voice receiving unit 103 determines whether a voice is received. If a voice is accepted, the process goes to step S1207, and if no voice is accepted, the process goes to step S213.

（ステップＳ１２０７）サンプリング部１０６は、第一サンプリング周波数格納部１０５に格納されている第一サンプリング周波数を読み込み、当該第一サンプリング周波数で、ステップＳ１２０６で受け付けた音声をサンプリングし、第一音声データを得る。
（ステップＳ１２０８）声道長正規化処理部１０９は、ステップＳ１２０７で得た第一音声データに対して、ステップＳ１２０４で得た第二サンプリング周波数で、リサンプリングし、第二音声データを得る。
（ステップＳ１２０９）音声処理部１１１０は、ステップＳ１２０８で得た第二音声データに対して、評定処理を行う。評定処理の詳細は、図１４のフローチャートを用いて説明する。ステップＳ１２０２に戻る。
なお、図１２のフローチャートにおいて、第二サンプリング周波数を算出するための音声と、評定するための音声が同一または包含されていても良い。 (Step S1207) The sampling unit 106 reads the first sampling frequency stored in the first sampling frequency storage unit 105, samples the audio received in step S1206 at the first sampling frequency, and obtains the first audio data. obtain.
(Step S1208) The vocal tract length normalization processing unit 109 resamples the first audio data obtained in step S1207 at the second sampling frequency obtained in step S1204, and obtains second audio data.
(Step S1209) The voice processing unit 1110 performs a rating process on the second voice data obtained in step S1208. Details of the rating process will be described with reference to the flowchart of FIG. The process returns to step S1202.
In the flowchart of FIG. 12, the sound for calculating the second sampling frequency and the sound for rating may be the same or included.

ステップＳ１２０４の第二サンプリング周波数算出処理について、図１３のフローチャートを用いて説明する。図１３のフローチャートにおいて、図３のフローチャートにおけるステップＳ３０１からステップＳ３０５の処理を行う。
ステップＳ１２０９の評定処理について、図１４のフローチャートを用いて説明する。 The second sampling frequency calculation process in step S1204 will be described using the flowchart of FIG. In the flowchart of FIG. 13, the processing from step S301 to step S305 in the flowchart of FIG. 3 is performed.
The rating process in step S1209 will be described with reference to the flowchart in FIG.

（ステップＳ１４０１）発音区間フレーム音韻確率値取得手段１１１０３２は、全フレームの全状態の前向き尤度と後向き尤度を算出する。そして、全フレーム、全状態の確率値を得る。具体的には、発音区間フレーム音韻確率値取得手段１１１０３２は、例えば、各特徴ベクトルが対象の状態から生成された事後確率を算出する。この事後確率は、ＨＭＭの最尤推定におけるＢａｕｍ−Ｗｅｌｃｈアルゴリズムの中で現れる占有度数に対応する。Ｂａｕｍ−Ｗｅｌｃｈアルゴリズムは、公知のアルゴリズムであるので、説明は省略する。
（ステップＳ１４０２）発音区間フレーム音韻確率値取得手段１１１０３２は、全フレームの最適状態確率値を算出する。
（ステップＳ１４０３）発音区間フレーム音韻確率値取得手段１１１０３２は、ｊに１を代入する。 (Step S1401) The pronunciation period frame phoneme probability value acquisition unit 111032 calculates the forward likelihood and the backward likelihood of all states of all frames. Then, probability values for all frames and all states are obtained. Specifically, the pronunciation interval frame phoneme probability value acquisition unit 111032 calculates, for example, the posterior probability that each feature vector is generated from the target state. This posterior probability corresponds to the occupation frequency appearing in the Baum-Welch algorithm in the maximum likelihood estimation of the HMM. The Baum-Welch algorithm is a known algorithm and will not be described.
(Step S1402) The sounding section frame phoneme probability value acquisition unit 111032 calculates optimum state probability values of all frames.
(Step S1403) The pronunciation period frame phoneme probability value acquisition unit 111032 substitutes 1 for j.

（ステップＳ１４０４）発音区間フレーム音韻確率値取得手段１１１０３２は、次の評定対象の発音区間である、ｊ番目の発音区間が存在するか否かを判断する。ｊ番目の発音区間が存在すればステップＳ１４０３に行き、ｊ番目の発音区間が存在しなければ上位関数にリターンする。
（ステップＳ１４０５）発音区間フレーム音韻確率値取得手段１１１０３２は、カウンタｋに１を代入する。 (Step S1404) The pronunciation period frame phoneme probability value acquisition unit 111032 determines whether or not the jth pronunciation period, which is the next evaluation target pronunciation period, exists. If the jth sound generation interval exists, the process goes to step S1403. If the jth sound generation interval does not exist, the process returns to the upper function.
(Step S1405) The sounding section frame phoneme probability value acquiring unit 111032 substitutes 1 for the counter k.

（ステップＳ１４０６）発音区間フレーム音韻確率値取得手段１１１０３２は、ｋ番目のフレームが、ｊ番目の発音区間に存在するか否かを判断する。ｋ番目のフレームが存在すればステップＳ１４０７に行き、ｋ番目のフレームが存在しなければステップＳ１４１０に飛ぶ。
（ステップＳ１４０７）発音区間フレーム音韻確率値取得手段１１１０３２は、ｋ番目のフレームの最適状態を含む音韻の全ての確率値を取得する。
（ステップＳ１４０８）評定値算出手段１１１０３３は、ステップＳ１４０７で取得した１以上の確率値をパラメータとして、１フレームの音声の評定値を算出する。
（ステップＳ１４０９）発音区間フレーム音韻確率値取得手段１１１０３２は、ｋを１、インクメントする。ステップＳ１４０６に戻る。 (Step S1406) The pronunciation period frame phoneme probability value acquisition unit 111032 determines whether or not the kth frame is present in the jth pronunciation period. If the kth frame exists, the process goes to step S1407, and if the kth frame does not exist, the process jumps to step S1410.
(Step S1407) The pronunciation period frame phoneme probability value acquisition unit 111032 acquires all probability values of phonemes including the optimal state of the kth frame.
(Step S1408) The rating value calculation means 111033 calculates the rating value of one frame of speech using one or more probability values acquired in step S1407 as parameters.
(Step S1409) The sounding section frame phoneme probability value acquiring unit 111032 increments k by 1. The process returns to step S1406.

（ステップＳ１４１０）評定値算出手段１１１０３３は、ｊ番目の発音区間の評定値を算出する。評定値算出手段１１１０３３は、例えば、最適状態決定手段１１０３１が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値の総和を、フレーム毎に得て、当該フレーム毎の確率値の総和に基づいて、発音区間の確率値の総和の時間平均値を、当該発音区間の音声の評定値として算出する。
（ステップＳ１４１１）出力手段１１０４は、ステップＳ１４１０で算出した評定値を出力する。
（ステップＳ１４１２）発音区間フレーム音韻確率値取得手段１１１０３２は、ｊを１、インクメントする。ステップＳ１４０４に戻る。
以下、本実施の形態における音声処理装置の具体的な動作について説明する。本実施の形態において、評定値の算出アルゴリズムが実施の形態１とは異なるので、その動作を中心に説明する。 (Step S1410) The rating value calculation means 111033 calculates the rating value of the j-th pronunciation section. For example, the rating value calculation unit 1111033 obtains, for each frame, a sum of one or more probability values in the entire phoneme state having the optimal state of each frame determined by the optimal state determination unit 11031, and the probability value for each frame. Based on the sum total, the time average value of the sum total of the probability values of the sounding sections is calculated as the speech evaluation value of the sounding section.
(Step S1411) The output means 1104 outputs the rating value calculated in step S1410.
(Step S1412) The sound generation section frame phoneme probability value acquiring unit 111032 increments j by 1. The process returns to step S1404.
Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In the present embodiment, the rating value calculation algorithm is different from that of the first embodiment, and therefore the operation will be mainly described.

まず、学習者（評価対象者）が、語学学習の開始の指示である動作開始指示を入力する。そして、音声処理装置は、当該動作開始指示を受け付け、次に、発声催促部１１０９は、例えば、「"う"と発声してください。」と画面出力する。 First, a learner (evaluator) inputs an operation start instruction that is an instruction to start language learning. Then, the voice processing device receives the operation start instruction, and next, the utterance prompting unit 1109 outputs, for example, “Please say“ U ”” to the screen.

なお、ここでも、例えば、学習者は、実施の形態１と同様に子供である。また、教師データを作成するために発声した教師は、ネイティブの大人である、とする。かかることは、他の実施の形態の具体例の記載においても同様である、とする。
そして、評価対象者は、"う"と発声し、音声処理装置は、当該発声から、第二ンプリング周波数「３２．１ＫＨｚ」を得る。かかる処理は、実施の形態１において説明した処理と同様である。 Here, for example, the learner is a child as in the first embodiment. Further, it is assumed that the teacher who has spoken to create the teacher data is a native adult. This also applies to the description of specific examples of other embodiments.
Then, the evaluation target person utters “U”, and the speech processing apparatus obtains the second sampling frequency “32.1 KHz” from the utterance. Such processing is the same as the processing described in the first embodiment.

次に、発声催促部１１０９は、例えば、「"ｒｉｇｈｔ"と発声してください。」と画面出力する。そして、学習者は、学習対象の音声「ｒｉｇｈｔ」を発音する。そして、音声受付部１０３は、学習者が発音した音声の入力を受け付ける。
次に、サンプリング部１０６は、受け付けた音声「ｒｉｇｈｔ」をサンプリング周波数「２２．０５ＫＨｚ」でサンプリング処理する。そして、サンプリング部１０６は、「ｒｉｇｈｔ」の第一音声データを得る。 Next, the utterance prompting unit 1109 outputs, for example, “Please utter“ right ”” on the screen. Then, the learner pronounces the voice “right” to be learned. Then, the voice receiving unit 103 receives an input of a voice pronounced by the learner.
Next, the sampling unit 106 samples the received voice “right” at the sampling frequency “22.05 KHz”. Then, the sampling unit 106 obtains the first audio data “right”.

次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。次に、音声処理部１１１０は、第二音声データを、以下のように処理する。
まず、フレーム区分手段１１０１は、「ｒｉｇｈｔ」の第二音声データを、短時間フレームに区分する。
そして、フレーム音声データ取得手段１１０２は、フレーム区分手段１１０１が区分した音声データを、スペクトル分析し、特徴ベクトル系列「Ｏ＝ｏ_１，ｏ_２，・・・，ｏ_Ｔ」を算出する。
次に、発音区間フレーム音韻確率値取得手段１１１０３２は、各フレームの各状態の事後確率（確率値）を算出する。確率値の算出は、上述した数式１、数式２により算出できる。 Next, the vocal tract length normalization processing unit 109 resamples the first voice data of “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data. Next, the audio processing unit 1110 processes the second audio data as follows.
First, the frame segmentation unit 1101 segments the second audio data “right” into short frames.
Then, the frame audio data acquisition unit 1102 performs spectrum analysis on the audio data classified by the frame classification unit 1101 and calculates a feature vector series “O = o ₁ , o ₂ ,..., O _T ”.
Next, the pronunciation interval frame phoneme probability value acquisition unit 111032 calculates a posteriori probability (probability value) of each state of each frame. The probability value can be calculated by the above-described Equations 1 and 2.

次に、最適状態決定手段１１０３１は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、各フレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。つまり、最適状態決定手段１１０３１は、最適状態系列を得る。なお、各フレームの各状態の事後確率（確率値）を算出する処理と、最適状態を決定する処理の処理順序は問わない。 Then, the optimal state determination unit 11031, based on the feature vector o _t constituting the obtained feature vector series to determine the optimum conditions for each frame (optimum condition for the feature vector o _t). That is, the optimum state determination unit 11031 obtains an optimum state sequence. In addition, the processing order of the process which calculates the posterior probability (probability value) of each state of each frame and the process which determines an optimal state is not ask | required.

次に、発音区間フレーム音韻確率値取得手段１１１０３２は、発音区間ごとに、当該発音区間に含まれる各フレームの最適状態を含む音韻の全ての確率値を取得する。そして、評定値算出手段１１１０３３は、各フレームの最適状態を含む音韻の全ての確率値の総和を、フレーム毎に算出する。そして、評定値算出手段１１１０３３は、フレーム毎に算出された確率値の総和を、発音区間毎に時間平均し、発音区間毎の評定値を算出する。具体的には、評定値算出手段１１１０３３は、数式３により評定値を算出する。数式３において、ｐ−ＤＡＰ（τ）は、各フレームにおける、すべての音韻の中で最適な音韻の事後確率（確率値）を表すように算出される評定値であり、数式４で算出され得る。なお、数式３のｔ−ｐ−ＤＡＰは、最適状態を含む音韻の中の全状態の確率値を発音区間で評価して、算出される評定値である。また、数式３において、Τ（ｑ_ｔ ^＊）は、状態ｑ_ｔ ^＊を含むＨＭＭが含まれる評定対象の発音区間である。｜Τ（ｑ_ｔ ^＊）｜は、Τ（ｑ_ｔ ^＊）の区間長である。また、数式４において、Ｐ（ｑ_ｔ ^＊）は、状態ｑ_ｔ ^＊を含むＨＭＭが有する全状態識別子の集合である。

Next, the pronunciation interval frame phoneme probability value acquisition unit 111032 acquires, for each pronunciation interval, all probability values of phonemes including the optimum state of each frame included in the pronunciation interval. Then, the rating value calculation unit 111033 calculates the sum of all probability values of phonemes including the optimum state of each frame for each frame. Then, the rating value calculation means 111033 averages the sum of the probability values calculated for each frame for each sounding section, and calculates a rating value for each sounding section. Specifically, the rating value calculation unit 111033 calculates the rating value using Equation 3. In Equation 3, p-DAP (τ) is a rating value calculated to represent the optimal posterior probability (probability value) of all phonemes in each frame, and can be calculated by Equation 4. . Note that tp-DAP in Equation 3 is a rating value calculated by evaluating the probability values of all states in the phoneme including the optimal state in the pronunciation interval. Further, in Equation 3, Τ (q _t ^* ) is a pronunciation interval of the evaluation target including the HMM including the state q _t ^* . | Τ (q _t ^* ) | is the section length of Τ (q _t ^* ). Further, in Equation 4, P (q _t ^* ) is a set of all state identifiers possessed by the HMM including the state q _t ^* .

かかる評定値算出手段１１１０３３が算出した評定値（「ｔ−ｐ−ＤＡＰスコア」とも言う。）を、図１５の表に示す。図１５において、アメリカ人男性と日本人男性の評定結果を示す。ＰｈｏｎｅｍｅおよびＷｏｒｄは，ｔ−ｐ−ＤＡＰにおける時間平均の範囲を示す。ここでは、ＤＡＰの代わりにｐ−ＤＡＰの時間平均を採用したものである。図１５において、アメリカ人男性の発音の評定値が日本人男性の発音の評定値より高く、良好な評定結果が得られている。
そして、出力手段１１０４は、算出した発音区間ごと（ここでは、音素毎）の評定値を、順次出力する。かかる出力例は、図１６である。 The table of FIG. 15 shows the rating values (also referred to as “tp-DAP score”) calculated by the rating value calculation means 111033. FIG. 15 shows the evaluation results of American men and Japanese men. Phoneme and Word indicate the range of time average in tp-DAP. Here, the time average of p-DAP is adopted instead of DAP. In FIG. 15, the American male pronunciation rating value is higher than the Japanese male pronunciation value, and a good rating result is obtained.
Then, the output means 1104 sequentially outputs the calculated rating values for each calculated sounding section (here, for each phoneme). An example of such output is shown in FIG.

また、本実施の形態によれば、連結されたＨＭＭである連結ＨＭＭを用いて最適状態を求め、評定値を算出するので、高速に評定値を求めることができる。したがって、上記の具体例で述べたように、リアルタイムに、発音区間ごとの評定値を出力できる。また、本実施の形態によれば、動的計画法に基づいた事後確率を確率値として算出するので、さらに高速に評定値を求めることができる。 Further, according to the present embodiment, since the optimum state is obtained by using the concatenated HMM, which is a concatenated HMM, and the evaluation value is calculated, the evaluation value can be obtained at high speed. Therefore, as described in the specific example above, it is possible to output a rating value for each pronunciation interval in real time. Further, according to the present embodiment, the posterior probability based on the dynamic programming is calculated as the probability value, so that the rating value can be obtained at higher speed.

また、本実施の形態によれば、評定値を、発音区間の単位で算出でき、実施の形態１におけるような状態単位のＤＡＰと比較して、本来、測定したい類似度（発音区間の類似度）を精度良く、安定して求めることができる。 In addition, according to the present embodiment, the rating value can be calculated in units of pronunciation intervals, and compared with the state unit DAP as in the first embodiment, the degree of similarity originally desired to be measured (similarity of pronunciation intervals) ) With high accuracy and stability.

さらに、本実施の形態における音声処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、第一サンプリング周波数で、受け付けた音声をサンプリングし、第一音声データを取得するサンプリングステップと、第二サンプリング周波数「第一サンプリング周波数／（教師データフォルマント周波数／評価対象者フォルマント周波数）」で、前記音声受付ステップで受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理ステップと、前記第二音声データを処理する音声処理ステップを実行させるためのプログラム、である。 Furthermore, the software that implements the speech processing apparatus in the present embodiment is the following program. In other words, the program samples the received voice at the first sampling frequency and acquires the first voice data to the computer, and the second sampling frequency “first sampling frequency / (teacher data formant frequency / evaluation”. Target voice formant frequency) ", the voice received in the voice receiving step is subjected to a sampling process to obtain a second voice data, a vocal tract length normalization processing step, and a voice process to process the second voice data A program for executing a step.

また、上記プログラムにおいて、前記評定ステップは、前記1以上のフレーム音声データの最適状態を決定する最適状態決定ステップと、前記最適状態決定ステップで決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値を、発音区間毎に取得する発音区間フレーム音韻確率値取得ステップと、前記発音区間フレーム音韻確率値取得ステップで取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する評定値算出ステップを具備する、ことは好適である。
（実施の形態３） Further, in the above program, the rating step includes an optimum state determination step for determining an optimum state of the one or more frame sound data, and an overall phoneme state having an optimum state for each frame determined in the optimum state determination step. One or more probability values are acquired for each sounding section, and a sounding section frame phoneme probability value acquiring step and one or more probability values for one or more sounding sections acquired in the sounding section frame phoneme probability value acquiring step are used as parameters. It is preferable to include a rating value calculation step for calculating a rating value of speech.
(Embodiment 3)

本実施の形態において、比較対象の音声と入力音声の類似度を精度高く評定できる音声処理装置について説明する。特に、本音声処理装置は、無音区間を検知し、無音区間を考慮した類似度評定が可能な音声処理装置である。 In the present embodiment, a description will be given of a speech processing apparatus that can accurately evaluate the similarity between a comparison target speech and an input speech. In particular, the speech processing apparatus is a speech processing apparatus that can detect a silent section and can evaluate similarity in consideration of the silent section.

図１７は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部１７１０、発声催促部１１０９を具備する。
音声処理部１７１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、特殊音声検知手段１７１０１、評定手段１７１０３、出力手段１１０４を具備する。
特殊音声検知手段１７１０１は、無音データ格納手段１７１０１１、無音区間検出手段１７１０１２を具備する。
評定手段１７１０３は、最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２、評定値算出手段１７１０３３を具備する。 FIG. 17 is a block diagram of the speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 1710, and the utterance prompting unit 1109 are provided.
The audio processing unit 1710 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a special audio detection unit 17101, a rating unit 17103, and an output unit 1104.
The special voice detection unit 17101 includes a silence data storage unit 171101 and a silence interval detection unit 171012.
The rating unit 17103 includes an optimum state determining unit 11031, an optimum state probability value acquiring unit 11032, and a rating value calculating unit 171033.

特殊音声検知手段１７１０１は、フレーム毎の入力音声データに基づいて、特殊な音声が入力されたことを検知する。なお、ここで特殊な音声は、無音も含む。また、特殊音声検知手段１７１０１は、例えば、フレームの最適状態の確率値を、ある音素区間において取得し、ある音素区間の１以上の確率値の総和が所定の値より低い場合（想定されている音素ではない、と判断できる場合）、当該音素区間において特殊な音声が入力されたと、検知する。かかる検知の具体的なアルゴリズムの例は後述する。特殊音声検知手段１７１０１は、通常、ＭＰＵやメモリ等から実現され得る。特殊音声検知手段１７１０１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The special sound detection unit 17101 detects that a special sound is input based on the input sound data for each frame. Here, the special voice includes silence. Further, the special speech detection unit 17101 acquires, for example, the probability value of the optimal state of the frame in a certain phoneme section, and the sum of one or more probability values of a certain phoneme section is lower than a predetermined value (assumed If it can be determined that it is not a phoneme), it is detected that a special voice is input in the phoneme section. An example of a specific algorithm for such detection will be described later. The special sound detection unit 17101 can be realized usually by an MPU, a memory, or the like. The processing procedure of the special sound detection means 17101 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

無音データ格納手段１７１０１１は、無音を示すデータであり、ＨＭＭに基づくデータである無音データを格納している。無音データ格納手段１７１０１１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The silence data storage unit 171101 is data indicating silence and stores silence data that is data based on the HMM. The silent data storage unit 171011 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

無音区間検出手段１７１０１２は、フレーム音声データ取得手段１１０２が取得したフレーム音声データ、および無音データ格納手段１７１０１１の無音データに基づいて、無音の区間を検出する。無音区間検出手段１７１０１２は、フレーム音声データ取得手段１１０２が取得したフレーム音声データと無音データの類似度が所定の値以上である場合に、当該フレーム音声データは無音区間のデータであると判断しても良い。また、無音区間検出手段１７１０１２は、下記で述べる最適状態確率値取得手段１１０３２が取得した確率値が所定の値以下であり、かつ、フレーム音声データ取得手段１１０２が取得したフレーム音声データと無音データの類似度が所定の値以上である場合に、当該フレーム音声データは無音区間のデータであると判断しても良い。無音区間検出手段１７１０１２は、通常、ＭＰＵやメモリ等から実現され得る。無音区間検出手段１７１０１２の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The silent section detecting means 171012 detects a silent section based on the frame sound data acquired by the frame sound data acquiring means 1102 and the silent data of the silent data storage means 171101. When the similarity between the frame audio data acquired by the frame audio data acquisition unit 1102 and the silence data is equal to or higher than a predetermined value, the silence interval detection unit 171012 determines that the frame audio data is data of the silence interval. Also good. Further, the silent section detecting means 171012 has a probability value acquired by the optimum state probability value acquiring means 11032 described below equal to or less than a predetermined value, and the frame audio data and silent data acquired by the frame audio data acquiring means 1102 When the degree of similarity is equal to or greater than a predetermined value, the frame audio data may be determined to be silent section data. The silent section detection means 171012 can be usually realized by an MPU, a memory, or the like. The processing procedure of the silent section detection means 171012 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定手段１７１０３は、教師データと入力音声データと特殊音声検知手段１７１０１における検知結果に基づいて、音声受付部１０３が受け付けた音声の評定を行う。「特殊音声検知手段１７１０１における検知結果に基づく」とは、例えば、特殊音声検知手段１７１０１が無音を検知した場合、当該無音の区間を無視することである。また、「特殊音声検知手段１７１０１における検知結果に基づく」とは、例えば、特殊音声検知手段１７１０１が置換や脱落などを検知した場合、当該置換や脱落などの検知により、評定値を所定数値分、減じて、評定値を算出することである。評定手段１７１０３は、通常、ＭＰＵやメモリ等から実現され得る。評定手段１７１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating unit 17103 evaluates the voice received by the voice receiving unit 103 based on the teacher data, the input voice data, and the detection result of the special voice detecting unit 17101. “Based on the detection result in the special voice detection unit 17101” means, for example, that the silent section is ignored when the special voice detection unit 17101 detects silence. Further, “based on the detection result in the special voice detection unit 17101” means that, for example, when the special voice detection unit 17101 detects replacement or dropout, the evaluation value is set to a predetermined numerical value by detecting the replacement or dropout. Subtract and calculate the rating value. The rating means 17103 can be usually realized by an MPU, a memory, or the like. The processing procedure of the rating means 17103 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定値算出手段１７１０３３は、無音区間検出手段１７１０１２が検出した無音区間を除いて、かつ最適状態確率値取得手段１１０３２が取得した確率値をパラメータとして音声の評定値を算出する。なお、評定値算出手段１７１０３３は、上記確率値を如何に利用して、評定値を算出するかは問わない。評定値算出手段１７１０３３は、例えば、最適状態確率値取得手段１１０３２が取得した確率値と、当該確率値に対応するフレームの全状態における確率値の総和をパラメータとして音声の評定値を算出する。評定値算出手段２１０２３は、ここでは、通常、無音区間検出手段１７１０１２が検出した無音区間を除いて、フレームごとに評定値を算出する。なお、評定値算出手段１７１０３３は、かならずしも無音区間を除いて、評定値を算出する必要はない。評定値算出手段１７１０３３は、無音区間の影響を少なくするように評定値を算出しても良い。評定値算出手段１７１０３３は、通常、ＭＰＵやメモリ等から実現され得る。評定値算出手段１７１０３３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating value calculation unit 171033 calculates a voice rating value using the probability value acquired by the optimum state probability value acquisition unit 11032 as a parameter, excluding the silent section detected by the silent segment detection unit 171012. Note that it does not matter how the rating value calculation means 171033 uses the probability value to calculate the rating value. The rating value calculation unit 171033 calculates a speech rating value using, for example, the sum of the probability value acquired by the optimum state probability value acquisition unit 11032 and the probability value in all states of the frame corresponding to the probability value as a parameter. Here, the rating value calculation means 21023 normally calculates a rating value for each frame except for the silent section detected by the silent section detection means 171012. Note that the rating value calculation unit 171033 does not necessarily calculate the rating value except for the silent section. The rating value calculation means 171033 may calculate the rating value so as to reduce the influence of the silent section. The rating value calculation unit 171033 can be realized usually by an MPU, a memory, or the like. The processing procedure of the rating value calculation means 171033 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、音声処理装置の動作について図１８、図１９のフローチャートを用いて説明する。なお、図１８のフローチャートは、図１２のフローチャートと比較して、ステップS１８０１の評定処理のみが異なるので、図１８のフローチャートの説明は省略する。ステップS１８０１の評定処理の詳細について、図１９のフローチャートを用いて説明する。 Next, the operation of the speech processing apparatus will be described using the flowcharts of FIGS. Note that the flowchart of FIG. 18 differs from the flowchart of FIG. 12 only in the rating process in step S1801, and therefore the description of the flowchart of FIG. 18 is omitted. Details of the rating process in step S1801 will be described with reference to the flowchart of FIG.

（ステップＳ１９０１）評定手段１７１０３は、ＤＡＰの評定値を算出する。ＤＡＰの評定値を算出するアルゴリズムは、実施の形態１で説明済みであるので、詳細な説明は省略する。ＤＡＰの評定値を算出する処理は、図２のフローチャートの、ステップＳ２１１からＳ２１４の処理により行う。 (Step S1901) The rating means 17103 calculates a DAP rating value. Since the algorithm for calculating the DAP rating value has been described in the first embodiment, a detailed description thereof will be omitted. The process of calculating the DAP rating value is performed by the processes of steps S211 to S214 in the flowchart of FIG.

（ステップＳ１９０２）特殊音声検知手段１７１０１は、ステップＳ１９０１で算出した値が、所定の値より低いか否かを判断する。所定の値より低ければステップＳ１９０３に行き、所定の値より低くなければステップＳ１９０６に飛ぶ。
（ステップＳ１９０３）無音区間検出手段１７１０１２は、無音データと全教師データの確率値を取得する。 (Step S1902) The special sound detection unit 17101 determines whether or not the value calculated in step S1901 is lower than a predetermined value. If it is lower than the predetermined value, the process goes to step S1903, and if it is not lower than the predetermined value, the process jumps to step S1906.
(Step S1903) The silent section detecting means 171012 acquires the probability values of the silent data and all the teacher data.

（ステップＳ１９０４）無音区間検出手段１７１０１２は、ステップＳ１９０３で取得した確率値の中で、無音データの確率値が最も高いか否かを判断する。無音データの確率値が最も高ければ（かかる場合、無音の区間であると判断する）ステップＳ１９０５に行き、無音データの確率値が最も高くなければステップＳ１９０６に行く。
（ステップＳ１９０５）無音区間検出手段１７１０１２は、カウンタｉを１、インクリメントする。ステップＳ２０８に戻る。
（ステップＳ１９０６）出力手段１１０４は、ステップＳ１９０１で算出した評定値を出力する。 (Step S1904) The silent section detection means 171012 determines whether or not the probability value of silent data is the highest among the probability values acquired in step S1903. If the silence data has the highest probability value (in this case, it is determined that it is a silent section), the procedure goes to step S1905. If the silence data has a highest probability value, the procedure goes to step S1906.
(Step S1905) The silent section detecting means 171012 increments the counter i by 1. The process returns to step S208.
(Step S1906) The output unit 1104 outputs the rating value calculated in step S1901.

なお、図１９のフローチャートにおいて、出力手段１１０４は、無音区間と判定した区間の評定値は出力しなかった（無音区間を無視した）が、特殊音声が検知された区間が無音区間である旨を明示したり、無音区間が存在する旨を明示したりする態様で出力しても良い。また、評定値算出手段１７１０３３は、発音区間や、それ以上の単位のスコアを算出する場合に、無音区間の評定値を無視して、スコアを算出することが好適であるが、無音区間の評定値の影響を、例えば、１／１０にして、発音区間や発音全体のスコアを算出するなどしても良い。評定手段１７１０３は、教師データと入力音声データと特殊音声検知手段１７１０１における検知結果に基づいて、音声受付部１０３が受け付けた音声の評定を行えばよい。 In the flowchart of FIG. 19, the output unit 1104 does not output the rating value of the section determined to be a silent section (ignoring the silent section), but indicates that the section where the special voice is detected is a silent section. You may output in the aspect which specifies clearly that a silence area exists. The rating value calculation means 171033 preferably calculates the score while ignoring the rating value in the silent section when calculating the score in the sounding section or higher units. The influence of the value may be set to 1/10, for example, and the score of the pronunciation interval or the entire pronunciation may be calculated. The rating unit 17103 may evaluate the voice received by the voice receiving unit 103 based on the teacher data, the input voice data, and the detection result of the special voice detecting unit 17101.

また、図１９のフローチャートにおいて、特殊音声検知手段１７１０１は、ｉ番目のフレーム音声データのＤＡＰスコアに基づいて特殊音声を検知したが、例えば、実施の形態２で説明したｔ−ｐ−ＤＡＰスコアに基づいて特殊音声を検知しても良い。 In the flowchart of FIG. 19, the special voice detection unit 17101 detects the special voice based on the DAP score of the i-th frame voice data. For example, the special voice detection unit 17101 uses the tp-DAP score described in the second embodiment. Special voices may be detected based on this.

以下、本実施の形態における音声処理装置の具体的な動作について説明する。本実施の形態において、無音区間を考慮して評定値を算出するので、評定値の算出アルゴリズムが実施の形態１、実施の形態２とは異なる。そこで、その異なる処理を中心に説明する。
まず、学習者（評価対象者）が、語学学習の開始の指示である動作開始指示を入力する。そして、音声処理装置は、当該動作開始指示を受け付け、次に、例えば、「"う"と発声してください。」と画面出力する。
そして、評価対象者は、"う"と発声し、音声処理装置は、当該発声から、第二ンプリング周波数「３２．１ＫＨｚ」を得る。かかる処理は、実施の形態１等において説明した処理と同様である。 Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In the present embodiment, since the rating value is calculated in consideration of the silent section, the rating value calculation algorithm is different from that of the first and second embodiments. Therefore, the different processing will be mainly described.
First, a learner (evaluator) inputs an operation start instruction that is an instruction to start language learning. Then, the voice processing apparatus accepts the operation start instruction, and then outputs, for example, “Please say“ U ”” to the screen.
Then, the evaluation target person utters “U”, and the speech processing apparatus obtains the second sampling frequency “32.1 KHz” from the utterance. Such processing is the same as the processing described in the first embodiment.

次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを、第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。次に、音声処理部１７１０は、第二音声データを、以下のように処理する。
まず、フレーム区分手段１１０１は、「ｒｉｇｈｔ」の第二音声データを、短時間フレームに区分する。
そして、フレーム音声データ取得手段１１０２は、フレーム区分手段１１０１が区分した音声データを、スペクトル分析し、特徴ベクトル系列「Ｏ＝ｏ_１，ｏ_２，・・・，ｏ_Ｔ」を算出する。
次に、最適状態決定手段１１０３１は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。
次に、最適状態確率値取得手段１１０３２は、上述した数式１、２により、最適状態における確率値を算出する。 Next, the vocal tract length normalization processing unit 109 performs a resampling process on the first voice data “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data. Next, the audio processing unit 1710 processes the second audio data as follows.
First, the frame segmentation unit 1101 segments the second audio data “right” into short frames.
Then, the frame audio data acquisition unit 1102 performs spectrum analysis on the audio data classified by the frame classification unit 1101 and calculates a feature vector series “O = o ₁ , o ₂ ,..., O _T ”.
Then, the optimal state determination unit 11031, based on the feature vector o _t constituting the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t).
Next, the optimum state probability value acquisition unit 11032 calculates the probability value in the optimum state according to the above-described formulas 1 and 2.

次に、評定値算出手段１７１０３３は、例えば、最適状態決定手段１１０３１が決定した最適状態を有する音韻全体の状態における１以上の確率値を取得し、当該１以上の確率値の総和をパラメータとして音声の評定値を算出する。つまり、評定値算出手段１７１０３３は、例えば、ＤＡＰスコアをフレーム毎に算出する。 Next, the rating value calculation unit 171033 acquires, for example, one or more probability values in the entire state of the phoneme having the optimal state determined by the optimal state determination unit 11031, and the sum of the one or more probability values is used as a parameter for speech. The rating value of is calculated. That is, the rating value calculation unit 171033 calculates, for example, a DAP score for each frame.

そして、特殊音声検知手段１７１０１は、算出されたフレームに対応する評定値（ＤＡＰスコア）を用いて、特殊な音声が入力されたか否かを判断する。具体的には、特殊音声検知手段１７１０１は、例えば、評価対象のフレームに対して算出された評定値が、所定の数値より低ければ、特殊な音声が入力された、と判断する。なお、特殊音声検知手段１７１０１は、一のフレームに対応する評定値が小さいからといって、直ちに特殊な音声が入力された、と判断する必要はない。つまり、特殊音声検知手段１７１０１は、フレームに対応する評定値が小さいフレームが所定の数以上、連続する場合に、当該連続するフレーム群に対応する区間が特殊な音声が入力された区間と判断しても良い。 Then, the special voice detection unit 17101 determines whether or not a special voice has been input, using a rating value (DAP score) corresponding to the calculated frame. Specifically, for example, if the rating value calculated for the evaluation target frame is lower than a predetermined numerical value, the special voice detection unit 17101 determines that a special voice has been input. Note that the special voice detection unit 17101 does not need to determine that a special voice has been input immediately because the rating value corresponding to one frame is small. That is, the special voice detection unit 17101 determines that a section corresponding to the group of consecutive frames is a section where a special voice is input when a predetermined number or more of frames having a small evaluation value corresponding to the frame are consecutive. May be.

特殊音声検知手段１７１０１が、特殊音声を検知する場合について説明する図を図２０に示す。図２０（ａ）の縦軸は、ＤＡＰスコアであり、横軸はフレームを示す。図２０（ａ）において、（Ｖ）は、Ｖｉｔｅｒｂｉアライメントを示す。図２０（ａ）において、網掛けのフレーム群のおけるＤＡＰスコアは、所定の値より低く、特殊音声の区間である、と判断される。 FIG. 20 illustrates a case where the special sound detection unit 17101 detects special sound. In FIG. 20A, the vertical axis represents the DAP score, and the horizontal axis represents the frame. In FIG. 20A, (V) shows Viterbi alignment. In FIG. 20A, it is determined that the DAP score in the shaded frame group is lower than a predetermined value and is a special speech section.

次に、特殊な音声が入力された、と判断した場合、無音区間検出手段１７１０１２は、無音データ格納手段１７１０１１から無音データを取得し、当該フレーム群と無音データとの類似度を算定し、類似度が所定値以上であれば当該フレーム群に対応する音声データが、無音データであると判断する。図２０（ｂ）は、無音データとの比較の結果、当該無音データとの類似度を示す事後確率の値（「ＤＡＰスコア」）が高いことを示す。その結果、無音区間検出手段１７１０１２は、当該特殊音声の区間は、無音区間である、と判断する。なお、図２０（ａ）において、網掛けのフレーム群のおけるＤＡＰスコアは、所定の値より低く、特殊音声の区間である、と判断され、かつ、無音データとの比較の結果、ＤＡＰスコアが低い場合には、無音区間ではない、と判断される。そして、かかる区間において、例えば、単に、発音が上手くなく、低い評定値が出力される。なお、図２０（ａ）に示しているように、通常、無音区間は、第一のワード（「ｗｏｒｄ１」）の最終音素の後半部、および第一のワードに続く第二のワード（「ｗｏｒｄ２」）の第一音素の前半部のスコアが低い。
そして、出力手段１１０４は、出力する評定値から、無音データの区間の評定値を考慮しないように、無視する。
そして、出力手段１１０４は、各フレームに対応する評定値を出力する。この場合、例えば、無音データの区間の評定値は、出力されない。
かかる評定値の出力態様例は、例えば、図９、図１０である。
なお、出力手段１１０４が行う出力は、無音区間の存在を示すだけの出力でも良い。 Next, if it is determined that a special voice has been input, the silent section detecting unit 171012 acquires the silent data from the silent data storage unit 171011, calculates the similarity between the frame group and the silent data, If the degree is equal to or greater than a predetermined value, it is determined that the audio data corresponding to the frame group is silent data. FIG. 20B shows that as a result of the comparison with the silence data, the value of the posterior probability (“DAP score”) indicating the similarity to the silence data is high. As a result, the silent section detector 171012 determines that the special speech section is a silent section. In FIG. 20A, the DAP score in the shaded frame group is lower than a predetermined value and is determined to be a special speech section, and as a result of comparison with silence data, the DAP score is If it is low, it is determined that it is not a silent section. In such a section, for example, the pronunciation is simply not good and a low rating value is output. Note that, as shown in FIG. 20 (a), the silence period usually includes the second half of the last phoneme of the first word ("word1") and the second word ("word2" following the first word). )) The first half of the first phoneme score is low.
Then, the output unit 1104 ignores the rating value to be output so as not to consider the rating value of the silent data section.
Then, the output unit 1104 outputs a rating value corresponding to each frame. In this case, for example, the rating value of the silent data section is not output.
Examples of output modes of such rating values are, for example, FIG. 9 and FIG.
Note that the output performed by the output unit 1104 may be an output that only indicates the presence of a silent section.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差、特に声道長の違いに影響を受けない、精度の高い評定ができる。さらに、本音声処理装置は、無音区間を考慮して類似度を評定するので、極めて正確な評定結果が得られる。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. In this case, according to the present embodiment, highly accurate evaluation can be performed without being affected by individual differences, particularly differences in vocal tract length. Furthermore, since the speech processing apparatus evaluates the similarity in consideration of the silent section, a very accurate evaluation result can be obtained.

なお、無音区間のデータは、無視して評定結果を算出することは好適である。ただし、本実施の形態において、例えば、無音区間の評価の影響を他の区間と比較して少なくするなど、無視する以外の方法で、無音区間のデータを考慮して、評定値を出力しても良い。 It is preferable to ignore the silent section data and calculate the evaluation result. However, in this embodiment, for example, the evaluation value is output in consideration of the data of the silent section by a method other than ignoring, such as reducing the influence of the evaluation of the silent section compared with other sections. Also good.

また、本実施の形態の具体例によれば、ＤＡＰスコアを用いて、評定値を算出したが、無音の区間を考慮して評定値を算出すれば良く、上述した他のアルゴリズム（ｔ−ｐ−ＤＡＰ等）、または、本明細書では述べていない他のアルゴリズムにより評定値を算出しても良い。つまり、本実施の形態によれば、教師データと入力音声データと特殊音声検知手段における検知結果に基づいて、音声受付部が受け付けた音声の評定を行い、特に、無音データを考慮して、評定値を算出すれば良い。
また、本実施の形態によれば、まず、ＤＡＰスコアが低い区間を検出してから、無音区間の検出をした。しかし、ＤＡＰスコアが低い区間を検出せずに、無音データとの比較により、無音区間を検出しても良い。 Further, according to the specific example of the present embodiment, the rating value is calculated using the DAP score. However, the rating value may be calculated in consideration of the silent section, and the other algorithm (tp) described above is used. The rating value may be calculated by DAP or the like) or another algorithm not described in this specification. In other words, according to the present embodiment, the voice received by the voice receiving unit is evaluated based on the teacher data, the input voice data, and the detection result of the special voice detection unit, and in particular, the evaluation is performed in consideration of silent data. What is necessary is just to calculate a value.
In addition, according to the present embodiment, first, a section having a low DAP score is detected, and then a silent section is detected. However, a silent section may be detected by comparing with silent data without detecting a section having a low DAP score.

また、上記プログラムにおいて、音声処理ステップは、前記第二音声データを、フレームに区分するフレーム区分ステップと、前記区分されたフレーム毎の音声データであるフレーム音声データを1以上得るフレーム音声データ取得ステップと、前記フレーム毎の入力音声データに基づいて、特殊な音声が入力されたことを検知する特殊音声検知ステップと、教師データと前記入力音声データと前記特殊音声検知ステップにおける検知結果に基づいて、前記受け付けた音声の評定を行う評定ステップと、前記評定ステップにおける評定結果を出力する出力ステップを具備する、ことは好適である。
また、上記プログラムにおいて、特殊音声検知ステップは、無音を示すＨＭＭに基づくデータである無音データと、前記入力音声データに基づいて、無音の区間を検出する、ことは好適である。
（実施の形態４） Further, in the above program, the audio processing step includes a frame dividing step for dividing the second audio data into frames, and a frame audio data acquiring step for obtaining one or more frame audio data which are audio data for each of the divided frames. And, based on the input voice data for each frame, a special voice detection step for detecting that a special voice has been input, and based on detection results in the teacher data, the input voice data, and the special voice detection step, It is preferable that a rating step for rating the received voice and an output step for outputting a rating result in the rating step are preferable.
In the above program, it is preferable that the special voice detecting step detects a silent section based on silent data which is data based on HMM indicating silent and the input voice data.
(Embodiment 4)

本実施の形態において、入力音声において、特殊音声を検知し、比較対象の音声と入力音声の類似度を精度高く評定できる音声処理装置について説明する。特に、本音声処理装置は、音韻の挿入を検知できる音声処理装置である。 In the present embodiment, a speech processing apparatus that detects special speech in input speech and can accurately evaluate the similarity between the speech to be compared and the input speech will be described. In particular, the speech processing apparatus is a speech processing apparatus that can detect insertion of phonemes.

図２１は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部２１１０、発声催促部１１０９を具備する。 FIG. 21 is a block diagram of the speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 2110, and the utterance prompting unit 1109 are provided.

音声処理部２１１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、特殊音声検知手段２１１０１、評定手段２１１０３、出力手段２１１０４を具備する。なお、評定手段２１１０３は、最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２を具備する。 The audio processing unit 2110 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a special audio detection unit 21101, a rating unit 21103, and an output unit 21104. The rating unit 21103 includes an optimum state determination unit 11031 and an optimum state probability value acquisition unit 11032.

特殊音声検知手段２１１０１は、一の音素の後半部および当該音素の次の音素の前半部の評定値が所定の条件を満たすことを検知する。後半部、および前半部の長さは問わない。特殊音声検知手段２１１０１は、通常、ＭＰＵやメモリ等から実現され得る。特殊音声検知手段２１１０１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The special voice detection unit 21101 detects that the rating values of the second half of one phoneme and the first half of the next phoneme after the phoneme satisfy a predetermined condition. The length of the second half and the first half is not limited. The special sound detection means 21101 can be usually realized by an MPU, a memory, or the like. The processing procedure of the special sound detection means 21101 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定手段２１１０３は、特殊音声検知手段２１１０１が所定の条件を満たすことを検知した場合に、少なくとも音素の挿入があった旨を示す評定結果を構成する。なお、評定手段２１１０３は、実施の形態３で述べたアルゴリズムにより、特殊音声検知手段２１１０１が所定の条件を満たすことを検知した区間に無音が挿入されたか否かを判断し、無音が挿入されていない場合に、他の音素が挿入されたと検知しても良い。また、評定手段２１１０３は、無音が挿入されていない場合に、他の音韻ＨＭＭに対する確率値を算出し、所定の値より高い確率値を得た音韻が挿入された、との評定結果を得ても良い。なお、実施の形態３で述べた無音区間の検知は、無音音素の挿入の検知である、とも言える。評定手段２１１０３は、通常、ＭＰＵやメモリ等から実現され得る。評定手段２１１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating means 21103 constitutes a rating result indicating that at least a phoneme has been inserted when the special voice detecting means 21101 detects that a predetermined condition is satisfied. Note that the rating means 21103 determines whether or not silence has been inserted in the section in which the special voice detection means 21101 has detected that the predetermined condition is satisfied by the algorithm described in the third embodiment, and silence has been inserted. If not, it may be detected that another phoneme has been inserted. The rating means 21103 calculates a probability value for another phoneme HMM when no silence is inserted, and obtains a rating result that a phoneme having a probability value higher than a predetermined value is inserted. Also good. It can be said that the detection of the silent section described in the third embodiment is the detection of the insertion of a silent phoneme. The rating means 21103 can be usually realized by an MPU, a memory, or the like. The processing procedure of the rating means 21103 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力手段２１１０４は、評定手段２１１０３における評定結果を出力する。ここでの評定結果は、音素の挿入があった旨を示す評定結果を含む。また、評定結果は、音素の挿入があった場合に、所定数値分、減じられて算出された評定値（スコア）のみでも良い。また、評定結果は、音素の挿入があった旨、および評定値（スコア）の両方であっても良い。なお、教師データにおいて想定されていない音素の挿入を検知した場合、通常、評定値は低くなる。ここで、出力とは、ディスプレイへの表示、プリンタへの印字、音出力、外部の装置への送信、記録媒体への蓄積等を含む概念である。出力手段２１１０４は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力手段２１１０４は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 21104 outputs the rating result in the rating unit 21103. The rating result here includes a rating result indicating that a phoneme has been inserted. The rating result may be only the rating value (score) calculated by subtracting a predetermined numerical value when a phoneme is inserted. In addition, the rating result may be both that the phoneme has been inserted and the rating value (score). In addition, when the insertion of the phoneme which is not assumed in the teacher data is detected, the rating value is usually low. Here, the output is a concept including display on a display, printing on a printer, sound output, transmission to an external device, accumulation in a recording medium, and the like. The output unit 21104 may or may not include an output device such as a display or a speaker. The output means 21104 can be implemented by output device driver software, or output device driver software and an output device.

次に、音声処理装置の動作について、図２２、図２３のフローチャートを用いて説明する。なお、図２２のフローチャートは、図１２のフローチャートと比較して、ステップS２２０１の評定処理のみが異なるので、図２２のフローチャートの説明は省略する。ステップS２２０１の評定処理の詳細について、図２３のフローチャートを用いて説明する。図２３のフローチャートにおいて、図２、図１９のフローチャートの処理と同様の処理については、その説明を省略する。 Next, the operation of the speech processing apparatus will be described using the flowcharts of FIGS. Note that the flowchart in FIG. 22 is different from the flowchart in FIG. 12 only in the rating process in step S2201, and thus the description of the flowchart in FIG. 22 is omitted. Details of the rating process in step S2201 will be described with reference to the flowchart of FIG. In the flowchart of FIG. 23, the description of the same processing as that of the flowcharts of FIGS. 2 and 19 is omitted.

（ステップＳ２３０１）特殊音声検知手段２１１０１は、フレームに対応するデータを一時的に蓄積するバッファにデータが格納されているか否かを判断する。なお、格納されているデータは、ステップＳ１９０２で、所定の値より低い評定値と評価されたフレーム音声データ、または当該フレーム音声データから取得できるデータである。データが格納されていればステップＳ２３０７に行き、データが格納されていなければ上位関数にリターンする。 (Step S2301) The special sound detection unit 21101 determines whether data is stored in a buffer that temporarily accumulates data corresponding to a frame. The stored data is frame audio data evaluated as a rating value lower than a predetermined value in step S1902, or data that can be acquired from the frame audio data. If the data is stored, the process goes to step S2307, and if the data is not stored, the process returns to the upper function.

（ステップＳ２３０２）特殊音声検知手段２１１０１は、バッファにデータが格納されているか否かを判断する。データが格納されていればステップＳ２３０７に行き、データが格納されていなければステップステップＳ２３０３に行く。
（ステップＳ２３０３）出力手段２１１０４は、ステップＳ１９０１で算出した評定値を出力する。
（ステップＳ２３０４）特殊音声検知手段２１１０１は、カウンタｉを１、インクリメントする。ステップＳ２０８に戻る。
（ステップＳ２３０５）特殊音声検知手段２１１０１は、バッファに、所定の値より低い評定値と評価されたフレーム音声データ、または当該フレーム音声データから取得できるデータを一時蓄積する。
（ステップＳ２３０６）特殊音声検知手段２１１０１は、カウンタｉを１、インクリメントする。ステップＳ２０８に戻る。
（ステップＳ２３０７）特殊音声検知手段２１１０１は、カウンタｊに１を代入する。 (Step S2302) The special sound detection unit 21101 determines whether data is stored in the buffer. If data is stored, go to step S2307, and if data is not stored, go to step S2303.
(Step S2303) The output means 21104 outputs the rating value calculated in step S1901.
(Step S2304) The special sound detection means 21101 increments the counter i by 1. The process returns to step S208.
(Step S2305) The special sound detection means 21101 temporarily stores in the buffer frame audio data evaluated as a rating value lower than a predetermined value or data obtainable from the frame sound data.
(Step S2306) The special sound detection means 21101 increments the counter i by 1. The process returns to step S208.
(Step S2307) The special sound detection means 21101 assigns 1 to the counter j.

（ステップＳ２３０８）特殊音声検知手段２１１０１は、ｊ番目のデータが、バッファに存在するか否かを判断する。ｊ番目のデータが存在すればステップＳ２３０９に行き、ｊ番目のデータが存在しなければステップＳ２３１５に飛ぶ。
（ステップＳ２３０９）特殊音声検知手段２１１０１は、ｊ番目のデータに対応する最適状態の音素を取得する。
（ステップＳ２３１０）特殊音声検知手段２１１０１は、ｊ番目のデータに対する全教師データの確率値を算出し、最大の確率値を持つ音素を取得する。 (Step S2308) The special sound detection unit 21101 determines whether or not the j-th data exists in the buffer. If the jth data exists, the process goes to step S2309, and if the jth data does not exist, the process jumps to step S2315.
(Step S2309) The special voice detecting unit 21101 acquires a phoneme in an optimal state corresponding to the j-th data.
(Step S2310) The special speech detection unit 21101 calculates the probability value of all the teacher data for the j-th data, and acquires the phoneme having the maximum probability value.

（ステップＳ２３１１）特殊音声検知手段２１１０１は、ステップＳ２３０９で取得した音素とステップＳ２３１０で取得した音素が異なる音素であるか否かを判断する。異なる音素であればステップＳ２３１２に行き、異なる音素でなければステップＳ２３１４に飛ぶ。
（ステップＳ２３１２）評定手段２１１０３は、音素の挿入があった旨を示す評定結果を構成する。
（ステップＳ２３１３）特殊音声検知手段２１１０１は、カウンタｊを１、インクリメントする。ステップＳ２３０８に戻る。
（ステップＳ２３１４）出力手段２１１０４は、バッファ中の全データに対応する全評定値を出力する。ここで、全評定値とは、例えば、フレーム毎のＤＡＰスコアである。ステップＳ２３１３に行く。 (Step S2311) The special voice detection unit 21101 determines whether the phoneme acquired in step S2309 and the phoneme acquired in step S2310 are different phonemes. If the phoneme is different, the process goes to step S2312, and if not, the process jumps to step S2314.
(Step S2312) The rating means 21103 constitutes a rating result indicating that a phoneme has been inserted.
(Step S2313) The special sound detection means 21101 increments the counter j by 1. The process returns to step S2308.
(Step S2314) The output means 21104 outputs all rating values corresponding to all data in the buffer. Here, the total rating value is, for example, a DAP score for each frame. Go to step S2313.

（ステップＳ２３１５）出力手段２１１０４は、評定結果に「挿入の旨」の情報が入っているか否かを判断する。「挿入の旨」の情報が入っていればステップＳ２３１６に行き、「挿入の旨」の情報が入っていなければステップＳ２３１７に行く。
（ステップＳ２３１６）出力手段２１１０４は、評定結果を出力する。
（ステップＳ２３１７）出力手段２１１０４は、バッファをクリアする。ステップＳ２０８に戻る。 (Step S2315) The output unit 21104 determines whether or not the information of “insertion” is included in the evaluation result. If “insertion” information is included, the process proceeds to step S2316, and if “insertion” information is not included, the process proceeds to step S2317.
(Step S2316) The output means 21104 outputs a rating result.
(Step S2317) The output means 21104 clears the buffer. The process returns to step S208.

なお、図２３のフローチャートにおいて、評定値の低いフレームが２つの音素に渡って存在すれば、音素の挿入があったと判断した。つまり、一の音素の後半部（少なくとも最終フレーム）および当該音素の次の音素の第一フレームの評定値が所定値より低い場合に、音素の挿入があったと判断した。しかし、図２３のフローチャートにおいて、一の音素の所定区間以上の後半部、および当該音素の次の音素の所定区間以上の前半部の評定値が所定値よりすべて低い場合に、音素の挿入があったと判断するようにしても良い。
以下、本実施の形態における音声処理装置の具体的な動作について説明する。本実施の形態において、音素の挿入の検知を行う処理が実施の形態３等とは異なる。そこで、その異なる処理を中心に説明する。
まず、学習者（評価対象者）が、語学学習の開始の指示である動作開始指示を入力する。そして、音声処理装置は、当該動作開始指示を受け付け、次に、例えば、「"あ"と発声してください。」と画面出力する。
そして、学習者は、"あ"と発声し、音声処理装置は、当該発声から、第二ンプリング周波数「３２．1ＫＨｚ」を得る。かかる処理は、実施の形態１等において説明した処理と同様である。 In the flowchart of FIG. 23, if a frame with a low rating value exists across two phonemes, it is determined that a phoneme has been inserted. That is, it is determined that a phoneme has been inserted when the rating value of the second half of at least one phoneme (at least the final frame) and the first frame of the next phoneme after the phoneme are lower than a predetermined value. However, in the flowchart of FIG. 23, when the evaluation values of the latter half of the predetermined phoneme and the first half of the phoneme next to the phoneme are all lower than the predetermined value, there is no phoneme insertion. You may make it judge that it was.
Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In the present embodiment, the processing for detecting insertion of phonemes is different from that in the third embodiment. Therefore, the different processing will be mainly described.
First, a learner (evaluator) inputs an operation start instruction that is an instruction to start language learning. Then, the voice processing apparatus receives the operation start instruction, and then outputs, for example, “Please say“ A ”” to the screen.
Then, the learner utters “A”, and the speech processing apparatus obtains the second sampling frequency “32.1 KHz” from the utterance. Such processing is the same as the processing described in the first embodiment.

次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。次に、音声処理部２１１０は、第二音声データを、以下のように処理する。
まず、フレーム区分手段１１０１は、「ｒｉｇｈｔ」の第二音声データを、短時間フレームに区分する。
そして、フレーム音声データ取得手段１１０２は、フレーム区分手段１１０１が区分した音声データを、スペクトル分析し、特徴ベクトル系列「Ｏ＝ｏ_１，ｏ_２，・・・，ｏ_Ｔ」を算出する。
次に、評定手段２１１０３の最適状態決定手段１１０３１は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。
次に、最適状態確率値取得手段１１０３２は、上述した数式１、２により、最適状態における確率値を算出する。 Next, the vocal tract length normalization processing unit 109 resamples the first voice data of “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data. Next, the audio processing unit 2110 processes the second audio data as follows.
First, the frame segmentation unit 1101 segments the second audio data “right” into short frames.
Then, the frame audio data acquisition unit 1102 performs spectrum analysis on the audio data classified by the frame classification unit 1101 and calculates a feature vector series “O = o ₁ , o ₂ ,..., O _T ”.
Then, the optimal state determination unit 11031 of the evaluation unit 21103, based on the feature vector o _t constituting the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t).
Next, the optimum state probability value acquisition unit 11032 calculates the probability value in the optimum state according to the above-described formulas 1 and 2.

次に、評定手段２１１０３は、例えば、最適状態決定手段１１０３１が決定した最適状態を有する音韻全体の状態における１以上の確率値を取得し、当該１以上の確率値の総和をパラメータとして音声の評定値を算出する。つまり、評定手段２１１０３は、例えば、ＤＡＰスコアをフレーム毎に算出する。ここで、算出するスコアは、上述したｔ−ｐ−ＤＡＰスコア等でも良い。 Next, the rating unit 21103 acquires, for example, one or more probability values in the overall state of the phoneme having the optimal state determined by the optimal state determining unit 11031, and evaluates the speech using the sum of the one or more probability values as a parameter. Calculate the value. That is, the rating unit 21103 calculates a DAP score for each frame, for example. Here, the calculated score may be the above-described tp-DAP score or the like.

そして、特殊音声検知手段２１１０１は、算出されたフレームに対応する評定値を用いて、特殊な音声が入力されたか否かを判断する。つまり、評定値(例えば、ＤＡＰスコア)が、所定の値より低い区間が存在するか否かを判断する。 Then, the special voice detection unit 21101 determines whether or not a special voice has been input using the rating value corresponding to the calculated frame. That is, it is determined whether or not there is a section where the rating value (for example, DAP score) is lower than a predetermined value.

次に、特殊音声検知手段２１１０１は、図２４に示すように、評定値(例えば、ＤＡＰスコア)が、所定の値より低い区間が、２つの音素に跨っているか否かを判断し、２つの音素に跨がっていれば、当該区間に音素が挿入された、と判断する。なお、かかる場合の詳細なアルゴリズムの例は、図２３で説明した。また、図２４において、斜線部が、予期しない音素が挿入された区間である。 Next, as shown in FIG. 24, the special voice detection unit 21101 determines whether or not a section where the rating value (for example, DAP score) is lower than a predetermined value straddles two phonemes. If the phoneme is straddled, it is determined that the phoneme is inserted in the section. An example of a detailed algorithm in such a case has been described with reference to FIG. In FIG. 24, the shaded area is a section in which an unexpected phoneme is inserted.

次に、評定手段２１１０３は、音素の挿入があった旨を示す評定結果（例えば、「予期しない音素が挿入されました。」）を構成する。なお、予期しない音素が挿入された場合、評定手段２１１０３は、例えば、所定数値分、減じて、評定値を算出することは好適である。そして、出力手段２１１０４は、構成した評定結果（評定値を含んでも良い）を出力する。図２５は、評定結果の出力例である。なお、出力手段２１１０４は、通常の入力音声に対しては、上述したように評定値を出力することが好適である。 Next, the rating means 21103 constitutes a rating result (for example, “an unexpected phoneme has been inserted”) indicating that a phoneme has been inserted. When an unexpected phoneme is inserted, it is preferable that the rating unit 21103 calculates the rating value by subtracting a predetermined numerical value, for example. Then, the output unit 21104 outputs a configured rating result (which may include a rating value). FIG. 25 is an output example of the evaluation result. Note that the output means 21104 preferably outputs the rating value as described above for normal input speech.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差、特に声道長の違いに影響を受けない、精度の高い評定ができる。さらに、本音声処理装置は、特殊音声、特に、予期せぬ音素の挿入を検知できるので、極めて精度の高い評定結果が得られる。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. In this case, according to the present embodiment, highly accurate evaluation can be performed without being affected by individual differences, particularly differences in vocal tract length. Furthermore, since the speech processing apparatus can detect special speech, particularly unexpected insertion of phonemes, a highly accurate evaluation result can be obtained.

なお、本実施の形態において、音素の挿入を検知できれば良く、評定値の算出アルゴリズムは問わない。評定値の算出アルゴリズムは、上述したアルゴリズム（ＤＡＰ、ｔ−ｐ−ＤＡＰ）でも良く、または、本明細書では述べていない他のアルゴリズムでも良い。 In the present embodiment, it is only necessary to detect the insertion of phonemes, and the rating value calculation algorithm is not limited. The algorithm for calculating the rating value may be the above-described algorithm (DAP, tp-DAP), or another algorithm not described in this specification.

また、上記プログラムにおいて、音声処理ステップは、前記第二音声データを、フレームに区分するフレーム区分ステップと、前記区分されたフレーム毎の音声データであるフレーム音声データを1以上得るフレーム音声データ取得ステップと、前記フレーム毎の入力音声データに基づいて、特殊な音声が入力されたことを検知する特殊音声検知ステップと、教師データと前記入力音声データと前記特殊音声検知ステップにおける検知結果に基づいて、前記受け付けた音声の評定を行う評定ステップと、前記評定ステップにおける評定結果を出力する出力ステップを具備する、ことは好適である。
また、上記プログラムにおいて、特殊音声検知ステップは、一の音素の後半部および当該音素の次の音素の前半部の評定値が所定の条件を満たすことを検知する、ことは好適である。
（実施の形態５） Further, in the above program, the audio processing step includes a frame dividing step for dividing the second audio data into frames, and a frame audio data acquiring step for obtaining one or more frame audio data which are audio data for each of the divided frames. And, based on the input voice data for each frame, a special voice detection step for detecting that a special voice has been input, and based on detection results in the teacher data, the input voice data, and the special voice detection step, It is preferable that a rating step for rating the received voice and an output step for outputting a rating result in the rating step are preferable.
In the above program, it is preferable that the special speech detection step detects that the rating values of the second half of one phoneme and the first half of the next phoneme satisfy a predetermined condition.
(Embodiment 5)

本実施の形態において、入力音声において、特殊音声を検知し、比較対象の音声と入力音声の類似度を精度高く評定できる音声処理装置について説明する。特に、本音声処理装置は、音韻の置換を検知できる音声処理装置である。 In the present embodiment, a speech processing apparatus that detects special speech in input speech and can accurately evaluate the similarity between the speech to be compared and the input speech will be described. In particular, the speech processing apparatus is a speech processing apparatus that can detect phoneme replacement.

図２６は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部２６１０、発声催促部１１０９を具備する。 FIG. 26 is a block diagram of the speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 2610, and the utterance prompting unit 1109 are provided.

音声処理部２６１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、特殊音声検知手段２６１０１、評定手段２６１０３、出力手段２１１０４を具備する。なお、評定手段２６１０３は、最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２を具備する。なお、評定手段２６１０３は、最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２を具備する。 The audio processing unit 2610 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a special audio detection unit 26101, a rating unit 26103, and an output unit 21104. The rating unit 26103 includes an optimum state determination unit 11031 and an optimum state probability value acquisition unit 11032. The rating unit 26103 includes an optimum state determination unit 11031 and an optimum state probability value acquisition unit 11032.

音声処理部２６１０は、第二音声データを処理する。音声処理部２６１０は、フレーム毎の入力音声データに基づいて、特殊な音声が入力されたことを検知する特殊音声検知手段２６１０１を具備する。音声処理部２６１０は、通常、ＭＰＵやメモリ等から実現され得る。音声処理部２６１０の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The audio processing unit 2610 processes the second audio data. The audio processing unit 2610 includes special audio detection means 26101 that detects that special audio is input based on input audio data for each frame. The audio processing unit 2610 can usually be realized by an MPU, a memory, or the like. The processing procedure of the audio processing unit 2610 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

特殊音声検知手段２６１０１は、一の音素の評定値が所定の値より低いことを検知する。また、特殊音声検知手段２６１０１は、一の音素の評定値が所定の値より低く、かつ当該音素の直前の音素および当該音素の直後の音素の評定値が所定の値より高いことをも検知しても良い。また、特殊音声検知手段２６１０１は、一の音素の評定値が所定の値より低く、かつ、想定していない音素のＨＭＭに基づいて算出された評定値が所定の値より高いことを検知しても良い。つまり、特殊音声検知手段２６１０１は、所定のアルゴリズムで、音韻の置換を検知できれば良い。そのアルゴリズムは種々考えられる。特殊音声検知手段２６１０１は、通常、ＭＰＵやメモリ等から実現され得る。特殊音声検知手段２６１０１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The special voice detection means 26101 detects that the rating value of one phoneme is lower than a predetermined value. The special speech detection means 26101 also detects that the rating value of one phoneme is lower than a predetermined value, and that the phoneme immediately before the phoneme and the rating value of the phoneme immediately after the phoneme are higher than a predetermined value. May be. Further, the special voice detecting means 26101 detects that the rating value of one phoneme is lower than a predetermined value and that the rating value calculated based on the HMM of an unexpected phoneme is higher than the predetermined value. Also good. In other words, the special voice detection means 26101 only needs to be able to detect phoneme replacement by a predetermined algorithm. Various algorithms are conceivable. The special sound detection means 26101 can be usually realized by an MPU, a memory, or the like. The processing procedure of the special sound detection means 26101 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定手段２６１０３は、特殊音声検知手段２６１０１が所定の条件を満たすことを検知した場合に、少なくとも音素の置換があった旨を示す評定結果を構成する。評定手段２６１０３は、音素の置換があった場合に、所定数値分、減じられて算出された評定値（スコア）を算出しても良い。評定手段２６１０３は、通常、ＭＰＵやメモリ等から実現され得る。評定手段２６１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating unit 26103 constitutes a rating result indicating that at least the phoneme has been replaced when the special voice detecting unit 26101 detects that the predetermined condition is satisfied. The rating means 26103 may calculate a rating value (score) calculated by subtracting a predetermined numerical value when a phoneme is replaced. The rating means 26103 can usually be realized by an MPU, a memory, or the like. The processing procedure of the rating means 26103 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、音声処理装置の動作について、図２７、図２８のフローチャートを用いて説明する。なお、図２７のフローチャートは、図１２のフローチャートと比較して、ステップS２７０１の評定処理のみが異なるので、図２７のフローチャートの説明は省略する。ステップS２７０１の評定処理の詳細について、図２８のフローチャートを用いて説明する。図２８のフローチャートにおいて、図２、図１９、図２３のフローチャートの処理と同様の処理については、その説明を省略する。 Next, the operation of the speech processing apparatus will be described using the flowcharts of FIGS. Note that the flowchart of FIG. 27 is different from the flowchart of FIG. 12 only in the rating process of step S2701, and thus the description of the flowchart of FIG. 27 is omitted. Details of the rating process in step S2701 will be described with reference to the flowchart of FIG. In the flowchart of FIG. 28, the description of the same processes as those in the flowcharts of FIGS. 2, 19, and 23 is omitted.

（ステップＳ２８０１）特殊音声検知手段２６１０１は、バッファに蓄積されているデータに対応するフレーム音声データ群が一の音素に対応するか否かを判断する。一の音素であればステップＳ２８０２に行き、一の音素でなければステップＳ２８１０に行く。 (Step S2801) The special sound detection means 26101 determines whether or not the frame sound data group corresponding to the data stored in the buffer corresponds to one phoneme. If it is one phoneme, go to step S2802, and if it is not one phoneme, go to step S2810.

（ステップＳ２８０２）特殊音声検知手段２６１０１は、バッファに蓄積されているデータに対応するフレーム音声データ群の音素の直前の音素の評定値を算出する。かかる評定値は、例えば、上述したＤＡＰスコアである。なお、直前の音素とは、現在評定中の音素に対して直前の音素である。音素の区切りは、Ｖｉｔｅｒｂｉアルゴリズムにより算出できる。 (Step S2802) The special sound detection means 26101 calculates the rating value of the phoneme immediately before the phoneme of the frame sound data group corresponding to the data stored in the buffer. Such a rating value is, for example, the DAP score described above. Note that the immediately preceding phoneme is the immediately preceding phoneme with respect to the currently rated phoneme. Phoneme breaks can be calculated by the Viterbi algorithm.

（ステップＳ２８０３）特殊音声検知手段２６１０１は、ステップＳ２８０２で算出した評定値が所定の値以上であるか否かを判断する。所定の値以上であればステップＳ２８０４に行き、所定の値より小さければステップＳ２８１０に行く。
（ステップＳ２８０４）特殊音声検知手段２６１０１は、直後の音素の評定値を算出する。かかる評定値は、例えば、上述したＤＡＰスコアである。直後の音素とは、現在評定中の音素に対して直後の音素である。 (Step S2803) The special sound detection unit 26101 determines whether or not the rating value calculated in Step S2802 is equal to or greater than a predetermined value. If it is equal to or larger than the predetermined value, the process goes to step S2804, and if it is smaller than the predetermined value, the process goes to step S2810.
(Step S2804) The special speech detection means 26101 calculates the rating value of the immediately following phoneme. Such a rating value is, for example, the DAP score described above. The phoneme immediately after is the phoneme immediately after the phoneme currently being evaluated.

（ステップＳ２８０５）特殊音声検知手段２６１０１は、ステップＳ２８０４で算出した評定値が所定の値以上であるか否かを判断する。所定の値以上であればステップＳ２８０６に行き、所定の値より小さければステップＳ２８１０に行く。 (Step S2805) The special sound detection means 26101 determines whether or not the rating value calculated in step S2804 is equal to or greater than a predetermined value. If it is equal to or greater than the predetermined value, the process proceeds to step S2806, and if it is smaller than the predetermined value, the process proceeds to step S2810.

（ステップＳ２８０６）特殊音声検知手段２６１０１は、予め格納されている音韻ＨＭＭ（予期する音韻のＨＭＭは除く）の中で、所定の値以上の評定値が得られる音韻ＨＭＭが一つ存在するか否かを判断する。所定の値以上の評定値が得られる音韻ＨＭＭが存在すればステップＳ２８０７に行き、所定の値以上の評定値が得られる音韻ＨＭＭが存在しなければステップＳ２８１０に行く。なお、予め格納されている音韻ＨＭＭは、通常、すべての音韻に対する多数の音韻ＨＭＭである。なお、本ステップにおいて、予め格納されている音韻ＨＭＭの確率値を算出し、最大の確率値を持つ音素を取得し、当該音素と最適状態の音素が異なるか否かを判断し、異なる場合に音素の置換があったと判断しても良い。
（ステップＳ２８０７）評定手段２６１０３は、音素の置換があった旨を示す評定結果を構成する。
（ステップＳ２８０８）出力手段２１１０４は、ステップＳ２８０７で構成した評定結果を出力する。
（ステップＳ２８０９）出力手段２１１０４は、バッファをクリアする。ステップＳ２０８に戻る。
（ステップＳ２８１０）出力手段２１１０４は、バッファ中の全データに対応する全評定値を出力する。ステップＳ２８０９に行く。
以下、本実施の形態における音声処理装置の具体的な動作について説明する。本実施の形態において、音素の置換の検知を行う処理が実施の形態４等とは異なる。そこで、その異なる処理を中心に説明する。 (Step S2806) The special speech detection means 26101 determines whether or not there is one phoneme HMM that can obtain a rating value equal to or higher than a predetermined value among the phoneme HMMs stored in advance (except for the expected phoneme HMM). Determine whether. If there is a phoneme HMM that obtains a rating value greater than or equal to a predetermined value, the process proceeds to step S2807, and if there is no phoneme HMM that yields a rating value greater than or equal to the predetermined value, the process proceeds to step S2810. The phoneme HMM stored in advance is usually a large number of phoneme HMMs for all phonemes. In this step, the probability value of the phoneme HMM stored in advance is calculated, the phoneme having the maximum probability value is obtained, it is determined whether or not the phoneme in the optimum state is different from the phoneme in the optimum state. It may be determined that the phoneme has been replaced.
(Step S2807) The rating means 26103 constitutes a rating result indicating that the phoneme has been replaced.
(Step S2808) The output unit 21104 outputs the rating result configured in step S2807.
(Step S2809) The output means 21104 clears the buffer. The process returns to step S208.
(Step S2810) The output means 21104 outputs all rating values corresponding to all data in the buffer. Go to step S2809.
Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In the present embodiment, the processing for detecting phoneme replacement is different from that in the fourth embodiment. Therefore, the different processing will be mainly described.

まず、学習者（評価対象者）が、語学学習の開始の指示である動作開始指示を入力する。そして、音声処理装置は、当該動作開始指示を受け付け、次に、例えば、「"う"と発声してください。」と画面出力する。
そして、評価対象者は、"う"と発声し、音声処理装置は、当該発声から、第二ンプリング周波数「３２．１ＫＨｚ」を得る。かかる処理は、実施の形態１等において説明した処理と同様である。 First, a learner (evaluator) inputs an operation start instruction that is an instruction to start language learning. Then, the voice processing apparatus accepts the operation start instruction, and then outputs, for example, “Please say“ U ”” to the screen.
Then, the evaluation target person utters “U”, and the speech processing apparatus obtains the second sampling frequency “32.1 KHz” from the utterance. Such processing is the same as the processing described in the first embodiment.

次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。次に、音声処理部２６１０は、第二音声データを、以下のように処理する。
まず、フレーム区分手段１１０１は、「ｒｉｇｈｔ」の第二音声データを、短時間フレームに区分する。
そして、フレーム音声データ取得手段１１０２は、フレーム区分手段１１０１が区分した音声データを、スペクトル分析し、特徴ベクトル系列「Ｏ＝ｏ_１，ｏ_２，・・・，ｏ_Ｔ」を算出する。
次に、評定手段２６１０３の最適状態決定手段１１０３１は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。
次に、最適状態確率値取得手段１１０３２は、上述した数式１、２により、最適状態における確率値を算出する。 Next, the vocal tract length normalization processing unit 109 resamples the first voice data of “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data. Next, the audio processing unit 2610 processes the second audio data as follows.
First, the frame segmentation unit 1101 segments the second audio data “right” into short frames.
Then, the frame audio data acquisition unit 1102 performs spectrum analysis on the audio data classified by the frame classification unit 1101 and calculates a feature vector series “O = o ₁ , o ₂ ,..., O _T ”.
Then, the optimal state determination unit 11031 of the evaluation unit 26103, based on the feature vector o _t constituting the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t).
Next, the optimum state probability value acquisition unit 11032 calculates the probability value in the optimum state according to the above-described formulas 1 and 2.

次に、評定手段２６１０３は、例えば、最適状態決定手段１１０３１が決定した最適状態を有する音韻全体の状態における１以上の確率値を取得し、当該１以上の確率値の総和をパラメータとして音声の評定値を算出する。つまり、評定手段２６１０３は、例えば、ＤＡＰスコアをフレーム毎に算出する。ここで、算出するスコアは、上述したｔ−ｐ−ＤＡＰスコア等でも良い。 Next, the rating unit 26103 acquires, for example, one or more probability values in the overall state of the phoneme having the optimal state determined by the optimal state determining unit 11031, and evaluates the speech using the sum of the one or more probability values as a parameter. Calculate the value. That is, the rating unit 26103 calculates a DAP score for each frame, for example. Here, the calculated score may be the above-described tp-DAP score or the like.

そして、特殊音声検知手段２６１０１は、算出されたフレームに対応する評定値を用いて、特殊な音声が入力されたか否かを判断する。つまり、評定値(例えば、ＤＡＰスコア)が、所定の値より低い区間が存在するか否かを判断する。 Then, the special voice detection unit 26101 determines whether or not a special voice has been input, using a rating value corresponding to the calculated frame. That is, it is determined whether or not there is a section where the rating value (for example, DAP score) is lower than a predetermined value.

次に、特殊音声検知手段２６１０１は、図２９に示すように、評定値(例えば、ＤＡＰスコア)が、所定の値より低い区間が、一つの音素内（ここでは音素２）であるか否かを判断する。そして、一つの音素内で評定値が低ければ、次に、特殊音声検知手段２６１０１は、直前の音素（音素１）および／または直後の音素（音素３）に対する評定値（例えば、ＤＡＰスコア)を算出し、当該評定値が所定の値より高ければ、音素の置換が発生している可能性があると判断する。次に、特殊音声検知手段２６１０１は、予め格納されている音韻ＨＭＭ（予期する音韻のＨＭＭは除く）の中で、所定の値以上の評定値が得られる音韻ＨＭＭが一つ存在すれば、音素の置換が発生していると判断する。なお、図２９において、音素２において、音素の置換が発生した区間である。なお、図２９において縦軸は評定値であり、当該評定値は、ＤＡＰ、ｔ−ｐ−ＤＡＰ等、問わない。 Next, as shown in FIG. 29, the special speech detection means 26101 determines whether or not a section where the rating value (for example, DAP score) is lower than a predetermined value is within one phoneme (here, phoneme 2). Judging. If the rating value is low in one phoneme, then the special speech detection means 26101 then calculates a rating value (for example, DAP score) for the immediately preceding phoneme (phoneme 1) and / or the immediately following phoneme (phoneme 3). If the rating value is higher than a predetermined value, it is determined that there is a possibility that phoneme replacement has occurred. Next, if there is one phoneme HMM in which a rating value equal to or higher than a predetermined value is present among the phoneme HMMs stored in advance (excluding the HMM of an expected phoneme), the special speech detection unit 26101 can generate a phoneme. It is determined that the replacement has occurred. In FIG. 29, the phoneme 2 is a section in which phoneme replacement has occurred. In FIG. 29, the vertical axis represents a rating value, and the rating value may be DAP, tp-DAP, or the like.

次に、評定手段２６１０３は、音素の置換があった旨を示す評定結果（例えば、「音素の置換が発生しました。」）を構成する。そして、出力手段２１１０４は、構成した評定結果を出力する。なお、出力手段２１１０４は、通常の入力音声に対しては、上述したように評定値を出力することが好適である。 Next, the rating means 26103 constitutes a rating result (for example, “phoneme replacement has occurred”) indicating that there has been a phoneme replacement. Then, the output means 21104 outputs the configured evaluation result. Note that the output means 21104 preferably outputs the rating value as described above for normal input speech.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差、特に声道長の違いに影響を受けない、精度の高い評定ができる。さらに、本音声処理装置は、特殊音声、特に、音素の置換を検知できるので、極めて精度の高い評定結果が得られる。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. In this case, according to the present embodiment, highly accurate evaluation can be performed without being affected by individual differences, particularly differences in vocal tract length. Furthermore, since this speech processing apparatus can detect special speech, particularly phoneme substitution, it is possible to obtain a highly accurate evaluation result.

なお、本実施の形態において、音素の置換を検知できれば良く、評定値の算出アルゴリズムは問わない。評定値の算出アルゴリズムは、上述したアルゴリズム（ＤＡＰ、ｔ−ｐ−ＤＡＰ）でも良く、または、本明細書では述べていない他のアルゴリズムでも良い。 In the present embodiment, it is only necessary to be able to detect phoneme replacement, and the rating value calculation algorithm is not limited. The algorithm for calculating the rating value may be the above-described algorithm (DAP, tp-DAP), or another algorithm not described in this specification.

また、本実施の形態において、音素の置換の検知アルゴリズムは、他のアルゴリズムでも良い。例えば、音素の置換の検知において、所定以上の長さの区間を有することを置換区間の検知で必須としても良い。その他、置換の検知アルゴリズムの詳細は種々考えられる。 In the present embodiment, the phoneme replacement detection algorithm may be another algorithm. For example, in the detection of the replacement of phonemes, it may be essential to detect the replacement section to have a section longer than a predetermined length. In addition, various details of the replacement detection algorithm can be considered.

また、上記プログラムにおいて、音声処理ステップは、前記第二音声データを、フレームに区分するフレーム区分ステップと、前記区分されたフレーム毎の音声データであるフレーム音声データを1以上得るフレーム音声データ取得ステップと、前記フレーム毎の入力音声データに基づいて、特殊な音声が入力されたことを検知する特殊音声検知ステップと、教師データと前記入力音声データと前記特殊音声検知ステップにおける検知結果に基づいて、前記受け付けた音声の評定を行う評定ステップと、前記評定ステップにおける評定結果を出力する出力ステップを具備する、ことは好適である。 Further, in the above program, the audio processing step includes a frame dividing step for dividing the second audio data into frames, and a frame audio data acquiring step for obtaining one or more frame audio data which are audio data for each of the divided frames. And, based on the input voice data for each frame, a special voice detection step for detecting that a special voice has been input, and based on detection results in the teacher data, the input voice data, and the special voice detection step, It is preferable that a rating step for rating the received voice and an output step for outputting a rating result in the rating step are preferable.

また、上記プログラムにおいて、特殊音声検知ステップは、一の音素の評定値が所定の条件を満たすことを検知し、特殊音声検知ステップで前記所定の条件を満たすことを検知した場合に、少なくとも音素の置換があった旨を示す評定結果を構成する、ことは好適である。
（実施の形態６） In the above program, the special speech detection step detects that the rating value of one phoneme satisfies a predetermined condition, and if the special speech detection step detects that the predetermined condition is satisfied, It is preferable to construct a rating result indicating that there has been a substitution.
(Embodiment 6)

本実施の形態において、入力音声において、特殊音声を検知し、比較対象の音声と入力音声の類似度を精度高く評定できる音声処理装置について説明する。特に、本音声処理装置は、音韻の欠落を検知できる音声処理装置である。 In the present embodiment, a speech processing apparatus that detects special speech in input speech and can accurately evaluate the similarity between the speech to be compared and the input speech will be described. In particular, the speech processing apparatus is a speech processing apparatus that can detect missing phonemes.

図３０は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部３０１０、発声催促部１１０９を具備する。 FIG. 30 is a block diagram of the speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 3010, and the utterance prompting unit 1109 are provided.

音声処理部３０１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、特殊音声検知手段３０１０１、評定手段３０１０３、出力手段２１１０４を具備する。なお、評定手段３０１０３は、最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２を具備する。 The audio processing unit 3010 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a special audio detection unit 30101, a rating unit 30103, and an output unit 21104. The rating unit 30103 includes an optimum state determination unit 11031 and an optimum state probability value acquisition unit 11032.

特殊音声検知手段３０１０１は、一の音素の評定値が所定の値より低く、かつ当該音素の直前の音素または当該音素の直後の音素の評定値が所定の値より高いことを検知する。また、特殊音声検知手段３０１０１は、一の音素の評定値が所定の値より低く、かつ当該音素の直前の音素または当該音素の直後の音素の評定値が所定の値より高く、かつ当該音素の区間長が所定の長さよりも短いことを検知しても良い。また、特殊音声検知手段３０１０１は、直前の音素に対応する確率値、または直後の音素に対応する確率値が、当該一の音素の確率値より高いことを検知しても良い。かかる場合に、特殊音声検知手段３０１０１は、音韻の欠落を検知することは好適である。さらに、音素の区間長が所定の長さよりも短いことを欠落の条件に含めることにより、音韻の欠落の検知の精度は向上する。特殊音声検知手段３０１０１は、通常、ＭＰＵやメモリ等から実現され得る。特殊音声検知手段３０１０１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The special voice detection unit 30101 detects that the rating value of one phoneme is lower than a predetermined value, and that the phoneme immediately before the phoneme or the rating value of the phoneme immediately after the phoneme is higher than a predetermined value. Further, the special voice detection means 30101 has a rating value of one phoneme lower than a predetermined value, a rating value of a phoneme immediately before the phoneme or a phoneme immediately after the phoneme is higher than a predetermined value, and the phoneme of the phoneme It may be detected that the section length is shorter than a predetermined length. The special speech detection unit 30101 may detect that the probability value corresponding to the immediately preceding phoneme or the probability value corresponding to the immediately following phoneme is higher than the probability value of the one phoneme. In such a case, it is preferable that the special voice detection unit 30101 detects missing phonemes. Furthermore, by including that the segment length of the phoneme is shorter than a predetermined length in the missing condition, the accuracy of detecting missing phonemes is improved. The special sound detection means 30101 can be usually realized by an MPU, a memory, or the like. The processing procedure of the special sound detection means 30101 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定手段３０１０３は、特殊音声検知手段３０１０１が所定の条件を満たすことを検知した場合に、少なくとも音素の欠落があった旨を示す評定結果を構成する。評定手段３０１０３は、通常、ＭＰＵやメモリ等から実現され得る。評定手段３０１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating unit 30103 constitutes a rating result indicating that at least a phoneme is missing when the special voice detecting unit 30101 detects that a predetermined condition is satisfied. The rating means 30103 can usually be realized by an MPU, a memory, or the like. The processing procedure of the rating means 30103 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、音声処理装置の動作について、図３１、図３２のフローチャートを用いて説明する。なお、図３１のフローチャートは、図１２のフローチャートと比較して、ステップS３１０１の評定処理のみが異なるので、図３１のフローチャートの説明は省略する。ステップS３１０１の評定処理の詳細について、図３２のフローチャートを用いて説明する。図３２のフローチャートにおいて、図２、図１９、図２３、図２８のフローチャートの処理と同様の処理については、その説明を省略する。 Next, the operation of the speech processing apparatus will be described using the flowcharts of FIGS. Note that the flowchart of FIG. 31 is different from the flowchart of FIG. 12 only in the rating process in step S3101, and thus the description of the flowchart of FIG. 31 is omitted. Details of the rating process in step S3101 will be described with reference to the flowchart of FIG. In the flowchart of FIG. 32, the description of the same processes as those of the flowcharts of FIGS. 2, 19, 23, and 28 is omitted.

（ステップＳ３２０１）特殊音声検知手段３０１０１は、バッファに蓄積されているデータに対して、直前の音素に対応する教師データの確率値または、直後の音素に対応する教師データの確率値が、予定されている音素に対応する教師データの確率値より高いか否かを判断する。高ければステップＳ３２０２に行き、高くなければステップＳ２８１０に行く。なお、ステップＳ３２０２に行くための条件として、バッファに蓄積されているデータに対応するフレーム音声データ群の区間長が所定の長さ以下であることを付加しても良い。
（ステップＳ３２０２）評定手段３０１０３は、音素の欠落があった旨を示す評定結果を構成する。ステップＳ２８０８に行く。
なお、図３２のフローチャートにおいて、評定対象の音素（欠落したであろう音素）の区間長が、所定の長さ（例えば、３フレーム）よりも短いことを条件としても良いし、かかる条件は無くても良い。
以下、本実施の形態における音声処理装置の具体的な動作について説明する。本実施の形態において、音素の欠落の検知を行う処理が実施の形態５等とは異なる。そこで、その異なる処理を中心に説明する。 (Step S3201) The special voice detection means 30101 schedules the probability value of the teacher data corresponding to the immediately preceding phoneme or the probability value of the teacher data corresponding to the immediately following phoneme with respect to the data stored in the buffer. It is determined whether or not the probability value of the teacher data corresponding to the current phoneme is higher. If it is higher, go to step S3202, otherwise go to step S2810. As a condition for going to step S3202, it may be added that the section length of the frame audio data group corresponding to the data stored in the buffer is equal to or less than a predetermined length.
(Step S3202) The rating means 30103 constitutes a rating result indicating that a phoneme is missing. Go to step S2808.
In the flowchart of FIG. 32, the section length of the phonemes to be evaluated (phonemes that will be missing) may be shorter than a predetermined length (for example, 3 frames), or there is no such condition. May be.
Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In the present embodiment, the processing for detecting missing phonemes is different from that in the fifth embodiment. Therefore, the different processing will be mainly described.

次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。次に、音声処理部３０１０は、第二音声データを、以下のように処理する。
まず、フレーム区分手段１１０１は、「ｒｉｇｈｔ」の第二音声データを、短時間フレームに区分する。
そして、フレーム音声データ取得手段１１０２は、フレーム区分手段１１０１が区分した音声データを、スペクトル分析し、特徴ベクトル系列「Ｏ＝ｏ_１，ｏ_２，・・・，ｏ_Ｔ」を算出する。
次に、最適状態決定手段１１０３１は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。
次に、最適状態確率値取得手段１１０３２は、上述した数式１、２により、最適状態における確率値を算出する。

Next, the vocal tract length normalization processing unit 109 resamples the first voice data of “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data. Next, the audio processing unit 3010 processes the second audio data as follows.
First, the frame segmentation unit 1101 segments the second audio data “right” into short frames.
Then, the frame audio data acquisition unit 1102 performs spectrum analysis on the audio data classified by the frame classification unit 1101 and calculates a feature vector series “O = o ₁ , o ₂ ,..., O _T ”.
Then, the optimal state determination unit 11031, based on the feature vector o _t constituting the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t).
Next, the optimum state probability value acquisition unit 11032 calculates the probability value in the optimum state according to the above-described formulas 1 and 2.

次に、評定手段３０１０３は、例えば、最適状態決定手段１１０３１が決定した最適状態を有する音韻全体の状態における１以上の確率値を取得し、当該１以上の確率値の総和をパラメータとして音声の評定値を算出する。つまり、評定手段３０１０３は、例えば、ＤＡＰスコアをフレーム毎に算出する。ここで、算出するスコアは、上述したｔ−ｐ−ＤＡＰスコア等でも良い。 Next, the rating unit 30103 acquires, for example, one or more probability values in the overall state of the phoneme having the optimum state determined by the optimum state determining unit 11031, and evaluates the speech using the sum of the one or more probability values as a parameter. Calculate the value. That is, the rating unit 30103 calculates, for example, a DAP score for each frame. Here, the calculated score may be the above-described tp-DAP score or the like.

そして、特殊音声検知手段３０１０１は、算出されたフレームに対応する評定値を用いて、特殊な音声が入力されたか否かを判断する。つまり、評定値(例えば、ＤＡＰスコア)が、所定の値より低い区間が存在するか否かを判断する。 Then, the special voice detection unit 30101 determines whether or not a special voice has been input using the rating value corresponding to the calculated frame. That is, it is determined whether or not there is a section where the rating value (for example, DAP score) is lower than a predetermined value.

次に、特殊音声検知手段３０１０１は、図３３に示すように、評定値(例えば、ＤＡＰスコア)が、所定の値より低い区間が、一つの音素内（ここでは音素２）であるか否かを判断する。そして、一つの音素内で評定値が低ければ、特殊音声検知手段３０１０１は、直前の音素（音素１）または直後の音素（音素３）に対する評定値（例えば、ＤＡＰスコア)を算出し、当該評定値が所定の値より高ければ、音素の欠落が発生している可能性があると判断する。そして、当該区間長が、例えば、３フレーム以下の長さであれば、かかる音素は欠落したと判断する。なお、図３３において、音素２の欠落が発生したことを示す。なお、図３３において縦軸は評定値であり、当該評定値は、ＤＡＰ、ｔ−ｐ−ＤＡＰ等、問わない。また、上記区間長の所定値は、「３フレーム以下」ではなく、「５フレーム以下」でも、「６フレーム以下」でも良い。 Next, as shown in FIG. 33, the special voice detection unit 30101 determines whether or not a section where the rating value (for example, DAP score) is lower than a predetermined value is within one phoneme (here, phoneme 2). Judging. If the rating value is low in one phoneme, the special speech detection unit 30101 calculates a rating value (for example, DAP score) for the immediately preceding phoneme (phoneme 1) or the immediately following phoneme (phoneme 3). If the value is higher than a predetermined value, it is determined that there is a possibility that a phoneme is missing. If the section length is, for example, 3 frames or less, it is determined that such phonemes are missing. FIG. 33 shows that the phoneme 2 is missing. In FIG. 33, the vertical axis represents the rating value, and the rating value may be DAP, tp-DAP, or the like. Further, the predetermined value of the section length is not “3 frames or less” but may be “5 frames or less” or “6 frames or less”.

次に、評定手段３０１０３は、音素の欠落があった旨を示す評定結果（例えば、「音素の欠落が発生しました。」）を構成する。そして、出力手段２１１０４は、構成した評定結果を出力する。なお、出力手段２１１０４は、通常の入力音声に対しては、上述したように評定値を出力することが好適である。 Next, the rating means 30103 configures a rating result (for example, “phoneme missing has occurred”) indicating that a phoneme is missing. Then, the output means 21104 outputs the configured evaluation result. Note that the output means 21104 preferably outputs the rating value as described above for normal input speech.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差、特に声道長の違いに影響を受けない、精度の高い評定ができる。さらに、本音声処理装置は、特殊音声、特に、音素の欠落を検知できるので、極めて精度の高い評定結果が得られる。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. In this case, according to the present embodiment, highly accurate evaluation can be performed without being affected by individual differences, particularly differences in vocal tract length. Furthermore, since this speech processing apparatus can detect special speech, particularly missing phonemes, an extremely accurate rating result can be obtained.

なお、本実施の形態において、音素の欠落を検知できれば良く、評定値の算出アルゴリズムは問わない。評定値の算出アルゴリズムは、上述したアルゴリズム（ＤＡＰ、ｔ−ｐ−ＤＡＰ）でも良く、または、本明細書では述べていない他のアルゴリズムでも良い。 In the present embodiment, it is only necessary to detect missing phonemes, and the algorithm for calculating the rating value is not limited. The algorithm for calculating the rating value may be the above-described algorithm (DAP, tp-DAP), or another algorithm not described in this specification.

また、本実施の形態において、音素の欠落の検知アルゴリズムは、他のアルゴリズムでも良い。例えば、音素の欠落の検知において、所定長さ未満の区間であることを欠落区間の検知で必須としても良いし、区間長を考慮しなくても良い。 In this embodiment, another algorithm may be used as the phoneme loss detection algorithm. For example, in the detection of missing phonemes, a section having a length less than a predetermined length may be essential in detecting the missing section, or the section length may not be considered.

また、上記プログラムにおいて、特殊音声検知ステップは、一の音素の評定値が所定の条件を満たすことを検知し、特殊音声検知ステップで前記所定の条件を満たすことを検知した場合に、少なくとも音素の欠落があった旨を示す評定結果を構成する、ことは好適である。
（実施の形態７） In the above program, the special speech detection step detects that the rating value of one phoneme satisfies a predetermined condition, and if the special speech detection step detects that the predetermined condition is satisfied, It is preferable to configure a rating result indicating that there is a lack.
(Embodiment 7)

本実施の形態における音声処理装置は、サンプリング周波数を変更し、リサンプリングを行わずに評定した場合の評定値と、リサンプリングを行って評定した場合の評定値とを取得し、２つの評定値に基づいて、最終的な評定値を算出する音声処理装置である。例えば、本音声処理装置は、２つの評定値の平均値を最終的な評定値としても良いし、２つの評定値の最大値を最終的な評定値としても良い。また、本音声処理装置は、例えば、カラオケ評定装置である。 The speech processing apparatus according to the present embodiment acquires a rating value when the sampling frequency is changed and the rating is performed without resampling, and a rating value when the rating is performed after resampling, and two rating values are obtained. Is a speech processing device for calculating a final rating value based on the above. For example, the speech processing apparatus may use an average value of two rating values as a final rating value, or may set a maximum value of two rating values as a final rating value. Moreover, this speech processing apparatus is a karaoke rating apparatus, for example.

図３４は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部３４１０、発声催促部１１０９を具備する。
音声処理部３４１０は、フレーム区分手段３４１０１、フレーム音声データ取得手段３４１０２、評定手段３４１０３、出力手段１１０４を具備する。
評定手段３４１０３は、第一評定手段３４１０３１、第二評定手段３４１０３２、評定結果取得手段３４１０３３を具備する。
フレーム区分手段３４１０１は、音声をフレームに区分し、かつ、前記第二音声データをフレームに区分する。 FIG. 34 is a block diagram of the speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 3410, and the utterance prompting unit 1109 are provided.
The audio processing unit 3410 includes a frame classification unit 34101, a frame audio data acquisition unit 34102, a rating unit 34103, and an output unit 1104.
The rating unit 34103 includes a first rating unit 341031, a second rating unit 341032, and a rating result acquisition unit 341033.
The frame classification unit 34101 classifies the voice into frames and classifies the second voice data into frames.

フレーム音声データ取得手段３４１０２は、音声が区分されたフレーム毎の音声データである第一フレーム音声データを1以上得て、かつ前記第二音声データが区分されたフレーム毎の音声データである第二フレーム音声データを1以上得る。 The frame audio data acquisition means 34102 obtains one or more first frame audio data, which is audio data for each frame into which audio is divided, and second audio data for each frame into which the second audio data is divided. Get one or more frame audio data.

評定手段３４１０３は、教師データと1以上のフレーム音声データに基づいて、音声受付部１０３が受け付けた音声の評定を行う。評定手段３４１０３は、以下の第一評定手段３４１０３１の評定結果と、第二評定手段３４１０３２の評定結果に基づいて、最終的な評定結果を得る。
第一評定手段３４１０３１は、教師データと1以上の第一フレーム音声データに基づいて、音声受付部が受け付けた音声の評定を行う。
第二評定手段３４１０３２は、教師データと1以上の第二フレーム音声データに基づいて、音声受付部が受け付けた音声の評定を行う。 The rating unit 34103 evaluates the voice received by the voice receiving unit 103 based on the teacher data and one or more frame voice data. The rating unit 34103 obtains a final rating result based on the following rating result of the first rating unit 341031 and the rating result of the second rating unit 341032.
The first rating means 341031 evaluates the voice received by the voice receiving unit based on the teacher data and the one or more first frame voice data.
The second rating means 341032 evaluates the voice received by the voice receiving unit based on the teacher data and one or more second frame voice data.

評定結果取得手段３４１０３３は、第一評定手段３４１０３１における評定結果（以下、適宜「第一評定結果」という。）と第二評定手段３４１０３２における評定結果（以下、適宜「第二評定結果」という。）に基づいて、最終的な評定結果を得る。評定結果取得手段３４１０３３は、例えば、第一評定結果と第二評定結果の平均値を、最終的な評定結果としても良いし、第一評定結果と第二評定結果の大きい方の値を最終的な評定結果としても良いし、第一評定結果と第二評定結果の小さい方の値を最終的な評定結果としても良い。 The rating result acquisition unit 341033 is a rating result in the first rating unit 341031 (hereinafter referred to as “first rating result” as appropriate) and a rating result in the second rating unit 341032 (hereinafter referred to as “second rating result” as appropriate). To get the final rating result. The rating result acquisition unit 341033 may use, for example, an average value of the first rating result and the second rating result as a final rating result, or finally set a larger value of the first rating result and the second rating result. The final rating result may be the smaller one of the first rating result and the second rating result.

フレーム区分手段３４１０１、フレーム音声データ取得手段３４１０２、第一評定手段３４１０３１、第二評定手段３４１０３２、評定結果取得手段３４１０３３は、通常、ＭＰＵやメモリ等から実現され得る。フレーム区分手段３４１０１等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。
次に、音声処理装置の動作について図３５のフローチャートを用いて説明する。図３５のフローチャートにおいて、図２、図１２のフローチャートと異なるステップについてのみ説明する。 The frame classification unit 34101, the frame audio data acquisition unit 34102, the first rating unit 341031, the second rating unit 341032, and the rating result acquisition unit 341033 can be usually realized by an MPU, a memory, or the like. The processing procedure of the frame sorting unit 34101 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).
Next, the operation of the speech processing apparatus will be described using the flowchart of FIG. In the flowchart of FIG. 35, only steps different from the flowcharts of FIGS. 2 and 12 will be described.

（ステップＳ３５０１）第一評定手段３４１０３１は、第一評定処理を行う。第一評定処理とは、教師データと1以上の第一フレーム音声データに基づいて、音声受付部１０３が受け付けた音声の評定を行う処理である。第一評定処理は、リサンプリングしない第一音声データを評定する処理である。第一評定処理における評定のアルゴリズムは、上記の実施の形態１から実施の形態６で述べたいずれのアルゴリズム（ＤＡＰ、ｔ−ｐ−ＤＡＰ、無音区間考慮、挿入を考慮、置換を考慮、欠落を考慮など）または、それらを組み合わせたアルゴリズムでも良い。 (Step S3501) The first rating means 341031 performs a first rating process. The first rating process is a process for evaluating the voice received by the voice receiving unit 103 based on the teacher data and one or more first frame voice data. The first rating process is a process for rating the first audio data that is not resampled. The rating algorithm in the first rating process is any algorithm described in the first to sixth embodiments (DAP, tp-DAP, silent section consideration, insertion consideration, replacement consideration, omission). Or a combination of them may be used.

（ステップＳ３５０２）第二評定手段３４１０３２は、第二評定処理を行う。第二評定処理とは、教師データと1以上の第二フレーム音声データに基づいて、音声受付部１０３が受け付けた音声の評定を行う処理である。第二評定処理は、リサンプリングした第二音声データを評定する処理である。第二評定処理における評定のアルゴリズムは、上記の実施の形態１から実施の形態６で述べたいずれのアルゴリズム（ＤＡＰ、ｔ−ｐ−ＤＡＰ、無音区間考慮、挿入を考慮、置換を考慮、欠落を考慮など）または、それらを組み合わせたアルゴリズムでも良い。なお、第一評定処理と第二評定処理のアルゴリズムは、同一であることが好適である。 (Step S3502) The second rating means 341032 performs a second rating process. The second rating process is a process for rating the voice received by the voice receiving unit 103 based on the teacher data and one or more second frame voice data. The second rating process is a process for rating the resampled second audio data. The rating algorithm in the second rating process is any algorithm (DAP, tp-DAP, silent section consideration, insertion consideration, replacement consideration, omission) described in the first to sixth embodiments. Or a combination of them may be used. In addition, it is preferable that the algorithm of a 1st rating process and a 2nd rating process is the same.

（ステップＳ３５０３）評定結果取得手段３４１０３３は、第一評定手段３４１０３１における評定結果（第一評定結果）と第二評定手段３４１０３２における評定結果（第二評定結果）に基づいて、最終的な評定結果を得る。評定結果取得手段３４１０３３は、例えば、第一評定結果と第二評定結果の評定値のうち、高得点の方の評定値を最終的な評定結果とする。 (Step S3503) The rating result acquisition means 341033 obtains the final rating result based on the rating result (first rating result) in the first rating means 341031 and the rating result (second rating result) in the second rating means 341032. obtain. For example, the rating result acquisition unit 341033 uses the rating value of the higher score among the rating values of the first rating result and the second rating result as the final rating result.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差を考慮した精度の高い評定ができる。さらに、本音声処理装置は、個人差を考慮した評定と、個人差を考慮しない評定の両方を利用した評定が行える。つまり、本実施の形態によれば、例えば、第一評定結果と第二評定結果の評定値のうち、高得点の方の評定値を最終的な評定結果とすることができ、カラオケ評定装置等として有効である。
（実施の形態８）
本実施の形態における音声処理装置の音声処理は、音声認識である。
図３６は、本実施の形態における音声処理装置のブロック図である。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. In such a case, according to the present embodiment, it is possible to perform a highly accurate evaluation in consideration of individual differences. Furthermore, this speech processing apparatus can perform a rating using both a rating considering individual differences and a rating not considering individual differences. That is, according to the present embodiment, for example, among the rating values of the first rating result and the second rating result, the rating value of the higher score can be set as the final rating result, such as a karaoke rating device, etc. It is effective as
(Embodiment 8)
The voice processing of the voice processing apparatus in the present embodiment is voice recognition.
FIG. 36 is a block diagram of the speech processing apparatus according to this embodiment.

本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部３６１０、発声催促部１１０９を具備する。
音声処理部３６１０は、音声認識手段３６１０１、出力手段３６１０２を具備する。 The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 3610, and the utterance prompting unit 1109 are provided.
The voice processing unit 3610 includes voice recognition means 36101 and output means 36102.

音声処理部３６１０の音声認識手段３６１０１は、第二音声データに基づいて、音声認識処理を行う。音声認識のアルゴリズムは、問わない。音声認識処理は、公知のアルゴリズムで良い。本実施の形態において、リサンプリングした第二音声データに基づいて音声認識することにより、精度の高い音声認識が可能である。音声処理部３６１０は、通常、ＭＰＵやメモリ等から実現され得る。音声処理部３６１０の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The voice recognition unit 36101 of the voice processing unit 3610 performs voice recognition processing based on the second voice data. The algorithm for speech recognition is not limited. The voice recognition process may be a known algorithm. In the present embodiment, highly accurate speech recognition is possible by performing speech recognition based on the resampled second speech data. The audio processing unit 3610 can usually be realized by an MPU, a memory, or the like. The processing procedure of the audio processing unit 3610 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力手段３６１０２は、音声認識結果を出力する。ここで、出力とは、ディスプレイへの表示、プリンタへの印字、音出力、外部の装置への送信、記録媒体への蓄積等を含む概念である。出力手段３６１０２は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力手段３６１０２は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。
次に、音声処理装置の動作について図３７のフローチャートを用いて説明する。なお、図３７のフローチャートにおいて、図２、図１２のフローチャートの処理と同様の処理については、その説明を省略する。 The output unit 36102 outputs a voice recognition result. Here, the output is a concept including display on a display, printing on a printer, sound output, transmission to an external device, accumulation in a recording medium, and the like. The output unit 36102 may or may not include an output device such as a display or a speaker. The output means 36102 can be realized by driver software of an output device, or driver software of an output device and an output device.
Next, the operation of the speech processing apparatus will be described using the flowchart of FIG. In the flowchart of FIG. 37, the description of the same processing as that of the flowcharts of FIGS. 2 and 12 is omitted.

（ステップＳ３７０１）音声認識手段３６１０１は、ステップＳ１２０８でリサンプリング処理され、得られた第二音声データに基づいて、音声認識処理を行う。なお、音声認識手段３６１０１は、教師データとのマッチングを取り、教師データに近い音であると認識することにより、認識結果を得る。
（ステップＳ３７０２）出力手段３６１０２は、ステップＳ３７０１における音声認識結果を出力する。ステップＳ１２０６に戻る。
以上、本実施の形態によれば、精度高く音声認識できる。 (Step S3701) The voice recognition means 36101 performs voice recognition processing based on the second voice data obtained by resampling in step S1208. The voice recognition unit 36101 obtains a recognition result by matching with the teacher data and recognizing that the sound is close to the teacher data.
(Step S3702) The output means 36102 outputs the speech recognition result in step S3701. The process returns to step S1206.
As described above, according to the present embodiment, speech recognition can be performed with high accuracy.

なお、本実施の形態における音声処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、第一サンプリング周波数で、受け付けた音声をサンプリングし、第一音声データを取得するサンプリングステップと、第二サンプリング周波数「第一サンプリング周波数／（教師データフォルマント周波数／評価対象者フォルマント周波数）」で、前記音声受付ステップで受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理ステップと、前記第二音声データに基づいて、音声認識処理を行う音声処理ステップを実行させるためのプログラム、である。 Note that the software that implements the speech processing apparatus according to the present embodiment is the following program. In other words, the program samples the received voice at the first sampling frequency and acquires the first voice data to the computer, and the second sampling frequency “first sampling frequency / (teacher data formant frequency / evaluation”. Subject formant frequency) ”, the voice received in the voice receiving step is subjected to a sampling process to obtain the second voice data, and the voice based on the second voice data It is the program for performing the audio | voice processing step which performs a recognition process.

また、上記の実施の形態において検出した特殊音声は、無音、挿入、置換、欠落であった。音声処理装置は、かかるすべての特殊音声について検知しても良いことはいうまでもない。また、音声処理装置は、主として、実施の形態１、実施の形態２において述べた評定値の算出アルゴリズムを利用して、特殊音声の検出を行ったが、他の評定値の算出アルゴリズムを利用しても良い。 In addition, the special voice detected in the above embodiment is silence, insertion, replacement, and omission. It goes without saying that the sound processing device may detect all such special sounds. In addition, the speech processing apparatus mainly detects the special speech using the rating value calculation algorithm described in the first embodiment and the second embodiment, but uses other rating value calculation algorithms. May be.

また、特殊音声は、無音、挿入、置換、欠落に限られない。例えば、特殊音声は、ｇａｒｂａｇｅ（雑音などの雑多な音素等）であっても良い。受け付けた音声にｇａｒｂａｇｅが混入している場合、その区間は類似度の計算対象から除外するのがしばしば望ましい。例えば、発音評定においては、学習者の発声には通常、息継ぎや無声区間などが数多く表れ、それらに対応する発声区間を評定対象から取り除くことが好適である。なお、無音は、一般に、ｇａｒｂａｇｅの一種である、と考える。 The special voice is not limited to silence, insertion, replacement, and omission. For example, the special voice may be a garbage (miscellaneous phonemes such as noise). When garbage is mixed in the received voice, it is often desirable to exclude that section from the similarity calculation target. For example, in pronunciation evaluation, a learner's utterance usually has many breathing and unvoiced intervals, and it is preferable to remove the corresponding utterance intervals from the evaluation target. Note that silence is generally considered a type of garbage.

そこで，どの音素にも属さない雑多な音素（ｇａｒｂａｇｅ音素）を設定し、ｇａｒｂａｇｅのＨＭＭをあらかじめ格納しておく。スコア低下区間において、ｇａｒｂａｇｅのＨＭＭに対する評定値（γ_ｔ（ｊ））が所定の値より大きい場合，その区間はｇａｒｂａｇｅ区間と判定することは好適である。特に、発音評定において，ｇａｒｂａｇｅ区間が２つの単語にまたがっている場合、息継ぎなどが起こったものとして、評定値の計算対象から除外することは極めて好適である。 Therefore, a miscellaneous phoneme (garbage phoneme) that does not belong to any phoneme is set, and a garbage HMM is stored in advance. If the rating value (γ _t (j)) for the garbage HMM is larger than a predetermined value in the score lowering section, it is preferable to determine the section as the garbage section. In particular, in the pronunciation evaluation, when the garbage section extends over two words, it is extremely preferable to exclude the evaluation value from the evaluation value as the occurrence of breathing or the like.

また、図３８は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の音声処理装置を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図３８は、このコンピュータシステム３４０の概観図であり、図３９は、コンピュータシステム３４０のブロック図である。 FIG. 38 shows the external appearance of a computer that executes the programs described in this specification to realize the sound processing apparatuses according to the various embodiments described above. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 38 is an overview of the computer system 340, and FIG. 39 is a block diagram of the computer system 340.

図３８において、コンピュータシステム３４０は、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）ドライブ、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）ドライブを含むコンピュータ３４１と、キーボード３４２と、マウス３４３と、モニタ３４４と、マイク３４５とを含む。 38, a computer system 340 includes a computer 341 including an FD (Flexible Disk) drive and a CD-ROM (Compact Disk Read Only Memory) drive, a keyboard 342, a mouse 343, a monitor 344, and a microphone 345. .

図３９において、コンピュータ３４１は、ＦＤドライブ３４１１、ＣＤ−ＲＯＭドライブ３４１２に加えて、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３４１３と、ＣＰＵ３４１３、ＣＤ−ＲＯＭドライブ３４１２及びＦＤドライブ３４１１に接続されたバス３４１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ（Ｒｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）３４１５と、ＣＰＵ３４１３に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）３４１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク３４１７とを含む。ここでは、図示しないが、コンピュータ３４１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 In FIG. 39, the computer 341 includes a CPU (Central Processing Unit) 3413, a CPU 3413, a CD-ROM drive 3412, a bus 3414 connected to the FD drive 3411, a boot in addition to the FD drive 3411 and the CD-ROM drive 3412. A ROM (Read-Only Memory) 3415 for storing a program such as an up program, and a RAM (Random Access Memory) connected to the CPU 3413 for temporarily storing instructions of an application program and providing a temporary storage space 3416 and a hard disk 3417 for storing application programs, system programs, and data. Although not shown here, the computer 341 may further include a network card that provides connection to the LAN.

コンピュータシステム３４０に、上述した実施の形態の音声処理装置の機能を実行させるプログラムは、ＣＤ−ＲＯＭ３５０１、またはＦＤ３５０２に記憶されて、ＣＤ−ＲＯＭドライブ３４１２またはＦＤドライブ３４１１に挿入され、さらにハードディスク３４１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ３４１に送信され、ハードディスク３４１７に記憶されても良い。プログラムは実行の際にＲＡＭ３４１６にロードされる。プログラムは、ＣＤ−ＲＯＭ３５０１、ＦＤ３５０２またはネットワークから直接、ロードされても良い。 A program that causes the computer system 340 to execute the functions of the sound processing apparatus according to the above-described embodiment is stored in the CD-ROM 3501 or the FD 3502, inserted into the CD-ROM drive 3412 or the FD drive 3411, and further stored in the hard disk 3417. May be forwarded. Alternatively, the program may be transmitted to the computer 341 via a network (not shown) and stored in the hard disk 3417. The program is loaded into the RAM 3416 at the time of execution. The program may be loaded directly from the CD-ROM 3501, the FD 3502, or the network.

プログラムは、コンピュータ３４１に、上述した実施の形態の音声処理装置の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３４０がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS), a third-party program, or the like that causes the computer 341 to execute the functions of the voice processing apparatus according to the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 340 operates is well known and will not be described in detail.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。
また、上記のプログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。
本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.
Moreover, the computer which performs said program may be single, and plural may be sufficient as it. That is, centralized processing may be performed, or distributed processing may be performed.
The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる音声処理装置は、評価対象者の話者特性に応じた精度の高い音声処理ができるという効果を有し、発音評定装置やカラオケ評定装置や音声認識装置等として有用である。 As described above, the speech processing device according to the present invention has an effect of being able to perform speech processing with high accuracy according to the speaker characteristics of the evaluation target person, as a pronunciation rating device, a karaoke rating device, a speech recognition device, and the like. Useful.

実施の形態１における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 1 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同声道長正規化処理について説明するフローチャートFlowchart explaining the same vocal tract length normalization process 同ＨＭＭの仕様の例を示す図The figure which shows the example of the specification of the same HMM 同Ｆ１、Ｆ２の計測結果を示す図The figure which shows the measurement result of F1 and F2 同音声分析条件を示す図Figure showing the same voice analysis conditions 同算出した評定値をグラフで表した例を示す図The figure which shows the example which expressed the calculated evaluation value with the graph 同算出した評定値をグラフで表した例を示す図The figure which shows the example which expressed the calculated evaluation value with the graph 同出力例を示す図Figure showing the same output example 同出力例を示す図Figure showing the same output example 実施の形態２における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 2 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同第二サンプリング周波数算出処理について説明するフローチャートFlowchart for explaining the second sampling frequency calculation process 同評定処理について説明するフローチャートFlow chart explaining the rating process 同評定結果（ｔ−ｐ−ＤＡＰスコア）を示す図The figure which shows the same evaluation result (tp-DAP score) 同出力例を示す図Figure showing the same output example 実施の形態３における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 3 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同評定処理について説明するフローチャートFlow chart explaining the rating process 同無音データの検知について説明する図The figure explaining the detection of the silence data 実施の形態４における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 4 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同評定処理について説明するフローチャートFlow chart explaining the rating process 同音素の挿入の検知について説明する図The figure explaining detection of insertion of the same phoneme 同出力例を示す図Figure showing the same output example 実施の形態５における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 5 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同評定処理について説明するフローチャートFlow chart explaining the rating process 同音素の置換の検知について説明する図The figure explaining the detection of substitution of the same phoneme 実施の形態６における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 6 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同評定処理について説明するフローチャートFlow chart explaining the rating process 同音素の欠落の検知について説明する図Diagram explaining detection of missing phonemes 実施の形態７における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 7 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 実施の形態８における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 8 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同音声処理装置を構成するコンピュータシステムの概観図Overview of the computer system that composes the same audio processor 同音声処理装置を構成するコンピュータのブロック図Block diagram of a computer constituting the audio processing apparatus

Explanation of symbols

１０１入力受付部
１０２教師データ格納部
１０３音声受付部
１０４教師データフォルマント周波数格納部
１０５第一サンプリング周波数格納部
１０６サンプリング部
１０７評価対象者フォルマント周波数取得部
１０８評価対象者フォルマント周波数格納部
１０９声道長正規化処理部
１１０、１１１０、１７１０、２１１０、２６１０、３０１０、３４１０、３６１０音声処理部
１１０１、３４１０１フレーム区分手段
１１０２、３４１０２フレーム音声データ取得手段
１１０３、１１１０３、１７１０３、２１１０３、２６１０３、３０１０３、３４１０３評定手段
１１０４、２１１０４、３６１０２出力手段
１１０９発声催促部
１１０３１最適状態決定手段
１１０３２最適状態確率値取得手段
１１０３３、２１０２３、１１１０３３、１７１０３３評定値算出手段
１７１０１、２１１０１、２６１０１、３０１０１特殊音声検知手段
３６１０１音声認識手段
１１１０３２発音区間フレーム音韻確率値取得手段
１７１０１１無音データ格納手段
１７１０１２無音区間検出手段
３４１０３１第一評定手段
３４１０３２第二評定手段
３４１０３３評定結果取得手段
DESCRIPTION OF SYMBOLS 101 Input reception part 102 Teacher data storage part 103 Voice reception part 104 Teacher data formant frequency storage part 105 1st sampling frequency storage part 106 Sampling part 107 Evaluation object formant frequency acquisition part 108 Evaluation object person formant frequency storage part 109 Vocal tract length Normalization processing unit 110, 1110, 1710, 2110, 2610, 3010, 3410, 3610 Audio processing unit 1101, 34101 Frame classification means 1102, 34102 Frame audio data acquisition means 1103, 11103, 17103, 21103, 26103, 30103, 34103 Means 1104, 21104, 36102 Output means 1109 Speech prompting section 11031 Optimal state determination means 11032 Optimal state probability value acquisition means 11033, 2102 3, 111033, 171033 Rating value calculation means 17101, 21101, 26101, 30101 Special voice detection means 36101 Speech recognition means 111032 Pronunciation section frame phoneme probability value acquisition means 171011 Silence data storage means 171012 Silence section detection means 341031 First rating means 341032 First Two rating means 341033 Rating result obtaining means

Claims

Teachers that store data of one or more phonemes that are to be compared and store one or more teacher data having information on transition probabilities between two or more state identifiers and states for identifying states . A data storage unit;
A voice reception unit for receiving voice;
A first sampling frequency storage unit storing a first sampling frequency;
A teacher data formant frequency storage unit that stores the teacher data formant frequency is formant frequency of the training data,
An evaluation target formant frequency storage unit that stores an evaluation target formant frequency that is a formant frequency of an evaluation target person who is a speaker of the voice received by the voice reception unit;
A voice that obtains second voice data by performing a sampling process on the voice received by the voice receiving unit at the second sampling frequency “first sampling frequency / (teacher data formant frequency / evaluator formant frequency)”. A road length normalization processing unit;
An audio processing unit for processing the second audio data ;
The voice processing unit
Frame dividing means for dividing the second audio data into frames;
Frame audio data acquisition means for obtaining one or more frame audio data which is audio data for each of the divided frames;
Based on the teacher data and the one or more frame voice data, a rating means for rating the voice received by the voice receiving unit;
Comprising output means for outputting a rating result in the rating means,
The rating means is
Optimal state determination means for determining an optimal state for at least one frame audio data of the one or more frame audio data;
Using the ratio between the probability value in the optimum state determined by the optimum state determination means and the sum of the probability values in all states of the frame corresponding to the probability value, an optimum state probability value that is a probability value indicating the posterior probability is obtained. An optimal state probability value acquisition means to acquire;
A speech processing apparatus comprising rating value calculation means for calculating a voice rating value using the optimum state probability value acquired by the optimum state probability value acquisition means as a parameter.

Teachers that store data of one or more phonemes that are to be compared and store one or more teacher data having information on transition probabilities between two or more state identifiers and states for identifying states. A data storage unit;
A voice reception unit for receiving voice;
A first sampling frequency storage unit storing a first sampling frequency;
A teacher data formant frequency storage unit storing a teacher data formant frequency which is a formant frequency of the teacher data;
An evaluation target formant frequency storage unit that stores an evaluation target formant frequency that is a formant frequency of an evaluation target person who is a speaker of the voice received by the voice reception unit;
A voice that obtains second voice data by performing a sampling process on the voice received by the voice receiving unit at the second sampling frequency “first sampling frequency / (teacher data formant frequency / evaluator formant frequency)”. A road length normalization processing unit;
An audio processing unit for processing the second audio data;
The voice processing unit
Frame dividing means for dividing the second audio data into frames;
Frame audio data acquisition means for obtaining one or more frame audio data which is audio data for each of the divided frames;
Based on the teacher data and the one or more frame voice data, a rating means for rating the voice received by the voice receiving unit;
Comprising output means for outputting a rating result in the rating means,
The rating means is
Optimal state determining means for determining an optimal state of the one or more frame audio data;
A pronunciation interval frame phoneme probability value acquisition means for acquiring a probability value indicating two or more posterior probabilities in the overall state of the phoneme having the optimal state of each frame determined by the optimal state determination means;
The sum of two or more probability values acquired by the pronunciation interval frame phoneme probability value acquisition means is calculated for each frame, and based on the sum of the probability values for each frame, the time average of the sum of the probability values for each pronunciation interval A speech processing apparatus comprising a rating value calculating means for calculating one or more values and calculating a rating value of speech using the one or more time average values.

Sampling the voice received by the voice receiving unit at the first sampling frequency, further comprising a sampling unit for obtaining first voice data,
The vocal tract length normalization processing unit
2. The second audio data is obtained by performing resampling processing on the first audio data at a second sampling frequency “the first sampling frequency / (teacher data formant frequency / evaluator formant frequency)”. Or the speech processing apparatus of Claim 2.

The voice processing unit
Based on the input voice data for each frame, further comprising special voice detection means for detecting that special voice is input,
The rating means is
The speech processing apparatus according to any one of claims 1 to 3 , wherein the speech received by the speech accepting unit is evaluated based on the teacher data, the input speech data, and a detection result of the special speech detecting means.

The special voice detecting means is
Silence data storage means for storing silence data that is data based on the HMM indicating silence;
Based on the input voice data and the silence data, comprising a silent section detection means for detecting a silent section ,
The rating means is
The speech processing apparatus according to claim 4 , wherein a speech rating value is calculated without using the data of the silent data section .

The special voice detecting means is
If the evaluation value of the first half including at least the last frame of one phoneme and the first half including at least the first frame of the next phoneme is lower than a predetermined value, or the second half including at least the last frame of one phoneme and When it is determined that the rating value of the first half including at least the first frame of the next phoneme is lower than a predetermined value and no silence is inserted, or the latter half including at least the last frame of one phoneme and the When the probability value of the first half including at least the first frame of the phoneme next to the phoneme is lower than a predetermined value and the probability value for another phoneme HMM is detected, and a phoneme having a probability value higher than the predetermined value is detected, Alternatively, if phonemes with a rating value lower than the predetermined value are consecutive, it is determined that a phoneme has been inserted ,
The rating means is
5. The speech processing apparatus according to claim 4, wherein when the special speech detection means determines that a phoneme has been inserted, the speech processing apparatus comprises a rating result indicating that at least a phoneme has been inserted.

The special voice detecting means is
The rating value of one phoneme is lower than a predetermined value, and the rating value of the phoneme immediately before the phoneme and the phoneme immediately after the phoneme is higher than the predetermined value, or the rating value of one phoneme is lower than the predetermined value And when the rating value calculated based on the HMM of the phoneme that is not assumed is higher than a predetermined value, it is determined that the phoneme has been replaced,
Or
The rating value of one phoneme is lower than a predetermined value, and the rating value of the phoneme immediately before the phoneme and the phoneme immediately after the phoneme is higher than the predetermined value, or the rating value of one phoneme is lower than the predetermined value And if the rating value of the phoneme immediately before the phoneme and the phoneme immediately after the phoneme is higher than a predetermined value and the section length of the phoneme is shorter than the predetermined length, or the probability value corresponding to the immediately preceding phoneme, or immediately after If the probability value corresponding to the phoneme is higher than the probability value of the one phoneme, it is determined that the phoneme is missing,
The rating means is
When the special speech detection means determines that there has been a substitution of phonemes, it constitutes a rating result indicating that at least a substitution of phonemes has occurred ,
Or
5. The speech processing apparatus according to claim 4, wherein when the special speech detecting means determines that a phoneme is missing , the speech processing apparatus constitutes a rating result indicating that at least a phoneme is missing .

The voice processing device is a karaoke evaluation device,
The voice reception unit
Accepts singing voice of the person being evaluated,
The voice processing unit
The voice processing device according to claim 1, wherein the singing voice is evaluated.

A first sampling frequency storage unit storing a first sampling frequency;
A teacher data formant frequency storage unit that stores the teacher data formant frequency is formant frequency of the training data,
An evaluation subject formant frequency storage unit that stores an evaluation subject formant frequency that is a formant frequency of the evaluation subject who is the speaker of the voice;
Normalization of vocal tract length to obtain second voice data by sampling the received voice at a second sampling frequency “first sampling frequency / (teacher data formant frequency / evaluator formant frequency)” A processing unit ,
The voice processing unit
Frame dividing means for dividing the second audio data into frames;
Frame audio data acquisition means for obtaining one or more frame audio data which is audio data for each of the divided frames;
Based on the teacher data and the one or more frame voice data, a rating means for rating the voice received by the voice receiving unit;
Comprising output means for outputting a rating result in the rating means,
The rating means is:
Optimal state determination means for determining an optimal state for at least one frame audio data of the one or more frame audio data;
Using the ratio between the probability value in the optimum state determined by the optimum state determination means and the sum of the probability values in all states of the frame corresponding to the probability value, an optimum state probability value that is a probability value indicating the posterior probability is obtained. An optimal state probability value acquisition means to acquire;
A digital signal processor comprising rating value calculating means for calculating a voice rating value using the optimum state probability value acquired by the optimum state probability value acquiring means as a parameter .

A first sampling frequency storage unit storing a first sampling frequency;
A teacher data formant frequency storage unit storing a teacher data formant frequency which is a formant frequency of the teacher data;
An evaluation subject formant frequency storage unit that stores an evaluation subject formant frequency that is a formant frequency of the evaluation subject who is the speaker of the voice;
Normalization of vocal tract length to obtain second voice data by sampling the received voice at a second sampling frequency “first sampling frequency / (teacher data formant frequency / evaluator formant frequency)” A processing unit,
The voice processing unit
Frame dividing means for dividing the second audio data into frames;
Frame audio data acquisition means for obtaining one or more frame audio data which is audio data for each of the divided frames;
Based on the teacher data and the one or more frame voice data, a rating means for rating the voice received by the voice receiving unit;
Comprising output means for outputting a rating result in the rating means,
The rating means is
Optimal state determining means for determining an optimal state of the one or more frame audio data;
A pronunciation interval frame phoneme probability value acquisition means for acquiring a probability value indicating two or more posterior probabilities in the overall state of the phoneme having the optimal state of each frame determined by the optimal state determination means;
The sum of two or more probability values acquired by the pronunciation interval frame phoneme probability value acquisition means is calculated for each frame, and based on the sum of the probability values for each frame, the time average of the sum of the probability values for each pronunciation interval A digital signal processor comprising a rating value calculating means for calculating one or more values and calculating a rating value of speech using the one or more time average values.