JP5028599B2

JP5028599B2 - Audio processing apparatus and program

Info

Publication number: JP5028599B2
Application number: JP2006161918A
Authority: JP
Inventors: 秀行渡辺; 玲子山田; 博章田川; 隆弘足立
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-12-26
Filing date: 2006-06-12
Publication date: 2012-09-19
Anticipated expiration: 2026-06-12
Also published as: JP2007199654A

Abstract

<P>PROBLEM TO BE SOLVED: To solve problems with a conventional speech processing device that the speech processing (singing voice evaluation and the like) meeting the speaker characteristics of an evaluation target object which is a speaker cannot be performed, and as a result thereof, the speech processing of high accuracy is not possible. <P>SOLUTION: The speech processing meeting the speaker characteristics can be performed by the speech processing device including a speech accepting section which accepts a speech, a first sampling frequency storage section which stores a first sampling frequwncy, a vocal tube length normalizing parameter storage section which stores the vocal tube length normalizing parameter being the information relating to the conversion rate of the sampling frequency, a vocal tube length normalization processing section which obtains second speech data by subjecting the speech accepted by the speech acceptance section to sampling processing at the second sampling frequency calculated with the vocal tube length normalizing parameter and the first sampling frequency as parameters, and a speech processing section which processes the second speech data. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、入力された音声を評価したり、入力された音声を認識したりする音声処理装置等に関するものである。 The present invention relates to a speech processing apparatus that evaluates input speech and recognizes input speech.

従来の技術として、以下の音声処理装置がある（特許文献１参照）。本音声処理装置は、語学学習装置であり、当該語学学習装置は、学習者が選択した役割の発音をレファランスデータと比較して一致度によって点数化して表示し、点数によって適当な次の画面を自動に表示することにより、学習能率を向上させる装置である。本従来の音声処理装置は、入力された音声信号は音声認識技術により分析された後、学習者発音のスペクトルと抑揚とが学習者発音表示ボックスに表れるという構成になっている。そして、従来の技術においては、標準音データと学習者の発音のスペクトル、および抑揚が比較されて点数が表示される。 As a conventional technique, there is the following voice processing apparatus (see Patent Document 1). This speech processing device is a language learning device, and the language learning device displays the pronunciation of the role selected by the learner by comparing it with the reference data, scoring it according to the degree of coincidence, and displaying an appropriate next screen depending on the score. It is a device that improves learning efficiency by displaying automatically. The conventional speech processing apparatus is configured such that the input speech signal is analyzed by speech recognition technology, and then the learner's pronunciation spectrum and inflection appear in the learner pronunciation display box. In the conventional technique, the standard sound data is compared with the learner's pronunciation spectrum and intonation, and the score is displayed.

また、従来の技術として、以下の音声処理装置がある（特許文献２参照）。本音声処理装置は歌唱音声評価装置であり、本歌唱音声評価装置は、歌唱音声の周波数成分を抽出する抽出手段と、当該抽出された周波数成分から基本周波数成分と倍音周波数成分とをそれぞれ抽出する特定周波数成分抽出手段と、特定周波数成分抽出手段によって抽出された基本周波数成分に対する倍音周波数成分の比率に応じて、歌唱音声の評価を示す評価値を算出する評価手段とを備える。そして、本歌唱音声評価装置は、歌唱音声の周波数成分に基づいてその声質の良否を適正に評価し、これを歌唱音声の採点結果に反映させることにより、歌唱音声の採点をより人間の感性に近づけることを狙いとしている。 Further, as a conventional technique, there is the following voice processing apparatus (see Patent Document 2). The voice processing device is a singing voice evaluation device, and the singing voice evaluation device extracts a frequency component of the singing voice and a fundamental frequency component and a harmonic frequency component from the extracted frequency component, respectively. Specific frequency component extraction means, and evaluation means for calculating an evaluation value indicating evaluation of the singing voice according to the ratio of the harmonic frequency component to the fundamental frequency component extracted by the specific frequency component extraction means. And this singing voice evaluation apparatus evaluates the quality of the voice quality appropriately based on the frequency component of the singing voice, and reflects this in the singing voice scoring result, thereby making the singing voice scoring more human sensitive. The aim is to get closer.

さらに、従来の技術として、以下の音声処理装置がある（特許文献３参照）。本音声処理装置は音声認識装置であり、入力音声パターンと標準パターンを、ＤＰ法を用いて照合し、最も照合距離の小さい標準パターンを認識結果とする音声認識装置であり、照合結果を用いて入力パターンを音素に分割し、各音素の継続時間と標準継続時間とのずれの分散を計算し、これを照合距離に付加することで距離を補正することを特徴とする。そして、分割部で照合結果を用いて音素に分割し、時間長ずれ計算部で標準継続時間とのずれの分散を計算し、距離補正部で照合距離を補正するように構成する。また、本音声認識装置は、時間長のずれを計算する対象音素を選択する音素選択部、距離補正する対象単語を選択する単語選択部を有し、単語の認識性能を高できる、というものである。
特開２００３−２２８２７９（第１頁、第１図等）特開２００５−１０７０８８（第１頁、第１図等）特開平６−４０９６（第１頁、第１図等） Further, as a conventional technique, there is the following voice processing apparatus (see Patent Document 3). This speech processing device is a speech recognition device, which is a speech recognition device that collates an input speech pattern with a standard pattern using the DP method, and uses a standard pattern with the smallest collation distance as a recognition result. The input pattern is divided into phonemes, the variance of the deviation between the duration of each phoneme and the standard duration is calculated, and this is added to the verification distance to correct the distance. Then, the dividing unit divides into phonemes using the collation result, the time length deviation calculating unit calculates the variance of the deviation from the standard duration, and the distance correcting unit corrects the collation distance. In addition, the speech recognition apparatus includes a phoneme selection unit that selects a target phoneme for calculating a time length shift and a word selection unit that selects a target word for distance correction, so that the word recognition performance can be improved. is there.
JP 2003-228279 A (first page, FIG. 1 etc.) JP-A-2005-107088 (first page, FIG. 1 etc.) JP-A-6-4096 (first page, FIG. 1 etc.)

しかしながら、特許文献１や特許文献２の従来の技術においては、音声（歌声も含む）の話者である評価対象者の話者特性に応じた音声処理が行えず、その結果、精度の高い音声処理ができなかった。具体的には、従来の技術においては、例えば、評価対象者の声道長の違いにより、スペクトル包絡が高周波数域または低周波数域に伸縮するが、従来の発音評定装置や歌唱音声評価装置などの音声処理装置において、かかるスペクトル包絡の伸縮により、評価結果が異なる。つまり、従来の技術においては、同様の上手さの発音や歌唱でも、評価対象者の声道長の違いにより、発音や歌唱の評価結果が異なり、精度の高い評価ができなかった。 However, in the conventional techniques of Patent Document 1 and Patent Document 2, speech processing according to the speaker characteristics of the evaluation target person who is a speaker of speech (including singing voice) cannot be performed, and as a result, highly accurate speech Could not process. Specifically, in the conventional technology, for example, the spectral envelope expands or contracts to a high frequency range or a low frequency range due to a difference in the vocal tract length of the evaluation subject, but a conventional pronunciation rating device, singing voice evaluation device, etc. In the speech processing apparatus, the evaluation result varies depending on the expansion and contraction of the spectrum envelope. In other words, in the conventional technology, even with the same skillful pronunciation and singing, the evaluation results of pronunciation and singing differ depending on the vocal tract length of the person to be evaluated, and high-accuracy evaluation cannot be performed.

また、特許文献１の音声処理装置において、標準音データと学習者の発音のスペクトル、および抑揚が比較されて点数が表示される構成であるので、両者の類似度の評定の精度が低く、また、リアルタイムに高速に点数を表示するためには、処理能力が極めて高いＣＰＵ、多量のメモリが必要であった。 In addition, since the voice processing device of Patent Document 1 is configured to display the score by comparing the standard sound data with the learner's pronunciation spectrum and intonation, the accuracy of the similarity between the two is low, and In order to display the score at high speed in real time, a CPU having a very high processing capability and a large amount of memory are required.

また、特許文献１の音声処理装置において、無音区間があれば、類似度が低く評価されると考えられ、評価の精度が低かった。また、音素の置換や挿入や欠落など、特殊な事象が発生していることを検知できなかった。 Further, in the speech processing apparatus of Patent Document 1, if there is a silent section, it is considered that the similarity is evaluated low, and the accuracy of the evaluation is low. In addition, it was not possible to detect the occurrence of special events such as phoneme substitution, insertion, or omission.

さらに、例えば、特許文献３に示すような音声認識処理を行う音声処理装置において、評価対象者の声道長の違いにより、スペクトル包絡の伸縮が生じるが、かかる評価対象者の話者特性に応じた音声認識処理を行っておらず、精度の高い音声認識ができなかった。 Further, for example, in a speech processing apparatus that performs speech recognition processing as shown in Patent Document 3, the spectral envelope expands or contracts due to the difference in the vocal tract length of the evaluation target person, but depending on the speaker characteristics of the evaluation target person Voice recognition processing was not performed, and high-accuracy voice recognition was not possible.

本第一の発明の音声処理装置は、音声を受け付ける音声受付部と、第一サンプリング周波数を格納している第一サンプリング周波数格納部と、サンプリング周波数の変換率に関する情報である声道長正規化パラメータを格納している声道長正規化パラメータ格納部と、前記声道長正規化パラメータと前記第一サンプリング周波数をパラメータとして算出される第二サンプリング周波数で、前記音声受付部が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理部と、前記第二音声データを処理する音声処理部を具備する音声処理装置である。 The voice processing device according to the first aspect of the invention includes a voice receiving unit that receives voice, a first sampling frequency storage unit that stores a first sampling frequency, and vocal tract length normalization that is information relating to a conversion rate of the sampling frequency. A vocal tract length normalization parameter storage unit storing parameters; a second sampling frequency calculated using the vocal tract length normalization parameter and the first sampling frequency as parameters; On the other hand, the voice processing apparatus includes a vocal tract length normalization processing unit that performs sampling processing to obtain second voice data, and a voice processing unit that processes the second voice data.

かかる構成により、評価対象者の話者特性に応じた音声処理ができる。 With this configuration, voice processing according to the speaker characteristics of the evaluation target person can be performed.

また、本第二の発明の音声処理装置は、第一の発明に対して、比較される対象の音声に関するデータであり、１以上の音韻毎のデータである教師データを１以上格納している教師データ格納部と、前記教師データのフォルマント周波数である教師データフォルマント周波数を格納している教師データフォルマント周波数格納部と、前記音声受付部が受け付けた音声の話者である評価対象者のフォルマント周波数である評価対象者フォルマント周波数を格納している評価対象者フォルマント周波数格納部とをさらに具備し、前記声道長正規化パラメータは、「評価対象者フォルマント周波数／教師データフォルマント周波数」により算出される値であり、前記声道長正規化処理部は、「前記第一サンプリング周波数×声道長正規化パラメータ」の演算式により、第二サンプリング周波数を算出し、当該第二サンプリング周波数で、前記音声受付部が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る音声処理装置である。 The speech processing apparatus according to the second aspect of the present invention stores one or more teacher data, which is data related to the speech to be compared with the first aspect, and is data for one or more phonemes. A teacher data storage unit, a teacher data formant frequency storage unit that stores a teacher data formant frequency that is a formant frequency of the teacher data, and a formant frequency of an evaluation target person who is a voice speaker received by the voice receiving unit And an evaluation subject formant frequency storage unit storing the evaluation subject formant frequency, wherein the vocal tract length normalization parameter is calculated by “evaluation subject formant frequency / teacher data formant frequency”. The vocal tract length normalization processing unit is "the first sampling frequency x vocal tract length normalization parameter" The arithmetic expression to calculate a second sampling frequency, in the second sampling frequency, the audio of the voice accepting unit accepts, performs sampling processing, a voice processing apparatus for obtaining a second audio data.

かかる構成により、評価対象者の話者特性に応じた精度の高い音声処理ができる。 With this configuration, it is possible to perform highly accurate speech processing according to the speaker characteristics of the evaluation target person.

また、本第三の発明の音声処理装置は、第二の発明に対して、前記第一サンプリング周波数で、前記音声受付部が受け付けた音声をサンプリングし、第一音声データを取得するサンプリング部をさらに具備し、前記声道長正規化処理部は、「前記第一サンプリング周波数×声道長正規化パラメータ」の演算式により、第二サンプリング周波数を算出し、当該第二サンプリング周波数で、前記第一音声データに対して、サンプリング処理を行い、第二音声データを得る音声処理装置である。 The voice processing device according to the third aspect of the invention further includes a sampling unit that samples the voice received by the voice receiving unit at the first sampling frequency and obtains first voice data with respect to the second invention. Further, the vocal tract length normalization processing unit calculates a second sampling frequency by an arithmetic expression of "the first sampling frequency x vocal tract length normalization parameter", and the second sampling frequency is used to calculate the second sampling frequency. The audio processing apparatus performs sampling processing on one audio data to obtain second audio data.

また、本第四の発明の音声処理装置は、第一の発明に対して、比較される対象の音声に関するデータであり、１以上の音韻毎のデータである教師データを１以上格納している教師データ格納部と、前記教師データおよび前記音声受付部が受け付けた音声に基づいて、前記声道長正規化パラメータを算出する声道長正規化パラメータ算出部とをさらに具備し、前記声道長正規化パラメータ格納部の声道長正規化パラメータは、前記声道長正規化パラメータ算出部が算出した声道長正規化パラメータである音声処理装置である。 The speech processing apparatus according to the fourth aspect of the present invention stores one or more teacher data, which is data related to the speech to be compared with the first invention, and is data for one or more phonemes. A vocal tract length normalization parameter calculating unit that calculates the vocal tract length normalization parameter based on the teacher data and the voice received by the voice reception unit; The vocal tract length normalization parameter of the normalization parameter storage unit is a voice processing device that is a vocal tract length normalization parameter calculated by the vocal tract length normalization parameter calculation unit.

かかる構成により、動的に評価対象者ごとの声道長正規化パラメータでき、かつ、話者特性に応じた精度の高い音声処理ができる。 With this configuration, it is possible to dynamically perform vocal tract length normalization parameters for each evaluation target person and perform high-accuracy voice processing according to speaker characteristics.

また、本第五の発明の音声処理装置は、第四の発明に対して、前記声道長正規化パラメータ算出部は、音素ＨＭＭを指定された発話内容に従って連結した連結ＨＭＭである学習音響データを格納している学習音響データ格納手段と、前記音声受付部が受け付けた音声から、短区間分析により第二のケプストラムベクトル系列を算出する第二ケプストラムベクトル系列算出手段と、前記第二のケプストラムベクトル系列を、ケプストラム変換パラメータを要素とする行列を用いて変換し、周波数ワープされた第三のケプストラムベクトル系列を算出するケプストラム変換手段と、指定された発話内容に従って受け付けた音声の事後確率である占有度数を算出する占有度算出手段と、前記学習音響データおよび前記第三のケプストラムベクトル系列および前記占有度数に基づいて、ケプストラム変換パラメータを算出するケプストラム変換パラメータ算出手段と、所定のルールに基づいて、前記ケプストラム変換手段における処理、および前記占有度算出手段における処理、および前記ケプストラム変換パラメータ算出手段における処理を繰り返えさせ、ケプストラム変換パラメータを得る最終ケプストラム変換パラメータ取得手段と、前記最終ケプストラム変換パラメータ取得手段が得たケプストラム変換パラメータに基づいて、前記声道長正規化パラメータを算出する声道長正規化パラメータ算出手段を具備する音声処理装置である。 Further, the speech processing apparatus according to the fifth aspect of the invention is the learning acoustic data according to the fourth aspect, wherein the vocal tract length normalization parameter calculation unit is a connected HMM in which phoneme HMMs are connected according to the specified utterance content. Learning acoustic data storage means for storing the second cepstrum vector series calculating means for calculating a second cepstrum vector series by short section analysis from the voice received by the voice receiving unit, and the second cepstrum vector Cepstrum transformation means for transforming a sequence using a matrix having cepstrum transformation parameters as elements and calculating a frequency-warped third cepstrum vector sequence, and an occupation that is a posterior probability of speech accepted according to the specified utterance content Occupancy calculation means for calculating frequency, the learning acoustic data and the third cepstrum vector system And a cepstrum conversion parameter calculation unit that calculates a cepstrum conversion parameter based on the occupancy degree, a process in the cepstrum conversion unit, a process in the occupancy calculation unit, and the cepstrum conversion parameter calculation based on a predetermined rule. A final cepstrum conversion parameter acquisition unit that obtains a cepstrum conversion parameter, and a voice for calculating the vocal tract length normalization parameter based on the cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition unit. A speech processing apparatus including a road length normalization parameter calculation unit.

また、本第六の発明の音声処理装置は、第四の発明に対して、前記声道長正規化パラメータ算出部は、音素平均ケプストラムベクトルを指定された発話内容に従って並べた音素平均ケプストラムベクトル列である学習音響データを格納している学習音響データ格納手段と、前記音声受付部が受け付けた音声から、短区間分析により第二のケプストラムベクトル系列を算出する第二ケプストラムベクトル系列算出手段と、前記第二のケプストラムベクトル系列を、ケプストラム変換パラメータを要素とする行列を用いて変換し、周波数ワープされた第三のケプストラムベクトル系列を算出するケプストラム変換手段と、ユーザが発声した発話音声を構成する各フレームに対応する音素を識別する情報の列である音素系列を取得する最適音素系列取得手段と、前記学習音響データおよび前記第三のケプストラムベクトル系列および前記音素系列を用いて、ケプストラム変換パラメータを算出するケプストラム変換パラメータ算出手段と、所定のルールに基づいて、前記ケプストラム変換手段における処理、および前記最適音素系列取得手段における処理、および前記ケプストラム変換パラメータ算出手段における処理を繰り返えさせ、ケプストラム変換パラメータを得る最終ケプストラム変換パラメータ取得手段と、前記最終ケプストラム変換パラメータ取得手段が得たケプストラム変換パラメータに基づいて、前記声道長正規化パラメータを算出する声道長正規化パラメータ算出手段を具備する音声処理装置である。 The speech processing apparatus according to the sixth aspect of the invention provides a phoneme average cepstrum vector sequence in which the vocal tract length normalization parameter calculation unit arranges the phoneme average cepstrum vectors according to the designated utterance content. Learning acoustic data storage means for storing learning acoustic data, second cepstrum vector series calculation means for calculating a second cepstrum vector series by short section analysis from the voice received by the voice receiving unit, The second cepstrum vector sequence is converted by using a matrix having cepstrum conversion parameters as elements, and the cepstrum conversion means for calculating the frequency-warped third cepstrum vector sequence, and each of the speech voices uttered by the user Optimal phoneme sequence to obtain phoneme sequence that is a sequence of information identifying phonemes corresponding to frames Obtaining means, cepstrum conversion parameter calculation means for calculating a cepstrum conversion parameter using the learning acoustic data, the third cepstrum vector series, and the phoneme series, and processing in the cepstrum conversion means based on a predetermined rule And the cepstrum obtained by the final cepstrum conversion parameter acquisition unit and the cepstrum obtained by the final cepstrum conversion parameter acquisition unit by repeating the process in the optimum phoneme sequence acquisition unit and the process in the cepstrum conversion parameter calculation unit. A speech processing apparatus comprising vocal tract length normalization parameter calculation means for calculating the vocal tract length normalization parameter based on a conversion parameter.

また、本第七の発明の音声処理装置は、第五、第六の発明に対して、前記声道長正規化パラメータ算出手段は、前記最終ケプストラム変換パラメータ取得手段が得たケプストラム変換パラメータから線形周波数伸縮比を算出し、当該線形周波数伸縮比から前記声道長正規化パラメータを算出する音声処理装置である。 Further, the speech processing apparatus according to the seventh aspect of the invention is the fifth and sixth aspects of the invention, wherein the vocal tract length normalization parameter calculating means linearly calculates from the cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition means. The speech processing apparatus calculates a frequency expansion ratio and calculates the vocal tract length normalization parameter from the linear frequency expansion ratio.

かかる構成により、動的に評価対象者ごとの声道長正規化パラメータでき、かつ、話者特性に応じた、精度の高い音声処理ができる。 With this configuration, the vocal tract length normalization parameter can be dynamically set for each person to be evaluated, and high-accuracy voice processing according to speaker characteristics can be performed.

また、本第八の発明の音声処理装置は、第五、第六の発明に対して、前記声道長正規化パラメータ算出部は、周波数範囲を指定する情報である周波数範囲指定情報を格納している周波数範囲指定情報格納手段をさらに具備し、前記ケプストラム変換パラメータ算出手段は、前記周波数範囲指定情報を用いて、ケプストラム変換パラメータを算出する音声処理装置である。 Further, in the voice processing device of the eighth invention, in contrast to the fifth and sixth inventions, the vocal tract length normalization parameter calculation unit stores frequency range designation information which is information for designating a frequency range. The cepstrum conversion parameter calculation unit is a voice processing device that calculates a cepstrum conversion parameter using the frequency range specification information.

かかる構成により、動的に評価対象者ごとの声道長正規化パラメータでき、かつ、話者特性に応じた、さらに精度の高い音声処理ができる。 With this configuration, the vocal tract length normalization parameter can be dynamically set for each person to be evaluated, and more accurate voice processing can be performed according to speaker characteristics.

また、本第九の発明の音声処理装置は、第一から第八いずれかの発明に対して、前記音声処理部は、前記第二音声データを、フレームに区分するフレーム区分手段と、前記区分されたフレーム毎の音声データであるフレーム音声データを１以上得るフレーム音声データ取得手段と、前記教師データと前記１以上のフレーム音声データに基づいて、前記音声受付部が受け付けた音声の評定を行う評定手段と、前記評定手段における評定結果を出力する出力手段を具備する音声処理装置である。 Further, according to the ninth aspect of the present invention, in any one of the first to eighth aspects, the sound processing unit includes a frame dividing means for dividing the second audio data into frames, and the classification Frame audio data acquisition means for obtaining one or more frame audio data, which is audio data for each frame, and evaluation of the audio received by the audio reception unit based on the teacher data and the one or more frame audio data A speech processing apparatus comprising a rating means and an output means for outputting a rating result in the rating means.

かかる構成により、話者特性に応じた、さらに精度の高い音声の評定ができる。 With this configuration, it is possible to evaluate speech with higher accuracy according to speaker characteristics.

また、本第十の発明の音声処理装置は、第九の発明に対して、前記評定手段は、前記１以上のフレーム音声データのうちの少なくとも一のフレーム音声データに対する最適状態を決定する最適状態決定手段と、前記最適状態決定手段が決定した最適状態における確率値を取得する最適状態確率値取得手段と、前記最適状態確率値取得手段が取得した確率値をパラメータとして音声の評定値を算出する評定値算出手段を具備する音声処理装置である。 The speech processing apparatus according to the tenth aspect of the present invention is the optimum state for the ninth aspect, wherein the rating means determines an optimum state for at least one frame sound data of the one or more frame sound data. A voice evaluation value is calculated using the determination means, the optimum state probability value acquisition means for acquiring the probability value in the optimum state determined by the optimum state determination means, and the probability value acquired by the optimum state probability value acquisition means as parameters. It is a voice processing device comprising rating value calculation means.

かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができる。 With this configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation target person.

また、本第十一の発明の音声処理装置は、第九の発明に対して、前記評定手段は、前記１以上のフレーム音声データの最適状態を決定する最適状態決定手段と、前記最適状態決定手段が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値を、発音区間毎に取得する発音区間フレーム音韻確率値取得手段と、前記発音区間フレーム音韻確率値取得手段が取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する評定値算出手段を具備する音声処理装置である。 The speech processing apparatus according to the eleventh aspect of the present invention is the speech processing apparatus according to the ninth aspect, wherein the rating means includes an optimum state determining means for determining an optimum state of the one or more frame sound data, and the optimum state determination. The pronunciation interval frame phoneme probability value acquisition means for acquiring, for each pronunciation interval, one or more probability values in the overall state of the phoneme having the optimum state of each frame determined by the means, and the pronunciation interval frame phoneme probability value acquisition means The speech processing apparatus includes a rating value calculation unit that calculates a rating value of speech using one or more probability values for each of the one or more pronunciation intervals as a parameter.

かかる構成により、評価対象者の話者特性に応じた、さらに精度の高い音声の評定ができる。 With this configuration, it is possible to evaluate speech with higher accuracy according to the speaker characteristics of the evaluation target person.

また、本第十二の発明の音声処理装置は、第九の発明に対して、前記音声処理部は、前記フレーム毎の入力音声データに基づいて、特殊な音声が入力されたことを検知する特殊音声検知手段をさらに具備し、前記評定手段は、前記教師データと前記入力音声データと前記特殊音声検知手段における検知結果に基づいて、前記音声受付部が受け付けた音声の評定を行う音声処理装置である。 Further, in the speech processing apparatus according to the twelfth aspect of the invention, in contrast to the ninth aspect of the invention, the speech processing unit detects that special speech has been input based on the input speech data for each frame. The voice processing device further comprising special voice detecting means, wherein the rating means evaluates the voice received by the voice receiving unit based on the teacher data, the input voice data, and the detection result in the special voice detecting means. It is.

かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができ、かつ特殊音声を検知し、かかる特殊音声に対応した音声の評定ができる。 With this configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation subject, detect special speech, and evaluate speech corresponding to the special speech.

また、本第十三の発明の音声処理装置は、第十二の発明に対して、前記特殊音声検知手段は、無音を示すＨＭＭに基づくデータである無音データを格納している無音データ格納手段と、前記入力音声データおよび前記無音データに基づいて、無音の区間を検出する無音区間検出手段を具備する音声処理装置である。 The sound processing apparatus according to the thirteenth aspect of the present invention is directed to the twelfth aspect of the invention, wherein the special sound detection means is a silence data storage means for storing silence data which is data based on HMM indicating silence. And a silent processing unit comprising a silent section detecting means for detecting a silent section based on the input voice data and the silent data.

かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができ、かつ無音区間を検知し、かかる無音区間に対応した音声の評定ができる。 With such a configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation target person, detect a silent section, and evaluate speech corresponding to the silent section.

また、本第十四の発明の音声処理装置は、第十二の発明に対して、前記特殊音声検知手段は、一の音素の後半部および当該音素の次の音素の前半部の評定値が所定の条件を満たすことを検知し、前記評定手段は、前記特殊音声検知手段が前記所定の条件を満たすことを検知した場合に、少なくとも音素の挿入があった旨を示す評定結果を構成する音声処理装置である。 The speech processing apparatus according to the fourteenth aspect of the invention relates to the twelfth aspect of the invention, wherein the special speech detecting means has a rating value of the second half of one phoneme and the first half of the next phoneme. A voice that constitutes a rating result indicating that at least a phoneme has been inserted when the special voice detecting means detects that the predetermined voice condition has been detected. It is a processing device.

かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができ、かつ音素の挿入を検知し、かかる音素の挿入に対応した音声の評定ができる。 With this configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation target person, detect insertion of phonemes, and evaluate speech corresponding to insertion of such phonemes.

また、本第十五の発明の音声処理装置は、第十二の発明に対して、前記特殊音声検知手段は、一の音素の評定値が所定の条件を満たすことを検知し、前記評定手段は、前記特殊音声検知手段が前記所定の条件を満たすことを検知した場合に、少なくとも音素の置換または欠落があった旨を示す評定結果を構成する音声処理装置である。 Further, in the voice processing apparatus of the fifteenth aspect of the invention, in contrast to the twelfth aspect of the invention, the special voice detecting means detects that a rating value of one phoneme satisfies a predetermined condition, and the rating means Is a speech processing device that constitutes an evaluation result indicating that at least a phoneme has been replaced or missing when the special speech detection means detects that the predetermined condition is satisfied.

かかる構成により、評価対象者の話者特性に応じた精度の高い音声の評定ができ、かつ音素の置換または欠落を検知し、かかる音素の置換または欠落に対応した音声の評定ができる。 With this configuration, it is possible to evaluate speech with high accuracy according to the speaker characteristics of the evaluation subject, detect phoneme replacement or omission, and evaluate speech corresponding to such phoneme substitution or omission.

また、本第十六の発明の音声処理装置は、第一から第十五いずれかの発明に対して、前記音声処理装置は、カラオケ評価装置であって、前記音声受付部は、評価対象者の歌声の入力を受け付け、前記音声処理部は、前記歌声を評価する音声処理装置である。 Further, in the sixteenth aspect of the present invention, in contrast to any one of the first to fifteenth aspects, the voice processing device is a karaoke evaluation device, and the voice reception unit is an evaluation subject. The voice processing unit is a voice processing device that evaluates the singing voice.

かかる構成により、優れたカラオケ評価装置として利用できる。 With this configuration, it can be used as an excellent karaoke evaluation apparatus.

また、本第十七の発明の音声処理装置は、第十六の発明に対して、前記音声受付部は、所定の母音の音声を受け付けた後、評価対象者の歌声の入力を受け付け、前記サンプリング部は、前記第一サンプリング周波数で、前記母音の音声をもサンプリングし、前記サンプリングした母音の音声に基づいて、評価対象者のフォルマント周波数である評価対象者フォルマント周波数を取得する評価対象者フォルマント周波数取得部をさらに具備し、前記評価対象者フォルマント周波数格納部の評価対象者フォルマント周波数は、前記評価対象者フォルマント周波数取得部が取得した評価対象者フォルマント周波数である音声処理装置である。 Further, in the seventeenth aspect of the present invention, in the sixteenth aspect of the invention, the sound receiving unit receives the input of the singing voice of the evaluation subject after receiving the sound of a predetermined vowel, The sampling unit also samples the voice of the vowel at the first sampling frequency, and obtains an evaluation target formant frequency that is an evaluation target formant frequency based on the sampled vowel voice. The speech processing apparatus further includes a frequency acquisition unit, and the evaluation subject formant frequency of the evaluation subject formant frequency storage unit is the evaluation subject formant frequency acquired by the evaluation subject formant frequency acquisition unit.

また、本第十八の発明の音声処理装置は、第一から第八いずれかの発明に対して、前記音声処理部は、前記第二音声データに基づいて、音声認識処理を行う音声処理装置である。 In the eighteenth aspect of the present invention, the voice processing device according to any of the first to eighth aspects, wherein the voice processing unit performs a voice recognition process based on the second voice data. It is.

かかる構成により、評価対象者の話者特性に応じた精度の高い音声認識ができる。 With this configuration, it is possible to perform highly accurate speech recognition according to the speaker characteristics of the evaluation target person.

また、本第十九の発明は、第一サンプリング周波数を格納している第一サンプリング周波数格納部と、サンプリング周波数の変換率に関する情報である声道長正規化パラメータを格納している声道長正規化パラメータ格納部と、前記声道長正規化パラメータと前記第一サンプリング周波数をパラメータとして算出される第二サンプリング周波数で、前記音声受付部が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理部を具備するデジタルシグナルプロセッサである。 In addition, the nineteenth aspect of the present invention provides a first sampling frequency storage unit that stores a first sampling frequency, and a vocal tract length that stores a vocal tract length normalization parameter that is information related to a conversion rate of the sampling frequency. A normalization parameter storage unit, and a second sampling frequency calculated using the vocal tract length normalization parameter and the first sampling frequency as parameters, perform a sampling process on the voice received by the voice reception unit, It is a digital signal processor including a vocal tract length normalization processing unit that obtains two voice data.

かかる構成により、評価対象者の話者特性に応じた精度の高い音声処理ができるＤＳＰを提供できる。 With this configuration, it is possible to provide a DSP that can perform highly accurate voice processing according to the speaker characteristics of the evaluation target person.

本発明による音声処理装置によれば、評価対象者の話者特性に応じた精度の高い音声処理ができる。 According to the speech processing device of the present invention, speech processing with high accuracy according to the speaker characteristics of the evaluation subject can be performed.

以下、音声処理装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。
（実施の形態１） Hereinafter, embodiments of a speech processing apparatus and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.
(Embodiment 1)

本実施の形態において、比較対象の音声と入力音声の類似度の評定を精度高く、かつ高速にできる音声処理装置について説明する。本音声処理装置は、音声（歌唱を含む）を評価する発音評定装置である。特に、本音声処理装置は、入力音声のフレームに対する最適状態の事後確率を、動的計画法を用いて算出することから、当該事後確率をＤＡＰ（ＤｙｎａｍｉｃＡＰｏｓｔｅｒｉｏｒｉＰｒｏｂａｂｉｌｉｔｙ）と呼び、ＤＡＰに基づく類似度計算法および発音評定装置をＤＡＰＳと呼ぶ。 In the present embodiment, a description will be given of a speech processing apparatus that can evaluate the similarity between the comparison target speech and the input speech with high accuracy and high speed. This speech processing device is a pronunciation rating device that evaluates speech (including singing). In particular, since the speech processing apparatus calculates the posterior probability of the optimum state with respect to the frame of the input speech using dynamic programming, the posterior probability is called DAP (Dynamic A Positive Probability) and is based on DAP. The degree calculation method and pronunciation rating device are called DAPS.

また、本実施の形態における音声処理装置は、例えば、語学学習や物真似練習やカラオケ評定などに利用できる。図１は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部１１０を具備する。 In addition, the speech processing apparatus according to the present embodiment can be used for language learning, imitation practice, karaoke evaluation, and the like. FIG. 1 is a block diagram of a speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. , An evaluation subject formant frequency storage unit 108, a vocal tract length normalization processing unit 109, and a speech processing unit 110.

音声処理部１１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、評定手段１１０３、出力手段１１０４を具備する。 The audio processing unit 110 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a rating unit 1103, and an output unit 1104.

評定手段１１０３は、最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２、評定値算出手段１１０３３を具備する。 The rating unit 1103 includes an optimum state determination unit 11031, an optimum state probability value acquisition unit 11032, and a rating value calculation unit 11033.

なお、音声処理装置は、キーボード３４２、マウス３４３などの入力手段からの入力を受け付ける。また、音声処理装置は、マイク３４５などの音声入力手段から音声入力を受け付ける。さらに、音声処理装置は、ディスプレイ３４４などの出力デバイスに情報を出力する。 Note that the voice processing apparatus accepts input from input means such as a keyboard 342 and a mouse 343. The voice processing apparatus accepts voice input from voice input means such as a microphone 345. Further, the sound processing apparatus outputs information to an output device such as the display 344.

入力受付部１０１は、音声処理装置の動作開始を指示する動作開始指示や、入力した音声の評定結果の出力態様の変更を指示する出力態様変更指示や、処理を終了する終了指示などの入力を受け付ける。かかる指示等の入力手段は、テンキーやキーボードやマウスやメニュー画面によるもの等、何でも良い。入力受付部１０１は、テンキーやキーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The input receiving unit 101 inputs an operation start instruction for instructing an operation start of the speech processing apparatus, an output mode change instruction for instructing an output mode change of the input speech evaluation result, an end instruction for ending the process, and the like. Accept. The input means for such an instruction may be anything such as a numeric keypad, keyboard, mouse or menu screen. The input receiving unit 101 can be realized by a device driver for input means such as a numeric keypad and a keyboard, control software for a menu screen, and the like.

教師データ格納部１０２は、教師データとして比較される対象の音声に関するデータであり、音韻毎の隠れマルコフモデル（ＨＭＭ）に基づくデータを１以上格納している。教師データは、音韻毎の隠れマルコフモデル（ＨＭＭ）を連結したＨＭＭに基づくデータであることが好適である。また、教師データは、入力される音声を構成する音素に対応するＨＭＭを、入力順序に従って連結されているＨＭＭに基づくデータであることが好適である。ただし、教師データは、必ずしも、音韻毎のＨＭＭを連結したＨＭＭに基づくデータである必要はない。教師データは、全音素のＨＭＭの、単なる集合であっても良い。また、教師データは、必ずしもＨＭＭに基づくデータである必要はない。教師データは、単一ガウス分布モデルや、確率モデル（ＧＭＭ：ガウシャンミクスチャモデル）や、統計モデルなど、他のモデルに基づくデータでも良い。ＨＭＭに基づくデータは、例えば、フレーム毎に、状態識別子と遷移確率の情報を有する。また、ＨＭＭに基づくデータは、例えば、複数の学習対象言語を母国語として話す外国人が発声した２以上のデータから学習した（推定した）モデルでも良い。教師データ格納部１０２は、ハードディスクやＲＯＭなどの不揮発性の記録媒体が好適であるが、ＲＡＭなどの揮発性の記録媒体でも実現可能である。 The teacher data storage unit 102 is data relating to a target voice to be compared as teacher data, and stores one or more data based on a hidden Markov model (HMM) for each phoneme. The teacher data is preferably HMM-based data obtained by connecting hidden Markov models (HMM) for each phoneme. In addition, the teacher data is preferably data based on an HMM in which HMMs corresponding to phonemes constituting input speech are linked in accordance with the input order. However, the teacher data does not necessarily need to be data based on the HMM obtained by connecting the HMMs for each phoneme. The teacher data may be a simple set of all phoneme HMMs. The teacher data does not necessarily need to be data based on the HMM. The teacher data may be data based on other models such as a single Gaussian distribution model, a probability model (GMM: Gaussian mixture model), and a statistical model. The data based on the HMM has, for example, a state identifier and transition probability information for each frame. The data based on the HMM may be, for example, a model learned (estimated) from two or more data uttered by a foreigner who speaks a plurality of learning target languages as a native language. The teacher data storage unit 102 is preferably a non-volatile recording medium such as a hard disk or a ROM, but can also be realized by a volatile recording medium such as a RAM.

音声受付部１０３は、音声を受け付ける。音声受付部１０３は、例えば、マイク３４５のドライバーソフトで実現され得る。また、なお、音声受付部１０３は、マイク３４５とそのドライバーから実現されると考えても良い。音声は、マイク３４５から入力されても良いし、磁気テープやＣＤ−ＲＯＭなどの記録媒体から読み出すことにより入力されても良い。 The voice reception unit 103 receives voice. The voice reception unit 103 can be realized by, for example, driver software for the microphone 345. In addition, it may be considered that the voice reception unit 103 is realized by the microphone 345 and its driver. The sound may be input from the microphone 345 or may be input by reading from a recording medium such as a magnetic tape or a CD-ROM.

教師データフォルマント周波数格納部１０４は、教師データのフォルマント周波数である教師データフォルマント周波数を格納している。教師データフォルマント周波数は、第一フォルマント周波数（Ｆ１）でも、第二フォルマント周波数（Ｆ２）でも、第三フォルマント周波数（Ｆ３）等でも良い。教師データフォルマント周波数格納部１０４の教師データフォルマント周波数は、予め格納されていても良いし、評価時に、動的に、教師データから取得しても良い。音声データからフォルマント周波数を取得する技術は、公知技術であるので説明を省略する。教師データフォルマント周波数格納部１０４は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The teacher data formant frequency storage unit 104 stores a teacher data formant frequency that is a formant frequency of teacher data. The teacher data formant frequency may be the first formant frequency (F1), the second formant frequency (F2), the third formant frequency (F3), or the like. The teacher data formant frequency in the teacher data formant frequency storage unit 104 may be stored in advance, or may be dynamically acquired from the teacher data at the time of evaluation. Since a technique for obtaining a formant frequency from audio data is a known technique, a description thereof will be omitted. The teacher data formant frequency storage unit 104 is preferably a nonvolatile recording medium, but can also be realized by a volatile recording medium.

第一サンプリング周波数格納部１０５は、第一のサンプリング周波数である第一サンプリング周波数を格納している。第一サンプリング周波数は、評価対象者の音声を、最初にサンプリングする場合のサンプリング周波数である。第一サンプリング周波数格納部１０５は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The first sampling frequency storage unit 105 stores a first sampling frequency that is a first sampling frequency. The first sampling frequency is a sampling frequency when the voice of the evaluation subject is first sampled. The first sampling frequency storage unit 105 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

サンプリング部１０６は、第一サンプリング周波数格納部１０５の第一サンプリング周波数で、音声受付部１０３が受け付けた音声をサンプリングし、第一音声データを取得する。なお、受け付けた音声をサンプリングする技術は公知技術であるので、詳細な説明を省略する。サンプリング部１０６は、通常、ＭＰＵやメモリ等から実現され得る。サンプリング部１０６の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The sampling unit 106 samples the sound received by the sound receiving unit 103 at the first sampling frequency of the first sampling frequency storage unit 105 and acquires first sound data. Since the technique for sampling the received voice is a known technique, detailed description thereof is omitted. The sampling unit 106 can be usually realized by an MPU, a memory, or the like. The processing procedure of the sampling unit 106 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評価対象者フォルマント周波数取得部１０７は、サンプリング部１０６が取得した第一音声データから、評価対象者のフォルマント周波数である評価対象者フォルマント周波数を取得する。評価対象者フォルマント周波数も、第一フォルマント周波数（Ｆ１）でも、第二フォルマント周波数（Ｆ２）でも、第三フォルマント周波数（Ｆ３）でも良い。ただし、評価対象者フォルマント周波数と教師データフォルマント周波数は同一種のフォルマント周波数である。サンプリングして取得した第一音声データから、フォルマント周波数を取得する技術は公知技術であるので、詳細な説明を省略する。評価対象者フォルマント周波数取得部１０７は、通常、ＭＰＵやメモリ等から実現され得る。評価対象者フォルマント周波数取得部１０７の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The evaluation target person formant frequency acquisition unit 107 acquires the evaluation target person formant frequency, which is the formant frequency of the evaluation target person, from the first sound data acquired by the sampling unit 106. The formant frequency to be evaluated may also be the first formant frequency (F1), the second formant frequency (F2), or the third formant frequency (F3). However, the evaluation subject formant frequency and the teacher data formant frequency are the same type of formant frequency. Since the technique for acquiring the formant frequency from the first audio data obtained by sampling is a known technique, detailed description thereof is omitted. The evaluation subject formant frequency acquisition unit 107 can be usually realized by an MPU, a memory, or the like. The processing procedure of the evaluation subject formant frequency acquisition unit 107 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評価対象者フォルマント周波数格納部１０８は、音声受付部１０３が受け付けた音声の話者である評価対象者のフォルマント周波数である評価対象者フォルマント周波数を、少なくとも一時的に格納している。評価対象者フォルマント周波数格納部１０８の評価対象者フォルマント周波数は、通常、評価対象者フォルマント周波数取得部１０７が取得したフォルマント周波数であるが、予め評価対象者フォルマント周波数を格納していても良い。評価対象者フォルマント周波数格納部１０８に、予め評価対象者フォルマント周波数が格納されている場合、本音声処理装置において、評価対象者フォルマント周波数取得部１０７は不要である。評価対象者フォルマント周波数格納部１０８は、不揮発性の記録媒体でも、揮発性の記録媒体でも良い。 The evaluation subject formant frequency storage unit 108 stores at least temporarily the evaluation subject formant frequency, which is the formant frequency of the evaluation subject who is the speaker of the voice received by the voice receiving unit 103. The evaluation subject formant frequency in the evaluation subject formant frequency storage unit 108 is usually the formant frequency acquired by the evaluation subject formant frequency acquisition unit 107, but the evaluation subject formant frequency may be stored in advance. When the evaluation subject formant frequency is stored in advance in the evaluation subject formant frequency storage unit 108, the evaluation subject formant frequency acquisition unit 107 is not required in the speech processing apparatus. The evaluation subject formant frequency storage unit 108 may be a non-volatile recording medium or a volatile recording medium.

声道長正規化処理部１０９は、第二サンプリング周波数で、音声受付部１０３が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る。第二サンプリング周波数は、「第一サンプリング周波数／（教師データフォルマント周波数／評価対象者フォルマント周波数）」で算出されるサンプリング周波数である。声道長正規化処理部１０９は、音声受付部１０３が受け付けた音声をサンプリング処理して得られた第一音声データを、リサンプリング処理して第二音声データを得ることが好適であるが、音声受付部１０３が受け付けた音声をサンプリング処理し、直接的に第二音声データを得ても良い。直接的に第二音声データを得る場合、例えば、サンプリング処理を行うハードウェアが可変のサンプリング周波数でサンプリング処理を行えることが必要である。声道長正規化処理部１０９は、通常、演算「教師データフォルマント周波数／評価対象者フォルマント周波数」を行い、周波数軸変換率（「ｒ」とする）を得る。そして、声道長正規化処理部１０９は、第一サンプリング周波数格納部１０５の第一サンプリング周波数（Ｆｓ）と「ｒ」に基づいて、演算「Ｆｓ／ｒ」を行い、新しいサンプリング周波数（Ｆｓ／ｒ）を得る。この新しいサンプリング周波数（Ｆｓ／ｒ）が第二サンプリング周波数である。次に、声道長正規化処理部１０９は、第一音声データに対して、第二サンプリング周波数（Ｆｓ／ｒ）で、リサンプリング処理を行い、第二音声データを得る。声道長正規化処理部１０９は、通常、ＭＰＵやメモリ等から実現され得る。声道長正規化処理部１０９の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。なお、周波数軸変換率「ｒ」の逆数「１／ｒ」は、後述する声道長パラメータである、と言える。また、声道長パラメータは、声道長の正規化のためのパラメータであるので、周波数軸変換率「ｒ」を声道長パラメータと考えても良い。 The vocal tract length normalization processing unit 109 performs sampling processing on the voice received by the voice receiving unit 103 at the second sampling frequency to obtain second voice data. The second sampling frequency is a sampling frequency calculated by “first sampling frequency / (teacher data formant frequency / evaluator formant frequency)”. The vocal tract length normalization processing unit 109 preferably resamples the first audio data obtained by sampling the audio received by the audio receiving unit 103 to obtain second audio data. The audio received by the audio receiving unit 103 may be sampled to obtain second audio data directly. When the second audio data is obtained directly, for example, it is necessary that the hardware that performs the sampling process can perform the sampling process at a variable sampling frequency. The vocal tract length normalization processing unit 109 normally performs an operation “teacher data formant frequency / evaluator formant frequency” to obtain a frequency axis conversion rate (“r”). Then, the vocal tract length normalization processing unit 109 performs an operation “Fs / r” based on the first sampling frequency (Fs) and “r” of the first sampling frequency storage unit 105 to obtain a new sampling frequency (Fs / r) is obtained. This new sampling frequency (Fs / r) is the second sampling frequency. Next, the vocal tract length normalization processing unit 109 performs resampling processing on the first sound data at the second sampling frequency (Fs / r) to obtain second sound data. The vocal tract length normalization processing unit 109 can usually be realized by an MPU, a memory, or the like. The processing procedure of the vocal tract length normalization processing unit 109 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit). It can be said that the reciprocal “1 / r” of the frequency axis conversion rate “r” is a vocal tract length parameter described later. Further, since the vocal tract length parameter is a parameter for normalizing the vocal tract length, the frequency axis conversion rate “r” may be considered as the vocal tract length parameter.

音声処理部１１０は、第二音声データを処理する。音声処理部１１０は、ここでは、評定処理である。ただし、音声処理部１１０は、音声認識や音声出力などの他の音声処理を行っても良い。音声出力は、単に、リサンプリング処理された音声を出力する処理である。なお、本実施の形態において、音声処理部１１０は、評定処理を行うものとして、説明する。音声処理部１１０は、通常、ＭＰＵやメモリ等から実現され得る。音声処理部１１０の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The audio processing unit 110 processes the second audio data. Here, the voice processing unit 110 is a rating process. However, the voice processing unit 110 may perform other voice processing such as voice recognition and voice output. The audio output is simply a process of outputting the resampled audio. In the present embodiment, the audio processing unit 110 will be described as performing a rating process. The audio processing unit 110 can usually be realized by an MPU, a memory, or the like. The processing procedure of the audio processing unit 110 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

音声処理部１１０を構成しているフレーム区分手段１１０１は、第二音声データを、フレームに区分する。フレーム区分手段１１０１は、通常、ＭＰＵやメモリ等から実現され得る。フレーム区分手段１１０１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 Frame classification means 1101 constituting the audio processing unit 110 divides the second audio data into frames. The frame partitioning means 1101 can usually be realized by an MPU, a memory, or the like. The processing procedure of the frame sorting unit 1101 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

音声処理部１１０を構成しているフレーム音声データ取得手段１１０２は、区分されたフレーム毎の音声データであるフレーム音声データを１以上得る。フレーム音声データ取得手段１１０２は、通常、ＭＰＵやメモリ等から実現され得る。フレーム音声データ取得手段１１０２の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The frame sound data acquisition means 1102 constituting the sound processing unit 110 obtains one or more frame sound data which is sound data for each divided frame. The frame audio data acquisition unit 1102 can be usually realized by an MPU, a memory, or the like. The processing procedure of the frame audio data acquisition unit 1102 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

音声処理部１１０を構成している評定手段１１０３は、教師データ格納部１０２の教師データと１以上のフレーム音声データに基づいて、音声受付部１０３が受け付けた音声の評定を行う。評定方法の具体例は、後述する。「音声受付部１０３が受け付けた音声を評定する」の概念には、第二音声データを評定することも含まれることは言うまでもない。評定手段１１０３は、通常、ＭＰＵやメモリ等から実現され得る。評定手段１１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating means 1103 constituting the voice processing unit 110 evaluates the voice received by the voice receiving unit 103 based on the teacher data in the teacher data storage unit 102 and one or more frame voice data. A specific example of the rating method will be described later. Needless to say, the concept of “rating the voice received by the voice receiving unit 103” includes rating the second voice data. The rating means 1103 can usually be realized by an MPU, a memory, or the like. The processing procedure of the rating means 1103 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定手段１１０３を構成している最適状態決定手段１１０３１は、１以上のフレーム音声データのうちの少なくとも一のフレーム音声データに対する最適状態を決定する。最適状態決定手段１１０３１は、例えば、全音韻ＨＭＭから、比較される対象（学習対象）の単語や文章などの音声を構成する１以上の音素に対応するＨＭＭを取得し、当該取得した１以上のＨＭＭから、音素の順序で連結したデータ（比較される対象の音声に関するデータであり、音韻毎の隠れマルコフモデルを連結したＨＭＭに基づくデータ）を構成する。そして、構成した当該データ、および取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。なお、最適状態を決定するアルゴリズムは、例えば、Ｖｉｔｅｒｂｉアルゴリズムである。また、教師データは、上述の比較される対象の音声に関するデータであり、音韻毎の隠れマルコフモデルを連結したＨＭＭに基づくデータと考えても良いし、連結される前のデータであり、全音韻ＨＭＭのデータと考えても良い。 Optimal state determination means 11031 constituting the rating means 1103 determines an optimal state for at least one frame sound data of one or more frame sound data. The optimum state determination unit 11031 acquires, for example, an HMM corresponding to one or more phonemes constituting speech such as words and sentences to be compared (learning target) from the whole phoneme HMM, and the acquired one or more From the HMM, data concatenated in the order of phonemes (data related to the speech to be compared and data based on the HMM concatenating hidden Markov models for each phoneme) is constructed. Then, based on each feature vector o _t constituting constituting the relevant data, and the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t). The algorithm for determining the optimum state is, for example, the Viterbi algorithm. In addition, the teacher data is data relating to the above-described target speech to be compared, and may be considered as data based on HMM obtained by concatenating hidden Markov models for each phoneme. It may be considered as HMM data.

評定手段１１０３を構成している最適状態確率値取得手段１１０３２は、最適状態決定手段１１０３１が決定した最適状態における確率値を取得する。 The optimum state probability value acquisition unit 11032 constituting the rating unit 1103 acquires the probability value in the optimum state determined by the optimum state determination unit 11031.

評定手段１１０３を構成している評定値算出手段１１０３３は、最適状態確率値取得手段１１０３２が取得した確率値をパラメータとして音声の評定値を算出する。評定値算出手段１１０３３は、上記確率値を如何に利用して、評定値を算出するかは問わない。評定値算出手段１１０３３は、例えば、最適状態確率値取得手段１１０３２が取得した確率値と、当該確率値に対応するフレームの全状態における確率値の総和をパラメータとして音声の評定値を算出する。評定値算出手段１１０３３は、ここでは、通常、フレームごとに評定値を算出する。 The rating value calculating means 11033 constituting the rating means 1103 calculates the voice rating value using the probability value acquired by the optimum state probability value acquiring means 11032 as a parameter. It does not matter how the rating value calculation means 11033 uses the probability value to calculate the rating value. The rating value calculation unit 11033 calculates a voice rating value using, for example, the sum of the probability value acquired by the optimum state probability value acquisition unit 11032 and the probability value in all states of the frame corresponding to the probability value as a parameter. Here, the rating value calculation means 11033 normally calculates a rating value for each frame.

最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２、評定値算出手段１１０３３は、通常、ＭＰＵやメモリ等から実現され得る。最適状態決定手段１１０３１等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The optimum state determination unit 11031, the optimum state probability value acquisition unit 11032, and the rating value calculation unit 11033 can be usually realized by an MPU, a memory, or the like. The processing procedure of the optimum state determining unit 11031 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力手段１１０４は、評定手段１１０３における評定結果を出力する。出力手段１１０４の出力態様は、種々考えられる。出力とは、ディスプレイへの表示、プリンタへの印字、音出力、外部の装置への送信、記録媒体への蓄積等を含む概念である。出力手段１１０４は、例えば、評定結果を視覚的に表示する。出力手段１１０４は、例えば、フレーム単位、または／および音素・単語単位、または／および発声全体の評定結果を視覚的に表示する。出力手段１１０４は、ディスプレイ３４４やスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力手段１１０４は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 1104 outputs the rating result from the rating unit 1103. Various output modes of the output means 1104 can be considered. Output is a concept that includes display on a display, printing on a printer, sound output, transmission to an external device, storage in a recording medium, and the like. The output unit 1104 visually displays the evaluation result, for example. The output unit 1104 visually displays, for example, the evaluation result of the frame unit, or / and the phoneme / word unit, or / and the entire utterance. The output unit 1104 may or may not include an output device such as the display 344 or a speaker. The output unit 1104 can be realized by driver software for an output device or driver software for an output device and an output device.

次に、本音声処理装置の動作について図２、図３のフローチャートを用いて説明する。 Next, the operation of the speech processing apparatus will be described with reference to the flowcharts of FIGS.

（ステップＳ２０１）入力受付部１０１は、音声処理装置の動作開始を指示する動作開始指示を受け付けたか否かを判断する。動作開始指示を受け付ければステップＳ２０２に行き、動作開始指示を受け付けなければステップＳ２１７に飛ぶ。 (Step S201) The input reception unit 101 determines whether an operation start instruction for instructing the operation start of the speech processing apparatus has been received. If the operation start instruction is accepted, the process goes to step S202, and if the operation start instruction is not accepted, the process jumps to step S217.

（ステップＳ２０２）音声受付部１０３は、音声を受け付けたか否かを判断する。音声を受け付ければステップＳ２０３に行き、音声を受け付けなければステップＳ２１６に飛ぶ。 (Step S202) The voice receiving unit 103 determines whether a voice is received. If a voice is accepted, the process goes to step S203, and if no voice is accepted, the process jumps to step S216.

（ステップＳ２０３）サンプリング部１０６は、第一サンプリング周波数格納部１０５に格納されている第一サンプリング周波数を読み込み、当該第一サンプリング周波数で、音声受付部１０３が受け付けた音声をサンプリングし、第一音声データを得る。 (Step S203) The sampling unit 106 reads the first sampling frequency stored in the first sampling frequency storage unit 105, samples the audio received by the audio receiving unit 103 at the first sampling frequency, and outputs the first audio. Get the data.

（ステップＳ２０４）声道長正規化処理部１０９は、音声受付部１０３が受け付けた音声から、第二音声データを得る。かかる第二音声データを得る処理である声道長正規化処理の詳細については、図３のフローチャートを用いて、詳細に説明する。なお、声道長正規化処理は、個人差を吸収する評定のための前処理である。 (Step S204) The vocal tract length normalization processing unit 109 obtains second voice data from the voice received by the voice receiving unit 103. Details of the vocal tract length normalization process, which is a process for obtaining such second audio data, will be described in detail with reference to the flowchart of FIG. The vocal tract length normalization process is a pre-process for rating that absorbs individual differences.

（ステップＳ２０５）フレーム区分手段１１０１は、ステップＳ２０４で得た第二音声データを図示しないバッファに一時格納する。 (Step S205) The frame sorting unit 1101 temporarily stores the second audio data obtained in Step S204 in a buffer (not shown).

（ステップＳ２０６）フレーム区分手段１１０１は、バッファに一時格納した第二音声データをフレームに区分する。かかる段階で、区分されたフレーム毎の音声データであるフレーム音声データが構成されている。フレーム区分手段１１０１が行うフレーム分割の処理は、例えば、フレーム音声データ取得手段１１０２がフレーム音声データを取り出す際の前処理であり、入力された音声のデータを、すべてのフレームに一度に分割するとは限らない。 (Step S206) The frame segmentation means 1101 segments the second audio data temporarily stored in the buffer into frames. At this stage, frame audio data which is audio data for each divided frame is configured. The frame division processing performed by the frame classification unit 1101 is, for example, preprocessing when the frame audio data acquisition unit 1102 extracts frame audio data, and the input audio data is divided into all frames at once. Not exclusively.

（ステップＳ２０７）フレーム音声データ取得手段１１０２は、カウンタｉに１を代入する。 (Step S207) The frame audio data acquisition means 1102 substitutes 1 for the counter i.

（ステップＳ２０８）フレーム音声データ取得手段１１０２は、ｉ番目のフレームが存在するか否かを判断する。ｉ番目のフレームが存在すればステップＳ２０９に行き、ｉ番目のフレームが存在しなければステップＳ２１１に行く。 (Step S208) The frame audio data acquisition unit 1102 determines whether or not the i-th frame exists. If the i-th frame exists, the process goes to step S209. If the i-th frame does not exist, the process goes to step S211.

（ステップＳ２０９）フレーム音声データ取得手段１１０２は、ｉ番目のフレーム音声データを取得する。フレーム音声データの取得とは、例えば、当該分割された音声データを音声分析し、特徴ベクトルデータを抽出することである。なお、フレーム音声データは、例えば、入力された音声データをフレーム分割されたデータである。また、フレーム音声データは、例えば、当該分割された音声データから音声分析され、抽出された特徴ベクトルデータを有する。本特徴ベクトルデータは、例えば、三角型フィルタを用いたチャネル数２４のフィルタバンク出力を離散コサイン変換したＭＦＣＣであり、その静的パラメータ、デルタパラメータおよびデルタデルタパラメータをそれぞれ１２次元、さらに正規化されたパワーとデルタパワーおよびデルタデルタパワー（３９次元）を有する。 (Step S209) The frame sound data acquisition unit 1102 acquires the i-th frame sound data. The acquisition of the frame sound data means, for example, performing sound analysis on the divided sound data and extracting feature vector data. Note that the frame audio data is, for example, data obtained by dividing the input audio data into frames. The frame audio data includes, for example, feature vector data extracted from the divided audio data by audio analysis. This feature vector data is, for example, MFCC obtained by discrete cosine transform of a filter bank output of 24 channels using a triangular filter, and the static parameter, delta parameter, and delta delta parameter are further normalized to 12 dimensions, respectively. Power and delta power and delta delta power (39th dimension).

（ステップＳ２１０）フレーム音声データ取得手段１１０２は、カウンタｉを１、インクリメントする。ステップＳ２０８に戻る。 (Step S210) The frame audio data acquisition unit 1102 increments the counter i by 1. The process returns to step S208.

（ステップＳ２１１）最適状態決定手段１１０３１は、全フレームの最適状態を決定する。最適状態決定手段１１０３１が最適状態を決定するアルゴリズムは、例えば、Ｖｉｔｅｒｂｉアルゴリズムによる。Ｖｉｔｅｒｂｉアルゴリズムは、公知のアルゴリズムであるので、詳細な説明は省略する。 (Step S211) Optimal state determination means 11031 determines the optimal state of all frames. The algorithm for determining the optimum state by the optimum state determining unit 11031 is, for example, the Viterbi algorithm. Since the Viterbi algorithm is a known algorithm, detailed description thereof is omitted.

（ステップＳ２１２）最適状態確率値取得手段１１０３２は、全フレームの全状態の前向き尤度、および後向き尤度を算出する。最適状態確率値取得手段１１０３２は、例えば、全てのＨＭＭを用いて、フォワード・バックワードアルゴリズムにより、前向き尤度、および後向き尤度を算出する。 (Step S212) The optimum state probability value acquisition unit 11032 calculates the forward likelihood and the backward likelihood of all states of all frames. For example, the optimal state probability value acquisition unit 11032 calculates the forward likelihood and the backward likelihood by the forward / backward algorithm using all the HMMs.

（ステップＳ２１３）最適状態確率値取得手段１１０３２は、ステップＳ２１２で取得した前向き尤度、および後向き尤度を用いて、最適状態の確率値（最適状態確率値）を、すべて算出する。 (Step S213) The optimal state probability value acquisition unit 11032 calculates all of the optimal state probability values (optimal state probability values) using the forward likelihood and the backward likelihood acquired in step S212.

（ステップＳ２１４）評定値算出手段１１０３３は、ステップＳ２１３で算出した１以上の最適状態確率値から、１以上のフレームの音声の評定値を算出する。評定値算出手段１１０３３が評定値を算出する関数は問わない。評定値算出手段１１０３３は、例えば、取得した最適状態確率値と、当該最適状態確率値に対応するフレームの全状態における確率値の総和をパラメータとして音声の評定値を算出する。詳細については、後述する。 (Step S214) The rating value calculation means 11033 calculates the rating value of one or more frames of speech from the one or more optimal state probability values calculated in step S213. The function for calculating the rating value by the rating value calculating means 11033 is not specified. The rating value calculation unit 11033 calculates a speech rating value using, for example, the sum of the acquired optimum state probability value and the probability value in all states of the frame corresponding to the optimum state probability value as a parameter. Details will be described later.

（ステップＳ２１５）出力手段１１０４は、ステップＳ２１４における評定結果（ここでは、音声の評定値）を、設定されている出力モードに従って、出力する。ステップＳ２０２に戻る。出力モードとは、評定値を数値で画面に表示するモード、評定値の遷移をグラフで画面に表示するモード、評定値を音声で出力するモード、評定値が所定の数値より低い場合に警告を示す情報を表示するモードなど、何でも良い。なお、ここでの出力モードは、ステップＳ２１８で設定されるモードである。 (Step S215) The output unit 1104 outputs the rating result (here, the voice rating value) in step S214 according to the set output mode. The process returns to step S202. The output mode is a mode in which the rating value is displayed on the screen as a numerical value, a mode in which the transition of the rating value is displayed on the screen as a graph, a mode in which the rating value is output by voice, and a warning when the rating value is lower than the predetermined value Any mode may be used, such as a mode for displaying information to be shown. The output mode here is a mode set in step S218.

（ステップＳ２１６）音声受付部１０３は、タイムアウトか否かを判断する。つまり、音声受付部１０３は、所定の時間以上、音声の入力を受け付けなかったか否かを判断する。タイムアウトであればステップＳ２０１に戻り、タイムアウトでなければステップＳ２０２に戻る。 (Step S216) The voice reception unit 103 determines whether or not a timeout has occurred. That is, the voice receiving unit 103 determines whether or not a voice input has been received for a predetermined time or more. If timed out, the process returns to step S201, and if not timed out, the process returns to step S202.

（ステップＳ２１７）入力受付部１０１は、出力態様変更指示を受け付けたか否かを判断する。出力態様変更指示を受け付ければステップＳ２１８に行き、出力態様変更指示を受け付なければステップＳ２１９に飛ぶ。出力態様変更指示は、上述した出力モードを有する情報である。 (Step S217) The input receiving unit 101 determines whether an output mode change instruction has been received. If an output mode change instruction is accepted, the process proceeds to step S218, and if no output mode change instruction is received, the process jumps to step S219. The output mode change instruction is information having the output mode described above.

（ステップＳ２１８）出力手段１１０４は、ステップＳ２１７で受け付けた出力態様変更指示が有する出力モードを示す情報を書き込み、出力モードを設定する。ステップＳ２０１に戻る。 (Step S218) The output unit 1104 writes information indicating the output mode included in the output mode change instruction received in Step S217, and sets the output mode. The process returns to step S201.

（ステップＳ２１９）入力受付部１０１は、終了指示を受け付けたか否かを判断する。終了指示を受け付ければ処理を終了し、終了指示を受け付なければステップＳ２０１に戻る。 (Step S219) The input receiving unit 101 determines whether an end instruction has been received. If an end instruction is accepted, the process ends. If no end instruction is accepted, the process returns to step S201.

なお、図２のフローチャートにおいて、本発音評定装置は、出力モードの設定機能を有しなくても良い。 In the flowchart of FIG. 2, the pronunciation evaluation device may not have an output mode setting function.

次に、ステップＳ２０４における声道長正規化処理の詳細について、図３のフローチャートを用いて説明する。 Next, details of the vocal tract length normalization process in step S204 will be described using the flowchart of FIG.

（ステップＳ３０１）評価対象者フォルマント周波数取得部１０７は、サンプリング部１０６のサンプリング処理により得られた第一音声データから、評価対象者フォルマント周波数（Ｆｉ）を取得し、評価対象者フォルマント周波数格納部１０８に一時格納する。評価対象者フォルマント周波数は、例えば、第二フォルマント周波数（Ｆ２）である。 (Step S301) The evaluation target person formant frequency acquisition unit 107 acquires the evaluation target person formant frequency (Fi) from the first sound data obtained by the sampling processing of the sampling unit 106, and the evaluation target person formant frequency storage unit 108. Temporarily store in. The evaluation subject formant frequency is, for example, the second formant frequency (F2).

（ステップＳ３０２）声道長正規化処理部１０９は、第一サンプリング周波数格納部１０５の第一サンプリング周波数（Ｆｓ）を読み出す。 (Step S302) The vocal tract length normalization processing unit 109 reads the first sampling frequency (Fs) of the first sampling frequency storage unit 105.

（ステップＳ３０３）声道長正規化処理部１０９は、教師データフォルマント周波数格納部１０４の教師データフォルマント周波数を読み出す。 (Step S303) The vocal tract length normalization processing unit 109 reads the teacher data formant frequency in the teacher data formant frequency storage unit 104.

（ステップＳ３０４）声道長正規化処理部１０９は、ステップＳ３０１で取得した評価対象者フォルマント周波数と、ステップＳ３０３で読み出した教師データフォルマント周波数から周波数軸変換率を算出する。具体的には、声道長正規化処理部１０９は、演算「教師データフォルマント周波数／評価対象者フォルマント周波数」を行い、周波数軸変換率（ｒ）を得る。 (Step S304) The vocal tract length normalization processing unit 109 calculates a frequency axis conversion rate from the evaluation subject formant frequency acquired in step S301 and the teacher data formant frequency read in step S303. Specifically, the vocal tract length normalization processing unit 109 performs an operation “teacher data formant frequency / evaluator formant frequency” to obtain a frequency axis conversion rate (r).

（ステップＳ３０５）声道長正規化処理部１０９は、ステップＳ３０２で読み出した第一サンプリング周波数（Ｆｓ）と周波数軸変換率（ｒ）に基づいて、演算「Ｆｓ／ｒ」を行い、第二サンプリング周波数（Ｆｓ／ｒ）を得る。 (Step S305) The vocal tract length normalization processing unit 109 performs the calculation “Fs / r” based on the first sampling frequency (Fs) and the frequency axis conversion rate (r) read in step S302, and performs the second sampling. Obtain the frequency (Fs / r).

（ステップＳ３０６）声道長正規化処理部１０９は、サンプリング部１０６がサンプリングして得た第一音声データに対して、第二サンプリング周波数（Ｆｓ／ｒ）で、リサンプリング処理を行い、第二音声データを得る。なお、リサンプリング処理は公知技術であるので、詳細な説明を省略する。上位関数にリターンする。 (Step S306) The vocal tract length normalization processing unit 109 performs resampling processing at the second sampling frequency (Fs / r) on the first audio data obtained by sampling by the sampling unit 106, Get audio data. Since the resampling process is a known technique, detailed description thereof is omitted. Return to upper function.

なお、図２、図３のフローチャートにおいて、声道長正規化処理を行う対象の音声と、評価対象の音声が異なっても良い。つまり、例えば、音声受付部１０３は、所定の１以上の母音（例えば、「う」）の音声を受け付けた後、評価対象者の音声を受け付け、評価対象者フォルマント周波数取得部１０７は、当該１以上の母音の音声に基づいて、評価対象者フォルマント周波数を取得し、声道長正規化処理部１０９は、当該評価対象者フォルマント周波数をパラメータとして、声道長正規化処理を行う。そして、音声処理部１１０は、所定の母音の音声を受け付けた後に受け付けた音声を処理し、当該音声の評価を行っても良い。 In the flowcharts of FIGS. 2 and 3, the voice to be subjected to vocal tract length normalization processing may be different from the voice to be evaluated. That is, for example, the voice receiving unit 103 receives the voice of the evaluation target person after receiving the voice of the predetermined one or more vowels (for example, “U”), and the evaluation target formant frequency acquisition unit 107 Based on the voice of the above vowel, the evaluation target person formant frequency is acquired, and the vocal tract length normalization processing unit 109 performs the vocal tract length normalization process using the evaluation target person formant frequency as a parameter. Then, the voice processing unit 110 may process the received voice after receiving the voice of a predetermined vowel and evaluate the voice.

以下、本実施の形態における音声処理装置の具体的な動作について説明する。本具体例において、音声処理装置が語学学習に利用される場合について説明する。 Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In this specific example, the case where the speech processing apparatus is used for language learning will be described.

まず、本音声処理装置において、図示しない手段により、ネイティブ発音の音声データベースからネイティブ発音の音韻ＨＭＭを学習しておく。ここで、音韻の種類数をＬとし、ｌ番目の音韻に対するＨＭＭをλ_ｌとする。なお、かかる学習の処理については、公知技術であるので、詳細な説明は省略する。なお、ＨＭＭの仕様の例について、図４に示す。なお、ＨＭＭの仕様は、他の実施の形態における具体例の説明においても同様である。ただし、ＨＭＭの仕様が、他の仕様でも良いことは言うまでもない。 First, in this speech processing apparatus, a phonetic HMM of native pronunciation is learned from a speech database of native pronunciation by means not shown. Here, the number of phoneme types is L, and the HMM for the l-th phoneme is λ _l . Since this learning process is a known technique, a detailed description thereof is omitted. An example of HMM specifications is shown in FIG. The specification of the HMM is the same in the description of specific examples in other embodiments. However, it goes without saying that the specifications of the HMM may be other specifications.

そして、図示しない手段により、学習したＬ種類の音韻ＨＭＭから、学習対象の単語や文章などの音声を構成する１以上の音素に対応するＨＭＭを取得し、当該取得した１以上のＨＭＭを、音素の順序で連結した教師データを構成する。そして、当該教師データを教師データ格納部１０２に保持しておく。ここでは、例えば、比較される対象の音声は、単語「ｒｉｇｈｔ」の音声である。また、ここでは、教師データを発声した者（教師）は、大人である、とする。 Then, by means not shown, an HMM corresponding to one or more phonemes constituting speech such as words or sentences to be learned is acquired from the learned L types of phoneme HMMs, and the acquired one or more HMMs are converted into phonemes. The teacher data concatenated in this order is configured. Then, the teacher data is held in the teacher data storage unit 102. Here, for example, the target speech to be compared is the speech of the word “right”. Here, it is assumed that the person (teacher) who uttered the teacher data is an adult.

次に、学習者（評価対象者）が、語学学習の開始の指示である動作開始指示を入力する。かかる指示は、例えば、マウスで所定のボタンを押下することによりなされる。なお、ここでは、学習者は、例えば、子供（５歳から１１歳）である、とする。 Next, the learner (evaluator) inputs an operation start instruction that is an instruction to start language learning. Such an instruction is made, for example, by pressing a predetermined button with a mouse. Here, it is assumed that the learner is, for example, a child (5 to 11 years old).

まず、学習者は、母音「う」を発音する、とする。かかる場合、本音声処理装置は、学習に、「う」を発声するように促すことは好適である。「う」を発声するように促すために、音声処理装置は、例えば、「"う"と発声してください。」と画面出力しても良いし、「"う"と発声してください。」と音声出力しても良い。また、母音「う」は、学習者の評価対象者フォルマント周波数を取得するために好適である。また、本音声処理装置は、第一サンプリング周波数として、「２２．０５ＫＨｚ」を保持している、とする。 First, it is assumed that the learner pronounces the vowel “U”. In such a case, it is preferable that the speech processing apparatus prompts the user to speak “U” for learning. In order to prompt the user to say “U”, the voice processing device may output, for example, “Please say“ U ”.” Or “Speak“ U ”.” May be output. Moreover, the vowel “U” is suitable for acquiring the learner's evaluation target formant frequency. In addition, it is assumed that the sound processing apparatus holds “22.05 KHz” as the first sampling frequency.

そして、次に、サンプリング部１０６は、音声受付部１０３が受け付けた音声「う」をサンプリングし、「う」の第一音声データを得る。 Then, the sampling unit 106 samples the voice “U” received by the voice receiving unit 103 to obtain first voice data “U”.

次に、評価対象者フォルマント周波数取得部１０７は、サンプリング部１０６が音声「う」をサンプリングして得た第一音声データから、第二フォルマント周波数を取得する。そして、この第二フォルマント周波数を評価対象者フォルマント周波数（Ｆｉとする。今、このＦｉが「１７２５Ｈｚ」であった、とする。そして、評価対象者フォルマント周波数取得部１０７は、Ｆｉ（１７２５Ｈｚ）を、評価対象者フォルマント周波数格納部１０８に一時格納する。 Next, the evaluation subject formant frequency acquisition unit 107 acquires the second formant frequency from the first audio data obtained by the sampling unit 106 sampling the audio “U”. Then, this second formant frequency is assumed to be the evaluation subject formant frequency (Fi. Now, let this Fi be “1725 Hz”. Then, the evaluation subject formant frequency acquisition unit 107 sets Fi (1725 Hz) to And temporarily stored in the evaluation formant formant frequency storage unit 108.

次に、声道長正規化処理部１０９は、教師データフォルマント周波数格納部１０４の教師データフォルマント周波数を読み出す。教師データフォルマント周波数格納部１０４に格納されている教師データフォルマント周波数は、大人の第二フォルマント周波数であり、今、「１１８４Ｈｚ」である、とする。また、教師データフォルマント周波数は、例えば、教師データを構築する場合に、教師に、例えば、「う」と発声してもらい、当該音声「う」をサンプリング処理した後、取得した第二フォルマント周波数である。 Next, the vocal tract length normalization processing unit 109 reads the teacher data formant frequency in the teacher data formant frequency storage unit 104. It is assumed that the teacher data formant frequency stored in the teacher data formant frequency storage unit 104 is the second formant frequency of an adult and is now “1184 Hz”. In addition, the teacher data formant frequency is, for example, when the teacher data is constructed, the teacher utters “U”, for example, and after sampling the audio “U”, the acquired second formant frequency is used. is there.

なお、図５に、年齢層別、性別ごとの、「う」の第一フォルマント周波数（Ｆ１）、第二フォルマント周波数（Ｆ２）の計測結果を示す。図５により、年齢、性別により、第一フォルマント周波数（Ｆ１）、第二フォルマント周波数（Ｆ２）の値が大きく異なることが分る。 FIG. 5 shows the measurement results of the first formant frequency (F1) and the second formant frequency (F2) of “U” for each age group and sex. FIG. 5 shows that the values of the first formant frequency (F1) and the second formant frequency (F2) are greatly different depending on the age and sex.

そして、次に、声道長正規化処理部１０９は、評価対象者フォルマント周波数「１７２５Ｈｚ」と教師データフォルマント周波数「１１８４Ｈｚ」から演算「教師データフォルマント周波数／評価対象者フォルマント周波数」を行い、周波数軸変換率（ｒ）を得る。具体的には、声道長正規化処理部１０９は、「１１８４／１７２５」により、周波数軸変換率「０．６８６」を得る。 Next, the vocal tract length normalization processing unit 109 performs an operation “teacher data formant frequency / evaluation target person formant frequency” from the evaluation subject formant frequency “1725 Hz” and the teacher data formant frequency “1184 Hz”, and the frequency axis The conversion rate (r) is obtained. Specifically, the vocal tract length normalization processing unit 109 obtains the frequency axis conversion rate “0.686” from “1184/1725”.

次に、声道長正規化処理部１０９は、第一サンプリング周波数（Ｆｓ）と「ｒ」に基づいて、演算「Ｆｓ／ｒ」を行い、第二サンプリング周波数（Ｆｓ／ｒ）を得る。ここで、得た第二サンプリング周波数は、「２２．０５／０．６８６」により、「３２．１」である。そして、声道長正規化処理部１０９は、第二サンプリング周波数「３２．１ＫＨｚ」を一時格納する。 Next, the vocal tract length normalization processing unit 109 performs an operation “Fs / r” based on the first sampling frequency (Fs) and “r” to obtain a second sampling frequency (Fs / r). Here, the obtained second sampling frequency is “32.1” by “22.05 / 0.686”. Then, the vocal tract length normalization processing unit 109 temporarily stores the second sampling frequency “32.1 KHz”.

次に、学習者は、学習対象の音声「ｒｉｇｈｔ」を発音する。そして、音声受付部１０３は、学習者が発音した音声の入力を受け付ける。なお、音声処理装置は、学習者に「"ｒｉｇｈｔ"を発音してください。」などを表示、または音声出力するなどして、学習者に「ｒｉｇｈｔ」の発声を促すことは好適である。 Next, the learner pronounces the voice “right” to be learned. Then, the voice receiving unit 103 receives an input of a voice pronounced by the learner. It is preferable that the speech processing apparatus prompts the learner to utter “right” by displaying or outputting a voice such as “Please pronounce“ right ”” to the learner.

次に、サンプリング部１０６は、受け付けた音声「ｒｉｇｈｔ」をサンプリング周波数「２２．０５ＫＨｚ」でサンプリング処理する。そして、サンプリング部１０６は、音声「ｒｉｇｈｔ」の第一音声データを得る。 Next, the sampling unit 106 samples the received voice “right” at the sampling frequency “22.05 KHz”. Then, the sampling unit 106 obtains the first sound data of the sound “right”.

次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。 Next, the vocal tract length normalization processing unit 109 resamples the first voice data of “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data.

次に、音声処理部１１０は、第二音声データを、以下のように処理する。 Next, the audio processing unit 110 processes the second audio data as follows.

まず、フレーム区分手段１１０１は、第二音声データを、短時間フレームに区分する。なお、フレームの間隔は、予め決められている、とする。 First, the frame classification unit 1101 classifies the second audio data into short-time frames. It is assumed that the frame interval is determined in advance.

そして、フレーム音声データ取得手段１１０２は、フレーム区分手段１１０１が区分した音声データを、スペクトル分析し、特徴ベクトル系列「Ｏ＝ｏ_１，ｏ_２，・・・，ｏ_Ｔ」を算出する。なお、Ｔは、系列長である。ここで、特徴ベクトル系列は、各フレームの特徴ベクトルの集合である。また、特徴ベクトルは、例えば、三角型フィルタを用いたチャネル数２４のフィルタバンク出力を離散コサイン変換したＭＦＣＣであり、その静的パラメータ、デルタパラメータおよびデルタデルタパラメータをそれぞれ１２次元、さらに正規化されたパワーとデルタパワーおよびデルタデルタパワー（３９次元）を有する。また、スペクトル分析において、ケプストラム平均除去を施すことは好適である。なお、音声分析条件の例を図６の表に示す。なお、音声分析条件は、他の実施の形態における具体例の説明においても同様である。ただし、音声分析条件が、他の条件でも良いことは言うまでもない。また、音声分析の際のサンプリング周波数は、第一サンプリング周波数「２２．０５ＫＨｚ」である。 Then, the frame audio data acquisition unit 1102 performs spectrum analysis on the audio data classified by the frame classification unit 1101 and calculates a feature vector series “O = o ₁ , o ₂ ,..., O _T ”. T is a sequence length. Here, the feature vector series is a set of feature vectors of each frame. The feature vector is, for example, an MFCC obtained by performing discrete cosine transform on a filter bank output of 24 channels using a triangular filter, and the static parameter, the delta parameter, and the delta delta parameter are further normalized to 12 dimensions, respectively. Power and delta power and delta delta power (39th dimension). In spectral analysis, it is preferable to perform cepstrum average removal. An example of speech analysis conditions is shown in the table of FIG. Note that the voice analysis conditions are the same in the description of specific examples in other embodiments. However, it goes without saying that the voice analysis conditions may be other conditions. The sampling frequency for voice analysis is the first sampling frequency “22.05 KHz”.

次に、最適状態決定手段１１０３１は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。最適状態決定手段１１０３１が最適状態を決定するアルゴリズムは、例えば、Ｖｉｔｅｒｂｉアルゴリズムによる。かかる場合、最適状態決定手段１１０３１は、上記で連結したＨＭＭを用いて最適状態を決定する。最適状態決定手段１１０３１は、２以上のフレームの最適状態である最適状態系列を求めることとなる。 Then, the optimal state determination unit 11031, based on the feature vector o _t constituting the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t). The algorithm for determining the optimum state by the optimum state determining unit 11031 is, for example, the Viterbi algorithm. In such a case, the optimum state determination unit 11031 determines the optimum state using the HMM connected as described above. The optimum state determination unit 11031 obtains an optimum state sequence that is an optimum state of two or more frames.

次に、最適状態確率値取得手段１１０３２は、以下の数式１により、最適状態（ｑ_ｔ ^＊）における最適状態確率値（γ_ｔ（ｑ_ｔ ^＊））を算出する。なお、γ_ｔ（ｑ_ｔ ^＊）は、状態ｊの事後確率関数γ_ｔ（ｊ）のｊにｑ_ｔ ^＊を代入した値である。そして、状態ｊの事後確率関数γ_ｔ（ｊ）は、数式２を用いて算出される。この確率値（γ_ｔ（ｊ））は、ｔ番目の特徴ベクトルｏ_ｔが状態ｊから生成された事後確率であり、動的計画法を用いて算出される。なお、ｊは、状態を識別する状態識別子である。
Next, the optimum state probability value acquisition unit 11032 calculates the optimum state probability value (γ _t (q _t ^* )) in the optimum state (q _t ^* ) according to the following Equation 1. Note that γ _t (q _t ^* ) is a value obtained by substituting q _t ^* for j in the posterior probability function γ _t (j) of the state j. Then, the posterior probability function γ _t (j) of the state j is calculated using Equation 2. This probability value _{(γ t} (j)) is, t th feature vector o _t is the posterior probability that is generated from the state j, is calculated using dynamic programming. Note that j is a state identifier for identifying a state.

数式２において、ｑ_ｔは、ｏ_ｔに対する状態識別子を表す。この確率値（γ_ｔ（ｊ））は、ＨＭＭの最尤推定におけるＢａｕｍ−Ｗｅｌｃｈアルゴリズムの中で表れる占有度数に対応する。
In Equation 2, _{q t} represents the state identifier for _{o t.} This probability value (γ _t (j)) corresponds to the occupation frequency appearing in the Baum-Welch algorithm in the maximum likelihood estimation of the HMM.

数式２において、「α_ｔ（ｊ）」「β_ｔ（ｊ）」は、全部のＨＭＭを用いて、ｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄアルゴリズムにより算出される。「α_ｔ（ｊ）」は前向き尤度、「β_ｔ（ｊ）」は後向き尤度である。Ｂａｕｍ−Ｗｅｌｃｈアルゴリズム、ｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄアルゴリズムは、公知のアルゴリズムであるので、詳細な説明は省略する。 In Equation 2, “α _t (j)” and “β _t (j)” are calculated by the forward-backward algorithm using all the HMMs. “Α _t (j)” is a forward likelihood, and “β _t (j)” is a backward likelihood. Since the Baum-Welch algorithm and the forward-backward algorithm are known algorithms, detailed description thereof is omitted.

また、数式２において、Ｎは、全ＨＭＭに渡る状態の総数を示す。 In Equation 2, N represents the total number of states over all HMMs.

なお、評定手段１１０３は、まず最適状態を求め、次に、最適状態の確率値（なお、確率値は、０以上、１以下である。）を求めても良いし、評定手段１１０３は、まず、全状態の確率値を求め、その後、特徴ベクトル系列の各特徴ベクトルに対する最適状態を求め、当該最適状態に対応する確率値を求めても良い。 Note that the rating unit 1103 may first obtain an optimum state, and then obtain a probability value of the optimum state (where the probability value is 0 or more and 1 or less). Then, the probability values of all the states may be obtained, then the optimum state for each feature vector of the feature vector series may be obtained, and the probability value corresponding to the optimum state may be obtained.

次に、評定値算出手段１１０３３は、例えば、上記の取得した最適状態確率値と、当該最適状態確率値に対応するフレームの全状態における確率値の総和をパラメータとして音声の評定値を算出する。かかる場合、もし学習者のｔフレーム目に対応する発声が、教師データが示す発音（例えば、正しいネイティブな発音）に近ければ、数式２の（２）式の分子の値が、他の全ての可能な音韻の全ての状態と比較して大きくなり、結果的に最適状態の確率値（評定値）が大きくなる。逆にその区間が、教師データが示す発音に近くなければ、評定値は小さくなる。なお、どのネイティブ発音にも近くないような場合は、評定値はほぼ１／Ｎに等しくなる。Ｎは全ての音韻ＨＭＭにおける全ての状態の数であるから、通常、大きな値となり、この評定値は十分小さくなる。また、ここでは、評定値は最適状態における確率値と全ての可能な状態における確率値との比率で定義されている。したがって、収音環境等の違いにより多少のスペクトルの変動があったとしても、学習者が正しい発音をしていれば、その変動が相殺され評定値が高いスコアを維持する。よって、評定値算出手段１１０３３は、最適状態確率値取得手段１１０３２が取得した確率値と、当該確率値に対応するフレームの全状態における確率値の総和をパラメータとして音声の評定値を算出することは、極めて好適である。 Next, the rating value calculation means 11033 calculates a speech rating value using, for example, the sum of the acquired optimum state probability value and the probability value in all states of the frame corresponding to the optimum state probability value as a parameter. In such a case, if the utterance corresponding to the learner's t-th frame is close to the pronunciation indicated by the teacher data (for example, correct native pronunciation), the numerator value of Equation (2) in Equation 2 is set to all other values. As compared with all possible phoneme states, the probability value (rating value) in the optimum state increases. Conversely, if the interval is not close to the pronunciation indicated by the teacher data, the rating value becomes small. In the case where it is not close to any native pronunciation, the rating value is approximately equal to 1 / N. Since N is the number of all states in all phoneme HMMs, it is usually a large value, and this rating value is sufficiently small. Here, the rating value is defined by the ratio between the probability value in the optimum state and the probability value in all possible states. Therefore, even if there is some spectrum variation due to differences in the sound collection environment or the like, if the learner pronounces correctly, the variation is offset and a score with a high rating value is maintained. Therefore, the rating value calculation means 11033 calculates the voice rating value using the probability value acquired by the optimum state probability value acquisition means 11032 and the sum of the probability values in all states of the frame corresponding to the probability value as parameters. Is very suitable.

かかる評定値算出手段１１０３３が算出した評定値（「ＤＡＰスコア」とも言う。）の出力例を、図７、図８に示す。図７、図８において、横軸は分析フレーム番号、縦軸はスコアを％で表わしたものである。太い破線は音素境界，細い点線は状態境界（いずれもＶｉｔｅｒｂｉアルゴリズムで求まったもの）を表わしており，図の上部に音素名を表記している。図７は、アメリカ人男性による英語「ｒｉｇｈｔ」の発音のＤＡＰスコアを示す。なお、評定値を示すグラフの横軸、縦軸は、後述するグラフにおいても同様である。 Examples of output of the rating value (also referred to as “DAP score”) calculated by the rating value calculation means 11033 are shown in FIGS. 7 and 8, the horizontal axis represents the analysis frame number, and the vertical axis represents the score in%. A thick broken line represents a phoneme boundary, a thin dotted line represents a state boundary (both obtained by the Viterbi algorithm), and a phoneme name is shown at the top of the figure. FIG. 7 shows the DAP score for the pronunciation of English “right” by an American male. The horizontal axis and vertical axis of the graph indicating the rating value are the same in the graph described later.

図８は、日本人男性による英語「ｒｉｇｈｔ」の発音のＤＡＰスコアを示す。アメリカ人の発音は、日本人の発音と比較して、基本的にスコアが高い。なお、図７において、状態の境界において所々スコアが落ち込んでいることがわかる。 FIG. 8 shows a DAP score of pronunciation of English “right” by a Japanese male. American pronunciation is basically higher than Japanese pronunciation. In addition, in FIG. 7, it turns out that the score has fallen in some places in the boundary of a state.

そして、出力手段１１０４は、評定手段１１０３の評定結果を出力する。具体的には、例えば、出力手段１１０４は、図９に示すような態様で、評定結果を出力する。つまり、出力手段１１０４は、各フレームにおける発音の良さを表すスコア（スコアグラフ）として、各フレームの評定値を表示する。その他、出力手段１１０４は、学習対象の単語の表示（単語表示）、音素要素の表示（音素表示）、教師データの波形の表示（教師波形）、学習者の入力した発音の波形の表示（ユーザ波形）を表示しても良い。なお、図９において、「録音」ボタンを押下すれば、動作開始指示が入力されることとなり、「停止」ボタンを押下すれば、終了指示が入力されることとなる。また、音素要素の表示や波形の表示をする技術は公知技術であるので、その詳細説明を省略する。また、本音声処理装置は、学習対象の単語（図９の「ｗｏｒｄ１」など）や、音素（図９の「ｐ１」など）や、教師波形を出力されるためのデータを予め格納している、とする。 Then, the output unit 1104 outputs the rating result of the rating unit 1103. Specifically, for example, the output unit 1104 outputs the rating result in a manner as shown in FIG. That is, the output unit 1104 displays the rating value of each frame as a score (score graph) indicating the goodness of pronunciation in each frame. In addition, the output unit 1104 displays a word to be learned (word display), a phoneme element display (phoneme display), a teacher data waveform display (teacher waveform), and a pronunciation waveform input by the learner (user). Waveform) may be displayed. In FIG. 9, when the “Record” button is pressed, an operation start instruction is input, and when the “Stop” button is pressed, an end instruction is input. Further, since the technology for displaying phoneme elements and displaying waveforms is a known technology, its detailed description is omitted. Further, the speech processing apparatus stores in advance data for outputting a word to be learned (such as “word1” in FIG. 9), a phoneme (such as “p1” in FIG. 9), and a teacher waveform. , And.

また、図９において、フレーム単位以外に、音素単位、単語単位、発声全体の評定結果を表示しても良い。上記の処理において、フレーム単位の評定値を算出するので、単語単位、発声全体の評定結果を得るためには、フレーム単位の１以上の評定値をパラメータとして、単語単位、発声全体の評定値を算出する必要がある。かかる算出式は問わないが、例えば、単語を構成するフレーム単位の１以上の評定値の平均値を単語単位の評定値とする、ことが考えられる。 Moreover, in FIG. 9, you may display the evaluation result of a phoneme unit, a word unit, and the whole utterance other than a frame unit. In the above processing, the evaluation value for each frame is calculated. In order to obtain the evaluation result for each word and the whole utterance, the evaluation value for each word and the whole utterance is obtained using one or more evaluation values for each frame as parameters. It is necessary to calculate. Such a calculation formula is not limited. For example, it is conceivable that an average value of one or more rating values for each frame constituting a word is used as a rating value for each word.

なお、図９において、音声処理装置は、波形表示（教師波形またはユーザ波形）の箇所においてクリックを受け付けると、再生メニューを表示し、音素区間内ではその音素またはその区間が属する単語、波形全体を再生し、単語区間外（無音部）では波形全体のみを再生するようにしても良い。 In FIG. 9, when the voice processing device accepts a click at the location of the waveform display (teacher waveform or user waveform), it displays a playback menu, and within the phoneme section, the phoneme, the word to which the section belongs, and the entire waveform are displayed. It is possible to reproduce only the entire waveform outside the word section (silent part).

また、出力手段１１０４の表示は、図１０に示すような態様でも良い。図１０において、音素ごとのスコア、単語のスコア、総合スコアが、数字で表示されている。 Further, the display of the output means 1104 may be in the form as shown in FIG. In FIG. 10, a score for each phoneme, a word score, and a total score are displayed in numbers.

なお、出力手段１１０４の表示は、図７、図８のような表示でも良いことは言うまでもない。 Needless to say, the display of the output means 1104 may be as shown in FIGS.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差、特に声道長の違いに影響を受けない、精度の高い評定ができる。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. In this case, according to the present embodiment, highly accurate evaluation can be performed without being affected by individual differences, particularly differences in vocal tract length.

また、本実施の形態によれば、連結されたＨＭＭである連結ＨＭＭを用いて最適状態を求め、評定値を算出するので、高速に評定値を求めることができる。したがって、上記の具体例で述べたように、リアルタイムに、フレームごと、音素ごと、単語ごとの評定値を出力できる。また、本実施の形態によれば、動的計画法に基づいた事後確率を確率値として算出するので、さらに高速に評定値を求めることができる。また、本実施の形態によれば、フレームごとに確率値を算出するので、上述したように、フレーム単位だけではなく、または／および音素・単語単位、または／および発声全体の評定結果を出力でき、出力態様の自由度が高い。 Further, according to the present embodiment, since the optimum state is obtained by using the concatenated HMM, which is a concatenated HMM, and the evaluation value is calculated, the evaluation value can be obtained at high speed. Therefore, as described in the above specific example, the rating value for each frame, for each phoneme, and for each word can be output in real time. Further, according to the present embodiment, the posterior probability based on the dynamic programming is calculated as the probability value, so that the rating value can be obtained at higher speed. Further, according to the present embodiment, since the probability value is calculated for each frame, as described above, it is possible to output not only the frame unit but / or the phoneme / word unit or / and the entire utterance evaluation result. The degree of freedom of the output mode is high.

また、本実施の形態によれば、音声処理装置は、語学学習に利用することを主として説明したが、物真似練習や、カラオケ評定や、歌唱評定などに利用できる。つまり、本音声処理装置は、比較される対象の音声に関するデータとの類似度を精度良く、高速に評定し、出力でき、そのアプリケーションは問わない。つまり、例えば、本音声処理装置は、カラオケ評価装置であって、音声受付部は、評価対象者の歌声の入力を受け付け、音声処理部は、前記歌声を評価する、という構成でも良い。かかることは、他の実施の形態においても同様である。 Moreover, according to this Embodiment, although the audio processing apparatus was mainly demonstrated using for language learning, it can be utilized for imitation practice, karaoke rating, singing rating, etc. That is, the speech processing apparatus can accurately evaluate and output the similarity to the data related to the target speech to be compared with high speed, and the application is not limited. That is, for example, the voice processing device may be a karaoke evaluation device, and the voice reception unit may receive an input of the evaluation subject's singing voice, and the voice processing unit may evaluate the singing voice. The same applies to other embodiments.

また、本実施の形態において、音声の入力を受け付けた後または停止ボタン操作後に、スコアリング処理を実行するかどうかをユーザに問い合わせ、スコアリング処理を行うとの指示を受け付けた場合のみ、図１０に示すような音素スコア、単語スコア、総合スコアを出力するようにしても良い。かかることも、他の実施の形態においても同様である。 Further, in the present embodiment, after receiving voice input or operating the stop button, the user is inquired whether to execute scoring processing, and only when an instruction to perform scoring processing is received, FIG. A phoneme score, a word score, and a total score as shown in FIG. This also applies to other embodiments.

また、本実施の形態において、教師データは、比較される対象の音声に関するデータであり、音韻毎の隠れマルコフモデル（ＨＭＭ）に基づくデータであるとして、主として説明したが、必ずしもＨＭＭに基づくデータである必要はない。教師データは、単一ガウス分布モデルや、確率モデル（ＧＭＭ：ガウシャンミクスチャモデル）や統計モデルなど、他のモデルに基づくデータでも良い。かかることも、他の実施の形態においても同様である。 Further, in the present embodiment, the teacher data is mainly related to the speech to be compared and is based on the hidden Markov model (HMM) for each phoneme, but is not necessarily data based on the HMM. There is no need. The teacher data may be data based on other models such as a single Gaussian distribution model, a probability model (GMM: Gaussian mixture model), and a statistical model. This also applies to other embodiments.

また、本実施の形態の具体例において、学習者は、母音「う」を発音し、音声処理装置は、かかる音声から第二サンプリング周波数を得た。しかし、学習者は、例えば、母音「あいえお」等、１以上の母音を発音し、かかる母音の音声から、音声処理装置は、第二サンプリング周波数を得ても良い。つまり、第二サンプリング周波数を得るために、学習者が発音する音は「う」に限られない。 In the specific example of the present embodiment, the learner pronounces the vowel “U”, and the speech processing apparatus obtains the second sampling frequency from the speech. However, the learner may generate one or more vowels such as the vowel “Aieo”, and the sound processing apparatus may obtain the second sampling frequency from the sound of the vowels. That is, the sound produced by the learner to obtain the second sampling frequency is not limited to “U”.

また、本実施の形態において、音声処理装置は、声道長正規化処理部１０９において、声道長正規化パラメータ「ｒ」を算出した。しかし、別途、声道長正規化パラメータ「１／ｒ」を算出しておいて、かかる声道長正規化パラメータ「１／ｒ」を声道長正規化パラメータ格納部に格納していても良い。かかる場合、音声処理装置は、音声を受け付ける音声受付部と、第一サンプリング周波数を格納している第一サンプリング周波数格納部と、サンプリング周波数の変換率に関する情報である声道長正規化パラメータを格納している声道長正規化パラメータ格納部と、前記声道長正規化パラメータと前記第一サンプリング周波数をパラメータとして算出される第二サンプリング周波数で、前記音声受付部が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理部と、前記第二音声データを処理する音声処理部を具備する音声処理装置、である。かかる場合、音声処理装置は、教師データフォルマント周波数格納部１０４、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８は、必須ではない。かかることも、他の実施の形態においても同様である。 In the present embodiment, the speech processing apparatus calculates the vocal tract length normalization parameter “r” in the vocal tract length normalization processing unit 109. However, the vocal tract length normalization parameter “1 / r” may be calculated separately, and the vocal tract length normalization parameter “1 / r” may be stored in the vocal tract length normalization parameter storage unit. . In such a case, the speech processing apparatus stores a speech acceptance unit that accepts speech, a first sampling frequency storage unit that stores the first sampling frequency, and a vocal tract length normalization parameter that is information related to the conversion rate of the sampling frequency. A voice sampling length normalization parameter storage unit, a second sampling frequency calculated using the vocal tract length normalization parameter and the first sampling frequency as parameters, with respect to the voice received by the voice reception unit, A voice processing device including a vocal tract length normalization processing unit that performs sampling processing to obtain second voice data, and a voice processing unit that processes the second voice data. In such a case, in the speech processing apparatus, the teacher data formant frequency storage unit 104, the evaluation subject formant frequency acquisition unit 107, and the evaluation subject formant frequency storage unit 108 are not essential. This also applies to other embodiments.

また、本実施の形態において、音声処理装置が行う下記の処理を、一のＤＳＰ（デジタルシグナルプロセッサ）で行っても良い。つまり、本ＤＳＰは、第一サンプリング周波数を格納している第一サンプリング周波数格納部と、前記第一サンプリング周波数で、受け付けた音声をサンプリングし、第一音声データを取得するサンプリング部と、前記教師データのフォルマント周波数である教師データフォルマント周波数を格納している教師データフォルマント周波数格納部と、前記音声の話者である評価対象者のフォルマント周波数である評価対象者フォルマント周波数を格納している評価対象者フォルマント周波数格納部と、第二サンプリング周波数「前記第一サンプリング周波数／（教師データフォルマント周波数／評価対象者フォルマント周波数）」で、前記受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理部を具備するデジタルシグナルプロセッサ、である。かかることも、他の実施の形態でも同様である。 In the present embodiment, the following processing performed by the audio processing device may be performed by a single DSP (digital signal processor). That is, the DSP includes a first sampling frequency storage unit that stores a first sampling frequency, a sampling unit that samples received audio at the first sampling frequency, and acquires first audio data, and the teacher. A teacher data formant frequency storage unit storing teacher data formant frequency which is a formant frequency of data, and an evaluation object storing an evaluation object formant frequency which is a formant frequency of an evaluation object person who is the voice speaker A sampling form is performed on the received voice at the second formant frequency storage unit and the second sampling frequency “the first sampling frequency / (teacher data formant frequency / evaluation person formant frequency)”, and the second voice data With a vocal tract length normalization processing unit Digital signal processor, a. This also applies to other embodiments.

さらに、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における音声処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、第一サンプリング周波数で、受け付けた音声をサンプリングし、第一音声データを取得するサンプリングステップと、第二サンプリング周波数「第一サンプリング周波数／（教師データフォルマント周波数／評価対象者フォルマント周波数）」で、前記音声受付ステップで受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理ステップと、前記第二音声データを処理する音声処理ステップを実行させるためのプログラム、である。 Furthermore, the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. Note that the software that implements the speech processing apparatus according to the present embodiment is the following program. In other words, the program samples the received voice at the first sampling frequency and acquires the first voice data to the computer, and the second sampling frequency “first sampling frequency / (teacher data formant frequency / evaluation”. Target voice formant frequency) ", the voice received in the voice receiving step is subjected to a sampling process to obtain a second voice data, a vocal tract length normalization processing step, and a voice process to process the second voice data A program for executing a step.

また、上記プログラムにおいて、音声処理ステップは、前記第二音声データを、フレームに区分するフレーム区分ステップと、前記区分されたフレーム毎の音声データであるフレーム音声データを１以上得るフレーム音声データ取得ステップと、前記教師データと前記１以上のフレーム音声データに基づいて、前記受け付けた音声の評定を行う評定ステップと、前記評定ステップにおける評定結果を出力する出力ステップを具備する、ことは好適である。 In the above program, the audio processing step includes a frame dividing step for dividing the second audio data into frames, and a frame audio data acquiring step for obtaining one or more frame audio data which are audio data for each of the divided frames. And a rating step for rating the received voice based on the teacher data and the one or more frame voice data, and an output step for outputting a rating result in the rating step.

さらに、上記プログラムにおいて、前記評定ステップは、前記１以上のフレーム音声データのうちの少なくとも一のフレーム音声データに対する最適状態を決定する最適状態決定ステップと、前記最適状態決定ステップで決定した最適状態における確率値を取得する最適状態確率値取得ステップと、前記最適状態確率値取得ステップで取得した確率値をパラメータとして音声の評定値を算出する評定値算出ステップを具備することは好適である。
（実施の形態２） Further, in the above program, the rating step includes an optimum state determination step for determining an optimum state for at least one frame sound data among the one or more frame sound data, and an optimum state determined in the optimum state determination step. It is preferable to include an optimum state probability value obtaining step for obtaining a probability value, and a rating value calculating step for calculating a speech evaluation value using the probability value obtained in the optimum state probability value obtaining step as a parameter.
(Embodiment 2)

本実施の形態における音声処理装置は、実施の形態１の音声処理装置と比較して、評定部における評定アルゴリズムが異なる。本実施の形態において、評定値は、最適状態を含む音韻の中の全状態の確率値を発音区間で評価して、算出される。本実施の形態における音声処理装置が算出する事後確率を、実施の形態１におけるＤＡＰに対してｔ−ｐ−ＤＡＰと呼ぶ。 The speech processing apparatus according to the present embodiment differs from the speech processing apparatus according to the first embodiment in the rating algorithm in the rating unit. In the present embodiment, the rating value is calculated by evaluating the probability values of all the states in the phoneme including the optimum state in the pronunciation interval. The posterior probability calculated by the speech processing apparatus in the present embodiment is referred to as tp-DAP with respect to the DAP in the first embodiment.

図１１は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部１１１０、発声催促部１１０９を具備する。 FIG. 11 is a block diagram of the speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 1110, and the utterance prompting unit 1109 are provided.

音声処理部１１１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、評定手段１１１０３、出力手段１１０４を具備する。 The audio processing unit 1110 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a rating unit 11103, and an output unit 1104.

評定手段１１１０３は、最適状態決定手段１１０３１、発音区間フレーム音韻確率値取得手段１１１０３２、評定値算出手段１１１０３３を具備する。 The rating unit 11103 includes an optimum state determining unit 11031, a pronunciation interval frame phoneme probability value acquiring unit 111032, and a rating value calculating unit 111033.

発音区間フレーム音韻確率値取得手段１１１０３２は、最適状態決定手段１１０３１が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値を、発音区間毎に取得する。 The pronunciation interval frame phoneme probability value acquisition unit 111032 acquires, for each pronunciation interval, one or more probability values in the entire phoneme state having the optimal state of each frame determined by the optimal state determination unit 11031.

評定値算出手段１１１０３３は、発音区間フレーム音韻確率値取得手段１１１０３２が取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する。評定値算出手段１１１０３３は、例えば、最適状態決定手段１１０３１が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値の総和を、フレーム毎に得て、当該フレーム毎の確率値の総和に基づいて、発音区間毎の確率値の総和の時間平均値を１以上得て、当該１以上の時間平均値をパラメータとして音声の評定値を算出する。 The rating value calculation means 111033 calculates a speech rating value using one or more probability values for each of one or more pronunciation intervals acquired by the pronunciation interval frame phoneme probability value acquisition means 111032 as parameters. For example, the rating value calculation unit 1111033 obtains, for each frame, a sum of one or more probability values in the entire phoneme state having the optimal state of each frame determined by the optimal state determination unit 11031, and the probability value for each frame. 1 or more is obtained based on the sum of the above, and the speech rating value is calculated using the one or more time average values as parameters.

発音区間フレーム音韻確率値取得手段１１１０３２、および評定値算出手段１１１０３３は、通常、ＭＰＵやメモリ等から実現され得る。発音区間フレーム音韻確率値取得手段１１１０３２等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The pronunciation interval frame phoneme probability value acquisition unit 111032 and the rating value calculation unit 1111033 can be realized usually by an MPU, a memory, or the like. The processing procedure of the pronunciation interval frame phoneme probability value acquisition unit 111032 and the like is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

発声催促部１１０９は、入力受付部１０１が、動作開始指示を受け付けた場合、第二サンプリング周波数を算出するために、評価対象者に発声を促す処理を行ったり、評価対象者の発音評定のために発声を促す処理を行ったりする。評価対象者に発声を促す処理は、例えば、「〜を発音してください。」とディスプレイに表示したり、「〜を発音してください。」とスピーカーから音出力したりする処理である。発声催促部１１０９は、ディスプレイ３４４やスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。発声催促部１１０９は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 When the input reception unit 101 receives an operation start instruction, the utterance prompting unit 1109 performs a process of prompting the evaluation target person to utter in order to calculate the second sampling frequency, or for the pronunciation evaluation of the evaluation target person. Or urge the user to speak. The process of prompting the evaluation subject to speak is, for example, a process of displaying “Please pronounce ~” on the display or outputting a sound from the speaker “Please pronounce ~”. The utterance prompting unit 1109 may or may not include an output device such as the display 344 or a speaker. The utterance prompting unit 1109 can be implemented by output device driver software, or output device driver software and an output device.

次に、本音声処理装置の動作について図１２から図１４のフローチャートを用いて説明する。図１２等のフローチャートにおいて、図２、図３のフローチャートと異なるステップについてのみ説明する。 Next, the operation of the voice processing apparatus will be described with reference to the flowcharts of FIGS. In the flowchart of FIG. 12 etc., only steps different from the flowcharts of FIG. 2 and FIG. 3 will be described.

（ステップＳ１２０１）発声催促部１１０９は、第二サンプリング周波数算出用の発声を促すために、例えば、母音「う」と発声してください、とディスプレイに表示する。 (Step S <b> 1201) The utterance prompting unit 1109 displays on the display, for example, utter the vowel “U” in order to prompt the utterance for calculating the second sampling frequency.

（ステップＳ１２０２）音声受付部１０３は、音声を受け付けたか否かを判断する。音声を受け付ければステップＳ１２０３に行き、音声を受け付けなければステップＳ２１３に行く。 (Step S1202) The voice receiving unit 103 determines whether a voice is received. If the voice is accepted, the process goes to step S1203. If the voice is not accepted, the process goes to step S213.

（ステップＳ１２０３）サンプリング部１０６は、第一サンプリング周波数格納部１０５に格納されている第一サンプリング周波数を読み込み、当該第一サンプリング周波数で、ステップＳ１２０２で受け付けた音声をサンプリングし、第一音声データを得る。 (Step S1203) The sampling unit 106 reads the first sampling frequency stored in the first sampling frequency storage unit 105, samples the sound received in step S1202 at the first sampling frequency, and obtains the first sound data. obtain.

（ステップＳ１２０４）声道長正規化処理部１０９は、ステップＳ１２０３で得た第一音声データから、第二サンプリング周波数を得る。かかる第二サンプリング周波数算出処理は、図１３のフローチャートを用いて説明する。 (Step S1204) The vocal tract length normalization processing unit 109 obtains a second sampling frequency from the first speech data obtained in Step S1203. The second sampling frequency calculation process will be described with reference to the flowchart of FIG.

（ステップＳ１２０５）発声催促部１１０９は、評定用の発声を促すために、例えば、「ｒｉｇｈｔ」と発声してください、とディスプレイに表示する。 (Step S1205) The utterance prompting unit 1109 displays on the display, for example, “speak“ right ”in order to urge the utterance for evaluation.

（ステップＳ１２０６）音声受付部１０３は、音声を受け付けたか否かを判断する。音声を受け付ければステップＳ１２０７に行き、音声を受け付けなければステップＳ２１３に行く。 (Step S1206) The voice receiving unit 103 determines whether a voice is received. If a voice is accepted, the process goes to step S1207, and if no voice is accepted, the process goes to step S213.

（ステップＳ１２０７）サンプリング部１０６は、第一サンプリング周波数格納部１０５に格納されている第一サンプリング周波数を読み込み、当該第一サンプリング周波数で、ステップＳ１２０６で受け付けた音声をサンプリングし、第一音声データを得る。 (Step S1207) The sampling unit 106 reads the first sampling frequency stored in the first sampling frequency storage unit 105, samples the audio received in step S1206 at the first sampling frequency, and obtains the first audio data. obtain.

（ステップＳ１２０８）声道長正規化処理部１０９は、ステップＳ１２０７で得た第一音声データに対して、ステップＳ１２０４で得た第二サンプリング周波数で、リサンプリングし、第二音声データを得る。 (Step S1208) The vocal tract length normalization processing unit 109 resamples the first audio data obtained in step S1207 at the second sampling frequency obtained in step S1204, and obtains second audio data.

（ステップＳ１２０９）音声処理部１１１０は、ステップＳ１２０８で得た第二音声データに対して、評定処理を行う。評定処理の詳細は、図１４のフローチャートを用いて説明する。ステップＳ１２０２に戻る。 (Step S1209) The voice processing unit 1110 performs a rating process on the second voice data obtained in step S1208. Details of the rating process will be described with reference to the flowchart of FIG. The process returns to step S1202.

なお、図１２のフローチャートにおいて、第二サンプリング周波数を算出するための音声と、評定するための音声が同一または包含されていても良い。 In the flowchart of FIG. 12, the sound for calculating the second sampling frequency and the sound for rating may be the same or included.

ステップＳ１２０４の第二サンプリング周波数算出処理について、図１３のフローチャートを用いて説明する。図１３のフローチャートにおいて、図３のフローチャートにおけるステップＳ３０１からステップＳ３０５の処理を行う。 The second sampling frequency calculation process in step S1204 will be described using the flowchart of FIG. In the flowchart of FIG. 13, the processing from step S301 to step S305 in the flowchart of FIG. 3 is performed.

ステップＳ１２０９の評定処理について、図１４のフローチャートを用いて説明する。 The rating process in step S1209 will be described with reference to the flowchart in FIG.

（ステップＳ１４０１）発音区間フレーム音韻確率値取得手段１１１０３２は、全フレームの全状態の前向き尤度と後向き尤度を算出する。そして、全フレーム、全状態の確率値を得る。具体的には、発音区間フレーム音韻確率値取得手段１１１０３２は、例えば、各特徴ベクトルが対象の状態から生成された事後確率を算出する。この事後確率は、ＨＭＭの最尤推定におけるＢａｕｍ−Ｗｅｌｃｈアルゴリズムの中で現れる占有度数に対応する。Ｂａｕｍ−Ｗｅｌｃｈアルゴリズムは、公知のアルゴリズムであるので、説明は省略する。 (Step S1401) The pronunciation period frame phoneme probability value acquisition unit 111032 calculates the forward likelihood and the backward likelihood of all states of all frames. Then, probability values for all frames and all states are obtained. Specifically, the pronunciation interval frame phoneme probability value acquisition unit 111032 calculates, for example, the posterior probability that each feature vector is generated from the target state. This posterior probability corresponds to the occupation frequency appearing in the Baum-Welch algorithm in the maximum likelihood estimation of the HMM. The Baum-Welch algorithm is a known algorithm and will not be described.

（ステップＳ１４０２）発音区間フレーム音韻確率値取得手段１１１０３２は、全フレームの最適状態確率値を算出する。 (Step S1402) The sounding section frame phoneme probability value acquisition unit 111032 calculates optimum state probability values of all frames.

（ステップＳ１４０３）発音区間フレーム音韻確率値取得手段１１１０３２は、ｊに１を代入する。 (Step S1403) The pronunciation period frame phoneme probability value acquisition unit 111032 substitutes 1 for j.

（ステップＳ１４０４）発音区間フレーム音韻確率値取得手段１１１０３２は、次の評定対象の発音区間である、ｊ番目の発音区間が存在するか否かを判断する。ｊ番目の発音区間が存在すればステップＳ１４０５に行き、ｊ番目の発音区間が存在しなければ上位関数にリターンする。 (Step S1404) The pronunciation period frame phoneme probability value acquisition unit 111032 determines whether or not the jth pronunciation period, which is the next evaluation target pronunciation period, exists. If the jth sound generation interval exists, the process goes to step S1405, and if the jth sound generation interval does not exist, the process returns to the upper function.

（ステップＳ１４０５）発音区間フレーム音韻確率値取得手段１１１０３２は、カウンタｋに１を代入する。 (Step S1405) The sounding section frame phoneme probability value acquiring unit 111032 substitutes 1 for the counter k.

（ステップＳ１４０６）発音区間フレーム音韻確率値取得手段１１１０３２は、ｋ番目のフレームが、ｊ番目の発音区間に存在するか否かを判断する。ｋ番目のフレームが存在すればステップＳ１４０７に行き、ｋ番目のフレームが存在しなければステップＳ１４１０に飛ぶ。 (Step S1406) The pronunciation period frame phoneme probability value acquisition unit 111032 determines whether or not the kth frame is present in the jth pronunciation period. If the kth frame exists, the process goes to step S1407, and if the kth frame does not exist, the process jumps to step S1410.

（ステップＳ１４０７）発音区間フレーム音韻確率値取得手段１１１０３２は、ｋ番目のフレームの最適状態を含む音韻の全ての確率値を取得する。 (Step S1407) The pronunciation period frame phoneme probability value acquisition unit 111032 acquires all probability values of phonemes including the optimal state of the kth frame.

（ステップＳ１４０８）評定値算出手段１１１０３３は、ステップＳ１４０７で取得した１以上の確率値をパラメータとして、１フレームの音声の評定値を算出する。 (Step S1408) The rating value calculation means 111033 calculates the rating value of one frame of speech using one or more probability values acquired in step S1407 as parameters.

（ステップＳ１４０９）発音区間フレーム音韻確率値取得手段１１１０３２は、ｋを１、インクメントする。ステップＳ１４０６に戻る。 (Step S1409) The sounding section frame phoneme probability value acquiring unit 111032 increments k by 1. The process returns to step S1406.

（ステップＳ１４１０）評定値算出手段１１１０３３は、ｊ番目の発音区間の評定値を算出する。評定値算出手段１１１０３３は、例えば、最適状態決定手段１１０３１が決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値の総和を、フレーム毎に得て、当該フレーム毎の確率値の総和に基づいて、発音区間の確率値の総和の時間平均値を、当該発音区間の音声の評定値として算出する。 (Step S1410) The rating value calculation means 111033 calculates the rating value of the j-th pronunciation section. For example, the rating value calculation unit 1111033 obtains, for each frame, a sum of one or more probability values in the entire phoneme state having the optimal state of each frame determined by the optimal state determination unit 11031, and the probability value for each frame. Based on the sum total, the time average value of the sum total of the probability values of the sounding sections is calculated as the speech evaluation value of the sounding section.

（ステップＳ１４１１）出力手段１１０４は、ステップＳ１４１０で算出した評定値を出力する。 (Step S1411) The output means 1104 outputs the rating value calculated in step S1410.

（ステップＳ１４１２）発音区間フレーム音韻確率値取得手段１１１０３２は、ｊを１、インクメントする。ステップＳ１４０４に戻る。 (Step S1412) The sound generation section frame phoneme probability value acquiring unit 111032 increments j by 1. The process returns to step S1404.

以下、本実施の形態における音声処理装置の具体的な動作について説明する。本実施の形態において、評定値の算出アルゴリズムが実施の形態１とは異なるので、その動作を中心に説明する。 Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In the present embodiment, the rating value calculation algorithm is different from that of the first embodiment, and therefore the operation will be mainly described.

まず、学習者（評価対象者）が、語学学習の開始の指示である動作開始指示を入力する。そして、音声処理装置は、当該動作開始指示を受け付け、次に、発声催促部１１０９は、例えば、「"う"と発声してください。」と画面出力する。 First, a learner (evaluator) inputs an operation start instruction that is an instruction to start language learning. Then, the voice processing device receives the operation start instruction, and next, the utterance prompting unit 1109 outputs, for example, “Please say“ U ”” to the screen.

なお、ここでも、例えば、学習者は、実施の形態１と同様に子供である。また、教師データを作成するために発声した教師は、ネイティブの大人である、とする。かかることは、他の実施の形態の具体例の記載においても同様である、とする。 Here, for example, the learner is a child as in the first embodiment. Further, it is assumed that the teacher who has spoken to create the teacher data is a native adult. This also applies to the description of specific examples of other embodiments.

そして、評価対象者は、"う"と発声し、音声処理装置は、当該発声から、第二ンプリング周波数「３２．１ＫＨｚ」を得る。かかる処理は、実施の形態１において説明した処理と同様である。 Then, the evaluation target person utters “U”, and the speech processing apparatus obtains the second sampling frequency “32.1 KHz” from the utterance. Such processing is the same as the processing described in the first embodiment.

次に、発声催促部１１０９は、例えば、「"ｒｉｇｈｔ"と発声してください。」と画面出力する。そして、学習者は、学習対象の音声「ｒｉｇｈｔ」を発音する。そして、音声受付部１０３は、学習者が発音した音声の入力を受け付ける。 Next, the utterance prompting unit 1109 outputs, for example, “Please utter“ right ”” on the screen. Then, the learner pronounces the voice “right” to be learned. Then, the voice receiving unit 103 receives an input of a voice pronounced by the learner.

次に、サンプリング部１０６は、受け付けた音声「ｒｉｇｈｔ」をサンプリング周波数「２２．０５ＫＨｚ」でサンプリング処理する。そして、サンプリング部１０６は、「ｒｉｇｈｔ」の第一音声データを得る。 Next, the sampling unit 106 samples the received voice “right” at the sampling frequency “22.05 KHz”. Then, the sampling unit 106 obtains the first audio data “right”.

次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。次に、音声処理部１１１０は、第二音声データを、以下のように処理する。 Next, the vocal tract length normalization processing unit 109 resamples the first voice data of “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data. Next, the audio processing unit 1110 processes the second audio data as follows.

まず、フレーム区分手段１１０１は、「ｒｉｇｈｔ」の第二音声データを、短時間フレームに区分する。 First, the frame segmentation unit 1101 segments the second audio data “right” into short frames.

そして、フレーム音声データ取得手段１１０２は、フレーム区分手段１１０１が区分した音声データを、スペクトル分析し、特徴ベクトル系列「Ｏ＝ｏ_１，ｏ_２，・・・，ｏ_Ｔ」を算出する。 Then, the frame audio data acquisition unit 1102 performs spectrum analysis on the audio data classified by the frame classification unit 1101 and calculates a feature vector series “O = o ₁ , o ₂ ,..., O _T ”.

次に、発音区間フレーム音韻確率値取得手段１１１０３２は、各フレームの各状態の事後確率（確率値）を算出する。確率値の算出は、上述した数式１、数式２により算出できる。 Next, the pronunciation interval frame phoneme probability value acquisition unit 111032 calculates a posteriori probability (probability value) of each state of each frame. The probability value can be calculated by the above-described Equations 1 and 2.

次に、最適状態決定手段１１０３１は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、各フレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。つまり、最適状態決定手段１１０３１は、最適状態系列を得る。なお、各フレームの各状態の事後確率（確率値）を算出する処理と、最適状態を決定する処理の処理順序は問わない。 Then, the optimal state determination unit 11031, based on the feature vector o _t constituting the obtained feature vector series to determine the optimum conditions for each frame (optimum condition for the feature vector o _t). That is, the optimum state determination unit 11031 obtains an optimum state sequence. In addition, the processing order of the process which calculates the posterior probability (probability value) of each state of each frame and the process which determines an optimal state is not ask | required.

次に、発音区間フレーム音韻確率値取得手段１１１０３２は、発音区間ごとに、当該発音区間に含まれる各フレームの最適状態を含む音韻の全ての確率値を取得する。そして、評定値算出手段１１１０３３は、各フレームの最適状態を含む音韻の全ての確率値の総和を、フレーム毎に算出する。そして、評定値算出手段１１１０３３は、フレーム毎に算出された確率値の総和を、発音区間毎に時間平均し、発音区間毎の評定値を算出する。具体的には、評定値算出手段１１１０３３は、数式３により評定値を算出する。数式３において、ｐ−ＤＡＰ（τ）は、各フレームにおける、すべての音韻の中で最適な音韻の事後確率（確率値）を表すように算出される評定値であり、数式４で算出され得る。なお、数式３のｔ−ｐ−ＤＡＰは、最適状態を含む音韻の中の全状態の確率値を発音区間で評価して、算出される評定値である。また、数式３において、Τ（ｑ_ｔ ^＊）は、状態ｑ_ｔ ^＊を含むＨＭＭが含まれる評定対象の発音区間である。｜Τ（ｑ_ｔ ^＊）｜は、Τ（ｑ_ｔ ^＊）の区間長である。また、数式４において、Ｐ（ｑ_ｔ ^＊）は、状態ｑ_ｔ ^＊を含むＨＭＭが有する全状態識別子の集合である。
Next, the pronunciation interval frame phoneme probability value acquisition unit 111032 acquires, for each pronunciation interval, all probability values of phonemes including the optimum state of each frame included in the pronunciation interval. Then, the rating value calculation unit 111033 calculates the sum of all probability values of phonemes including the optimum state of each frame for each frame. Then, the rating value calculation means 111033 averages the sum of the probability values calculated for each frame for each sounding section, and calculates a rating value for each sounding section. Specifically, the rating value calculation unit 111033 calculates the rating value using Equation 3. In Equation 3, p-DAP (τ) is a rating value calculated to represent the optimal posterior probability (probability value) of all phonemes in each frame, and can be calculated by Equation 4. . Note that tp-DAP in Equation 3 is a rating value calculated by evaluating the probability values of all states in the phoneme including the optimal state in the pronunciation interval. Further, in Equation 3, Τ (q _t ^* ) is a pronunciation interval of the evaluation target including the HMM including the state q _t ^* . | Τ (q _t ^* ) | is the section length of Τ (q _t ^* ). Further, in Equation 4, P (q _t ^* ) is a set of all state identifiers possessed by the HMM including the state q _t ^* .

かかる評定値算出手段１１１０３３が算出した評定値（「ｔ−ｐ−ＤＡＰスコア」とも言う。）の出力例を、図１５の表に示す。図１５において、アメリカ人男性と日本人男性の評定結果を示す。ＰｈｏｎｅｍｅおよびＷｏｒｄは，ｔ−ｐ−ＤＡＰにおける時間平均の範囲を示す。ここでは、ＤＡＰの代わりにｐ−ＤＡＰの時間平均を採用したものである。図１５において、アメリカ人男性の発音の評定値が日本人男性の発音の評定値より高く、良好な評定結果が得られている。 An output example of the rating value (also referred to as “tp-DAP score”) calculated by the rating value calculation unit 111033 is shown in the table of FIG. FIG. 15 shows the evaluation results of American men and Japanese men. Phoneme and Word indicate the range of time average in tp-DAP. Here, the time average of p-DAP is adopted instead of DAP. In FIG. 15, the American male pronunciation rating value is higher than the Japanese male pronunciation value, and a good rating result is obtained.

そして、出力手段１１０４は、算出した発音区間ごと（ここでは、音素毎）の評定値を、順次出力する。かかる出力例は、図１６である。 Then, the output means 1104 sequentially outputs the calculated rating values for each calculated sounding section (here, for each phoneme). An example of such output is shown in FIG.

また、本実施の形態によれば、連結されたＨＭＭである連結ＨＭＭを用いて最適状態を求め、評定値を算出するので、高速に評定値を求めることができる。したがって、上記の具体例で述べたように、リアルタイムに、発音区間ごとの評定値を出力できる。また、本実施の形態によれば、動的計画法に基づいた事後確率を確率値として算出するので、さらに高速に評定値を求めることができる。 Further, according to the present embodiment, since the optimum state is obtained by using the concatenated HMM, which is a concatenated HMM, and the evaluation value is calculated, the evaluation value can be obtained at high speed. Therefore, as described in the specific example above, it is possible to output a rating value for each pronunciation interval in real time. Further, according to the present embodiment, the posterior probability based on the dynamic programming is calculated as the probability value, so that the rating value can be obtained at higher speed.

また、本実施の形態によれば、評定値を、発音区間の単位で算出でき、実施の形態１におけるような状態単位のＤＡＰと比較して、本来、測定したい類似度（発音区間の類似度）を精度良く、安定して求めることができる。 In addition, according to the present embodiment, the rating value can be calculated in units of pronunciation intervals, and compared with the state unit DAP as in the first embodiment, the degree of similarity originally desired to be measured (similarity of pronunciation intervals) ) With high accuracy and stability.

さらに、本実施の形態における音声処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、第一サンプリング周波数で、受け付けた音声をサンプリングし、第一音声データを取得するサンプリングステップと、第二サンプリング周波数「第一サンプリング周波数／（教師データフォルマント周波数／評価対象者フォルマント周波数）」で、前記音声受付ステップで受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理ステップと、前記第二音声データを処理する音声処理ステップを実行させるためのプログラム、である。 Furthermore, the software that implements the speech processing apparatus in the present embodiment is the following program. In other words, the program samples the received voice at the first sampling frequency and acquires the first voice data to the computer, and the second sampling frequency “first sampling frequency / (teacher data formant frequency / evaluation”. Target voice formant frequency) ", the voice received in the voice receiving step is subjected to a sampling process to obtain a second voice data, a vocal tract length normalization processing step, and a voice process to process the second voice data A program for executing a step.

また、上記プログラムにおいて、前記評定ステップは、前記１以上のフレーム音声データの最適状態を決定する最適状態決定ステップと、前記最適状態決定ステップで決定した各フレームの最適状態を有する音韻全体の状態における１以上の確率値を、発音区間毎に取得する発音区間フレーム音韻確率値取得ステップと、前記発音区間フレーム音韻確率値取得ステップで取得した１以上の発音区間毎の１以上の確率値をパラメータとして音声の評定値を算出する評定値算出ステップを具備する、ことは好適である。
（実施の形態３） In the above program, the rating step includes an optimum state determination step for determining an optimum state of the one or more frame sound data, and an overall phoneme state having an optimum state for each frame determined in the optimum state determination step. One or more probability values are acquired for each sounding section, and a sounding section frame phoneme probability value acquiring step and one or more probability values for one or more sounding sections acquired in the sounding section frame phoneme probability value acquiring step are used as parameters. It is preferable to include a rating value calculation step for calculating a rating value of speech.
(Embodiment 3)

本実施の形態において、比較対象の音声と入力音声の類似度を精度高く評定できる音声処理装置について説明する。特に、本音声処理装置は、無音区間を検知し、無音区間を考慮した類似度評定が可能な音声処理装置である。 In the present embodiment, a description will be given of a speech processing apparatus that can accurately evaluate the similarity between a comparison target speech and an input speech. In particular, the speech processing apparatus is a speech processing apparatus that can detect a silent section and can evaluate similarity in consideration of the silent section.

図１７は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部１７１０、発声催促部１１０９を具備する。 FIG. 17 is a block diagram of the speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 1710, and the utterance prompting unit 1109 are provided.

音声処理部１７１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、特殊音声検知手段１７１０１、評定手段１７１０３、出力手段１１０４を具備する。 The audio processing unit 1710 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a special audio detection unit 17101, a rating unit 17103, and an output unit 1104.

特殊音声検知手段１７１０１は、無音データ格納手段１７１０１１、無音区間検出手段１７１０１２を具備する。 The special voice detection unit 17101 includes a silence data storage unit 171101 and a silence interval detection unit 171012.

評定手段１７１０３は、最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２、評定値算出手段１７１０３３を具備する。 The rating unit 17103 includes an optimum state determining unit 11031, an optimum state probability value acquiring unit 11032, and a rating value calculating unit 171033.

特殊音声検知手段１７１０１は、フレーム毎の入力音声データに基づいて、特殊な音声が入力されたことを検知する。なお、ここで特殊な音声は、無音も含む。また、特殊音声検知手段１７１０１は、例えば、フレームの最適状態の確率値を、ある音素区間において取得し、ある音素区間の１以上の確率値の総和が所定の値より低い場合（想定されている音素ではない、と判断できる場合）、当該音素区間において特殊な音声が入力されたと、検知する。かかる検知の具体的なアルゴリズムの例は後述する。特殊音声検知手段１７１０１は、通常、ＭＰＵやメモリ等から実現され得る。特殊音声検知手段１７１０１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The special sound detection unit 17101 detects that a special sound is input based on the input sound data for each frame. Here, the special voice includes silence. Further, the special speech detection unit 17101 acquires, for example, the probability value of the optimal state of the frame in a certain phoneme section, and the sum of one or more probability values of a certain phoneme section is lower than a predetermined value (assumed If it can be determined that it is not a phoneme), it is detected that a special voice is input in the phoneme section. An example of a specific algorithm for such detection will be described later. The special sound detection unit 17101 can be realized usually by an MPU, a memory, or the like. The processing procedure of the special sound detection means 17101 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

無音データ格納手段１７１０１１は、無音を示すデータであり、ＨＭＭに基づくデータである無音データを格納している。無音データ格納手段１７１０１１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The silence data storage unit 171101 is data indicating silence and stores silence data that is data based on the HMM. The silent data storage unit 171011 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

無音区間検出手段１７１０１２は、フレーム音声データ取得手段１１０２が取得したフレーム音声データ、および無音データ格納手段１７１０１１の無音データに基づいて、無音の区間を検出する。無音区間検出手段１７１０１２は、フレーム音声データ取得手段１１０２が取得したフレーム音声データと無音データの類似度が所定の値以上である場合に、当該フレーム音声データは無音区間のデータであると判断しても良い。また、無音区間検出手段１７１０１２は、下記で述べる最適状態確率値取得手段１１０３２が取得した確率値が所定の値以下であり、かつ、フレーム音声データ取得手段１１０２が取得したフレーム音声データと無音データの類似度が所定の値以上である場合に、当該フレーム音声データは無音区間のデータであると判断しても良い。無音区間検出手段１７１０１２は、通常、ＭＰＵやメモリ等から実現され得る。無音区間検出手段１７１０１２の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The silent section detecting means 171012 detects a silent section based on the frame sound data acquired by the frame sound data acquiring means 1102 and the silent data of the silent data storage means 171101. When the similarity between the frame audio data acquired by the frame audio data acquisition unit 1102 and the silence data is equal to or higher than a predetermined value, the silence interval detection unit 171012 determines that the frame audio data is data of the silence interval. Also good. Further, the silent section detecting means 171012 has a probability value acquired by the optimum state probability value acquiring means 11032 described below equal to or less than a predetermined value, and the frame audio data and silent data acquired by the frame audio data acquiring means 1102 When the degree of similarity is equal to or greater than a predetermined value, the frame audio data may be determined to be silent section data. The silent section detection means 171012 can be usually realized by an MPU, a memory, or the like. The processing procedure of the silent section detection means 171012 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定手段１７１０３は、教師データと入力音声データと特殊音声検知手段１７１０１における検知結果に基づいて、音声受付部１０３が受け付けた音声の評定を行う。「特殊音声検知手段１７１０１における検知結果に基づく」とは、例えば、特殊音声検知手段１７１０１が無音を検知した場合、当該無音の区間を無視することである。また、「特殊音声検知手段１７１０１における検知結果に基づく」とは、例えば、特殊音声検知手段１７１０１が置換や脱落などを検知した場合、当該置換や脱落などの検知により、評定値を所定数値分、減じて、評定値を算出することである。評定手段１７１０３は、通常、ＭＰＵやメモリ等から実現され得る。評定手段１７１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating unit 17103 evaluates the voice received by the voice receiving unit 103 based on the teacher data, the input voice data, and the detection result of the special voice detecting unit 17101. “Based on the detection result in the special voice detection unit 17101” means, for example, that the silent section is ignored when the special voice detection unit 17101 detects silence. Further, “based on the detection result in the special voice detection unit 17101” means that, for example, when the special voice detection unit 17101 detects replacement or dropout, the evaluation value is set to a predetermined numerical value by detecting the replacement or dropout. Subtract and calculate the rating value. The rating means 17103 can be usually realized by an MPU, a memory, or the like. The processing procedure of the rating means 17103 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定値算出手段１７１０３３は、無音区間検出手段１７１０１２が検出した無音区間を除いて、かつ最適状態確率値取得手段１１０３２が取得した確率値をパラメータとして音声の評定値を算出する。なお、評定値算出手段１７１０３３は、上記確率値を如何に利用して、評定値を算出するかは問わない。評定値算出手段１７１０３３は、例えば、最適状態確率値取得手段１１０３２が取得した確率値と、当該確率値に対応するフレームの全状態における確率値の総和をパラメータとして音声の評定値を算出する。評定値算出手段２１０２３は、ここでは、通常、無音区間検出手段１７１０１２が検出した無音区間を除いて、フレームごとに評定値を算出する。なお、評定値算出手段１７１０３３は、かならずしも無音区間を除いて、評定値を算出する必要はない。評定値算出手段１７１０３３は、無音区間の影響を少なくするように評定値を算出しても良い。評定値算出手段１７１０３３は、通常、ＭＰＵやメモリ等から実現され得る。評定値算出手段１７１０３３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating value calculation unit 171033 calculates a voice rating value using the probability value acquired by the optimum state probability value acquisition unit 11032 as a parameter, excluding the silent section detected by the silent segment detection unit 171012. Note that it does not matter how the rating value calculation means 171033 uses the probability value to calculate the rating value. The rating value calculation unit 171033 calculates a speech rating value using, for example, the sum of the probability value acquired by the optimum state probability value acquisition unit 11032 and the probability value in all states of the frame corresponding to the probability value as a parameter. Here, the rating value calculation means 21023 normally calculates a rating value for each frame except for the silent section detected by the silent section detection means 171012. Note that the rating value calculation unit 171033 does not necessarily calculate the rating value except for the silent section. The rating value calculation means 171033 may calculate the rating value so as to reduce the influence of the silent section. The rating value calculation unit 171033 can be realized usually by an MPU, a memory, or the like. The processing procedure of the rating value calculation means 171033 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、音声処理装置の動作について図１８、図１９のフローチャートを用いて説明する。なお、図１８のフローチャートは、図１２のフローチャートと比較して、ステップＳ１８０１の評定処理のみが異なるので、図１８のフローチャートの説明は省略する。ステップＳ１８０１の評定処理の詳細について、図１９のフローチャートを用いて説明する。 Next, the operation of the speech processing apparatus will be described using the flowcharts of FIGS. Note that the flowchart of FIG. 18 is different from the flowchart of FIG. 12 only in the rating process of step S1801, and therefore the description of the flowchart of FIG. 18 is omitted. Details of the rating process in step S1801 will be described with reference to the flowchart of FIG.

（ステップＳ１９０１）評定手段１７１０３は、ＤＡＰの評定値を算出する。ＤＡＰの評定値を算出するアルゴリズムは、実施の形態１で説明済みであるので、詳細な説明は省略する。ＤＡＰの評定値を算出する処理は、図２のフローチャートの、ステップＳ２１１からＳ２１４の処理により行う。 (Step S1901) The rating means 17103 calculates a DAP rating value. Since the algorithm for calculating the DAP rating value has been described in the first embodiment, a detailed description thereof will be omitted. The process of calculating the DAP rating value is performed by the processes of steps S211 to S214 in the flowchart of FIG.

（ステップＳ１９０２）特殊音声検知手段１７１０１は、ステップＳ１９０１で算出した値が、所定の値より低いか否かを判断する。所定の値より低ければステップＳ１９０３に行き、所定の値より低くなければステップＳ１９０６に飛ぶ。 (Step S1902) The special sound detection unit 17101 determines whether or not the value calculated in step S1901 is lower than a predetermined value. If it is lower than the predetermined value, the process goes to step S1903, and if it is not lower than the predetermined value, the process jumps to step S1906.

（ステップＳ１９０３）無音区間検出手段１７１０１２は、無音データと全教師データの確率値を取得する。 (Step S1903) The silent section detecting means 171012 acquires the probability values of the silent data and all the teacher data.

（ステップＳ１９０４）無音区間検出手段１７１０１２は、ステップＳ１９０３で取得した確率値の中で、無音データの確率値が最も高いか否かを判断する。無音データの確率値が最も高ければ（かかる場合、無音の区間であると判断する）ステップＳ１９０５に行き、無音データの確率値が最も高くなければステップＳ１９０６に行く。 (Step S1904) The silent section detection means 171012 determines whether or not the probability value of silent data is the highest among the probability values acquired in step S1903. If the silence data has the highest probability value (in this case, it is determined that it is a silent section), the procedure goes to step S1905. If the silence data has a highest probability value, the procedure goes to step S1906.

（ステップＳ１９０５）無音区間検出手段１７１０１２は、カウンタｉを１、インクリメントする。ステップＳ２０８に戻る。 (Step S1905) The silent section detecting means 171012 increments the counter i by 1. The process returns to step S208.

（ステップＳ１９０６）出力手段１１０４は、ステップＳ１９０１で算出した評定値を出力する。 (Step S1906) The output unit 1104 outputs the rating value calculated in step S1901.

なお、図１９のフローチャートにおいて、出力手段１１０４は、無音区間と判定した区間の評定値は出力しなかった（無音区間を無視した）が、特殊音声が検知された区間が無音区間である旨を明示したり、無音区間が存在する旨を明示したりする態様で出力しても良い。また、評定値算出手段１７１０３３は、発音区間や、それ以上の単位のスコアを算出する場合に、無音区間の評定値を無視して、スコアを算出することが好適であるが、無音区間の評定値の影響を、例えば、１／１０にして、発音区間や発音全体のスコアを算出するなどしても良い。評定手段１７１０３は、教師データと入力音声データと特殊音声検知手段１７１０１における検知結果に基づいて、音声受付部１０３が受け付けた音声の評定を行えばよい。 In the flowchart of FIG. 19, the output unit 1104 does not output the rating value of the section determined to be a silent section (ignoring the silent section), but indicates that the section where the special voice is detected is a silent section. You may output in the aspect which specifies clearly that a silence area exists. The rating value calculation means 171033 preferably calculates the score while ignoring the rating value in the silent section when calculating the score in the sounding section or higher units. The influence of the value may be set to 1/10, for example, and the score of the pronunciation interval or the entire pronunciation may be calculated. The rating unit 17103 may evaluate the voice received by the voice receiving unit 103 based on the teacher data, the input voice data, and the detection result of the special voice detecting unit 17101.

また、図１９のフローチャートにおいて、特殊音声検知手段１７１０１は、ｉ番目のフレーム音声データのＤＡＰスコアに基づいて特殊音声を検知したが、例えば、実施の形態２で説明したｔ−ｐ−ＤＡＰスコアに基づいて特殊音声を検知しても良い。 In the flowchart of FIG. 19, the special voice detection unit 17101 detects the special voice based on the DAP score of the i-th frame voice data. For example, the special voice detection unit 17101 uses the tp-DAP score described in the second embodiment. Special voices may be detected based on this.

以下、本実施の形態における音声処理装置の具体的な動作について説明する。本実施の形態において、無音区間を考慮して評定値を算出するので、評定値の算出アルゴリズムが実施の形態１、実施の形態２とは異なる。そこで、その異なる処理を中心に説明する。 Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In the present embodiment, since the rating value is calculated in consideration of the silent section, the rating value calculation algorithm is different from that of the first and second embodiments. Therefore, the different processing will be mainly described.

まず、学習者（評価対象者）が、語学学習の開始の指示である動作開始指示を入力する。そして、音声処理装置は、当該動作開始指示を受け付け、次に、例えば、「"う"と発声してください。」と画面出力する。 First, a learner (evaluator) inputs an operation start instruction that is an instruction to start language learning. Then, the voice processing apparatus accepts the operation start instruction, and then outputs, for example, “Please say“ U ”” to the screen.

そして、評価対象者は、"う"と発声し、音声処理装置は、当該発声から、第二ンプリング周波数「３２．１ＫＨｚ」を得る。かかる処理は、実施の形態１等において説明した処理と同様である。 Then, the evaluation target person utters “U”, and the speech processing apparatus obtains the second sampling frequency “32.1 KHz” from the utterance. Such processing is the same as the processing described in the first embodiment.

次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを、第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。次に、音声処理部１７１０は、第二音声データを、以下のように処理する。 Next, the vocal tract length normalization processing unit 109 performs a resampling process on the first voice data “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data. Next, the audio processing unit 1710 processes the second audio data as follows.

次に、最適状態決定手段１１０３１は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。 Then, the optimal state determination unit 11031, based on the feature vector o _t constituting the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t).

次に、最適状態確率値取得手段１１０３２は、上述した数式１、２により、最適状態における確率値を算出する。 Next, the optimum state probability value acquisition unit 11032 calculates the probability value in the optimum state according to the above-described formulas 1 and 2.

次に、評定値算出手段１７１０３３は、例えば、最適状態決定手段１１０３１が決定した最適状態を有する音韻全体の状態における１以上の確率値を取得し、当該１以上の確率値の総和をパラメータとして音声の評定値を算出する。つまり、評定値算出手段１７１０３３は、例えば、ＤＡＰスコアをフレーム毎に算出する。 Next, the rating value calculation unit 171033 acquires, for example, one or more probability values in the entire state of the phoneme having the optimal state determined by the optimal state determination unit 11031, and the sum of the one or more probability values is used as a parameter for the speech. The rating value of is calculated. That is, the rating value calculation unit 171033 calculates, for example, a DAP score for each frame.

そして、特殊音声検知手段１７１０１は、算出されたフレームに対応する評定値（ＤＡＰスコア）を用いて、特殊な音声が入力されたか否かを判断する。具体的には、特殊音声検知手段１７１０１は、例えば、評価対象のフレームに対して算出された評定値が、所定の数値より低ければ、特殊な音声が入力された、と判断する。なお、特殊音声検知手段１７１０１は、一のフレームに対応する評定値が小さいからといって、直ちに特殊な音声が入力された、と判断する必要はない。つまり、特殊音声検知手段１７１０１は、フレームに対応する評定値が小さいフレームが所定の数以上、連続する場合に、当該連続するフレーム群に対応する区間が特殊な音声が入力された区間と判断しても良い。 Then, the special voice detection unit 17101 determines whether or not a special voice has been input, using a rating value (DAP score) corresponding to the calculated frame. Specifically, for example, if the rating value calculated for the evaluation target frame is lower than a predetermined numerical value, the special voice detection unit 17101 determines that a special voice has been input. Note that the special voice detection unit 17101 does not need to determine that a special voice has been input immediately because the rating value corresponding to one frame is small. That is, the special voice detection unit 17101 determines that a section corresponding to the group of consecutive frames is a section where a special voice is input when a predetermined number or more of frames having a small evaluation value corresponding to the frame are consecutive. May be.

特殊音声検知手段１７１０１が、特殊音声を検知する場合について説明する図を図２０に示す。図２０（ａ）の縦軸は、ＤＡＰスコアであり、横軸はフレームを示す。図２０（ａ）において、（Ｖ）は、Ｖｉｔｅｒｂｉアライメントを示す。図２０（ａ）において、網掛けのフレーム群のおけるＤＡＰスコアは、所定の値より低く、特殊音声の区間である、と判断される。 FIG. 20 illustrates a case where the special sound detection unit 17101 detects special sound. In FIG. 20A, the vertical axis represents the DAP score, and the horizontal axis represents the frame. In FIG. 20A, (V) shows Viterbi alignment. In FIG. 20A, it is determined that the DAP score in the shaded frame group is lower than a predetermined value and is a special speech section.

次に、特殊な音声が入力された、と判断した場合、無音区間検出手段１７１０１２は、無音データ格納手段１７１０１１から無音データを取得し、当該フレーム群と無音データとの類似度を算定し、類似度が所定値以上であれば当該フレーム群に対応する音声データが、無音データであると判断する。図２０（ｂ）は、無音データとの比較の結果、当該無音データとの類似度を示す事後確率の値（「ＤＡＰスコア」）が高いことを示す。その結果、無音区間検出手段１７１０１２は、当該特殊音声の区間は、無音区間である、と判断する。なお、図２０（ａ）において、網掛けのフレーム群のおけるＤＡＰスコアは、所定の値より低く、特殊音声の区間である、と判断され、かつ、無音データとの比較の結果、ＤＡＰスコアが低い場合には、無音区間ではない、と判断される。そして、かかる区間において、例えば、単に、発音が上手くなく、低い評定値が出力される。なお、図２０（ａ）に示しているように、通常、無音区間は、第一のワード（「ｗｏｒｄ１」）の最終音素の後半部、および第一のワードに続く第二のワード（「ｗｏｒｄ２」）の第一音素の前半部のスコアが低い。 Next, if it is determined that a special voice has been input, the silent section detecting unit 171012 acquires the silent data from the silent data storage unit 171011, calculates the similarity between the frame group and the silent data, If the degree is equal to or greater than a predetermined value, it is determined that the audio data corresponding to the frame group is silent data. FIG. 20B shows that as a result of the comparison with the silence data, the value of the posterior probability (“DAP score”) indicating the similarity to the silence data is high. As a result, the silent section detector 171012 determines that the special speech section is a silent section. In FIG. 20A, the DAP score in the shaded frame group is lower than a predetermined value and is determined to be a special speech section, and as a result of comparison with silence data, the DAP score is If it is low, it is determined that it is not a silent section. In such a section, for example, the pronunciation is simply not good and a low rating value is output. Note that, as shown in FIG. 20 (a), the silence period usually includes the second half of the last phoneme of the first word ("word1") and the second word ("word2" following the first word). )) The first half of the first phoneme score is low.

そして、出力手段１１０４は、出力する評定値から、無音データの区間の評定値を考慮しないように、無視する。 Then, the output unit 1104 ignores the rating value to be output so as not to consider the rating value of the silent data section.

そして、出力手段１１０４は、各フレームに対応する評定値を出力する。この場合、例えば、無音データの区間の評定値は、出力されない。 Then, the output unit 1104 outputs a rating value corresponding to each frame. In this case, for example, the rating value of the silent data section is not output.

かかる評定値の出力態様例は、例えば、図９、図１０である。 Examples of output modes of such rating values are, for example, FIG. 9 and FIG.

なお、出力手段１１０４が行う出力は、無音区間の存在を示すだけの出力でも良い。 Note that the output performed by the output unit 1104 may be an output that only indicates the presence of a silent section.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差、特に声道長の違いに影響を受けない、精度の高い評定ができる。さらに、本音声処理装置は、無音区間を考慮して類似度を評定するので、極めて正確な評定結果が得られる。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. In this case, according to the present embodiment, highly accurate evaluation can be performed without being affected by individual differences, particularly differences in vocal tract length. Furthermore, since the speech processing apparatus evaluates the similarity in consideration of the silent section, a very accurate evaluation result can be obtained.

なお、無音区間のデータは、無視して評定結果を算出することは好適である。ただし、本実施の形態において、例えば、無音区間の評価の影響を他の区間と比較して少なくするなど、無視する以外の方法で、無音区間のデータを考慮して、評定値を出力しても良い。 It is preferable to ignore the silent section data and calculate the evaluation result. However, in this embodiment, for example, the evaluation value is output in consideration of the data of the silent section by a method other than ignoring, such as reducing the influence of the evaluation of the silent section compared with other sections. Also good.

また、本実施の形態の具体例によれば、ＤＡＰスコアを用いて、評定値を算出したが、無音の区間を考慮して評定値を算出すれば良く、上述した他のアルゴリズム（ｔ−ｐ−ＤＡＰ等）、または、本明細書では述べていない他のアルゴリズムにより評定値を算出しても良い。つまり、本実施の形態によれば、教師データと入力音声データと特殊音声検知手段における検知結果に基づいて、音声受付部が受け付けた音声の評定を行い、特に、無音データを考慮して、評定値を算出すれば良い。 Further, according to the specific example of the present embodiment, the rating value is calculated using the DAP score. However, the rating value may be calculated in consideration of the silent section, and the other algorithm (tp) described above is used. The rating value may be calculated by DAP or the like) or another algorithm not described in this specification. In other words, according to the present embodiment, the voice received by the voice receiving unit is evaluated based on the teacher data, the input voice data, and the detection result of the special voice detection unit, and in particular, the evaluation is performed in consideration of silent data. What is necessary is just to calculate a value.

また、本実施の形態によれば、まず、ＤＡＰスコアが低い区間を検出してから、無音区間の検出をした。しかし、ＤＡＰスコアが低い区間を検出せずに、無音データとの比較により、無音区間を検出しても良い。 In addition, according to the present embodiment, first, a section having a low DAP score is detected, and then a silent section is detected. However, a silent section may be detected by comparing with silent data without detecting a section having a low DAP score.

また、上記プログラムにおいて、音声処理ステップは、前記第二音声データを、フレームに区分するフレーム区分ステップと、前記区分されたフレーム毎の音声データであるフレーム音声データを１以上得るフレーム音声データ取得ステップと、前記フレーム毎の入力音声データに基づいて、特殊な音声が入力されたことを検知する特殊音声検知ステップと、教師データと前記入力音声データと前記特殊音声検知ステップにおける検知結果に基づいて、前記受け付けた音声の評定を行う評定ステップと、前記評定ステップにおける評定結果を出力する出力ステップを具備する、ことは好適である。 In the above program, the audio processing step includes a frame dividing step for dividing the second audio data into frames, and a frame audio data acquiring step for obtaining one or more frame audio data which are audio data for each of the divided frames. And, based on the input voice data for each frame, a special voice detection step for detecting that a special voice has been input, and based on detection results in the teacher data, the input voice data, and the special voice detection step, It is preferable that a rating step for rating the received voice and an output step for outputting a rating result in the rating step are preferable.

また、上記プログラムにおいて、特殊音声検知ステップは、無音を示すＨＭＭに基づくデータである無音データと、前記入力音声データに基づいて、無音の区間を検出する、ことは好適である。
（実施の形態４） In the above program, it is preferable that the special voice detecting step detects a silent section based on silent data which is data based on HMM indicating silent and the input voice data.
(Embodiment 4)

本実施の形態において、入力音声において、特殊音声を検知し、比較対象の音声と入力音声の類似度を精度高く評定できる音声処理装置について説明する。特に、本音声処理装置は、音韻の挿入を検知できる音声処理装置である。 In the present embodiment, a speech processing apparatus that detects special speech in input speech and can accurately evaluate the similarity between the speech to be compared and the input speech will be described. In particular, the speech processing apparatus is a speech processing apparatus that can detect insertion of phonemes.

図２１は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部２１１０、発声催促部１１０９を具備する。 FIG. 21 is a block diagram of the speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 2110, and the utterance prompting unit 1109 are provided.

音声処理部２１１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、特殊音声検知手段２１１０１、評定手段２１１０３、出力手段２１１０４を具備する。なお、評定手段２１１０３は、最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２を具備する。 The audio processing unit 2110 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a special audio detection unit 21101, a rating unit 21103, and an output unit 21104. The rating unit 21103 includes an optimum state determination unit 11031 and an optimum state probability value acquisition unit 11032.

特殊音声検知手段２１１０１は、一の音素の後半部および当該音素の次の音素の前半部の評定値が所定の条件を満たすことを検知する。後半部、および前半部の長さは問わない。特殊音声検知手段２１１０１は、通常、ＭＰＵやメモリ等から実現され得る。特殊音声検知手段２１１０１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The special voice detection unit 21101 detects that the rating values of the second half of one phoneme and the first half of the next phoneme after the phoneme satisfy a predetermined condition. The length of the second half and the first half is not limited. The special sound detection means 21101 can be usually realized by an MPU, a memory, or the like. The processing procedure of the special sound detection means 21101 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定手段２１１０３は、特殊音声検知手段２１１０１が所定の条件を満たすことを検知した場合に、少なくとも音素の挿入があった旨を示す評定結果を構成する。なお、評定手段２１１０３は、実施の形態３で述べたアルゴリズムにより、特殊音声検知手段２１１０１が所定の条件を満たすことを検知した区間に無音が挿入されたか否かを判断し、無音が挿入されていない場合に、他の音素が挿入されたと検知しても良い。また、評定手段２１１０３は、無音が挿入されていない場合に、他の音韻ＨＭＭに対する確率値を算出し、所定の値より高い確率値を得た音韻が挿入された、との評定結果を得ても良い。なお、実施の形態３で述べた無音区間の検知は、無音音素の挿入の検知である、とも言える。評定手段２１１０３は、通常、ＭＰＵやメモリ等から実現され得る。評定手段２１１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating means 21103 constitutes a rating result indicating that at least a phoneme has been inserted when the special voice detecting means 21101 detects that a predetermined condition is satisfied. Note that the rating means 21103 determines whether or not silence has been inserted in the section in which the special voice detection means 21101 has detected that the predetermined condition is satisfied by the algorithm described in the third embodiment, and silence has been inserted. If not, it may be detected that another phoneme has been inserted. The rating means 21103 calculates a probability value for another phoneme HMM when no silence is inserted, and obtains a rating result that a phoneme having a probability value higher than a predetermined value is inserted. Also good. It can be said that the detection of the silent section described in the third embodiment is the detection of the insertion of a silent phoneme. The rating means 21103 can be usually realized by an MPU, a memory, or the like. The processing procedure of the rating means 21103 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力手段２１１０４は、評定手段２１１０３における評定結果を出力する。ここでの評定結果は、音素の挿入があった旨を示す評定結果を含む。また、評定結果は、音素の挿入があった場合に、所定数値分、減じられて算出された評定値（スコア）のみでも良い。また、評定結果は、音素の挿入があった旨、および評定値（スコア）の両方であっても良い。なお、教師データにおいて想定されていない音素の挿入を検知した場合、通常、評定値は低くなる。ここで、出力とは、ディスプレイへの表示、プリンタへの印字、音出力、外部の装置への送信、記録媒体への蓄積等を含む概念である。出力手段２１１０４は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力手段２１１０４は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 21104 outputs the rating result in the rating unit 21103. The rating result here includes a rating result indicating that a phoneme has been inserted. The rating result may be only the rating value (score) calculated by subtracting a predetermined numerical value when a phoneme is inserted. In addition, the rating result may be both that the phoneme has been inserted and the rating value (score). In addition, when the insertion of the phoneme which is not assumed in the teacher data is detected, the rating value is usually low. Here, the output is a concept including display on a display, printing on a printer, sound output, transmission to an external device, accumulation in a recording medium, and the like. The output unit 21104 may or may not include an output device such as a display or a speaker. The output means 21104 can be implemented by output device driver software, or output device driver software and an output device.

次に、音声処理装置の動作について、図２２、図２３のフローチャートを用いて説明する。なお、図２２のフローチャートは、図１２のフローチャートと比較して、ステップＳ２２０１の評定処理のみが異なるので、図２２のフローチャートの説明は省略する。ステップＳ２２０１の評定処理の詳細について、図２３のフローチャートを用いて説明する。図２３のフローチャートにおいて、図２、図１９のフローチャートの処理と同様の処理については、その説明を省略する。 Next, the operation of the speech processing apparatus will be described using the flowcharts of FIGS. Note that the flowchart of FIG. 22 is different from the flowchart of FIG. 12 only in the rating process in step S2201, and thus the description of the flowchart of FIG. 22 is omitted. Details of the rating process in step S2201 will be described with reference to the flowchart of FIG. In the flowchart of FIG. 23, the description of the same processing as that of the flowcharts of FIGS. 2 and 19 is omitted.

（ステップＳ２３０１）特殊音声検知手段２１１０１は、フレームに対応するデータを一時的に蓄積するバッファにデータが格納されているか否かを判断する。なお、格納されているデータは、ステップＳ１９０２で、所定の値より低い評定値と評価されたフレーム音声データ、または当該フレーム音声データから取得できるデータである。データが格納されていればステップＳ２３０７に行き、データが格納されていなければ上位関数にリターンする。 (Step S2301) The special sound detection unit 21101 determines whether data is stored in a buffer that temporarily accumulates data corresponding to a frame. The stored data is frame audio data evaluated as a rating value lower than a predetermined value in step S1902, or data that can be acquired from the frame audio data. If the data is stored, the process goes to step S2307, and if the data is not stored, the process returns to the upper function.

（ステップＳ２３０２）特殊音声検知手段２１１０１は、バッファにデータが格納されているか否かを判断する。データが格納されていればステップＳ２３０７に行き、データが格納されていなければステップステップＳ２３０３に行く。 (Step S2302) The special sound detection unit 21101 determines whether data is stored in the buffer. If data is stored, go to step S2307, and if data is not stored, go to step S2303.

（ステップＳ２３０３）出力手段２１１０４は、ステップＳ１９０１で算出した評定値を出力する。 (Step S2303) The output means 21104 outputs the rating value calculated in step S1901.

（ステップＳ２３０４）特殊音声検知手段２１１０１は、カウンタｉを１、インクリメントする。ステップＳ２０８に戻る。 (Step S2304) The special sound detection means 21101 increments the counter i by 1. The process returns to step S208.

（ステップＳ２３０５）特殊音声検知手段２１１０１は、バッファに、所定の値より低い評定値と評価されたフレーム音声データ、または当該フレーム音声データから取得できるデータを一時蓄積する。 (Step S2305) The special sound detection means 21101 temporarily stores in the buffer frame audio data evaluated as a rating value lower than a predetermined value or data obtainable from the frame sound data.

（ステップＳ２３０６）特殊音声検知手段２１１０１は、カウンタｉを１、インクリメントする。ステップＳ２０８に戻る。 (Step S2306) The special sound detection means 21101 increments the counter i by 1. The process returns to step S208.

（ステップＳ２３０７）特殊音声検知手段２１１０１は、カウンタｊに１を代入する。 (Step S2307) The special sound detection means 21101 assigns 1 to the counter j.

（ステップＳ２３０８）特殊音声検知手段２１１０１は、ｊ番目のデータが、バッファに存在するか否かを判断する。ｊ番目のデータが存在すればステップＳ２３０９に行き、ｊ番目のデータが存在しなければステップＳ２３１５に飛ぶ。 (Step S2308) The special sound detection unit 21101 determines whether or not the j-th data exists in the buffer. If the jth data exists, the process goes to step S2309, and if the jth data does not exist, the process jumps to step S2315.

（ステップＳ２３０９）特殊音声検知手段２１１０１は、ｊ番目のデータに対応する最適状態の音素を取得する。 (Step S2309) The special voice detecting unit 21101 acquires a phoneme in an optimal state corresponding to the j-th data.

（ステップＳ２３１０）特殊音声検知手段２１１０１は、ｊ番目のデータに対する全教師データの確率値を算出し、最大の確率値を持つ音素を取得する。 (Step S2310) The special speech detection unit 21101 calculates the probability value of all the teacher data for the j-th data, and acquires the phoneme having the maximum probability value.

（ステップＳ２３１１）特殊音声検知手段２１１０１は、ステップＳ２３０９で取得した音素とステップＳ２３１０で取得した音素が異なる音素であるか否かを判断する。異なる音素であればステップＳ２３１２に行き、異なる音素でなければステップＳ２３１４に飛ぶ。 (Step S2311) The special voice detection unit 21101 determines whether the phoneme acquired in step S2309 and the phoneme acquired in step S2310 are different phonemes. If the phoneme is different, the process goes to step S2312, and if not, the process jumps to step S2314.

（ステップＳ２３１２）評定手段２１１０３は、音素の挿入があった旨を示す評定結果を構成する。 (Step S2312) The rating means 21103 constitutes a rating result indicating that a phoneme has been inserted.

（ステップＳ２３１３）特殊音声検知手段２１１０１は、カウンタｊを１、インクリメントする。ステップＳ２３０８に戻る。 (Step S2313) The special sound detection means 21101 increments the counter j by 1. The process returns to step S2308.

（ステップＳ２３１４）出力手段２１１０４は、バッファ中の全データに対応する全評定値を出力する。ここで、全評定値とは、例えば、フレーム毎のＤＡＰスコアである。ステップＳ２３１３に行く。 (Step S2314) The output means 21104 outputs all rating values corresponding to all data in the buffer. Here, the total rating value is, for example, a DAP score for each frame. Go to step S2313.

（ステップＳ２３１５）出力手段２１１０４は、評定結果に「挿入の旨」の情報が入っているか否かを判断する。「挿入の旨」の情報が入っていればステップＳ２３１６に行き、「挿入の旨」の情報が入っていなければステップＳ２３１７に行く。 (Step S2315) The output unit 21104 determines whether or not the information of “insertion” is included in the evaluation result. If “insertion” information is included, the process proceeds to step S2316, and if “insertion” information is not included, the process proceeds to step S2317.

（ステップＳ２３１６）出力手段２１１０４は、評定結果を出力する。 (Step S2316) The output means 21104 outputs a rating result.

（ステップＳ２３１７）出力手段２１１０４は、バッファをクリアする。ステップＳ２０８に戻る。 (Step S2317) The output means 21104 clears the buffer. The process returns to step S208.

なお、図２３のフローチャートにおいて、評定値の低いフレームが２つの音素に渡って存在すれば、音素の挿入があったと判断した。つまり、一の音素の後半部（少なくとも最終フレーム）および当該音素の次の音素の第一フレームの評定値が所定値より低い場合に、音素の挿入があったと判断した。しかし、図２３のフローチャートにおいて、一の音素の所定区間以上の後半部、および当該音素の次の音素の所定区間以上の前半部の評定値が所定値よりすべて低い場合に、音素の挿入があったと判断するようにしても良い。 In the flowchart of FIG. 23, if a frame with a low rating value exists across two phonemes, it is determined that a phoneme has been inserted. That is, it is determined that a phoneme has been inserted when the rating value of the second half of at least one phoneme (at least the final frame) and the first frame of the next phoneme after the phoneme are lower than a predetermined value. However, in the flowchart of FIG. 23, when the evaluation values of the latter half of the predetermined phoneme and the first half of the phoneme next to the phoneme are all lower than the predetermined value, there is no phoneme insertion. You may make it judge that it was.

以下、本実施の形態における音声処理装置の具体的な動作について説明する。本実施の形態において、音素の挿入の検知を行う処理が実施の形態３等とは異なる。そこで、その異なる処理を中心に説明する。 Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In the present embodiment, the processing for detecting insertion of phonemes is different from that in the third embodiment. Therefore, the different processing will be mainly described.

まず、学習者（評価対象者）が、語学学習の開始の指示である動作開始指示を入力する。そして、音声処理装置は、当該動作開始指示を受け付け、次に、例えば、「"あ"と発声してください。」と画面出力する。 First, a learner (evaluator) inputs an operation start instruction that is an instruction to start language learning. Then, the voice processing apparatus receives the operation start instruction, and then outputs, for example, “Please say“ A ”” to the screen.

そして、学習者は、"あ"と発声し、音声処理装置は、当該発声から、第二ンプリング周波数「３２．１ＫＨｚ」を得る。かかる処理は、実施の形態１等において説明した処理と同様である。 Then, the learner utters “A”, and the speech processing apparatus obtains the second sampling frequency “32.1 KHz” from the utterance. Such processing is the same as the processing described in the first embodiment.

次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。次に、音声処理部２１１０は、第二音声データを、以下のように処理する。 Next, the vocal tract length normalization processing unit 109 resamples the first voice data of “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data. Next, the audio processing unit 2110 processes the second audio data as follows.

次に、評定手段２１１０３の最適状態決定手段１１０３１は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。 Then, the optimal state determination unit 11031 of the evaluation unit 21103, based on the feature vector o _t constituting the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t).

次に、評定手段２１１０３は、例えば、最適状態決定手段１１０３１が決定した最適状態を有する音韻全体の状態における１以上の確率値を取得し、当該１以上の確率値の総和をパラメータとして音声の評定値を算出する。つまり、評定手段２１１０３は、例えば、ＤＡＰスコアをフレーム毎に算出する。ここで、算出するスコアは、上述したｔ−ｐ−ＤＡＰスコア等でも良い。 Next, the rating unit 21103 acquires, for example, one or more probability values in the overall state of the phoneme having the optimal state determined by the optimal state determining unit 11031, and evaluates the speech using the sum of the one or more probability values as a parameter. Calculate the value. That is, the rating unit 21103 calculates a DAP score for each frame, for example. Here, the calculated score may be the above-described tp-DAP score or the like.

そして、特殊音声検知手段２１１０１は、算出されたフレームに対応する評定値を用いて、特殊な音声が入力されたか否かを判断する。つまり、評定値（例えば、ＤＡＰスコア）が、所定の値より低い区間が存在するか否かを判断する。 Then, the special voice detection unit 21101 determines whether or not a special voice has been input using the rating value corresponding to the calculated frame. That is, it is determined whether or not there is a section where the rating value (for example, DAP score) is lower than a predetermined value.

次に、特殊音声検知手段２１１０１は、図２４に示すように、評定値（例えば、ＤＡＰスコア）が、所定の値より低い区間が、２つの音素に跨っているか否かを判断し、２つの音素に跨がっていれば、当該区間に音素が挿入された、と判断する。なお、かかる場合の詳細なアルゴリズムの例は、図２３で説明した。また、図２４において、斜線部が、予期しない音素が挿入された区間である。 Next, as shown in FIG. 24, the special voice detection unit 21101 determines whether or not a section where the rating value (for example, DAP score) is lower than a predetermined value straddles two phonemes. If the phoneme is straddled, it is determined that the phoneme is inserted in the section. An example of a detailed algorithm in such a case has been described with reference to FIG. In FIG. 24, the shaded area is a section in which an unexpected phoneme is inserted.

次に、評定手段２１１０３は、音素の挿入があった旨を示す評定結果（例えば、「予期しない音素が挿入されました。」）を構成する。なお、予期しない音素が挿入された場合、評定手段２１１０３は、例えば、所定数値分、減じて、評定値を算出することは好適である。そして、出力手段２１１０４は、構成した評定結果（評定値を含んでも良い）を出力する。図２５は、評定結果の出力例である。なお、出力手段２１１０４は、通常の入力音声に対しては、上述したように評定値を出力することが好適である。 Next, the rating means 21103 constitutes a rating result (for example, “an unexpected phoneme has been inserted”) indicating that a phoneme has been inserted. When an unexpected phoneme is inserted, it is preferable that the rating unit 21103 calculates the rating value by subtracting a predetermined numerical value, for example. Then, the output unit 21104 outputs a configured rating result (which may include a rating value). FIG. 25 is an output example of the evaluation result. Note that the output means 21104 preferably outputs the rating value as described above for normal input speech.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差、特に声道長の違いに影響を受けない、精度の高い評定ができる。さらに、本音声処理装置は、特殊音声、特に、予期せぬ音素の挿入を検知できるので、極めて精度の高い評定結果が得られる。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. In this case, according to the present embodiment, highly accurate evaluation can be performed without being affected by individual differences, particularly differences in vocal tract length. Furthermore, since the speech processing apparatus can detect special speech, particularly unexpected insertion of phonemes, a highly accurate evaluation result can be obtained.

なお、本実施の形態において、音素の挿入を検知できれば良く、評定値の算出アルゴリズムは問わない。評定値の算出アルゴリズムは、上述したアルゴリズム（ＤＡＰ、ｔ−ｐ−ＤＡＰ）でも良く、または、本明細書では述べていない他のアルゴリズムでも良い。 In the present embodiment, it is only necessary to detect the insertion of phonemes, and the rating value calculation algorithm is not limited. The algorithm for calculating the rating value may be the above-described algorithm (DAP, tp-DAP), or another algorithm not described in this specification.

また、上記プログラムにおいて、特殊音声検知ステップは、一の音素の後半部および当該音素の次の音素の前半部の評定値が所定の条件を満たすことを検知する、ことは好適である。
（実施の形態５） In the above program, it is preferable that the special speech detection step detects that the rating values of the second half of one phoneme and the first half of the next phoneme satisfy a predetermined condition.
(Embodiment 5)

本実施の形態において、入力音声において、特殊音声を検知し、比較対象の音声と入力音声の類似度を精度高く評定できる音声処理装置について説明する。特に、本音声処理装置は、音韻の置換を検知できる音声処理装置である。 In the present embodiment, a speech processing apparatus that detects special speech in input speech and can accurately evaluate the similarity between the speech to be compared and the input speech will be described. In particular, the speech processing apparatus is a speech processing apparatus that can detect phoneme replacement.

図２６は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部２６１０、発声催促部１１０９を具備する。 FIG. 26 is a block diagram of the speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 2610, and the utterance prompting unit 1109 are provided.

音声処理部２６１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、特殊音声検知手段２６１０１、評定手段２６１０３、出力手段２１１０４を具備する。なお、評定手段２６１０３は、最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２を具備する。 The audio processing unit 2610 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a special audio detection unit 26101, a rating unit 26103, and an output unit 21104. The rating unit 26103 includes an optimum state determination unit 11031 and an optimum state probability value acquisition unit 11032.

音声処理部２６１０は、第二音声データを処理する。音声処理部２６１０は、フレーム毎の入力音声データに基づいて、特殊な音声が入力されたことを検知する特殊音声検知手段２６１０１を具備する。音声処理部２６１０は、通常、ＭＰＵやメモリ等から実現され得る。音声処理部２６１０の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The audio processing unit 2610 processes the second audio data. The audio processing unit 2610 includes special audio detection means 26101 that detects that special audio is input based on input audio data for each frame. The audio processing unit 2610 can usually be realized by an MPU, a memory, or the like. The processing procedure of the audio processing unit 2610 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

特殊音声検知手段２６１０１は、一の音素の評定値が所定の値より低いことを検知する。また、特殊音声検知手段２６１０１は、一の音素の評定値が所定の値より低く、かつ当該音素の直前の音素および当該音素の直後の音素の評定値が所定の値より高いことをも検知しても良い。また、特殊音声検知手段２６１０１は、一の音素の評定値が所定の値より低く、かつ、想定していない音素のＨＭＭに基づいて算出された評定値が所定の値より高いことを検知しても良い。つまり、特殊音声検知手段２６１０１は、所定のアルゴリズムで、音韻の置換を検知できれば良い。そのアルゴリズムは種々考えられる。特殊音声検知手段２６１０１は、通常、ＭＰＵやメモリ等から実現され得る。特殊音声検知手段２６１０１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The special voice detection means 26101 detects that the rating value of one phoneme is lower than a predetermined value. The special speech detection means 26101 also detects that the rating value of one phoneme is lower than a predetermined value, and that the phoneme immediately before the phoneme and the rating value of the phoneme immediately after the phoneme are higher than a predetermined value. May be. Further, the special voice detecting means 26101 detects that the rating value of one phoneme is lower than a predetermined value and that the rating value calculated based on the HMM of an unexpected phoneme is higher than the predetermined value. Also good. In other words, the special voice detection means 26101 only needs to be able to detect phoneme replacement by a predetermined algorithm. Various algorithms are conceivable. The special sound detection means 26101 can be usually realized by an MPU, a memory, or the like. The processing procedure of the special sound detection means 26101 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定手段２６１０３は、特殊音声検知手段２６１０１が所定の条件を満たすことを検知した場合に、少なくとも音素の置換があった旨を示す評定結果を構成する。評定手段２６１０３は、音素の置換があった場合に、所定数値分、減じられて算出された評定値（スコア）を算出しても良い。評定手段２６１０３は、通常、ＭＰＵやメモリ等から実現され得る。評定手段２６１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating unit 26103 constitutes a rating result indicating that at least the phoneme has been replaced when the special voice detecting unit 26101 detects that the predetermined condition is satisfied. The rating means 26103 may calculate a rating value (score) calculated by subtracting a predetermined numerical value when a phoneme is replaced. The rating means 26103 can usually be realized by an MPU, a memory, or the like. The processing procedure of the rating means 26103 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、音声処理装置の動作について、図２７、図２８のフローチャートを用いて説明する。なお、図２７のフローチャートは、図１２のフローチャートと比較して、ステップＳ２７０１の評定処理のみが異なるので、図２７のフローチャートの説明は省略する。ステップＳ２７０１の評定処理の詳細について、図２８のフローチャートを用いて説明する。図２８のフローチャートにおいて、図２、図１９、図２３のフローチャートの処理と同様の処理については、その説明を省略する。 Next, the operation of the speech processing apparatus will be described using the flowcharts of FIGS. Note that the flowchart of FIG. 27 is different from the flowchart of FIG. 12 only in the rating process of step S2701, and thus the description of the flowchart of FIG. 27 is omitted. Details of the rating process in step S2701 will be described with reference to the flowchart of FIG. In the flowchart of FIG. 28, the description of the same processes as those in the flowcharts of FIGS. 2, 19, and 23 is omitted.

（ステップＳ２８０１）特殊音声検知手段２６１０１は、バッファに蓄積されているデータに対応するフレーム音声データ群が一の音素に対応するか否かを判断する。一の音素であればステップＳ２８０２に行き、一の音素でなければステップＳ２８１０に行く。 (Step S2801) The special sound detection means 26101 determines whether or not the frame sound data group corresponding to the data stored in the buffer corresponds to one phoneme. If it is one phoneme, go to step S2802, and if it is not one phoneme, go to step S2810.

（ステップＳ２８０２）特殊音声検知手段２６１０１は、バッファに蓄積されているデータに対応するフレーム音声データ群の音素の直前の音素の評定値を算出する。かかる評定値は、例えば、上述したＤＡＰスコアである。なお、直前の音素とは、現在評定中の音素に対して直前の音素である。音素の区切りは、Ｖｉｔｅｒｂｉアルゴリズムにより算出できる。 (Step S2802) The special sound detection means 26101 calculates the rating value of the phoneme immediately before the phoneme of the frame sound data group corresponding to the data stored in the buffer. Such a rating value is, for example, the DAP score described above. Note that the immediately preceding phoneme is the immediately preceding phoneme with respect to the currently rated phoneme. Phoneme breaks can be calculated by the Viterbi algorithm.

（ステップＳ２８０３）特殊音声検知手段２６１０１は、ステップＳ２８０２で算出した評定値が所定の値以上であるか否かを判断する。所定の値以上であればステップＳ２８０４に行き、所定の値より小さければステップＳ２８１０に行く。 (Step S2803) The special sound detection unit 26101 determines whether or not the rating value calculated in Step S2802 is equal to or greater than a predetermined value. If it is equal to or larger than the predetermined value, the process goes to step S2804, and if it is smaller than the predetermined value, the process goes to step S2810.

（ステップＳ２８０４）特殊音声検知手段２６１０１は、直後の音素の評定値を算出する。かかる評定値は、例えば、上述したＤＡＰスコアである。直後の音素とは、現在評定中の音素に対して直後の音素である。 (Step S2804) The special speech detection means 26101 calculates the rating value of the immediately following phoneme. Such a rating value is, for example, the DAP score described above. The phoneme immediately after is the phoneme immediately after the phoneme currently being evaluated.

（ステップＳ２８０５）特殊音声検知手段２６１０１は、ステップＳ２８０４で算出した評定値が所定の値以上であるか否かを判断する。所定の値以上であればステップＳ２８０６に行き、所定の値より小さければステップＳ２８１０に行く。 (Step S2805) The special sound detection means 26101 determines whether or not the rating value calculated in step S2804 is equal to or greater than a predetermined value. If it is equal to or greater than the predetermined value, the process proceeds to step S2806, and if it is smaller than the predetermined value, the process proceeds to step S2810.

（ステップＳ２８０６）特殊音声検知手段２６１０１は、予め格納されている音韻ＨＭＭ（予期する音韻のＨＭＭは除く）の中で、所定の値以上の評定値が得られる音韻ＨＭＭが一つ存在するか否かを判断する。所定の値以上の評定値が得られる音韻ＨＭＭが存在すればステップＳ２８０７に行き、所定の値以上の評定値が得られる音韻ＨＭＭが存在しなければステップＳ２８１０に行く。なお、予め格納されている音韻ＨＭＭは、通常、すべての音韻に対する多数の音韻ＨＭＭである。なお、本ステップにおいて、予め格納されている音韻ＨＭＭの確率値を算出し、最大の確率値を持つ音素を取得し、当該音素と最適状態の音素が異なるか否かを判断し、異なる場合に音素の置換があったと判断しても良い。 (Step S2806) The special speech detection means 26101 determines whether or not there is one phoneme HMM that can obtain a rating value equal to or higher than a predetermined value among the phoneme HMMs stored in advance (except for the expected phoneme HMM). Determine whether. If there is a phoneme HMM that obtains a rating value greater than or equal to a predetermined value, the process proceeds to step S2807, and if there is no phoneme HMM that yields a rating value greater than or equal to the predetermined value, the process proceeds to step S2810. The phoneme HMM stored in advance is usually a large number of phoneme HMMs for all phonemes. In this step, the probability value of the phoneme HMM stored in advance is calculated, the phoneme having the maximum probability value is obtained, it is determined whether or not the phoneme in the optimum state is different from the phoneme in the optimum state. It may be determined that the phoneme has been replaced.

（ステップＳ２８０７）評定手段２６１０３は、音素の置換があった旨を示す評定結果を構成する。 (Step S2807) The rating means 26103 constitutes a rating result indicating that the phoneme has been replaced.

（ステップＳ２８０８）出力手段２１１０４は、ステップＳ２８０７で構成した評定結果を出力する。 (Step S2808) The output unit 21104 outputs the rating result configured in step S2807.

（ステップＳ２８０９）出力手段２１１０４は、バッファをクリアする。ステップＳ２０８に戻る。 (Step S2809) The output means 21104 clears the buffer. The process returns to step S208.

（ステップＳ２８１０）出力手段２１１０４は、バッファ中の全データに対応する全評定値を出力する。ステップＳ２８０９に行く。 (Step S2810) The output means 21104 outputs all rating values corresponding to all data in the buffer. Go to step S2809.

以下、本実施の形態における音声処理装置の具体的な動作について説明する。本実施の形態において、音素の置換の検知を行う処理が実施の形態４等とは異なる。そこで、その異なる処理を中心に説明する。 Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In the present embodiment, the processing for detecting phoneme replacement is different from that in the fourth embodiment. Therefore, the different processing will be mainly described.

次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。次に、音声処理部２６１０は、第二音声データを、以下のように処理する。 Next, the vocal tract length normalization processing unit 109 resamples the first voice data of “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data. Next, the audio processing unit 2610 processes the second audio data as follows.

次に、評定手段２６１０３の最適状態決定手段１１０３１は、取得した特徴ベクトル系列を構成する各特徴ベクトルｏ_ｔに基づいて、所定のフレームの最適状態（特徴ベクトルｏ_ｔに対する最適状態）を決定する。 Then, the optimal state determination unit 11031 of the evaluation unit 26103, based on the feature vector o _t constituting the obtained feature vector series, to determine the optimal conditions for a given frame (optimum condition for the feature vector o _t).

次に、評定手段２６１０３は、例えば、最適状態決定手段１１０３１が決定した最適状態を有する音韻全体の状態における１以上の確率値を取得し、当該１以上の確率値の総和をパラメータとして音声の評定値を算出する。つまり、評定手段２６１０３は、例えば、ＤＡＰスコアをフレーム毎に算出する。ここで、算出するスコアは、上述したｔ−ｐ−ＤＡＰスコア等でも良い。 Next, the rating unit 26103 acquires, for example, one or more probability values in the overall state of the phoneme having the optimal state determined by the optimal state determining unit 11031, and evaluates the speech using the sum of the one or more probability values as a parameter. Calculate the value. That is, the rating unit 26103 calculates a DAP score for each frame, for example. Here, the calculated score may be the above-described tp-DAP score or the like.

そして、特殊音声検知手段２６１０１は、算出されたフレームに対応する評定値を用いて、特殊な音声が入力されたか否かを判断する。つまり、評定値（例えば、ＤＡＰスコア）が、所定の値より低い区間が存在するか否かを判断する。 Then, the special voice detection unit 26101 determines whether or not a special voice has been input, using a rating value corresponding to the calculated frame. That is, it is determined whether or not there is a section where the rating value (for example, DAP score) is lower than a predetermined value.

次に、特殊音声検知手段２６１０１は、図２９に示すように、評定値（例えば、ＤＡＰスコア）が、所定の値より低い区間が、一つの音素内（ここでは音素２）であるか否かを判断する。そして、一つの音素内で評定値が低ければ、次に、特殊音声検知手段２６１０１は、直前の音素（音素１）および／または直後の音素（音素３）に対する評定値（例えば、ＤＡＰスコア）を算出し、当該評定値が所定の値より高ければ、音素の置換が発生している可能性があると判断する。次に、特殊音声検知手段２６１０１は、予め格納されている音韻ＨＭＭ（予期する音韻のＨＭＭは除く）の中で、所定の値以上の評定値が得られる音韻ＨＭＭが一つ存在すれば、音素の置換が発生していると判断する。なお、図２９において、音素２において、音素の置換が発生した区間である。なお、図２９において縦軸は評定値であり、当該評定値は、ＤＡＰ、ｔ−ｐ−ＤＡＰ等、問わない。 Next, as shown in FIG. 29, the special speech detection means 26101 determines whether or not a section where the rating value (for example, DAP score) is lower than a predetermined value is within one phoneme (here, phoneme 2). Judging. If the rating value is low in one phoneme, then the special speech detection means 26101 then calculates a rating value (for example, DAP score) for the immediately preceding phoneme (phoneme 1) and / or the immediately following phoneme (phoneme 3). If the rating value is higher than a predetermined value, it is determined that there is a possibility that phoneme replacement has occurred. Next, if there is one phoneme HMM in which a rating value equal to or higher than a predetermined value is present among the phoneme HMMs stored in advance (excluding the HMM of an expected phoneme), the special speech detection unit 26101 can generate a phoneme. It is determined that the replacement has occurred. In FIG. 29, the phoneme 2 is a section in which phoneme replacement has occurred. In FIG. 29, the vertical axis represents a rating value, and the rating value may be DAP, tp-DAP, or the like.

次に、評定手段２６１０３は、音素の置換があった旨を示す評定結果（例えば、「音素の置換が発生しました。」）を構成する。そして、出力手段２１１０４は、構成した評定結果を出力する。なお、出力手段２１１０４は、通常の入力音声に対しては、上述したように評定値を出力することが好適である。 Next, the rating means 26103 constitutes a rating result (for example, “phoneme replacement has occurred”) indicating that there has been a phoneme replacement. Then, the output means 21104 outputs the configured evaluation result. Note that the output means 21104 preferably outputs the rating value as described above for normal input speech.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差、特に声道長の違いに影響を受けない、精度の高い評定ができる。さらに、本音声処理装置は、特殊音声、特に、音素の置換を検知できるので、極めて精度の高い評定結果が得られる。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. In this case, according to the present embodiment, highly accurate evaluation can be performed without being affected by individual differences, particularly differences in vocal tract length. Furthermore, since this speech processing apparatus can detect special speech, particularly phoneme substitution, it is possible to obtain a highly accurate evaluation result.

なお、本実施の形態において、音素の置換を検知できれば良く、評定値の算出アルゴリズムは問わない。評定値の算出アルゴリズムは、上述したアルゴリズム（ＤＡＰ、ｔ−ｐ−ＤＡＰ）でも良く、または、本明細書では述べていない他のアルゴリズムでも良い。 In the present embodiment, it is only necessary to be able to detect phoneme replacement, and the rating value calculation algorithm is not limited. The algorithm for calculating the rating value may be the above-described algorithm (DAP, tp-DAP), or another algorithm not described in this specification.

また、本実施の形態において、音素の置換の検知アルゴリズムは、他のアルゴリズムでも良い。例えば、音素の置換の検知において、所定以上の長さの区間を有することを置換区間の検知で必須としても良い。その他、置換の検知アルゴリズムの詳細は種々考えられる。 In the present embodiment, the phoneme replacement detection algorithm may be another algorithm. For example, in the detection of the replacement of phonemes, it may be essential to detect the replacement section to have a section longer than a predetermined length. In addition, various details of the replacement detection algorithm can be considered.

また、上記プログラムにおいて、特殊音声検知ステップは、一の音素の評定値が所定の条件を満たすことを検知し、特殊音声検知ステップで前記所定の条件を満たすことを検知した場合に、少なくとも音素の置換があった旨を示す評定結果を構成する、ことは好適である。
（実施の形態６） In the above program, the special speech detection step detects that the rating value of one phoneme satisfies a predetermined condition, and if the special speech detection step detects that the predetermined condition is satisfied, It is preferable to construct a rating result indicating that there has been a substitution.
(Embodiment 6)

本実施の形態において、入力音声において、特殊音声を検知し、比較対象の音声と入力音声の類似度を精度高く評定できる音声処理装置について説明する。特に、本音声処理装置は、音韻の欠落を検知できる音声処理装置である。 In the present embodiment, a speech processing apparatus that detects special speech in input speech and can accurately evaluate the similarity between the speech to be compared and the input speech will be described. In particular, the speech processing apparatus is a speech processing apparatus that can detect missing phonemes.

図３０は、本実施の形態における音声処理装置のブロック図である。本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部３０１０、発声催促部１１０９を具備する。 FIG. 30 is a block diagram of the speech processing apparatus according to this embodiment. The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 3010, and the utterance prompting unit 1109 are provided.

音声処理部３０１０は、フレーム区分手段１１０１、フレーム音声データ取得手段１１０２、特殊音声検知手段３０１０１、評定手段３０１０３、出力手段２１１０４を具備する。なお、評定手段３０１０３は、最適状態決定手段１１０３１、最適状態確率値取得手段１１０３２を具備する。 The audio processing unit 3010 includes a frame classification unit 1101, a frame audio data acquisition unit 1102, a special audio detection unit 30101, a rating unit 30103, and an output unit 21104. The rating unit 30103 includes an optimum state determination unit 11031 and an optimum state probability value acquisition unit 11032.

特殊音声検知手段３０１０１は、一の音素の評定値が所定の値より低く、かつ当該音素の直前の音素または当該音素の直後の音素の評定値が所定の値より高いことを検知する。また、特殊音声検知手段３０１０１は、一の音素の評定値が所定の値より低く、かつ当該音素の直前の音素または当該音素の直後の音素の評定値が所定の値より高く、かつ当該音素の区間長が所定の長さよりも短いことを検知しても良い。また、特殊音声検知手段３０１０１は、直前の音素に対応する確率値、または直後の音素に対応する確率値が、当該一の音素の確率値より高いことを検知しても良い。かかる場合に、特殊音声検知手段３０１０１は、音韻の欠落を検知することは好適である。さらに、音素の区間長が所定の長さよりも短いことを欠落の条件に含めることにより、音韻の欠落の検知の精度は向上する。特殊音声検知手段３０１０１は、通常、ＭＰＵやメモリ等から実現され得る。特殊音声検知手段３０１０１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The special voice detection unit 30101 detects that the rating value of one phoneme is lower than a predetermined value, and that the phoneme immediately before the phoneme or the rating value of the phoneme immediately after the phoneme is higher than a predetermined value. Further, the special voice detection means 30101 has a rating value of one phoneme lower than a predetermined value, a rating value of a phoneme immediately before the phoneme or a phoneme immediately after the phoneme is higher than a predetermined value, and the phoneme of the phoneme It may be detected that the section length is shorter than a predetermined length. The special speech detection unit 30101 may detect that the probability value corresponding to the immediately preceding phoneme or the probability value corresponding to the immediately following phoneme is higher than the probability value of the one phoneme. In such a case, it is preferable that the special voice detection unit 30101 detects missing phonemes. Furthermore, by including that the segment length of the phoneme is shorter than a predetermined length in the missing condition, the accuracy of detecting missing phonemes is improved. The special sound detection means 30101 can be usually realized by an MPU, a memory, or the like. The processing procedure of the special sound detection means 30101 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

評定手段３０１０３は、特殊音声検知手段３０１０１が所定の条件を満たすことを検知した場合に、少なくとも音素の欠落があった旨を示す評定結果を構成する。評定手段３０１０３は、通常、ＭＰＵやメモリ等から実現され得る。評定手段３０１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The rating unit 30103 constitutes a rating result indicating that at least a phoneme is missing when the special voice detecting unit 30101 detects that a predetermined condition is satisfied. The rating means 30103 can usually be realized by an MPU, a memory, or the like. The processing procedure of the rating means 30103 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、音声処理装置の動作について、図３１、図３２のフローチャートを用いて説明する。なお、図３１のフローチャートは、図１２のフローチャートと比較して、ステップＳ３１０１の評定処理のみが異なるので、図３１のフローチャートの説明は省略する。ステップＳ３１０１の評定処理の詳細について、図３２のフローチャートを用いて説明する。図３２のフローチャートにおいて、図２、図１９、図２３、図２８のフローチャートの処理と同様の処理については、その説明を省略する。 Next, the operation of the speech processing apparatus will be described using the flowcharts of FIGS. Note that the flowchart of FIG. 31 is different from the flowchart of FIG. 12 only in the rating process in step S3101, and thus the description of the flowchart of FIG. 31 is omitted. Details of the rating process in step S3101 will be described with reference to the flowchart of FIG. In the flowchart of FIG. 32, the description of the same processes as those of the flowcharts of FIGS. 2, 19, 23, and 28 is omitted.

（ステップＳ３２０１）特殊音声検知手段３０１０１は、バッファに蓄積されているデータに対して、直前の音素に対応する教師データの確率値または、直後の音素に対応する教師データの確率値が、予定されている音素に対応する教師データの確率値より高いか否かを判断する。高ければステップＳ３２０２に行き、高くなければステップＳ２８１０に行く。なお、ステップＳ３２０２に行くための条件として、バッファに蓄積されているデータに対応するフレーム音声データ群の区間長が所定の長さ以下であることを付加しても良い。 (Step S3201) The special voice detection means 30101 schedules the probability value of the teacher data corresponding to the immediately preceding phoneme or the probability value of the teacher data corresponding to the immediately following phoneme with respect to the data stored in the buffer. It is determined whether or not the probability value of the teacher data corresponding to the current phoneme is higher. If it is higher, go to step S3202, otherwise go to step S2810. As a condition for going to step S3202, it may be added that the section length of the frame audio data group corresponding to the data stored in the buffer is equal to or less than a predetermined length.

（ステップＳ３２０２）評定手段３０１０３は、音素の欠落があった旨を示す評定結果を構成する。ステップＳ２８０８に行く。 (Step S3202) The rating means 30103 constitutes a rating result indicating that a phoneme is missing. Go to step S2808.

なお、図３２のフローチャートにおいて、評定対象の音素（欠落したであろう音素）の区間長が、所定の長さ（例えば、３フレーム）よりも短いことを条件としても良いし、かかる条件は無くても良い。 In the flowchart of FIG. 32, the section length of the phonemes to be evaluated (phonemes that will be missing) may be shorter than a predetermined length (for example, 3 frames), or there is no such condition. May be.

以下、本実施の形態における音声処理装置の具体的な動作について説明する。本実施の形態において、音素の欠落の検知を行う処理が実施の形態５等とは異なる。そこで、その異なる処理を中心に説明する。 Hereinafter, a specific operation of the speech processing apparatus according to the present embodiment will be described. In the present embodiment, the processing for detecting missing phonemes is different from that in the fifth embodiment. Therefore, the different processing will be mainly described.

次に、声道長正規化処理部１０９は、「ｒｉｇｈｔ」の第一音声データを第二サンプリング周波数「３２．１ＫＨｚ」でリサンプリング処理する。そして、声道長正規化処理部１０９は、第二音声データを得る。次に、音声処理部３０１０は、第二音声データを、以下のように処理する。 Next, the vocal tract length normalization processing unit 109 resamples the first voice data of “right” at the second sampling frequency “32.1 KHz”. Then, the vocal tract length normalization processing unit 109 obtains second voice data. Next, the audio processing unit 3010 processes the second audio data as follows.

次に、評定手段３０１０３は、例えば、最適状態決定手段１１０３１が決定した最適状態を有する音韻全体の状態における１以上の確率値を取得し、当該１以上の確率値の総和をパラメータとして音声の評定値を算出する。つまり、評定手段３０１０３は、例えば、ＤＡＰスコアをフレーム毎に算出する。ここで、算出するスコアは、上述したｔ−ｐ−ＤＡＰスコア等でも良い。 Next, the rating unit 30103 acquires, for example, one or more probability values in the overall state of the phoneme having the optimum state determined by the optimum state determining unit 11031, and evaluates the speech using the sum of the one or more probability values as a parameter. Calculate the value. That is, the rating unit 30103 calculates, for example, a DAP score for each frame. Here, the calculated score may be the above-described tp-DAP score or the like.

そして、特殊音声検知手段３０１０１は、算出されたフレームに対応する評定値を用いて、特殊な音声が入力されたか否かを判断する。つまり、評定値（例えば、ＤＡＰスコア）が、所定の値より低い区間が存在するか否かを判断する。 Then, the special voice detection unit 30101 determines whether or not a special voice has been input using the rating value corresponding to the calculated frame. That is, it is determined whether or not there is a section where the rating value (for example, DAP score) is lower than a predetermined value.

次に、特殊音声検知手段３０１０１は、図３３に示すように、評定値（例えば、ＤＡＰスコア）が、所定の値より低い区間が、一つの音素内（ここでは音素２）であるか否かを判断する。そして、一つの音素内で評定値が低ければ、特殊音声検知手段３０１０１は、直前の音素（音素１）または直後の音素（音素３）に対する評定値（例えば、ＤＡＰスコア）を算出し、当該評定値が所定の値より高ければ、音素の欠落が発生している可能性があると判断する。そして、当該区間長が、例えば、３フレーム以下の長さであれば、かかる音素は欠落したと判断する。なお、図３３において、音素２の欠落が発生したことを示す。なお、図３３において縦軸は評定値であり、当該評定値は、ＤＡＰ、ｔ−ｐ−ＤＡＰ等、問わない。また、上記区間長の所定値は、「３フレーム以下」ではなく、「５フレーム以下」でも、「６フレーム以下」でも良い。 Next, as shown in FIG. 33, the special speech detection means 30101 determines whether or not a section where the rating value (for example, DAP score) is lower than a predetermined value is within one phoneme (here, phoneme 2). Judging. If the rating value is low in one phoneme, the special speech detection unit 30101 calculates a rating value (for example, DAP score) for the immediately preceding phoneme (phoneme 1) or the immediately following phoneme (phoneme 3), and the rating. If the value is higher than a predetermined value, it is determined that there is a possibility that a phoneme is missing. If the section length is, for example, 3 frames or less, it is determined that such phonemes are missing. FIG. 33 shows that the phoneme 2 is missing. In FIG. 33, the vertical axis represents the rating value, and the rating value may be DAP, tp-DAP, or the like. Further, the predetermined value of the section length is not “3 frames or less” but may be “5 frames or less” or “6 frames or less”.

次に、評定手段３０１０３は、音素の欠落があった旨を示す評定結果（例えば、「音素の欠落が発生しました。」）を構成する。そして、出力手段２１１０４は、構成した評定結果を出力する。なお、出力手段２１１０４は、通常の入力音声に対しては、上述したように評定値を出力することが好適である。 Next, the rating means 30103 configures a rating result (for example, “phoneme missing has occurred”) indicating that a phoneme is missing. Then, the output means 21104 outputs the configured evaluation result. Note that the output means 21104 preferably outputs the rating value as described above for normal input speech.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差、特に声道長の違いに影響を受けない、精度の高い評定ができる。さらに、本音声処理装置は、特殊音声、特に、音素の欠落を検知できるので、極めて精度の高い評定結果が得られる。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. In this case, according to the present embodiment, highly accurate evaluation can be performed without being affected by individual differences, particularly differences in vocal tract length. Furthermore, since this speech processing apparatus can detect special speech, particularly missing phonemes, an extremely accurate rating result can be obtained.

なお、本実施の形態において、音素の欠落を検知できれば良く、評定値の算出アルゴリズムは問わない。評定値の算出アルゴリズムは、上述したアルゴリズム（ＤＡＰ、ｔ−ｐ−ＤＡＰ）でも良く、または、本明細書では述べていない他のアルゴリズムでも良い。 In the present embodiment, it is only necessary to detect missing phonemes, and the algorithm for calculating the rating value is not limited. The algorithm for calculating the rating value may be the above-described algorithm (DAP, tp-DAP), or another algorithm not described in this specification.

また、本実施の形態において、音素の欠落の検知アルゴリズムは、他のアルゴリズムでも良い。例えば、音素の欠落の検知において、所定長さ未満の区間であることを欠落区間の検知で必須としても良いし、区間長を考慮しなくても良い。 In this embodiment, another algorithm may be used as the phoneme loss detection algorithm. For example, in the detection of missing phonemes, a section having a length less than a predetermined length may be essential in detecting the missing section, or the section length may not be considered.

また、上記プログラムにおいて、特殊音声検知ステップは、一の音素の評定値が所定の条件を満たすことを検知し、特殊音声検知ステップで前記所定の条件を満たすことを検知した場合に、少なくとも音素の欠落があった旨を示す評定結果を構成する、ことは好適である。
（実施の形態７） In the above program, the special speech detection step detects that the rating value of one phoneme satisfies a predetermined condition, and if the special speech detection step detects that the predetermined condition is satisfied, It is preferable to configure a rating result indicating that there is a lack.
(Embodiment 7)

本実施の形態における音声処理装置の音声処理は、音声認識である。 The voice processing of the voice processing apparatus in the present embodiment is voice recognition.

図３４は、本実施の形態における音声処理装置のブロック図である。 FIG. 34 is a block diagram of the speech processing apparatus according to this embodiment.

本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、教師データフォルマント周波数格納部１０４、第一サンプリング周波数格納部１０５、サンプリング部１０６、評価対象者フォルマント周波数取得部１０７、評価対象者フォルマント周波数格納部１０８、声道長正規化処理部１０９、音声処理部３４１０、発声催促部１１０９を具備する。 The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a teacher data formant frequency storage unit 104, a first sampling frequency storage unit 105, a sampling unit 106, and an evaluation target person formant frequency acquisition unit 107. The evaluation subject person formant frequency storage unit 108, the vocal tract length normalization processing unit 109, the voice processing unit 3410, and the utterance prompting unit 1109 are provided.

音声処理部３４１０は、音声認識手段３４１０１、出力手段３４１０２を具備する。 The voice processing unit 3410 includes voice recognition means 34101 and output means 34102.

音声処理部３４１０の音声認識手段３４１０１は、第二音声データに基づいて、音声認識処理を行う。音声認識のアルゴリズムは、問わない。音声認識処理は、公知のアルゴリズムで良い。本実施の形態において、リサンプリングした第二音声データに基づいて音声認識することにより、精度の高い音声認識が可能である。音声処理部３４１０は、通常、ＭＰＵやメモリ等から実現され得る。音声処理部３４１０の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The voice recognition unit 34101 of the voice processing unit 3410 performs voice recognition processing based on the second voice data. The algorithm for speech recognition is not limited. The voice recognition process may be a known algorithm. In the present embodiment, highly accurate speech recognition is possible by performing speech recognition based on the resampled second speech data. The audio processing unit 3410 can be usually realized by an MPU, a memory, or the like. The processing procedure of the audio processing unit 3410 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力手段３４１０２は、音声認識結果を出力する。ここで、出力とは、ディスプレイへの表示、プリンタへの印字、音出力、外部の装置への送信、記録媒体への蓄積等を含む概念である。出力手段３４１０２は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力手段３４１０２は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 34102 outputs a voice recognition result. Here, the output is a concept including display on a display, printing on a printer, sound output, transmission to an external device, accumulation in a recording medium, and the like. The output unit 34102 may or may not include an output device such as a display or a speaker. The output unit 34102 can be realized by output device driver software, or output device driver software and an output device.

次に、音声処理装置の動作について図３５のフローチャートを用いて説明する。なお、図３７のフローチャートにおいて、図２、図１２のフローチャートの処理と同様の処理については、その説明を省略する。 Next, the operation of the speech processing apparatus will be described using the flowchart of FIG. In the flowchart of FIG. 37, the description of the same processing as that of the flowcharts of FIGS. 2 and 12 is omitted.

（ステップＳ３５０１）音声認識手段３４１０１は、ステップＳ１２０８でリサンプリング処理され、得られた第二音声データに基づいて、音声認識処理を行う。なお、音声認識手段３４１０１は、教師データとのマッチングを取り、教師データに近い音であると認識することにより、認識結果を得る。 (Step S3501) The voice recognition unit 34101 performs voice recognition processing based on the second voice data obtained by resampling in step S1208. Note that the voice recognition unit 34101 obtains a recognition result by matching with the teacher data and recognizing that the sound is close to the teacher data.

（ステップＳ３５０２）出力手段３４１０２は、ステップＳ３５０１における音声認識結果を出力する。ステップＳ１２０６に戻る。 (Step S3502) The output means 34102 outputs the speech recognition result in step S3501. The process returns to step S1206.

以上、本実施の形態によれば、精度高く音声認識できる。 As described above, according to the present embodiment, speech recognition can be performed with high accuracy.

なお、本実施の形態における音声処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、第一サンプリング周波数で、受け付けた音声をサンプリングし、第一音声データを取得するサンプリングステップと、第二サンプリング周波数「第一サンプリング周波数／（教師データフォルマント周波数／評価対象者フォルマント周波数）」で、前記音声受付ステップで受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理ステップと、前記第二音声データに基づいて、音声認識処理を行う音声処理ステップを実行させるためのプログラム、である。
（実施の形態８） Note that the software that implements the speech processing apparatus according to the present embodiment is the following program. In other words, the program samples the received voice at the first sampling frequency and acquires the first voice data to the computer, and the second sampling frequency “first sampling frequency / (teacher data formant frequency / evaluation”. Subject formant frequency) ”, the voice received in the voice receiving step is subjected to a sampling process to obtain the second voice data, and the voice based on the second voice data It is the program for performing the audio | voice processing step which performs a recognition process.
(Embodiment 8)

本実施の形態において、比較対象の音声と入力音声の類似度の評定を精度高く、かつ高速にできる音声処理装置について説明する。本音声処理装置は、主として、音声（歌唱を含む）を評価する発音評定装置である、として説明する。さらに、本実施の形態において、上記の実施の形態で記載した音声処理装置よりもさらに精度高く、評価対象者の話者特性に応じた発音評定が可能な音声処理装置について説明する。具体的には、本実施の形態において、最小自乗誤差基準に基づく、簡潔な声道長正規化法に基づいて、評価対象者の話者特性に左右されにくい音声処理装置について説明する。 In the present embodiment, a description will be given of a speech processing apparatus that can evaluate the similarity between the comparison target speech and the input speech with high accuracy and high speed. The audio processing apparatus will be described as being a pronunciation rating apparatus that mainly evaluates audio (including singing). Further, in the present embodiment, a description will be given of a speech processing apparatus that can perform pronunciation evaluation according to speaker characteristics of the evaluation target person with higher accuracy than the speech processing apparatus described in the above embodiment. Specifically, in the present embodiment, a speech processing apparatus that is less susceptible to the speaker characteristics of the evaluation target will be described based on a simple vocal tract length normalization method based on the least square error criterion.

また、本実施の形態における音声処理装置は、例えば、語学学習や物真似練習やカラオケ評定などに利用できる。 In addition, the speech processing apparatus according to the present embodiment can be used for language learning, imitation practice, karaoke evaluation, and the like.

図３６は、本実施の形態における音声処理装置のブロック図である。 FIG. 36 is a block diagram of the speech processing apparatus according to this embodiment.

本音声処理装置は、入力受付部１０１、教師データ格納部１０２、音声受付部１０３、第一サンプリング周波数格納部１０５、サンプリング部１０６、声道長正規化パラメータ算出部３６０１、声道長正規化パラメータ格納部３６０２、声道長正規化処理部３６０９、音声処理部１１０を具備する。 The speech processing apparatus includes an input reception unit 101, a teacher data storage unit 102, a speech reception unit 103, a first sampling frequency storage unit 105, a sampling unit 106, a vocal tract length normalization parameter calculation unit 3601, a vocal tract length normalization parameter. A storage unit 3602, a vocal tract length normalization processing unit 3609, and a speech processing unit 110 are provided.

声道長正規化パラメータ算出部３６０１は、周波数範囲指定情報格納手段３６０１１、長時間ケプストラム平均ベクトル格納手段３６０１２、第二ケプストラムベクトル系列算出手段３６０１３、ケプストラム変換手段３６０１４、ケプストラム変換パラメータ算出手段３６０１５、最終ケプストラム変換パラメータ取得手段３６０１６、声道長正規化パラメータ算出手段３６０１７を具備する。 The vocal tract length normalization parameter calculation unit 3601 includes a frequency range designation information storage unit 36011, a long-time cepstrum average vector storage unit 36012, a second cepstrum vector series calculation unit 36013, a cepstrum conversion unit 36014, a cepstrum conversion parameter calculation unit 36015, a final Cepstrum conversion parameter acquisition means 36016 and vocal tract length normalization parameter calculation means 36017 are provided.

声道長正規化パラメータ算出部３６０１は、教師データ格納部１０２の教師データおよび音声受付部１０３が受け付けた音声に基づいて、声道長正規化パラメータを算出する。声道長正規化パラメータとは、評価対象者の話者特性（声道長の違い）を吸収するためのサンプリング周波数変換率である。なお、本実施の形態において、まず、算出されるのは、ケプストラム変換（ケプストラムワーピング）パラメータである。このケプストラム変換パラメータは、声道長の変換を直接表わすものではないため、声道長正規化パラメータ算出部３６０１は、後述の近似変換式を用いて最終的な声道長正規化パラメータ（サンプリング周波数変換率）を算出する。声道長正規化パラメータ算出部３６０１は、通常、ＭＰＵやメモリ等から実現され得る。声道長正規化パラメータ算出部の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The vocal tract length normalization parameter calculation unit 3601 calculates a vocal tract length normalization parameter based on the teacher data in the teacher data storage unit 102 and the voice received by the voice reception unit 103. The vocal tract length normalization parameter is a sampling frequency conversion rate for absorbing speaker characteristics (difference in vocal tract length) of the evaluation target person. In the present embodiment, first, a cepstrum conversion (cepstrum warping) parameter is calculated. Since this cepstrum conversion parameter does not directly represent conversion of the vocal tract length, the vocal tract length normalization parameter calculation unit 3601 uses the approximate conversion formula described later to calculate the final vocal tract length normalization parameter (sampling frequency). Conversion rate). The vocal tract length normalization parameter calculation unit 3601 can usually be realized by an MPU, a memory, or the like. The processing procedure of the vocal tract length normalization parameter calculation unit is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

周波数範囲指定情報格納手段３６０１１は、周波数範囲を指定する情報である周波数範囲指定情報（Ｗ）を格納している。周波数範囲指定情報（Ｗ）は、後述する最適なケプストラム変換パラメータ（α）を算出する場合に、１次オールパス関数等の双１次変換による周波数ワーピングが線形の周波数伸縮で近似できる範囲に、周波数範囲を限定するための情報である。かかる周波数範囲は、０周波数からナイキストレートの１／３から１／２程度が好適である。ただし、双１次変換は、スペクトル領域ではなく、ケプストラム領域で行われるため、周波数範囲指定情報（Ｗ）は、例えば、以下に述べる行列の情報であることが好適である。周波数範囲指定情報は、例えば、図示しない周波数範囲指定情報算出手段により、以下のように算出される。サンプリング周波数をＦ_ｓ（Ｈｚ）、指定される周波数範囲の最高周波数をＦ_ｍａｘ（Ｈｚ）とし，Ｎを十分大きな自然数（５１２や１０２４など）、「Ｎ_ｍ＝Ｎ×Ｆ_ｍａｘ／Ｆ_ｓ」とおいて，ケプストラムベクトルに対する周波数範囲指定行列Ｗの（ｉ，ｊ）成分を、周波数範囲指定情報算出手段は、以下の数式５に従って計算する。具体的には、周波数範囲指定情報算出手段は、コンピュータの記録媒体（図示しない）に格納されているサンプリング周波数（Ｆ_ｓ）、最高周波数（Ｆ_ｍａｘ）、予め決められた十分大きな自然数（Ｎ）を読み出す。そして、周波数範囲指定情報算出手段は、自ら保持している演算式の情報「Ｎ_ｍ＝Ｎ×Ｆ_ｍａｘ／Ｆ_ｓ」を読み出し、読み出したＦ_ｓ、Ｆ_ｍａｘ、Ｎを演算式に代入し、Ｎ_ｍを算出する。そして、周波数範囲指定情報算出手段は、格納している以下の数式５の情報を読み出し、ｉ、ｊを０から順に、１ずつインクリメントさせながら、ループ処理（２重ループの処理になる）により、｛Ｗ｝_ｉ，ｊを算出する。そして、周波数範囲指定情報算出手段は、算出した｛Ｗ｝_ｉ，ｊのすべてを、少なくとも一時的に周波数範囲指定情報格納手段３６０１１に格納する。なお、数式５において、「k」は、周波数インデクスであり、「ｋ」の範囲は、「ｋ＝０，１，２，...，Ｎ／２」である。また、「ｎ」は、離散時間インデクスであり、「ｎ」の範囲は、「ｎ＝...−２，−１，０，１，２，...」である。
The frequency range designation information storage unit 36011 stores frequency range designation information (W) that is information for designating a frequency range. The frequency range designation information (W) has a frequency within a range in which frequency warping by bilinear transformation such as a linear allpass function can be approximated by linear frequency expansion and contraction when calculating an optimal cepstrum transformation parameter (α) described later. This is information for limiting the range. Such a frequency range is preferably about 0 to 1/3 to 1/2 of Nyquist rate. However, since the bilinear transformation is performed not in the spectral domain but in the cepstrum domain, the frequency range designation information (W) is preferably, for example, matrix information described below. The frequency range designation information is calculated as follows, for example, by frequency range designation information calculation means (not shown). Sampling frequency is F _s (Hz), maximum frequency in the specified frequency range is F _max (Hz), N is a sufficiently large natural number (512, 1024, etc.), “N _m = N × F _max / F _s ” Then, the frequency range designation information calculation means calculates the (i, j) component of the frequency range designation matrix W for the cepstrum vector according to the following formula 5. Specifically, the frequency range designation information calculation means is configured to store a sampling frequency (F _s ), a maximum frequency (F _max ), and a predetermined sufficiently large natural number (N) stored in a computer recording medium (not shown). Is read. Then, the frequency range designation information calculation means reads information “N _m = N × F _max / F _s ” held by itself, substitutes the read F _s , F _max , and N into the arithmetic expression, N _m is calculated. Then, the frequency range designation information calculation means reads out the stored information of the following Expression 5, and increments i and j by 1 in order from 0, while performing loop processing (becomes a double loop processing), {W} _{i, j} is calculated. Then, the frequency range designation information calculation means stores all of the calculated {W} _{i, j} in the frequency range designation information storage means 36011 at least temporarily. In Equation 5, “k” is a frequency index, and the range of “k” is “k = 0, 1, 2,..., N / 2”. “N” is a discrete-time index, and the range of “n” is “n = ...− 2, −1, 0, 1, 2,.

なお、周波数範囲指定情報のデータ構造は問わない。また、周波数範囲指定情報格納手段３６０１１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 In addition, the data structure of frequency range designation | designated information is not ask | required. The frequency range designation information storage unit 36011 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

長時間ケプストラム平均ベクトル格納手段３６０１２は、長時間ケプストラム平均ベクトル（μ）を格納している。長時間ケプストラム平均ベクトル（μ）は、教師データを構成するデータから、短区間分析により算出された第一のケプストラムベクトル系列の時間平均である。第一のケプストラムベクトル系列（ｘ_ｔ（ｔ＝１，２，・・・，Ｔ_０））は、通常、教師データを構成する単一音素（例えば、／ｕ／）を短区間分析し、取得される。そして、長時間ケプストラム平均ベクトル（μ）は、ベクトル（ｘ_ｔ）から、以下の数式６により算出される。
The long-time cepstrum average vector storage unit 36012 stores a long-time cepstrum average vector (μ). The long-term cepstrum average vector (μ) is the time average of the first cepstrum vector sequence calculated from the data constituting the teacher data by the short interval analysis. The first cepstrum vector series (x _t (t = 1, 2,..., T ₀ )) is usually obtained by performing a short-term analysis on a single phoneme (eg, / u /) constituting teacher data. Is done. Then, the long-term cepstrum average vector (μ) is calculated from the vector (x _t ) according to the following Equation 6.

なお、ケプストラムベクトルは０次係数も含めたＭ＋１次元であり，ベクトル（ｘ_ｔ）およびベクトル（μ）は、それぞれ数式７、数式８で表わされる。
Note that the cepstrum vector has M + 1 dimensions including the 0th order coefficient, and the vector (x _t ) and the vector (μ) are expressed by Expression 7 and Expression 8, respectively.

数式７、数式８において、（・・・）^Ｔは行列またはベクトルの転置を表わす。 In Equations 7 and 8, (...) ^T represents transposition of a matrix or a vector.

また、第一のケプストラムベクトル系列（ｘ_ｔ）は、図示しない第一ケプストラムベクトル系列算出手段が、教師データを構成するデータ（単一音素（例えば、／ｕ／））から、短区間分析により算出しても良い。 The first cepstrum vector sequence (x _t ) is calculated by the first cepstrum vector sequence calculation means (not shown) from the data (single phoneme (for example, / u /)) constituting the teacher data by short section analysis. You may do it.

また、図示しない長時間ケプストラム平均ベクトル取得手段が、第一のケプストラムベクトル系列（ｘ_ｔ）の時間平均を、数式６により算出し、長時間ケプストラム平均ベクト（μ）を取得しても良い。長時間ケプストラム平均ベクトル格納手段３６０１２は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 Further, a long-time cepstrum average vector acquisition unit (not shown) may calculate the time average of the first cepstrum vector series (x _t ) using Equation 6 to acquire the long-time cepstrum average vector (μ). The long-term cepstrum average vector storage means 36012 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

第二ケプストラムベクトル系列算出手段３６０１３は、音声受付部１０３が受け付けた音声（通常、単一音素（例えば、／ｕ／）である。）から、短区間分析により第二のケプストラムベクトル系列（Ｃ_ｔ）を算出する。第二ケプストラムベクトル系列算出手段３６０１３は、サンプリング部１０６がサンプリングして得た第一音声データから、第二のケプストラムベクトル系列（Ｃ_ｔ）を算出しても良く、かかる場合も音声受付部１０３が受け付けた音声から、短区間分析により第二のケプストラムベクトル系列（Ｃ_ｔ）を算出した、と言える。なお、音素を短区間分析し、ケプストラムベクトル系列を算出する処理は、公知技術による処理であるので、説明は省略する。第二ケプストラムベクトル系列算出手段３６０１３は、通常、ＭＰＵやメモリ等から実現され得る。第二ケプストラムベクトル系列算出手段３６０１３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。なお、第二ケプストラムベクトル系列算出手段３６０１３が処理する音声であり音声受付部１０３が受け付ける音素（例えば、／ｕ／）は、長時間ケプストラム平均ベクトルの元になる音声と同一の音素（例えば、／ｕ／）である。また、この第二のケプストラムベクトル系列（Ｃ_ｔ）は、やはり０次係数も含めたＭ＋１次元ベクトルとして数式９で表わされる。
The second cepstrum vector sequence calculation means 36013 uses the second cepstrum vector sequence (C _t ) from the speech received by the speech accepting unit 103 (usually a single phoneme (eg, / u /)) by short interval analysis. ) Is calculated. The second cepstrum vector sequence calculation unit 36013 may calculate the second cepstrum vector sequence (C _t ) from the first audio data obtained by sampling by the sampling unit 106, and in this case, the audio reception unit 103 also calculates. It can be said that the second cepstrum vector series (C _t ) was calculated from the received speech by short interval analysis. Note that the process of analyzing a phoneme for a short interval and calculating a cepstrum vector series is a process according to a known technique, and thus the description thereof is omitted. The second cepstrum vector series calculation unit 36013 can usually be realized by an MPU, a memory, or the like. The processing procedure of the second cepstrum vector series calculation unit 36013 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit). Note that the phoneme (for example, / u /) that is the voice processed by the second cepstrum vector series calculation unit 36013 and received by the voice receiving unit 103 is the same phoneme (for example, / u /). The second cepstrum vector series (C _t ) is expressed by Equation 9 as an M + 1 dimensional vector that also includes the zeroth order coefficient.

また、第二ケプストラムベクトル系列算出手段３６０１３は、後述するケプストラム変換パラメータ算出手段３６０１５の処理のために、算出した第二のケプストラムベクトル系列（Ｃ_ｔ）に対して、以下の数式１０の処理を行い、ベクトル（Ｃ_ｔ ⁻）（ｔ＝１，２，・・・，Ｔ）を取得することは好適である。
Further, the second cepstrum vector sequence calculation unit 36013 performs the following equation 10 on the calculated second cepstrum vector sequence (C _t ) for the processing of the cepstrum conversion parameter calculation unit 36015 described later. It is preferable to obtain a vector (C _t ⁻ ) (t = 1, 2,..., T).

ケプストラム変換手段３６０１４は、第二のケプストラムベクトル系列（Ｃ_ｔ）を、ケプストラム変換パラメータ（α）を要素とする行列（Ｆ（α））を用いて線形変換し、周波数ワープされた第三のケプストラムベクトル系列（Ｏ_ｔ）を算出する。具体的には、まず、ケプストラム変換手段３８０１４は、ケプストラム変換パラメータ（α）の初期値（α^〜）を設定する。（α^〜）はαの最適値の近似値であることが望ましく、通常はα＝０とおくが、例えば、最適値が「α＞０」であると予想できる場合は、小さな正の値のαでも良い。なお、初期値（α^〜）は、ケプストラム変換手段３８０１４が、予め記憶媒体やメモリ等に格納している。次に、ケプストラム変換手段３８０１４は、与えられたケプストラム変換パラメータ（α）をパラメータとして、以下の数式１１により、周波数ワープされた第三のケプストラムベクトル系列（Ｏ_ｔ）を算出する。なお、ケプストラム変換手段３８０１４は、最初に、（Ｏ_ｔ（α^〜））を算出する。
The cepstrum transforming unit 36014 linearly transforms the second cepstrum vector series (C _t ) using a matrix (F (α)) having cepstrum transform parameters (α) as elements, and a frequency-warped third cepstrum. A vector series (O _t ) is calculated. Specifically, first, the cepstrum conversion unit 38014 sets an initial value (α ^to ) of the cepstrum conversion parameter (α). (Α ^˜ ) is preferably an approximate value of the optimal value of α, and is usually set to α = 0. For example, when it can be predicted that the optimal value is “α> 0”, a small positive value is used. α may be used. Note that the initial values (α ^to ) are stored in advance in a storage medium, a memory, or the like by the cepstrum conversion means 38014. Next, the cepstrum conversion means 38014 calculates a third warped cepstrum vector sequence (O _t ) according to the following Equation 11 using the given cepstrum conversion parameter (α) as a parameter. The cepstrum conversion means 38014 first calculates (O _t (α ^˜ )).

ケプストラム変換手段３６０１４は、通常、ＭＰＵやメモリ等から実現され得る。ケプストラム変換手段段３６０１４の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The cepstrum conversion unit 36014 can be usually realized by an MPU, a memory, or the like. The processing procedure of the cepstrum conversion means stage 36014 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

ケプストラム変換パラメータ算出手段３６０１５は、長時間ケプストラム平均ベクトル（μ）および第三のケプストラムベクトル系列（Ｏ_ｔ）に基づいて、ケプストラム変換パラメータ（α）を算出する。ケプストラム変換パラメータ算出手段３６０１５は、さらに好ましくは、周波数範囲指定情報（Ｗ）で示される周波数範囲における長時間ケプストラム平均ベクトルおよび第三のケプストラムベクトル系列（Ｏ_ｔ）に基づいて、ケプストラム変換パラメータを算出する。 The cepstrum conversion parameter calculation unit 36015 calculates a cepstrum conversion parameter (α) based on the long-time cepstrum average vector (μ) and the third cepstrum vector series (O _t ). More preferably, the cepstrum conversion parameter calculation unit 36015 calculates the cepstrum conversion parameter based on the long-time cepstrum average vector and the third cepstrum vector series (O _t ) in the frequency range indicated by the frequency range designation information (W). To do.

具体的には、まず、ケプストラム変換パラメータ算出手段３６０１５は、以下の数式１２により、ベクトル（ｕ_ｔ（α））を算出する。そして、次に、ケプストラム変換パラメータ算出手段３６０１５は、以下の数式１３により、αの最適値（α^＊）を算出する。なお、αの最適値（α^＊）は、現繰り返しステップにおける最適値である。
Specifically, first, the cepstrum conversion parameter calculation unit 36015 calculates a vector (u _t (α)) by the following Expression 12. Then, the cepstrum conversion parameter calculation unit 36015 calculates the optimum value of α (α ^* ) by the following Expression 13. Note that the optimum value of α (α ^* ) is the optimum value in the current iteration step.

ケプストラム変換パラメータ算出手段３６０１５は、通常、ＭＰＵやメモリ等から実現され得る。ケプストラム変換パラメータ算出手段３６０１５の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The cepstrum conversion parameter calculation unit 36015 can be usually realized by an MPU, a memory, or the like. The processing procedure of the cepstrum conversion parameter calculation means 36015 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

最終ケプストラム変換パラメータ取得手段３６０１６は、所定のルールに基づいて、ケプストラム変換手段３６０１４における処理、およびケプストラム変換パラメータ算出手段３６０１５における処理を繰り返えさせ、最終的な最適なケプストラム変換パラメータを得る。ここで、所定のルールとは、例えば、予め決められた所定の繰り返し回数だけ、処理を繰り返し行われたことである。また、所定のルールとは、例えば、αの最適値（α^＊）の変化が所定の値より小さくなった（先の（α^＊）と今回の（α^＊）の差が閾値以下など）ことである。また、所定のルールとは、その他のルールでも良い。最終ケプストラム変換パラメータ取得手段３６０１６は、通常、ＭＰＵやメモリ等から実現され得る。最終ケプストラム変換パラメータ取得手段３６０１６の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The final cepstrum conversion parameter acquisition unit 36016 repeats the processing in the cepstrum conversion unit 36014 and the processing in the cepstrum conversion parameter calculation unit 36015 based on a predetermined rule, and obtains a final optimum cepstrum conversion parameter. Here, the predetermined rule is, for example, that the process is repeatedly performed a predetermined number of repetitions. Also, the predetermined rule is, for example, that the change in the optimum value of α (α ^* ) is smaller than the predetermined value (the difference between the previous (α ^* ) and the current (α ^* ) is less than a threshold value, etc.) It is. Further, the predetermined rule may be another rule. The final cepstrum conversion parameter acquisition unit 36016 can be usually realized by an MPU, a memory, or the like. The processing procedure of the final cepstrum conversion parameter acquisition unit 36016 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

声道長正規化パラメータ算出手段３６０１７は、最終ケプストラム変換パラメータ取得手段３６０１６が得たケプストラム変換パラメータに基づいて、声道長正規化パラメータ（γ）を算出する。さらに具体的には、声道長正規化パラメータ算出手段３６０１７は、まず、最終ケプストラム変換パラメータ取得手段３６０１６が得たケプストラム変換パラメータから、例えば、以下の数式１４に従って、線形周波数伸縮比（ρ）を算出する。
The vocal tract length normalization parameter calculation unit 36017 calculates the vocal tract length normalization parameter (γ) based on the cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition unit 36016. More specifically, the vocal tract length normalization parameter calculation unit 36017 first calculates the linear frequency expansion / contraction ratio (ρ) from the cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition unit 36016, for example, according to the following Expression 14. calculate.

次に、声道長正規化パラメータ算出手段３６０１７は、当該線形周波数伸縮比（ρ）から、例えば、以下の数式１５に従って、声道長正規化パラメータ（γ）を算出する。
Next, the vocal tract length normalization parameter calculation unit 36017 calculates the vocal tract length normalization parameter (γ) from the linear frequency expansion / contraction ratio (ρ), for example, according to the following Expression 15.

声道長正規化パラメータ算出手段３６０１７は、通常、ＭＰＵやメモリ等から実現され得る。声道長正規化パラメータ算出手段３６０１７の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The vocal tract length normalization parameter calculation means 36017 can be usually realized by an MPU, a memory, or the like. The processing procedure of the vocal tract length normalization parameter calculation means 36017 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

声道長正規化パラメータ格納部３６０２は、サンプリング周波数の変換率に関する情報である声道長正規化パラメータ（γ）を格納している。声道長正規化パラメータ（γ）は、声道長正規化パラメータ算出部３６０１が取得した声道長正規化パラメータである。声道長正規化パラメータ格納部３６０２は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。なお、声道長正規化パラメータ算出部３６０１は、ここでは、算出した声道長正規化パラメータ（γ）を、声道長正規化パラメータ格納部３６０２に蓄積する処理も行う。 The vocal tract length normalization parameter storage unit 3602 stores a vocal tract length normalization parameter (γ), which is information regarding the conversion rate of the sampling frequency. The vocal tract length normalization parameter (γ) is a vocal tract length normalization parameter acquired by the vocal tract length normalization parameter calculation unit 3601. The vocal tract length normalization parameter storage unit 3602 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium. Here, the vocal tract length normalization parameter calculation unit 3601 also performs a process of accumulating the calculated vocal tract length normalization parameter (γ) in the vocal tract length normalization parameter storage unit 3602.

声道長正規化処理部３６０９は、声道長正規化パラメータ（γ）と第一サンプリング周波数をパラメータとして算出される第二サンプリング周波数で、音声受付部１０３が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る。「音声受付部１０３が受け付けた音声に対して」とは、サンプリング部１０６がサンプリングして取得した第一音声データに対して、第二サンプリング周波数でリサンプリング処理を行い、第二音声データを得ることも含むし、サンプリング部１０６が第一音声データを取得することなく、直接的に音声受付部１０３が受け付けた音声に対して、第二サンプリング周波数でサンプリング処理し、第二音声データを得ることも含む。声道長正規化処理部３６０９は、通常、ＭＰＵやメモリ等から実現され得る。声道長正規化処理部３６０９の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The vocal tract length normalization processing unit 3609 performs sampling processing on the voice received by the voice reception unit 103 at the second sampling frequency calculated using the vocal tract length normalization parameter (γ) and the first sampling frequency as parameters. To obtain second audio data. “For voice received by the voice receiving unit 103” means that the first voice data sampled and acquired by the sampling unit 106 is subjected to resampling processing at the second sampling frequency to obtain second voice data. In addition, the sampling unit 106 obtains the second audio data by sampling the audio received directly by the audio receiving unit 103 at the second sampling frequency without acquiring the first audio data. Including. The vocal tract length normalization processing unit 3609 can usually be realized by an MPU, a memory, or the like. The processing procedure of the vocal tract length normalization processing unit 3609 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、本音声処理装置の動作について説明する。まず、本音声処理装置における、声道長正規化パラメータを算出する処理について、図３７のフローチャートを用いて説明する。なお、本声道長正規化パラメータ算出処理は、必ずしも音声処理装置で行う必要はない。 Next, the operation of the voice processing apparatus will be described. First, the process of calculating the vocal tract length normalization parameter in the speech processing apparatus will be described with reference to the flowchart of FIG. Note that the vocal tract length normalization parameter calculation process does not necessarily have to be performed by the voice processing device.

（ステップＳ３７０１）音声処理装置は、初期化処理を行う。初期化処理とは、例えば、ユーザ（評価対象者）に対して、「／ｕ／」と発声するように促す処理（例えば、ディスプレイに「「う」と発音してください。」と表示する処理）と、周波数範囲指定情報格納手段３６０１１の周波数範囲指定情報、および長時間ケプストラム平均ベクトル格納手段３６０１２の長時間ケプストラム平均ベクトルを読み出す処理である。 (Step S3701) The speech processing apparatus performs initialization processing. The initialization process is, for example, a process for prompting the user (evaluation subject) to say “/ u /” (for example, a process of displaying “Please pronounce“ U ”” on the display). ) And the frequency range designation information stored in the frequency range designation information storage unit 36011 and the long time cepstrum average vector stored in the long time cepstrum average vector storage unit 36012.

（ステップＳ３７０２）音声受付部１０３が、評価対象者からの音声を受け付けたか否かを判断する。音声を受け付ければステップＳ３７０３に行き、音声を受け付けなければステップＳ３７０２に戻る。 (Step S3702) It is judged whether the audio | voice reception part 103 received the audio | voice from an evaluation object person. If a voice is accepted, the process goes to step S3703, and if no voice is accepted, the process returns to step S3702.

（ステップＳ３７０３）サンプリング部１０６は、ステップＳ３７０２で受け付けた音声をサンプリングし、第一音声データを得て、メモリ上に、少なくとも一時格納する。なお、音声をサンプリングする処理は公知技術である。 (Step S3703) The sampling unit 106 samples the sound received in step S3702, obtains first sound data, and stores at least temporarily in the memory. Note that the process of sampling audio is a known technique.

（ステップＳ３７０４）第二ケプストラムベクトル系列算出手段３６０１３は、ステップＳ３７０３で得た第一音声データを取得し、当該第一音声データから、短区間分析により第二のケプストラムベクトル系列（Ｃ_ｔ）を算出し、第二のケプストラムベクトル系列（Ｃ_ｔ）をメモリ上に、少なくとも一時格納する。 (Step S3704) The second cepstrum vector series calculation means 36013 acquires the first voice data obtained in step S3703, and calculates the second cepstrum vector series (C _t ) from the first voice data by short section analysis. The second cepstrum vector series (C _t ) is at least temporarily stored in the memory.

（ステップＳ３７０５）第二ケプストラムベクトル系列算出手段３６０１３は、ステップＳ３７０４で算出した第二のケプストラムベクトル系列（Ｃ_ｔ）を取得し、当該第二のケプストラムベクトル系列（Ｃ_ｔ）から、ベクトル（Ｃ_ｔ ⁻）（ｔ＝１，２，・・・，Ｔ）を取得し、ベクトル（Ｃ_ｔ ⁻）をメモリ上に、少なくとも一時格納する。 (Step S3705) The second cepstrum vector sequence calculation unit 36013 obtains the second cepstral vector sequences calculated in step S 3704 _{(C t),} from the second cepstral vector series _{(C t),} the vector _{(C t} ^- ) (T = 1, 2,..., T) are acquired, and the vector (C _t ⁻ ) is at least temporarily stored in the memory.

（ステップＳ３７０６）ケプストラム変換手段３６０１４は、予め格納しているケプストラム変換パラメータの初期値（α^〜）を読み出し、変数（α）に（α^〜）を設定する。 (Step S3706) cepstrum conversion unit 36014 reads the initial value of the cepstrum transformation parameters stored in advance (alpha ^~), set the variable (alpha) and (alpha ^~).

（ステップＳ３７０７）ケプストラム変換手段３６０１４は、第二のケプストラムベクトル系列（Ｃ_ｔ）を読み出し、当該第二のケプストラムベクトル系列（Ｃ_ｔ）を、ケプストラム変換パラメータ（α）を要素とする行列を用いて線形変換し、周波数ワープされた第三のケプストラムベクトル系列（Ｏ_ｔ）を算出し、メモリ上に、少なくとも一時格納する。 (Step S3707) cepstrum conversion unit 36014 reads the second cepstral vector series _{(C t),} the second cepstral vector series _{(C t),} using a matrix for cepstrum conversion parameter a (alpha) and elements A third cepstrum vector sequence (O _t ) subjected to linear transformation and frequency warped is calculated, and at least temporarily stored in the memory.

（ステップＳ３７０８）ケプストラム変換パラメータ算出手段３６０１５は、ステップＳ３７０７で算出した第三のケプストラムベクトル系列（Ｏ_ｔ）、およびステップＳ３７０５で算出したベクトル（Ｃ_ｔ ⁻）を読み出し、当該第三のケプストラムベクトル系列（Ｏ_ｔ）およびベクトル（Ｃ_ｔ ⁻）から、ベクトル（ｕ_ｔ（α））を算出する（数式１２参照）。ベクトル（ｕ_ｔ（α））を算出する場合に、格納している数式１２の情報を読み出して、演算することは言うまでもない。 (Step S3708) The cepstrum transformation parameter calculation unit 36015 reads out the third cepstrum vector sequence (O _t ) calculated in step S3707 and the vector (C _t ⁻ ) calculated in step S3705, and the third cepstrum vector sequence. From (O _t ) and the vector (C _t ⁻ ), a vector (u _t (α)) is calculated (see Expression 12). Needless to say, when the vector (u _t (α)) is calculated, the stored information of Equation 12 is read out.

（ステップＳ３７０９）ケプストラム変換パラメータ算出手段３６０１５は、予め格納している数式１３の情報を読み出し、当該数式１３の数式に、ベクトル（ｕ_ｔ（α））、長時間ケプストラム平均ベクトル（μ）、ベクトル（Ｃ_ｔ ⁻）、周波数範囲指定情報（Ｗ）の情報（Ｗ^Ｔ等も含む）を与え、数式１３を演算し、αの最適値（α^＊）を算出する。このαの最適値（α^＊）は、本ループにおける最適値である。そして、ケプストラム変換パラメータ算出手段３６０１５は、αの最適値（α^＊）を、少なくともメモリに一時格納する。 (Step S3709) The cepstrum conversion parameter calculation means 36015 reads information of Equation 13 stored in advance, and in the equation of Equation 13, a vector (u _t (α)), a long-time cepstrum average vector (μ), a vector _{(C t} ^-), it gives information ^{(W T,} etc. including also) in the frequency range specification information (W), and calculates the formula 13, and calculates the optimum value of alpha the (alpha ^*). This optimal value (α ^* ) of α is the optimal value in this loop. Then, the cepstrum conversion parameter calculation means 36015 temporarily stores the optimal value (α ^* ) of α at least in the memory.

（ステップＳ３７１０）最終ケプストラム変換パラメータ取得手段３６０１６は、予め決められた所定のルール（ルールの情報は、予め格納されている）に合致するか否かを判断する。ルールに合致すればステップＳ３７１１に行き、ルールに合致しなければステップＳ３７１３に行く。なお、ルールとは、上述したように、例えば、予め決められた所定の繰り返し回数（この回数の情報は、予めメモリ等に格納されている）だけ、本ループ処理（αの最適値（α^＊）を算出し、α^＊をαに代入する処理）が繰り返し行われたことである。 (Step S3710) The final cepstrum conversion parameter acquisition unit 36016 determines whether or not a predetermined rule (rule information is stored in advance) is met. If it matches the rule, the process goes to step S3711, and if it does not match the rule, the process goes to step S3713. Note that, as described above, the rule is, for example, the loop processing (the optimum value of α (α ^* ) as many times as a predetermined number of repetitions (information on this number is stored in advance in a memory or the like) ^. ) And the process of substituting α ^* into α is repeatedly performed.

（ステップＳ３７１１）声道長正規化パラメータ算出手段３６０１７は、最終ケプストラム変換パラメータ取得手段３６０１６が得たケプストラム変換パラメータ（最終のα^＊）を取得し、当該パラメータを数式１４の演算式に与え、線形周波数伸縮比（ρ）を算出する。なお、数式１４の演算式の情報は、声道長正規化パラメータ算出手段３６０１７が予め格納している。また、声道長正規化パラメータ算出手段３６０１７は、算出した線形周波数伸縮比（ρ）を、少なくともメモリに一時格納する。 (Step S3711) The vocal tract length normalization parameter calculation unit 36017 acquires the cepstrum conversion parameter (final α ^* ) obtained by the final cepstrum conversion parameter acquisition unit 36016, gives the parameter to the calculation formula of Formula 14, and linearly The frequency expansion / contraction ratio (ρ) is calculated. Note that the information of the arithmetic expression of Expression 14 is stored in advance by the vocal tract length normalization parameter calculation means 36017. The vocal tract length normalization parameter calculation means 36017 temporarily stores the calculated linear frequency expansion / contraction ratio (ρ) in at least a memory.

（ステップＳ３７１２）声道長正規化パラメータ算出手段３６０１７は、線形周波数伸縮比（ρ）を読み出し、当該線形周波数伸縮比（ρ）から、声道長正規化パラメータ（γ）を算出し、少なくともメモリに一時格納する。声道長正規化パラメータ算出手段３６０１７は、例えば、予め格納している数式１５の情報を読み出し、線形周波数伸縮比（ρ）を代入し、数式１５の演算式を実行し、声道長正規化パラメータ（γ）を得る。 (Step S3712) The vocal tract length normalization parameter calculation means 36017 reads the linear frequency expansion / contraction ratio (ρ), calculates the vocal tract length normalization parameter (γ) from the linear frequency expansion / contraction ratio (ρ), and at least the memory Temporarily store in. For example, the vocal tract length normalization parameter calculation unit 36017 reads information of Formula 15 stored in advance, substitutes the linear frequency expansion / contraction ratio (ρ), executes the calculation formula of Formula 15, and normalizes the vocal tract length. The parameter (γ) is obtained.

（ステップＳ３７１３）最終ケプストラム変換パラメータ取得手段３６０１６は、ステップＳ３７０９で算出した（α^＊）をαに代入する。そして、ステップＳ３７０７に戻る。 (Step S3713) The final cepstrum conversion parameter acquisition unit 36016 substitutes (α ^* ) calculated in step S3709 for α. Then, the process returns to step S3707.

本音声処理装置の動作（音声の評定処理）について、図３８、図３９のフローチャートを用いて説明する。図３８のフローチャートにおいて、図２において説明した音声処理装置の動作と比較して、ステップＳ２０４の声道長正規化処理の内容のみが異なる。 The operation of the speech processing apparatus (speech rating process) will be described with reference to the flowcharts of FIGS. In the flowchart of FIG. 38, only the content of the vocal tract length normalization process in step S204 is different from the operation of the speech processing apparatus described in FIG.

次に、図３９のフローチャートを用いて声道長正規化処理の内容について説明する。 Next, the contents of the vocal tract length normalization process will be described using the flowchart of FIG.

（ステップＳ３９０１）声道長正規化処理部３６０９は、第一サンプリング周波数を、第一サンプリング周波数格納部１０５から読み出す。 (Step S3901) The vocal tract length normalization processing unit 3609 reads the first sampling frequency from the first sampling frequency storage unit 105.

（ステップＳ３９０２）声道長正規化処理部３６０９は、声道長正規化パラメータ（γ）を、声道長正規化パラメータ格納部３６０２から読み出す。 (Step S3902) The vocal tract length normalization processing unit 3609 reads the vocal tract length normalization parameter (γ) from the vocal tract length normalization parameter storage unit 3602.

（ステップＳ３９０３）声道長正規化処理部３６０９は、予め決められた第二サンプリング周波数（Ｆ_２）を算出する演算式の情報を読み出し、当該演算式の情報に、声道長正規化パラメータ（γ）と第一サンプリング周波数（Ｆ_１）をパラメータとして代入し、演算式を実行し、第二サンプリング周波数（Ｆ_２）を算出する。予め決められた第二サンプリング周波数（Ｆ_２）を算出する演算式は、例えば、（Ｆ_２＝Ｆ_１×γ）である。 (Step S3903) The vocal tract length normalization processing unit 3609 reads out information on an arithmetic expression for calculating a predetermined second sampling frequency (F ₂ ), and the vocal tract length normalization parameter ( γ) and the first sampling frequency (F ₁ ) are substituted as parameters, an arithmetic expression is executed, and the second sampling frequency (F ₂ ) is calculated. An arithmetic expression for calculating the predetermined second sampling frequency (F ₂ ) is, for example, (F ₂ = F ₁ × γ).

（ステップＳ３９０４）声道長正規化処理部３６０９は、第二サンプリング周波数で、第一音声データに対して、リサンプリング処理を行い、第二音声データを得る。 (Step S3904) The vocal tract length normalization processing unit 3609 performs resampling processing on the first audio data at the second sampling frequency to obtain second audio data.

なお、図３８のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 Note that the processing is ended by powering off or interruption for aborting the processing in the flowchart in FIG.

以下、図４０の概念図を用いて、本実施の形態における音声処理装置における声道長正規化パラメータ算出処理の概念について説明する。声道長正規化パラメータの算出は、システム設計用に用いられる話者（基準話者）のある音素（教師データのある音素）の平均ケプストラムベクトルμと、同じ音素のユーザ発話音声の変換されたケプストラムベクトルＯ_ｔとの自乗誤差が最小になるように求められる。ただし、求まるパラメータ（α?）はケプストラム変換（ケプストラムワーピング）パラメータであり、このままでは声道長の変換を直接表わすものではないため、近似変換式（１／ρ）を用いて最終的な声道長正規化パラメータ（サンプリング周波数変換率）γを計算する。なお、声道長正規化パラメータを算出するための数式は、上記の数式５から数式１５により行われる。 Hereinafter, the concept of the vocal tract length normalization parameter calculation process in the speech processing apparatus according to the present embodiment will be described using the conceptual diagram of FIG. The vocal tract length normalization parameter is calculated by converting the average cepstrum vector μ of the phoneme (phoneme with teacher data) of the speaker (reference speaker) used for system design and the user utterance of the same phoneme. It is determined so that the square error with the cepstrum vector O _t is minimized. However, since the obtained parameter (α?) Is a cepstrum transformation (cepstrum warping) parameter and does not directly represent the transformation of the vocal tract length as it is, the final vocal tract is obtained using the approximate transformation formula (1 / ρ). The long normalization parameter (sampling frequency conversion rate) γ is calculated. In addition, the mathematical formula for calculating the vocal tract length normalization parameter is performed by the mathematical formula 5 to the mathematical formula 15 described above.

以上、本実施の形態によれば、ユーザが入力した発音を、教師データに対して、如何に似ているかを示す類似度（評定値）を算出し、出力できる。また、かかる場合、本実施の形態によれば、個人差、特に声道長の違いに影響を受けない、特に、精度の高い評定ができる。 As described above, according to the present embodiment, it is possible to calculate and output the similarity (rating value) indicating how the pronunciation input by the user is similar to the teacher data. Further, in this case, according to the present embodiment, it is possible to make a highly accurate evaluation that is not affected by individual differences, particularly vocal tract length differences.

なお、本実施の形態によれば、音声処理装置が声道長正規化パラメータを算出する処理と、声道長正規化処理を行った。しかし、声道長正規化パラメータを算出する処理と声道長正規化処理を、異なる装置が行う構成でも良い。かかる場合、本音声処理装置は、音声を受け付ける音声受付部と、第一サンプリング周波数を格納している第一サンプリング周波数格納部と、サンプリング周波数の変換率に関する情報である声道長正規化パラメータを格納している声道長正規化パラメータ格納部と、前記声道長正規化パラメータと前記第一サンプリング周波数をパラメータとして算出される第二サンプリング周波数で、前記音声受付部が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理部と、前記第二音声データを処理する音声処理部を具備する音声処理装置である。また、声道長正規化パラメータを算出する装置は、教師データを構成するデータから、短区間分析により算出された第一のケプストラムベクトル系列の時間平均である長時間ケプストラム平均ベクトルを格納している長時間ケプストラム平均ベクトル格納手段と、受け付けた音声から、短区間分析により第二のケプストラムベクトル系列を算出する第二ケプストラムベクトル系列算出手段と、前記第二のケプストラムベクトル系列を、ケプストラム変換パラメータを要素とする行列を用いて線形変換し、周波数ワープされた第三のケプストラムベクトル系列を算出するケプストラム変換手段と、前記長時間ケプストラム平均ベクトルおよび前記第三のケプストラムベクトル系列に基づいて、ケプストラム変換パラメータを算出するケプストラム変換パラメータ算出手段と、所定のルールに基づいて、前記ケプストラム変換手段における処理、および前記ケプストラム変換パラメータ算出手段における処理を繰り返えさせ、ケプストラム変換パラメータを得る最終ケプストラム変換パラメータ取得手段と、前記最終ケプストラム変換パラメータ取得手段が得たケプストラム変換パラメータに基づいて、前記声道長正規化パラメータを算出する声道長正規化パラメータ算出手段を具備する装置である。 Note that, according to the present embodiment, the voice processing device performs the process of calculating the vocal tract length normalization parameter and the vocal tract length normalization process. However, a configuration in which different devices perform the process of calculating the vocal tract length normalization parameter and the process of normalizing the vocal tract length may be used. In such a case, the speech processing apparatus includes a speech reception unit that receives speech, a first sampling frequency storage unit that stores a first sampling frequency, and a vocal tract length normalization parameter that is information related to a conversion rate of the sampling frequency. A voice sampling length storage parameter storing unit, a second sampling frequency calculated using the vocal tract length normalization parameter and the first sampling frequency as parameters, and the voice received by the voice receiving unit The voice processing apparatus includes a vocal tract length normalization processing unit that performs sampling processing to obtain second voice data, and a voice processing unit that processes the second voice data. Further, the apparatus for calculating the vocal tract length normalization parameter stores a long-time cepstrum average vector that is a time average of the first cepstrum vector sequence calculated by the short interval analysis from the data constituting the teacher data. Long-term cepstrum average vector storage means, second cepstrum vector series calculation means for calculating a second cepstrum vector series from received speech by short interval analysis, and the second cepstrum vector series, cepstrum conversion parameters as elements And a cepstrum conversion means for calculating a frequency-warped third cepstrum vector sequence and a cepstrum conversion parameter based on the long-time cepstrum average vector and the third cepstrum vector sequence. Cepstra to calculate A conversion parameter calculation means, a final cepstrum conversion parameter acquisition means for obtaining a cepstrum conversion parameter by repeating the process in the cepstrum conversion means and the process in the cepstrum conversion parameter calculation means based on a predetermined rule, and the final The apparatus includes a vocal tract length normalization parameter calculation unit that calculates the vocal tract length normalization parameter based on the cepstrum conversion parameter obtained by the cepstrum conversion parameter acquisition unit.

また、本実施の形態によれば、ケプストラム変換パラメータの算出の際に、周波数範囲を限定したが、かかる処理は必須ではない。かかることは、実施の形態９、実施の形態１０においても同様である。 Further, according to the present embodiment, the frequency range is limited when calculating the cepstrum conversion parameter, but such processing is not essential. The same applies to the ninth and tenth embodiments.

また、本実施の形態によれば、音声処理部は、ＤＡＰに基づき発音評定を行った。しかし、他のアルゴリズムにより、発音評定を行っても良い。他のアルゴリズムとは、例えば、実施の形態２で述べたｔ−ｐ−ＤＡＰや、実施の形態３で述べた無音区間を考慮した類似度評定や、実施の形態４で述べた音韻の挿入を考慮した類似度評定や、実施の形態５で述べた音韻の置換を考慮した類似度評定や、実施の形態６で述べた音韻の欠落を考慮した類似度評定等である。かかることも、実施の形態９、実施の形態１０においても同様である。 Moreover, according to this Embodiment, the audio | voice processing part performed pronunciation evaluation based on DAP. However, pronunciation evaluation may be performed by other algorithms. Other algorithms include, for example, the tp-DAP described in the second embodiment, the similarity evaluation considering the silent section described in the third embodiment, and the phoneme insertion described in the fourth embodiment. For example, the similarity evaluation in consideration of the phonetic substitution described in the fifth embodiment, the similarity evaluation in consideration of the phoneme omission described in the sixth embodiment, and the like. This also applies to the ninth and tenth embodiments.

また、本実施の形態によれば、音声処理装置における音声処理部は、主として、発音評定を行ったが、音声処理部は、第二音声データに基づいて音声認識処理を行っても良い。かかることも、実施の形態９、実施の形態１０においても同様である。 Further, according to the present embodiment, the voice processing unit in the voice processing apparatus mainly performs pronunciation evaluation, but the voice processing unit may perform voice recognition processing based on the second voice data. This also applies to the ninth and tenth embodiments.

また、本実施の形態において、音声処理装置が行う下記の処理を、一のＤＳＰ（デジタルシグナルプロセッサ）で行っても良い。つまり、本ＤＳＰは、第一サンプリング周波数を格納している第一サンプリング周波数格納部と、サンプリング周波数の変換率に関する情報である声道長正規化パラメータを格納している声道長正規化パラメータ格納部と、前記声道長正規化パラメータと前記第一サンプリング周波数をパラメータとして算出される第二サンプリング周波数で、前記音声受付部が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理部を具備するデジタルシグナルプロセッサ、である。かかることも、実施の形態９、実施の形態１０においても同様である。 In the present embodiment, the following processing performed by the audio processing device may be performed by a single DSP (digital signal processor). That is, the DSP stores a first sampling frequency storage unit that stores the first sampling frequency and a vocal tract length normalization parameter storage that stores a vocal tract length normalization parameter that is information related to the conversion rate of the sampling frequency. And a second sampling frequency calculated using the vocal tract length normalization parameter and the first sampling frequency as parameters, sampling the voice received by the voice receiving unit, A digital signal processor including a vocal tract length normalization processing unit. This also applies to the ninth and tenth embodiments.

さらに、本実施の形態における音声処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、音声を受け付ける音声受付ステップと、格納している声道長正規化パラメータと格納している第一サンプリング周波数をパラメータとして算出される第二サンプリング周波数で、前記音声受付ステップで受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理ステップと、前記第二音声データを処理する音声処理ステップを実行させるためのプログラム、である。 Furthermore, the software that implements the speech processing apparatus in the present embodiment is the following program. In other words, the program has a voice receiving step for receiving voice, a stored vocal tract length normalization parameter, and a second sampling frequency calculated using the stored first sampling frequency as parameters. A program for performing a vocal tract length normalization processing step for obtaining a second voice data by performing a sampling process on the voice received in the reception step and a voice processing step for processing the second voice data. .

また、上記プログラムにおいて、比較される対象の音声に関するデータであり、1以上の音韻毎のデータである教師データを１以上格納しており、前記教師データおよび前記音声受付ステップで受け付けた音声に基づいて、前記声道長正規化パラメータを算出する声道長正規化パラメータ算出ステップとをさらにコンピュータに実行させ、前記格納している声道長正規化パラメータは、前記声道長正規化パラメータ算出ステップで算出した声道長正規化パラメータであることは好適である。
（実施の形態９） Further, in the above program, one or more teacher data, which is data related to a target voice to be compared and is data for each one or more phonemes, is stored, and is based on the teacher data and the voice received in the voice receiving step. And further executing a vocal tract length normalization parameter calculating step for calculating the vocal tract length normalization parameter, wherein the stored vocal tract length normalization parameter is the vocal tract length normalization parameter calculation step. It is preferable that the vocal tract length normalization parameter calculated in (1) is used.
(Embodiment 9)

本実施の形態において、比較対象の音声と入力音声の類似度の評定を精度高く、かつ高速にできる音声処理装置について説明する。本音声処理装置は、主として、音声（歌唱を含む）を評価する発音評定装置である、として説明する。さらに、本実施の形態において、上記の実施の形態で記載した音声処理装置よりもさらに精度高く、評価対象者の話者特性に応じた発音評定が可能な音声処理装置について説明する。具体的には、本実施の形態において、出現確率最大化基準に基づく、簡潔な声道長正規化法に基づいて、評価対象者の話者特性に左右されにくい音声処理装置について説明する。なお、本実施の形態における音声処理装置は、実施の形態８の音声処理装置と比較して、声道長正規化パラメータの算出アルゴリズムが異なる。 In the present embodiment, a description will be given of a speech processing apparatus that can evaluate the similarity between the comparison target speech and the input speech with high accuracy and high speed. The audio processing apparatus will be described as being a pronunciation rating apparatus that mainly evaluates audio (including singing). Further, in the present embodiment, a description will be given of a speech processing apparatus that can perform pronunciation evaluation according to speaker characteristics of the evaluation target person with higher accuracy than the speech processing apparatus described in the above embodiment. Specifically, in the present embodiment, a speech processing apparatus that is less susceptible to the speaker characteristics of the evaluation subject will be described based on a simple vocal tract length normalization method based on the appearance probability maximization criterion. Note that the speech processing apparatus according to the present embodiment differs from the speech processing apparatus according to the eighth embodiment in the calculation algorithm of the vocal tract length normalization parameter.

図４１は、本実施の形態における音声処理装置のブロック図である。 FIG. 41 is a block diagram of the speech processing apparatus according to this embodiment.

本音声処理装置は、図３６における音声処理装置と比較して、声道長正規化パラメータ算出部４１０１が異なる。 This speech processing device is different from the speech processing device in FIG. 36 in the vocal tract length normalization parameter calculation unit 4101.

声道長正規化パラメータ算出部４１０１は、周波数範囲指定情報格納手段４１０１１、学習音響データ格納手段４１０１２、第二ケプストラムベクトル系列算出手段３６０１３、ケプストラム変換手段３６０１４、占有度数算出手段４１０１３、ケプストラム変換パラメータ算出手段４１０１５、最終ケプストラム変換パラメータ取得手段３６０１６、声道長正規化パラメータ算出手段３６０１７を具備する。なお、ここでの最終ケプストラム変換パラメータ取得手段３６０１６は、所定のルールに基づいて、ケプストラム変換手段３６０１４における処理、ケプストラム変換パラメータ算出手段４１０１５における処理だけではなく、占有度数算出手段４１０１３における処理をも繰り返えさせ、最終的な最適なケプストラム変換パラメータを得る。 The vocal tract length normalization parameter calculation unit 4101 includes a frequency range designation information storage unit 41011, a learning acoustic data storage unit 41012, a second cepstrum vector series calculation unit 36013, a cepstrum conversion unit 36014, an occupancy degree calculation unit 41013, and a cepstrum conversion parameter calculation. Means 41015, final cepstrum conversion parameter acquisition means 36016, and vocal tract length normalization parameter calculation means 36017 are provided. Here, the final cepstrum conversion parameter acquisition unit 36016 repeats not only the processing in the cepstrum conversion unit 36014 and the processing in the cepstrum conversion parameter calculation unit 41015 but also the processing in the occupation frequency calculation unit 41013 based on a predetermined rule. To get the final optimal cepstrum transformation parameters.

周波数範囲指定情報格納手段４１０１１は、周波数範囲を指定する情報である周波数範囲指定情報（Ｗ）を格納している。周波数範囲指定情報（Ｗ）は、後述する最適なケプストラム変換パラメータ（α）を算出する場合に、１次オールパス関数等の双１次変換による周波数ワーピングが線形の周波数伸縮で近似できる範囲に、周波数範囲を限定するための情報である。かかる周波数範囲は、０周波数からナイキストレートの１／３から１／２程度が好適である。ただし、双１次変換は、スペクトル領域ではなく、ケプストラム領域で行われるため、周波数範囲指定情報（Ｗ）は、例えば、以下に述べる行列の情報であることが好適である。周波数範囲指定情報（Ｗ）は、例えば、図示しない周波数範囲指定情報算出手段により、以下のように算出される。サンプリング周波数をＦ_ｓ（Ｈｚ）、指定される周波数範囲の最高周波数をＦ_ｍａｘ（Ｈｚ）とし，「ω_ｍａｘ＝２πＦ_ｍａｘ／Ｆ_ｓ」と置いて、ケプストラムベクトルに対する周波数範囲指定行列Ｗの（ｉ，ｊ）成分を、周波数範囲指定情報算出手段は、以下の数式１６に従って計算する。具体的には、周波数範囲指定情報算出手段は、コンピュータの記録媒体（図示しない）に格納されているサンプリング周波数（Ｆ_ｓ）、最高周波数（Ｆ_ｍａｘ）を読み出す。そして、周波数範囲指定情報算出手段は、自ら保持している演算式の情報「ω_ｍａｘ＝２πＦ_ｍａｘ／Ｆ_ｓ」を読み出し、ω_ｍａｘを算出する。そして、周波数範囲指定情報算出手段は、格納している以下の数式１６の情報を読み出し、ｉ、ｊを０から順に、１ずつインクリメントさせながら、ループ処理（２重ループの処理になる）により、｛Ｗ｝_ｉ，ｊを算出する。そして、周波数範囲指定情報算出手段は、算出した｛Ｗ｝_ｉ，ｊのすべてを、少なくとも一時的に周波数範囲指定情報格納手段４１０１１に格納する。
The frequency range designation information storage unit 41011 stores frequency range designation information (W) that is information for designating a frequency range. The frequency range designation information (W) has a frequency within a range in which frequency warping by bilinear transformation such as a linear allpass function can be approximated by linear frequency expansion and contraction when calculating an optimal cepstrum transformation parameter (α) described later. This is information for limiting the range. Such a frequency range is preferably about 0 to 1/3 to 1/2 of Nyquist rate. However, since the bilinear transformation is performed not in the spectral domain but in the cepstrum domain, the frequency range designation information (W) is preferably, for example, matrix information described below. The frequency range designation information (W) is calculated as follows, for example, by a frequency range designation information calculation unit (not shown). The sampling frequency is F _s (Hz), the maximum frequency in the specified frequency range is F _max (Hz), and “ω _max = 2πF _max / F _s ” is set, and the frequency range specification matrix W for the cepstrum vector (i , J) component is calculated by the frequency range designation information calculating means according to the following equation (16). Specifically, the frequency range designation information calculation means reads a sampling frequency (F _s ) and a maximum frequency (F _max ) stored in a computer recording medium (not shown). Then, the frequency range designation information calculation means reads information “ω _max = 2πF _max / F _s ” held by itself, and calculates ω _max . Then, the frequency range designation information calculating means reads out the stored information of the following Expression 16, and increments i and j by 1 in order from 0, while performing loop processing (becomes a double loop processing), {W} _{i, j} is calculated. Then, the frequency range designation information calculation means stores all of the calculated {W} _{i, j} in the frequency range designation information storage means 41011 at least temporarily.

なお、周波数範囲指定情報のデータ構造は問わない。また、周波数範囲指定情報格納手段４１０１１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 In addition, the data structure of frequency range designation | designated information is not ask | required. The frequency range designation information storage unit 41011 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

学習音響データ格納手段４１０１２は、学習音響データを格納している。学習音響データとは、以下のように生成される。つまり、図示しない学習音響データ生成手段が、教師データを構成するデータ（例えば、教師データ格納部１０２のデータ）から、話者の音声のケプストラムベクトル系列を短区間分析で求め、それを用いて２以上の基準話者音素ＨＭＭを取得し、メモリ上に、少なくとも一時格納する。そして、学習音響データ生成手段は、２以上の基準話者音素ＨＭＭを声道長正規化用の指定された発話内容（例えば／あいうえお／）に従って連結し、連結ＨＭＭを生成する。なお、ケプストラムベクトルは０次係数も含めたＭ＋１次元である。また、連結ＨＭＭのｊ番目状態におけるｍ番目ガウス分布成分の平均ベクトルおよび共分散行列を、それぞれμ_ｊ，ｍ、Σ_ｊ，ｍとすると、μ_ｊ，ｍは、以下の数式１７で表される。
The learning acoustic data storage means 41012 stores learning acoustic data. The learning acoustic data is generated as follows. That is, a learning acoustic data generation unit (not shown) obtains a cepstrum vector sequence of the speaker's voice from the data constituting the teacher data (for example, data in the teacher data storage unit 102) by short-term analysis, and uses this to obtain 2 The above reference speaker phoneme HMM is acquired and stored at least temporarily in the memory. Then, the learning acoustic data generating means connects two or more reference speaker phoneme HMMs according to the specified utterance content for normalizing the vocal tract length (for example, / Aiueo /) to generate a connected HMM. Note that the cepstrum vector has M + 1 dimensions including the zeroth order coefficient. Further, if the mean vector and covariance matrix of the mth Gaussian distribution component in the jth state of the connected HMM are μ _{j, m} and Σ _{j, m} , μ _{j, m} is expressed by the following Expression 17. .

なお、（・・・）^Ｔは、行列またはベクトルの転置を示す。 In addition, (...) ^T indicates transposition of a matrix or a vector.

そして、図示しない学習音響データ生成手段は、生成した学習音響データを、学習音響データ格納手段４１０１２に蓄積する。 Then, the learning acoustic data generation means (not shown) accumulates the generated learning acoustic data in the learning acoustic data storage means 41012.

なお、学習音響データは、連結ＨＭＭが好適であるが、単一ガウス分布モデルや、確率モデル（ＧＭＭ：ガウシャンミクスチャモデル）や、統計モデルなど、他のモデルに基づくデータでも良い。また、学習音響データ格納手段４１０１２は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The learning acoustic data is preferably a connected HMM, but may be data based on other models such as a single Gaussian distribution model, a probability model (GMM: Gaussian mixture model), and a statistical model. The learning acoustic data storage means 41012 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

占有度数算出手段４１０１３は、占有度数（γ_ｔ（ｊ，ｍ））を算出し、メモリ上に書き込み、少なくとも一時的に保持する。占有度数（γ_ｔ（ｊ，ｍ））とは、ｔ番目のフレームがｊ番目状態のｍ番目ガウス成分から生成された事後確率である。ｔ番目のフレームとは、例えば、音声受付部１０３が、声道長正規化用の指定された発話内容（例えば／あいうえお／）の発声をユーザから受け付け、当該受け付けた音声から得られたｔ番目のフレームである。占有度数算出手段４１０１３は、ケプストラム変換手段３６０１４が算出した第三のケプストラムベクトル系列（Ｏ_ｔ）と学習音響データ格納手段４１０１２の学習音響データを用いて、占有度数（γ_ｔ（ｊ，ｍ））を算出する。占有度数算出手段４１０１３は、さらに好ましくは、第三のケプストラムベクトル系列（Ｏ_ｔ）と学習音響データと周波数範囲指定情報格納手段４１０１１の周波数範囲指定情報（Ｗ）を用いて、占有度数（γ_ｔ（ｊ，ｍ））を算出する。占有度数算出手段４１０１３は、占有度数（γ_ｔ（ｊ，ｍ））を、フォワード・バックワードアルゴリズムにより算出する。フォワード・バックワードアルゴリズムは、公知技術であるので、詳細な説明を省略する。 The occupation frequency calculation means 41013 calculates the occupation frequency (γ _t (j, m)), writes it in the memory, and holds it at least temporarily. The occupation frequency (γ _t (j, m)) is a posterior probability that the t-th frame is generated from the m-th Gaussian component in the j-th state. The t-th frame is, for example, the t-th frame obtained from the received voice by the voice receiving unit 103 receiving the utterance of the specified utterance content (for example, / Aiueo /) for normalizing the vocal tract length. It is a frame. The occupancy degree calculation means 41013 uses the third cepstrum vector sequence (O _t ) calculated by the cepstrum conversion means 36014 and the learning acoustic data of the learning acoustic data storage means 41012 to occupy degrees (γ _t (j, m)). Is calculated. More preferably, the occupancy frequency calculation means 41013 uses the third cepstrum vector sequence (O _t ), the learning acoustic data, and the frequency range designation information (W) of the frequency range designation information storage means 41011 to use the occupancy frequency (γ _t (J, m)) is calculated. The occupation frequency calculation means 41013 calculates the occupation frequency (γ _t (j, m)) using a forward / backward algorithm. Since the forward / backward algorithm is a known technique, a detailed description thereof will be omitted.

さらに具体的には、占有度数算出手段４１０１３は、以下の処理により、占有度数（γ_ｔ（ｊ，ｍ））を算出する。ここで、教師データの音声から学習される学習音素ＨＭＭを，声道長正規化パラメータ推定のために指定した発話内容（例えば/あいうえお/）にしたがって連結した連結ＨＭＭをΛとする。そして、ケプストラム変換パラメータ推定処理に使用される占有度数（γ_ｔ（ｊ，ｍ））は、第三ケプストラム系列を周波数範囲指定情報Ｗにより周波数範囲指定されたケプストラムベクトル系列「Ｗｏ_１（α），Ｗｏ_２（α），...，Ｗｏ_Ｔ（α）」およびΛが与えられたときのｔ番目フレームがΛのｊ番目状態のｍ番目ガウス成分から生起した事後確率として定義されるので、占有度数（γ_ｔ（ｊ，ｍ））は、数式１８で定義される。
More specifically, the occupation frequency calculation means 41013 calculates the occupation frequency (γ _t (j, m)) by the following processing. Here, let Λ be a concatenated HMM in which learning phoneme HMMs learned from speech of teacher data are concatenated according to the utterance content (for example, / aiueo /) designated for vocal tract length normalization parameter estimation. The occupation frequency (γ _t (j, m)) used for the cepstrum transformation parameter estimation process is the cepstrum vector sequence “Wo ₁ (α), Wo ₂ (α),..., Wo _T (α) ”and Λ are defined as the posterior probabilities that arise from the mth Gaussian component of the jth state of Λ. The frequency (γ _t (j, m)) is defined by Equation 18.

また、数式１８における占有度数（γ_ｔ（ｊ，ｍ））は、前向き尤度Ａ_ｔ（ｊ）および後向き尤度Ｂ_ｔ（ｊ）を使って数式１９のように計算される。つまり、占有度数算出手段４１０１３は、数式１９で示される式の情報を保持しており、与えられるケプストラムベクトル系列、およびΛを用いて、占有度数（γ_ｔ（ｊ，ｍ））を算出する。
Further, the occupation frequency (γ _t (j, m)) in Expression 18 is calculated as Expression 19 using the forward likelihood A _t (j) and the backward likelihood B _t (j). That is, the occupation frequency calculation means 41013 holds the information of the equation represented by Equation 19, and calculates the occupation frequency (γ _t (j, m)) using the given cepstrum vector sequence and Λ.

なお、占有度数算出手段４１０１３は、数式１９のＡｔ（ｊ）およびＢｔ（ｊ）を、ケプストラムベクトル系列「Ｗｏ_１（α）...Ｗｏ_Ｔ（α）」およびΛから公知のｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄアルゴリズムを用いて算出する。 Note that the occupation frequency calculation means 41013 converts At (j) and Bt (j) of Equation 19 into a forward-backward algorithm known from the cepstrum vector series “Wo ₁ (α)... W o _T (α)” and Λ. Calculate using.

また、ｖ_ｊ，ｍは、ｊ番目状態のｍ番目ガウス成分に対する重み係数である。さらに、占有度数算出手段４１０１３は、出力分布ｂ_ｊ，ｍ（Ｗｏ_ｔ（α））を、数式２０を用いて算出する。占有度数算出手段４１０１３は、数式２０の情報を予め保持しており、当該情報を読み出して、演算を実行し、出力分布ｂ_ｊ，ｍ（Ｗｏ_ｔ（α））を得る。
V _{j, m} is a weighting factor for the mth Gaussian component in the jth state. Moreover, occupied frequency computing unit 41013, the output distribution _{b j,} m the (Wo _t (α)), is calculated using Equation 20. Occupied frequency computing unit 41013 is prestores information of formula 20, reads the information, performs operations to obtain power distribution _{b j,} m the (Wo _t (α)).

なお、数式２０において、Ｍはケプストラム次数、|Ａ|は行列Ａのdeterminantを表わす。 In Equation 20, M represents a cepstrum order, and | A | represents a determinant of the matrix A.

占有度数算出手段４１０１３は、通常、ＭＰＵやメモリ等から実現され得る。占有度数算出手段４１０１３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The occupancy frequency calculation means 41013 can usually be realized by an MPU, a memory, or the like. The processing procedure of the occupation frequency calculation means 41013 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

ケプストラム変換パラメータ算出手段４１０１５は、学習音響データ格納手段４１０１２の学習音響データを読み出し、およびメモリ上に一時格納されている第三のケプストラムベクトル系列（Ｏ_ｔ）を読み出し、読み出した学習音響データおよび第三のケプストラムベクトル系列（Ｏ_ｔ）に基づいて、ケプストラム変換パラメータ（α）を算出する。ケプストラム変換パラメータ算出手段４１０１５は、さらに好ましくは、周波数範囲指定情報（Ｗ）で示される周波数範囲における学習音響データ、第三のケプストラムベクトル系列（Ｏ_ｔ）および占有度数（γ_ｔ（ｊ，ｍ））に基づいて、ケプストラム変換パラメータを算出する。なお、第三のケプストラムベクトル系列（Ｏ_ｔ）は、ケプストラム変換手段３６０１４が算出したデータである。 The cepstrum conversion parameter calculation unit 41015 reads out the learning acoustic data in the learning acoustic data storage unit 41012 and reads out the third cepstrum vector sequence (O _t ) temporarily stored in the memory, Based on the third cepstrum vector series (O _t ), a cepstrum transformation parameter (α) is calculated. More preferably, the cepstrum conversion parameter calculation means 41015 is the learning acoustic data, the third cepstrum vector series (O _t ), and the occupancy frequency (γ _t (j, m) in the frequency range indicated by the frequency range designation information (W). ) To calculate a cepstrum conversion parameter. The third cepstrum vector series (O _t ) is data calculated by the cepstrum conversion unit 36014.

具体的には、まず、ケプストラム変換パラメータ算出手段４１０１５は、上述の数式１２により、ベクトル（ｕ_ｔ（α））を算出する。なお、ケプストラム変換パラメータ算出手段４１０１５は、上述の数式１２を示す情報を格納しており、かかる数式１２の情報を読み出し、当該数式に、第三のケプストラムベクトル系列（Ｏ_ｔ）、ケプストラム変換パラメータ（α）、ベクトル（Ｃ_ｔ ⁻）の情報を与え、当該数式を演算する。そして、次に、ケプストラム変換パラメータ算出手段４１０１５は、以下の数式２１により、αの最適値（α^＊）を算出する。なお、αの最適値（α^＊）は、繰り返しステップ（ループ処理）における現在のループ内の処理での最適値である。また、ケプストラム変換パラメータ算出手段４１０１５は、以下の数式２１を示す情報を、予め格納している。
Specifically, first, the cepstrum conversion parameter calculation unit 41015 calculates a vector (u _t (α)) by the above-described Expression 12. Note that the cepstrum conversion parameter calculation means 41015 stores information indicating the above formula 12, reads the information of the formula 12, and stores the third cepstrum vector series (O _t ), cepstrum conversion parameter ( α) and vector (C _t ⁻ ) information are given, and the mathematical formula is calculated. Then, the cepstrum conversion parameter calculation means 41015 calculates the optimum value (α ^* ) of α by the following Equation 21. Note that the optimum value of α (α ^* ) is the optimum value in the processing in the current loop in the iteration step (loop processing). Further, the cepstrum conversion parameter calculation means 41015 stores information indicating the following formula 21 in advance.

ケプストラム変換パラメータ算出手段４１０１５は、通常、ＭＰＵやメモリ等から実現され得る。ケプストラム変換パラメータ算出手段４１０１５の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The cepstrum conversion parameter calculation means 41015 can usually be realized by an MPU, a memory, or the like. The processing procedure of the cepstrum conversion parameter calculation means 41015 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

次に、本音声処理装置の動作について説明する。まず、本音声処理装置における、声道長正規化パラメータを算出する処理について、図４２のフローチャートを用いて説明する。図４２のフローチャートにおいて、図３７のフローチャートと比較して差異のあるステップのみ説明する。なお、本声道長正規化パラメータ算出処理は、必ずしも音声処理装置で行う必要はない。また、図４２における初期化処理は、例えば、ユーザ（評価対象者）に対して、「／ａｉｕｅｏ／」と発声するように促す処理と、周波数範囲指定情報格納手段４１０１１の周波数範囲指定情報、および学習音響データ格納手段４１０１２の学習音響データを読み出す処理である。 Next, the operation of the voice processing apparatus will be described. First, the process of calculating the vocal tract length normalization parameter in the speech processing apparatus will be described with reference to the flowchart of FIG. In the flowchart of FIG. 42, only steps that are different from the flowchart of FIG. 37 will be described. Note that the vocal tract length normalization parameter calculation process does not necessarily have to be performed by the voice processing device. The initialization process in FIG. 42 includes, for example, a process for prompting the user (evaluation subject) to say “/ aiueo”, frequency range designation information in the frequency range designation information storage unit 41011, and This is a process of reading the learning sound data in the learning sound data storage means 41012.

（ステップＳ４２０１）占有度数算出手段４１０１３は、ステップＳ３７０７で得られた第三のケプストラムベクトル系列（Ｏ_ｔ）と学習音響データ格納手段４１０１２の学習音響データと周波数範囲指定情報（Ｗ）を用いて、占有度数（γ_ｔ（ｊ，ｍ））を算出し、メモリに一時記憶する。 (Step S4201) The occupation frequency calculation means 41013 uses the third cepstrum vector sequence (O _t ) obtained in Step S3707, the learning acoustic data of the learning acoustic data storage means 41012, and the frequency range designation information (W). The occupancy frequency (γ _t (j, m)) is calculated and temporarily stored in the memory.

（ステップＳ４２０２）ケプストラム変換パラメータ算出手段４１０１５は、周波数範囲指定情報（Ｗ）で示される周波数範囲における学習音響データ、ベクトル（ｕ_ｔ（α））および占有度数（γ_ｔ（ｊ，ｍ））に基づいて、本ループにおける最適なケプストラム変換パラメータ（α^＊）を算出する。ケプストラム変換パラメータ算出手段４１０１５は、例えば、格納している数式２１の情報を読み出し、また、メモリ上の学習音響データ、ベクトル（ｕ_ｔ（α））および占有度数（γ_ｔ（ｊ，ｍ））を読み出し、数式２１に代入し、ケプストラム変換パラメータ（α^＊）を算出する。 (Step S4202) The cepstrum transformation parameter calculation means 41015 converts the learning acoustic data, vector (u _t (α)), and occupancy frequency (γ _t (j, m)) in the frequency range indicated by the frequency range designation information (W). Based on this, the optimum cepstrum transformation parameter (α ^* ) in this loop is calculated. The cepstrum conversion parameter calculation means 41015, for example, reads the stored information of Equation 21 and also learns the learning acoustic data, the vector (u _t (α)) and the occupation frequency (γ _t (j, m)) on the memory. Is substituted into Equation 21, and a cepstrum conversion parameter (α ^* ) is calculated.

次に、本音声処理装置の動作（音声の評定処理）について説明する。本音声処理装置の動作は、図３８、図３９のフローチャートにおける動作と同様である。 Next, the operation (voice evaluation process) of the voice processing apparatus will be described. The operation of this speech processing apparatus is the same as the operation in the flowcharts of FIGS.

以下、図４３の概念図を用いて、本実施の形態における音声処理装置における声道長正規化パラメータ算出処理の概念について説明する。まず、システム設計用音声データベースから、指定された発話内容（例えば／あいうえお／）に従って音素ＨＭＭ（教師データが有する音素ＨＭＭ）を連結し連結ＨＭＭ（図４３の基準話者ＨＭＭ）を構成する。そして、連結ＨＭＭと、同じ音素列のユーザ発話音声の変換されたケプストラムベクトルＯ_ｔの、Λ（Λは、ユーザ発話音声にしたがって連結された連結ＨＭＭ）に対する出現確率が最大になるように（図４３の最適化に相当）パラメータを算出する。ただし、求まるパラメータ（α?）はケプストラム変換（ケプストラムワーピング）パラメータであり、このままでは声道長の変換を直接表わすものではないため、近似変換式（１／ρ）を用いて最終的な声道長正規化パラメータ（サンプリング周波数変換率）γを計算する。なお、声道長正規化パラメータを算出するための数式は、上記の数式１６から数式２１、数式９から数式１２、数式１４、数式１５により行われる。 Hereinafter, the concept of the vocal tract length normalization parameter calculation process in the speech processing apparatus according to the present embodiment will be described using the conceptual diagram of FIG. First, phoneme HMMs (phoneme HMMs possessed by teacher data) are concatenated from the system design speech database according to the specified utterance content (for example, / Aiueo /) to form a concatenated HMM (reference speaker HMM in FIG. 43). Then, the appearance probability with respect to Λ (Λ is a concatenated HMM concatenated according to the user speech) of the concatenated HMM and the cepstrum vector O _t converted from the user speech of the same phoneme sequence is maximized (see FIG. 43 parameter). However, since the obtained parameter (α?) Is a cepstrum transformation (cepstrum warping) parameter and does not directly represent the transformation of the vocal tract length as it is, the final vocal tract is obtained using the approximate transformation formula (1 / ρ). The long normalization parameter (sampling frequency conversion rate) γ is calculated. Note that the mathematical formulas for calculating the vocal tract length normalization parameters are calculated by the above mathematical formulas 16 to 21, 9 to 12, 12, 14 and 15.

なお、本実施の形態によれば、音声処理装置が声道長正規化パラメータを算出する処理と、声道長正規化処理を行った。しかし、声道長正規化パラメータを算出する処理と声道長正規化処理を、異なる装置が行う構成でも良い。かかる場合、本音声処理装置は、音声を受け付ける音声受付部と、第一サンプリング周波数を格納している第一サンプリング周波数格納部と、サンプリング周波数の変換率に関する情報である声道長正規化パラメータを格納している声道長正規化パラメータ格納部と、前記声道長正規化パラメータと前記第一サンプリング周波数をパラメータとして算出される第二サンプリング周波数で、前記音声受付部が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理部と、前記第二音声データを処理する音声処理部を具備する音声処理装置である。また、声道長正規化パラメータを算出する装置は、教師データを構成するデータから、話者の音声のケプストラムベクトル系列を短区間分析で求め、それを用いて２以上の基準話者音素ＨＭＭを取得し、当該２以上の基準話者音素ＨＭＭを声道長正規化用の指定された発話内容（例えば／あいうえお／）に従って連結して得られた連結ＨＭＭである学習音響データを格納している学習音響データ格納手段と、受け付けた音声から、短区間分析により第二のケプストラムベクトル系列を算出する第二ケプストラムベクトル系列算出手段と、前記第二のケプストラムベクトル系列を、ケプストラム変換パラメータを要素とする行列を用いて線形変換し、周波数ワープされた第三のケプストラムベクトル系列を算出するケプストラム変換手段と、指定された発話内容に従って受け付けた音声の事後確率である占有度数を算出する占有度算出手段と、前記学習音響データおよび前記第三のケプストラムベクトル系列および前記占有度数に基づいて、ケプストラム変換パラメータを算出するケプストラム変換パラメータ算出手段と、所定のルールに基づいて、前記ケプストラム変換手段における処理、および前記占有度算出手段における処理、および前記ケプストラム変換パラメータ算出手段における処理を繰り返えさせ、ケプストラム変換パラメータを得る最終ケプストラム変換パラメータ取得手段と、前記最終ケプストラム変換パラメータ取得手段が得たケプストラム変換パラメータに基づいて、前記声道長正規化パラメータを算出する声道長正規化パラメータ算出手段を具備する装置である。 Note that, according to the present embodiment, the voice processing device performs the process of calculating the vocal tract length normalization parameter and the vocal tract length normalization process. However, a configuration in which different devices perform the process of calculating the vocal tract length normalization parameter and the process of normalizing the vocal tract length may be used. In such a case, the speech processing apparatus includes a speech reception unit that receives speech, a first sampling frequency storage unit that stores a first sampling frequency, and a vocal tract length normalization parameter that is information related to a conversion rate of the sampling frequency. A voice sampling length storage parameter storing unit, a second sampling frequency calculated using the vocal tract length normalization parameter and the first sampling frequency as parameters, and the voice received by the voice receiving unit The voice processing apparatus includes a vocal tract length normalization processing unit that performs sampling processing to obtain second voice data, and a voice processing unit that processes the second voice data. Further, the apparatus for calculating the vocal tract length normalization parameter obtains a cepstrum vector sequence of the speaker's speech from the data constituting the teacher data by short interval analysis, and uses it to obtain two or more reference speaker phoneme HMMs. Acquiring and storing learning acoustic data which is a connected HMM obtained by connecting the two or more reference speaker phoneme HMMs according to the specified utterance content for normalizing the vocal tract length (for example, / Aiueo /) Learning acoustic data storage means; second cepstrum vector series calculation means for calculating a second cepstrum vector series from the received speech by short interval analysis; and the second cepstrum vector series, with cepstrum conversion parameters as elements. A cepstrum conversion means for performing linear conversion using a matrix and calculating a frequency-warped third cepstrum vector sequence; An occupancy degree calculating means for calculating an occupancy degree that is a posterior probability of the received voice according to the utterance content, and calculating a cepstrum conversion parameter based on the learning acoustic data, the third cepstrum vector series, and the occupancy degree. Based on the cepstrum conversion parameter calculation means and a predetermined rule, the process in the cepstrum conversion means, the process in the occupancy calculation means, and the process in the cepstrum conversion parameter calculation means are repeated to obtain a cepstrum conversion parameter. An apparatus comprising: a final cepstrum conversion parameter acquisition unit; and a vocal tract length normalization parameter calculation unit that calculates the vocal tract length normalization parameter based on the cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition unit. That.

また、上記プログラムにおいて、比較される対象の音声に関するデータであり、1以上の音韻毎のデータである教師データを１以上格納しており、前記教師データおよび前記音声受付ステップで受け付けた音声に基づいて、前記声道長正規化パラメータを算出する声道長正規化パラメータ算出ステップとをさらにコンピュータに実行させ、前記格納している声道長正規化パラメータは、前記声道長正規化パラメータ算出ステップで算出した声道長正規化パラメータであることは好適である。 Further, in the above program, one or more teacher data, which is data related to a target voice to be compared and is data for each one or more phonemes, is stored, and is based on the teacher data and the voice received in the voice receiving step. And further executing a vocal tract length normalization parameter calculating step for calculating the vocal tract length normalization parameter, wherein the stored vocal tract length normalization parameter is the vocal tract length normalization parameter calculation step. It is preferable that the vocal tract length normalization parameter calculated in (1) is used.

また、上記プログラムにおいて、音素ＨＭＭを指定された発話内容に従って連結した連結ＨＭＭである学習音響データを格納しており、前記声道長正規化パラメータ算出ステップは、前記音声受付ステップで受け付けた音声から、短区間分析により第二のケプストラムベクトル系列を算出する第二ケプストラムベクトル系列算出ステップと、前記第二のケプストラムベクトル系列を、ケプストラム変換パラメータを要素とする行列を用いて線形変換し、周波数ワープされた第三のケプストラムベクトル系列を算出するケプストラム変換ステップと、指定された発話内容に従って受け付けた音声の事後確率である占有度数を算出する占有度算出ステップと、前記学習音響データおよび前記第三のケプストラムベクトル系列および前記占有度数に基づいて、ケプストラム変換パラメータを算出するケプストラム変換パラメータ算出ステップと、所定のルールに基づいて、前記ケプストラム変換ステップにおける処理、および前記占有度算出ステップにおける処理、および前記ケプストラム変換パラメータ算出ステップにおける処理を繰り返えさせ、ケプストラム変換パラメータを得る最終ケプストラム変換パラメータ取得ステップと、前記最終ケプストラム変換パラメータ取得ステップで得たケプストラム変換パラメータに基づいて、前記声道長正規化パラメータを算出する声道長正規化パラメータ算出ステップを具備することは好適である。 Further, in the above program, learning acoustic data, which is a concatenated HMM in which phoneme HMMs are concatenated according to designated utterance contents, is stored, and the vocal tract length normalization parameter calculation step is performed based on the speech received in the speech reception step. A second cepstrum vector sequence calculating step for calculating a second cepstrum vector sequence by short interval analysis, and linearly transforming the second cepstrum vector sequence using a matrix having cepstrum transformation parameters as elements, and frequency warping is performed. A cepstrum conversion step for calculating a third cepstrum vector sequence, an occupancy degree calculating step for calculating an occupancy degree which is a posterior probability of speech received according to the designated utterance content, the learning acoustic data and the third cepstrum Vector series and the above occupancy frequency Therefore, the cepstrum conversion parameter calculation step for calculating the cepstrum conversion parameter, the processing in the cepstrum conversion step, the processing in the occupancy calculation step, and the processing in the cepstrum conversion parameter calculation step are repeated based on a predetermined rule. A final cepstrum conversion parameter acquisition step for obtaining a cepstrum conversion parameter, and a vocal tract length normalization parameter for calculating the vocal tract length normalization parameter based on the cepstrum conversion parameter obtained in the final cepstrum conversion parameter acquisition step. It is preferable to include a calculation step.

また、上記プログラムにおいて、前記声道長正規化パラメータ算出ステップにおいて、前記最終ケプストラム変換パラメータ取得ステップで得たケプストラム変換パラメータから線形周波数伸縮比を算出し、当該線形周波数伸縮比から前記声道長正規化パラメータを算出することは好適である。 In the above program, in the vocal tract length normalization parameter calculation step, a linear frequency expansion / contraction ratio is calculated from the cepstrum conversion parameter obtained in the final cepstrum conversion parameter acquisition step, and the vocal tract length normalization is calculated from the linear frequency expansion / contraction ratio. It is preferable to calculate the optimization parameter.

また、上記プログラムにおいて、周波数範囲を指定する情報である周波数範囲指定情報を格納しており、前記ケプストラム変換パラメータ算出ステップにおいて、前記周波数範囲指定情報で示される周波数範囲における学習音響データ、第三のケプストラムベクトル系列および占有度数に基づいて、ケプストラム変換パラメータを算出することは好適である。
（実施の形態１０） Further, in the above program, frequency range designation information that is information for designating a frequency range is stored, and in the cepstrum conversion parameter calculation step, learning acoustic data in the frequency range indicated by the frequency range designation information, It is preferable to calculate the cepstrum transformation parameter based on the cepstrum vector series and the occupation frequency.
(Embodiment 10)

本実施の形態において、比較対象の音声と入力音声の類似度の評定を精度高く、かつ高速にできる音声処理装置について説明する。本音声処理装置は、主として、音声（歌唱を含む）を評価する発音評定装置である、として説明する。さらに、本実施の形態において、上記の実施の形態で記載した音声処理装置よりもさらに精度高く、評価対象者の話者特性に応じた発音評定が可能な音声処理装置について説明する。具体的には、本実施の形態において、複数の音素を用いた最小自乗誤差基準に基づく、簡潔な声道長正規化法に基づいて、評価対象者の話者特性に左右されにくい音声処理装置について説明する。 In the present embodiment, a description will be given of a speech processing apparatus that can evaluate the similarity between the comparison target speech and the input speech with high accuracy and high speed. The audio processing apparatus will be described as being a pronunciation rating apparatus that mainly evaluates audio (including singing). Further, in the present embodiment, a description will be given of a speech processing apparatus that can perform pronunciation evaluation according to speaker characteristics of the evaluation target person with higher accuracy than the speech processing apparatus described in the above embodiment. Specifically, in the present embodiment, a speech processing apparatus that is less susceptible to the speaker characteristics of the evaluation target, based on a simple vocal tract length normalization method based on a least square error criterion using a plurality of phonemes Will be described.

さらに具体的には、本実施の形態において、基準話者（システム設計用の話者）の音声から、音素毎に基準話者平均ケプストラムベクトルを計算しておく。そして、ユーザ音声の周波数ワープされたケプストラムベクトル系列と上記の基準話者平均ケプストラムベクトルの系列との自乗誤差が最小になるように、最適なワーピングパラメータを求める。このとき、ユーザ音声は指定された発話内容にしたがって発声されたものであり、基準話者平均ケプストラムの系列は同じ発話内容にしたがって音素毎の基準話者平均ケプストラムを並べたものであることが好適である。ただし、ここで算出されるパラメータはケプストラムワーピングパラメータであり、声道長変換を直接表わすものではないため、予め決まられた近似変換式を用いて最終的な声道長正規化パラメータ（サンプリング周波数変換率）を算出する。 More specifically, in the present embodiment, a reference speaker average cepstrum vector is calculated for each phoneme from the speech of a reference speaker (system design speaker). Then, an optimal warping parameter is obtained so that the square error between the frequency warped cepstrum vector sequence of the user speech and the above-described reference speaker average cepstrum vector sequence is minimized. At this time, it is preferable that the user voice is uttered according to the specified utterance content, and the reference speaker average cepstrum sequence is a sequence of reference speaker average cepstrum for each phoneme according to the same utterance content. It is. However, since the parameter calculated here is a cepstrum warping parameter and does not directly represent vocal tract length conversion, the final vocal tract length normalization parameter (sampling frequency conversion) is determined using a predetermined approximate conversion formula. Rate).

なお、本実施の形態における音声処理装置は、実施の形態８、９の音声処理装置と比較して、声道長正規化パラメータの算出アルゴリズムが異なる。 Note that the speech processing apparatus in the present embodiment differs from the speech processing apparatuses in the eighth and ninth embodiments in the calculation algorithm of the vocal tract length normalization parameter.

図４４は、本実施の形態における音声処理装置のブロック図である。 FIG. 44 is a block diagram of the audio processing apparatus according to the present embodiment.

本音声処理装置は、図４１における音声処理装置と比較して、声道長正規化パラメータ算出部４４０１が異なる。 This speech processing device is different from the speech processing device in FIG. 41 in a vocal tract length normalization parameter calculation unit 4401.

声道長正規化パラメータ算出部４４０１は、周波数範囲指定情報格納手段４１０１１、学習音響データ格納手段４４０１２、第二ケプストラムベクトル系列算出手段３６０１３、ケプストラム変換手段３６０１４、最適音素系列取得手段４４０１３、ケプストラム変換パラメータ算出手段４４０１５、最終ケプストラム変換パラメータ取得手段３６０１６、声道長正規化パラメータ算出手段３６０１７を具備する。なお、ここでの最終ケプストラム変換パラメータ取得手段３６０１６は、所定のルールに基づいて、ケプストラム変換手段３６０１４における処理、ケプストラム変換パラメータ算出手段４４０１５における処理だけではなく、最適音素系列取得手段４４０１３における処理をも繰り返えさせ、最終的な最適なケプストラム変換パラメータを得る。 The vocal tract length normalization parameter calculation unit 4401 includes a frequency range designation information storage unit 41011, a learning acoustic data storage unit 44012, a second cepstrum vector sequence calculation unit 36013, a cepstrum conversion unit 36014, an optimum phoneme sequence acquisition unit 44013, and a cepstrum conversion parameter. Calculation means 44015, final cepstrum conversion parameter acquisition means 36016, and vocal tract length normalization parameter calculation means 36017 are provided. Here, the final cepstrum conversion parameter acquisition unit 36016 performs not only the process in the cepstrum conversion unit 36014 and the process in the cepstrum conversion parameter calculation unit 44015, but also the process in the optimum phoneme sequence acquisition unit 44003 based on a predetermined rule. Repeat to get the final optimal cepstrum transformation parameters.

学習音響データ格納手段４４０１２は、学習音響データを格納している。ここでの学習音響データは、音素平均ケプストラムベクトルを指定された発話内容に従って並べた音素平均ケプストラムベクトル列である。学習音響データは、例えば、以下のように生成される。つまり、図示しない学習音響データ生成手段が、教師データを構成するデータ（例えば、教師データ格納部１０２のデータ）から、各音素ｌ（ｌ＝１，２，...．，Ｌ）に対して、基準話者の音声ケプストラムベクトル系列を短区間分析で求め，その時間平均（μ_ｌ）を算出する。なお、ケプストラムベクトルは０次係数も含めたＭ＋１次元である。そして、時間平均（μ_ｌ）は、以下の数式２２で表される。
The learning acoustic data storage unit 44012 stores learning acoustic data. The learning acoustic data here is a phoneme average cepstrum vector sequence in which phoneme average cepstrum vectors are arranged according to the specified utterance content. The learning acoustic data is generated as follows, for example. In other words, a learning acoustic data generation unit (not shown) applies to each phoneme l (l = 1, 2,..., L) from data constituting the teacher data (for example, data in the teacher data storage unit 102). Then, the speech cepstrum vector series of the reference speaker is obtained by short interval analysis, and the time average (μ _l ) is calculated. Note that the cepstrum vector has M + 1 dimensions including the zeroth order coefficient. The time average (μ ₁ ) is expressed by the following formula 22.

そして、図示しない学習音響データ生成手段は、生成した学習音響データを、学習音響データ格納手段４４０１２に蓄積する。 Then, the learning acoustic data generation means (not shown) accumulates the generated learning acoustic data in the learning acoustic data storage means 44012.

なお、学習音響データ格納手段４４０１２は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 Note that the learning acoustic data storage unit 44012 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium.

最適音素系列取得手段４４０１３は、ユーザが発声した発話音声を構成する各フレームｔに対応する音素番号の情報である音素系列（ｓ^＊ _ｔ（ｔ＝１，２，...．，Ｔ））を取得する。最適音素系列取得手段４４０１３が音素系列（ｓ^＊ _ｔ）を取得するアルゴリズムは問わない。最適音素系列取得手段４４０１３は、例えば、ユーザ（キーボードなど）からの音素系列（ｓ^＊ _ｔ）の入力を受け付けても良い。また、最適音素系列取得手段４４０１３は、ユーザケプストラムベクトル系列（Ｃ_ｔ）からオートセグメンテーションにより、音素系列（ｓ^＊ _ｔ）を算出しても良い。オートセグメンテーションは、公知技術であるので説明を省略する。また、最適音素系列取得手段４４０１３は、動的計画法（ＤＰマッチング）により、以下の数式２３のＪを最小にする音素系列Ｓ_ｔ（ｔ＝１，２，...．，Ｔ）を取得し、そのＳ_ｔを音素系列（ｓ^＊ _ｔ）として取得しても良い。また、音素系列は、音素番号の列の情報でなくても良い。音素系列は、音素を識別する情報の列であれば良い。
Optimal phoneme sequence acquisition means 44013 is a phoneme sequence (s ^* _t (t = 1, 2,..., T)) that is information of phoneme numbers corresponding to each frame t constituting speech uttered by the user. To get. The algorithm for obtaining the phoneme sequence (s ^* _t ) by the optimum phoneme sequence acquisition unit 44013 is not limited. The optimal phoneme sequence acquisition unit 44013 may accept an input of a phoneme sequence (s ^* _t ) from a user (keyboard or the like), for example. The optimal phoneme sequence acquisition unit 44013 may calculate a phoneme sequence (s ^* _t ) from the user cepstrum vector sequence (C _t ) by auto-segmentation. Since auto segmentation is a well-known technique, description thereof is omitted. Also, the optimal phoneme sequence acquisition unit 44013 acquires a phoneme sequence S _t (t = 1, 2,..., T) that minimizes J in Expression 23 below by dynamic programming (DP matching). and it may acquire the _{S t} as a phoneme sequence ^(s _{* t).} Further, the phoneme sequence may not be information on a sequence of phoneme numbers. The phoneme sequence may be a string of information for identifying a phoneme.

ケプストラム変換パラメータ算出手段４４０１５は、学習音響データ格納手段４４０１２の学習音響データを読み出し、およびメモリ上に一時格納されている第三のケプストラムベクトル系列（Ｏ_ｔ）を読み出し、および最適音素系列取得手段４４０１３が取得した音素系列（ｓ^＊ _ｔ）を読み出し、読み出した学習音響データおよび第三のケプストラムベクトル系列（Ｏ_ｔ）および音素系列（ｓ^＊ _ｔ）を用いて、ケプストラム変換パラメータ（α）を算出する。ケプストラム変換パラメータ算出手段４４０１５は、さらに好ましくは、周波数範囲指定情報（Ｗ）および読み出した学習音響データおよび第三のケプストラムベクトル系列（Ｏ_ｔ）および音素系列（ｓ^＊ _ｔ）に基づいて、ケプストラム変換パラメータを算出する。 The cepstrum conversion parameter calculation unit 44015 reads out the learning acoustic data of the learning acoustic data storage unit 44012, reads out the third cepstrum vector sequence (O _t ) temporarily stored in the memory, and the optimum phoneme sequence acquisition unit 44013. There reads the obtained phoneme sequences (s * ^_t), using the read training acoustic data and the third cepstrum vector sequences (O _t) and the phoneme sequence (s * ^_t), to calculate the cepstrum transformation parameter (alpha) . The cepstrum conversion parameter calculation means 44015 is more preferably based on the frequency range designation information (W), the read learning acoustic data, the third cepstrum vector sequence (O _t ), and the phoneme sequence (s ^* _t ). Calculate the parameters.

具体的には、まず、ケプストラム変換パラメータ算出手段３６０１５は、以下の数式２４により、ベクトル（ｕ_ｔ（α））を算出する。そして、次に、ケプストラム変換パラメータ算出手段４４０１５は、以下の数式２５により、αの最適値（α^＊）を算出する。なお、αの最適値（α^＊）は、現繰り返しステップにおける最適値である。
Specifically, first, the cepstrum conversion parameter calculation unit 36015 calculates a vector (u _t (α)) by the following Expression 24. Then, the cepstrum conversion parameter calculation means 44015 calculates the optimum value (α ^* ) of α by the following formula 25. Note that the optimum value of α (α ^* ) is the optimum value in the current iteration step.

なお、ケプストラム変換パラメータ算出手段４４０１５は、通常、上記の数式２４、２５の情報を格納しており、当該数式の情報を読み出し、演算を行う。ケプストラム変換パラメータ算出手段４４０１５は、通常、ＭＰＵやメモリ等から実現され得る。ケプストラム変換パラメータ算出手段４４０１５の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 Note that the cepstrum conversion parameter calculation unit 44015 normally stores information on the above formulas 24 and 25, reads out the information on the formulas, and performs calculation. The cepstrum conversion parameter calculation unit 44015 can be usually realized by an MPU, a memory, or the like. The processing procedure of the cepstrum conversion parameter calculation means 44015 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

なお、声道長正規化パラメータ算出手段３６０１７は、実施の形態９において、最終ケプストラム変換パラメータ取得手段３６０１６が得たケプストラム変換パラメータから、数式１４に従って、線形周波数伸縮比（ρ）を算出した。しかし、声道長正規化パラメータ算出手段３６０１７は、最終ケプストラム変換パラメータ取得手段３６０１６が得たケプストラム変換パラメータから、以下の数式２６に従って、線形周波数伸縮比（ρ）を算出しても良い。
Note that the vocal tract length normalization parameter calculation unit 36017 calculates the linear frequency expansion / contraction ratio (ρ) from the cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition unit 36016 in the ninth embodiment according to Equation 14. However, the vocal tract length normalization parameter calculation unit 36017 may calculate the linear frequency expansion / contraction ratio (ρ) from the cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition unit 36016 according to the following Equation 26.

次に、本音声処理装置の動作について説明する。まず、本音声処理装置における、声道長正規化パラメータを算出する処理について、図４５のフローチャートを用いて説明する。図４５のフローチャートにおいて、図３７、図４２のフローチャートと比較して差異のあるステップのみ説明する。なお、本声道長正規化パラメータ算出処理は、必ずしも音声処理装置で行う必要はない。また、図４５における初期化処理は、例えば、ユーザ（評価対象者）に対して、「／あいうえお／」と発声するように促す処理と、周波数範囲指定情報格納手段４１０１１の周波数範囲指定情報、および学習音響データ格納手段４４０１２の学習音響データを読み出す処理である。 Next, the operation of the voice processing apparatus will be described. First, processing for calculating a vocal tract length normalization parameter in the speech processing apparatus will be described with reference to the flowchart of FIG. In the flowchart of FIG. 45, only steps that are different from the flowcharts of FIGS. 37 and 42 will be described. Note that the vocal tract length normalization parameter calculation process does not necessarily have to be performed by the voice processing device. The initialization process in FIG. 45 includes, for example, a process for prompting the user (evaluation subject) to say “/ aiueo /”, frequency range designation information in the frequency range designation information storage unit 41011, and This is a process of reading the learning sound data in the learning sound data storage means 44012.

（ステップＳ４５０１）最適音素系列取得手段４４０１３は、ユーザが発声した発話音声を構成する各フレームｔに対応する音素系列（ｓ^＊ _ｔ（ｔ＝１，２，...．，Ｔ））を取得する。 (Step S4501) Optimal phoneme sequence acquisition means 44013 acquires a phoneme sequence (s ^* _t (t = 1, 2,..., T)) corresponding to each frame t constituting the speech voice uttered by the user. To do.

（ステップＳ４５０２）ケプストラム変換パラメータ算出手段４４０１５は、周波数範囲指定情報（Ｗ）で示される周波数範囲における学習音響データ、ベクトル（ｕ_ｔ（α））およびステップＳ４５０１で取得した音素系列（ｓ^＊ _ｔ）に基づいて、本ループにおける最適なケプストラム変換パラメータ（α^＊）を算出する。ケプストラム変換パラメータ算出手段４４０１５は、例えば、格納している数式２５の情報を読み出し、また、メモリ上の学習音響データ、ベクトル（ｕ_ｔ（α））および音素系列（ｓ^＊ _ｔ）を読み出し、数式２５に代入し、ケプストラム変換パラメータ（α^＊）を算出する。 (Step S4502) The cepstrum transformation parameter calculation unit 44015 learns acoustic data in the frequency range indicated by the frequency range designation information (W), the vector (u _t (α)), and the phoneme sequence (s ^* _t ) acquired in step S4501. Based on the above, the optimum cepstrum transformation parameter (α ^* ) in this loop is calculated. The cepstrum conversion parameter calculation means 44015 reads, for example, the stored information of Equation 25, and also reads the learning acoustic data, the vector (u _t (α)) and the phoneme sequence (s ^* _t ) on the memory, Substituting into 25, the cepstrum transformation parameter (α ^* ) is calculated.

以下、図４６の概念図を用いて、本実施の形態における音声処理装置における声道長正規化パラメータ算出処理の概念について説明する。まず、基準話者音声データベースから算出された音素毎の平均ケプストラムベクトルを、指定された発話内容（例えば／あいうえお／）の音素列に従って並べたベクトル列と、同じ音素列のユーザ発話音声の変換されたケプストラムベクトルとの自乗誤差が最小になるようにパラメータ（α^＊）が求められる。ただし、求まるパラメータ（α^＊）はケプストラム変換（ケプストラムワーピング）パラメータであり、このままでは声道長の変換を直接表わすものではないため、近似変換式（１／ρ）を用いて最終的な声道長正規化パラメータ（サンプリング周波数変換率）γを計算する。なお、声道長正規化パラメータを算出するための数式は、上記の数式２２から数式２６、数式９から数式１１、数式１４から数式１６等により行われる。 The concept of vocal tract length normalization parameter calculation processing in the speech processing apparatus according to the present embodiment will be described below using the conceptual diagram of FIG. First, a vector sequence in which the average cepstrum vector for each phoneme calculated from the reference speaker speech database is arranged according to the phoneme sequence of the specified utterance content (for example, / Aiueo /) and the user utterance speech of the same phoneme sequence are converted. The parameter (α ^* ) is determined so that the square error with the cepstrum vector is minimized. However, since the obtained parameter (α ^* ) is a cepstrum transformation (cepstrum warping) parameter and does not directly represent the transformation of the vocal tract length as it is, the final vocal tract is obtained using the approximate transformation formula (1 / ρ). The long normalization parameter (sampling frequency conversion rate) γ is calculated. It should be noted that the equations for calculating the vocal tract length normalization parameter are performed by the above-described Equations 22 to 26, Equations 9 to 11, Equations 14 to 16, and the like.

なお、本実施の形態によれば、音声処理装置が声道長正規化パラメータを算出する処理と、声道長正規化処理を行った。しかし、声道長正規化パラメータを算出する処理と声道長正規化処理を、異なる装置が行う構成でも良い。かかる場合、本音声処理装置は、音声を受け付ける音声受付部と、第一サンプリング周波数を格納している第一サンプリング周波数格納部と、サンプリング周波数の変換率に関する情報である声道長正規化パラメータを格納している声道長正規化パラメータ格納部と、前記声道長正規化パラメータと前記第一サンプリング周波数をパラメータとして算出される第二サンプリング周波数で、前記音声受付部が受け付けた音声に対して、サンプリング処理を行い、第二音声データを得る声道長正規化処理部と、前記第二音声データを処理する音声処理部を具備する音声処理装置である。また、声道長正規化パラメータを算出する装置は、教師データを構成するデータから、話者の音声のケプストラムベクトル系列を短区間分析で求め、それを用いて２以上の基準話者音素平均ケプストラムベクトルを取得し、当該２以上の基準話者音素平均ケプストラムベクトルを声道長正規化用の指定された発話内容（例えば／あいうえお／）に従って並べて得られた平均ケプストラムベクトル列である学習音響データを格納している学習音響データ格納手段と、受け付けた音声から、短区間分析により第二のケプストラムベクトル系列を算出する第二ケプストラムベクトル系列算出手段と、前記第二のケプストラムベクトル系列を、ケプストラム変換パラメータを要素とする行列を用いて線形変換し、周波数ワープされた第三のケプストラムベクトル系列を算出するケプストラム変換手段と、ユーザが発声した発話音声を構成する各フレームｔに対応する音素を識別する情報の列である音素系列を取得する最適音素系列取得手段と、前記学習音響データおよび前記第三のケプストラムベクトル系列および前記音素系列に基づいて、ケプストラム変換パラメータを算出するケプストラム変換パラメータ算出手段と、所定のルールに基づいて、前記ケプストラム変換手段における処理、および前記最適音素系列取得手段における処理、および前記ケプストラム変換パラメータ算出手段における処理を繰り返えさせ、ケプストラム変換パラメータを得る最終ケプストラム変換パラメータ取得手段と、前記最終ケプストラム変換パラメータ取得手段が得たケプストラム変換パラメータに基づいて、前記声道長正規化パラメータを算出する声道長正規化パラメータ算出手段を具備する装置である。 Note that, according to the present embodiment, the voice processing device performs the process of calculating the vocal tract length normalization parameter and the vocal tract length normalization process. However, a configuration in which different devices perform the process of calculating the vocal tract length normalization parameter and the process of normalizing the vocal tract length may be used. In such a case, the speech processing apparatus includes a speech reception unit that receives speech, a first sampling frequency storage unit that stores a first sampling frequency, and a vocal tract length normalization parameter that is information related to a conversion rate of the sampling frequency. A voice sampling length storage parameter storing unit, a second sampling frequency calculated using the vocal tract length normalization parameter and the first sampling frequency as parameters, and the voice received by the voice receiving unit The voice processing apparatus includes a vocal tract length normalization processing unit that performs sampling processing to obtain second voice data, and a voice processing unit that processes the second voice data. The apparatus for calculating the vocal tract length normalization parameter obtains a cepstrum vector sequence of a speaker's voice from the data constituting the teacher data by short interval analysis, and uses it to calculate two or more reference speaker phoneme average cepstrum. The learning acoustic data, which is an average cepstrum vector sequence obtained by acquiring vectors and arranging the two or more reference speaker phoneme average cepstrum vectors according to the specified utterance content for normalization of vocal tract length (for example, / aiueo /), The stored learning acoustic data storage means, the second cepstrum vector series calculation means for calculating the second cepstrum vector series by the short interval analysis from the received speech, and the second cepstrum vector series, the cepstrum conversion parameter A third cepstra that is linearly transformed using a matrix with frequency warped Cepstrum conversion means for calculating a vector series, optimum phoneme sequence acquisition means for acquiring a phoneme series that is a string of information for identifying a phoneme corresponding to each frame t constituting speech speech uttered by a user, and the learning acoustic data And cepstrum conversion parameter calculation means for calculating a cepstrum conversion parameter based on the third cepstrum vector series and the phoneme series, processing in the cepstrum conversion means based on a predetermined rule, and the optimum phoneme sequence acquisition means And the cepstrum conversion parameter calculating unit to repeat the process in step 1 and the cepstrum conversion parameter calculating unit to obtain a cepstrum conversion parameter, and the cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition unit. There are a device having a vocal tract length normalization parameter calculating means for calculating the vocal tract length normalization parameter.

また、上記プログラムにおける声道長正規化パラメータ算出ステップは、前記音声受付ステップで受け付けた音声から、短区間分析により第二のケプストラムベクトル系列を算出する第二ケプストラムベクトル系列算出ステップと、前記第二のケプストラムベクトル系列を、ケプストラム変換パラメータを要素とする行列を用いて変換し、周波数ワープされた第三のケプストラムベクトル系列を算出するケプストラム変換ステップと、ユーザが発声した発話音声を構成する各フレームｔに対応する音素を識別する情報の列である音素系列を取得する最適音素系列取得ステップと、格納している学習音響データおよび前記第三のケプストラムベクトル系列および前記音素系列を用いて、ケプストラム変換パラメータを算出するケプストラム変換パラメータ算出ステップと、所定のルールに基づいて、前記ケプストラム変換ステップにおける処理、および前記最適音素系列取得ステップにおける処理、および前記ケプストラム変換パラメータ算出ステップにおける処理を繰り返えさせ、ケプストラム変換パラメータを得る最終ケプストラム変換パラメータ取得手段と、
前記最終ケプストラム変換パラメータ取得ステップが得たケプストラム変換パラメータに基づいて、前記声道長正規化パラメータを算出する声道長正規化パラメータ算出サブステップを具備することは好適である。 Further, the vocal tract length normalization parameter calculating step in the program includes a second cepstrum vector sequence calculating step for calculating a second cepstrum vector sequence by a short interval analysis from the speech received in the speech receiving step, and the second cepstrum vector sequence calculating step. The cepstrum vector sequence is transformed using a matrix having cepstrum transformation parameters as elements, and a cepstrum transformation step for calculating a frequency-warped third cepstrum vector sequence, and each frame t constituting the speech voice uttered by the user An optimal phoneme sequence acquisition step for acquiring a phoneme sequence that is a sequence of information that identifies a phoneme corresponding to, a stored cepstrum vector sequence, and a cepstrum conversion parameter using the third cepstrum vector sequence and the phoneme sequence Cepstrum transformation to calculate Based on the parameter calculation step and a predetermined rule, the process in the cepstrum conversion step, the process in the optimal phoneme sequence acquisition step, and the process in the cepstrum conversion parameter calculation step are repeated to obtain a cepstrum conversion parameter. Cepstrum conversion parameter acquisition means;
It is preferable to include a vocal tract length normalization parameter calculation sub-step for calculating the vocal tract length normalization parameter based on the cepstrum conversion parameter obtained in the final cepstrum conversion parameter acquisition step.

また、上記プログラムの声道長正規化パラメータ算出サブステップにおいて、前記最終ケプストラム変換パラメータ取得ステップで得たケプストラム変換パラメータから線形周波数伸縮比を算出し、当該線形周波数伸縮比から前記声道長正規化パラメータを算出することは好適である。 Further, in the vocal tract length normalization parameter calculation substep of the program, a linear frequency expansion / contraction ratio is calculated from the cepstrum conversion parameter obtained in the final cepstrum conversion parameter acquisition step, and the vocal tract length normalization is calculated from the linear frequency expansion / contraction ratio. It is preferable to calculate the parameters.

また、上記プログラムにおいて、周波数範囲を指定する情報である周波数範囲指定情報を格納しており、前記ケプストラム変換パラメータ算出ステップにおいて、前記周波数範囲指定情報で示される周波数範囲における学習音響データ、第三のケプストラムベクトル系列および音素系列に基づいて、ケプストラム変換パラメータを算出することは好適である。 Further, in the above program, frequency range designation information that is information for designating a frequency range is stored, and in the cepstrum conversion parameter calculation step, learning acoustic data in the frequency range indicated by the frequency range designation information, It is preferable to calculate the cepstrum conversion parameter based on the cepstrum vector sequence and the phoneme sequence.

また、上記の実施の形態において検出した特殊音声は、無音、挿入、置換、欠落であった。音声処理装置は、かかるすべての特殊音声について検知しても良いことはいうまでもない。また、音声処理装置は、主として、実施の形態１、実施の形態２において述べた評定値の算出アルゴリズムを利用して、特殊音声の検出を行ったが、他の評定値の算出アルゴリズムを利用しても良い。 In addition, the special voice detected in the above embodiment is silence, insertion, replacement, and omission. It goes without saying that the sound processing device may detect all such special sounds. In addition, the speech processing apparatus mainly detects the special speech using the rating value calculation algorithm described in the first embodiment and the second embodiment, but uses other rating value calculation algorithms. May be.

また、特殊音声は、無音、挿入、置換、欠落に限られない。例えば、特殊音声は、ｇａｒｂａｇｅ（雑音などの雑多な音素等）であっても良い。受け付けた音声にｇａｒｂａｇｅが混入している場合、その区間は類似度の計算対象から除外するのがしばしば望ましい。例えば、発音評定においては、学習者の発声には通常、息継ぎや無声区間などが数多く表れ、それらに対応する発声区間を評定対象から取り除くことが好適である。なお、無音は、一般に、ｇａｒｂａｇｅの一種である、と考える。 The special voice is not limited to silence, insertion, replacement, and omission. For example, the special voice may be a garbage (miscellaneous phonemes such as noise). When garbage is mixed in the received voice, it is often desirable to exclude that section from the similarity calculation target. For example, in pronunciation evaluation, a learner's utterance usually has many breathing and unvoiced intervals, and it is preferable to remove the corresponding utterance intervals from the evaluation target. Note that silence is generally considered a type of garbage.

そこで，どの音素にも属さない雑多な音素（ｇａｒｂａｇｅ音素）を設定し、ｇａｒｂａｇｅのＨＭＭをあらかじめ格納しておく。スコア低下区間において、ｇａｒｂａｇｅのＨＭＭに対する評定値（γ_ｔ（ｊ））が所定の値より大きい場合，その区間はｇａｒｂａｇｅ区間と判定することは好適である。特に、発音評定において，ｇａｒｂａｇｅ区間が２つの単語にまたがっている場合、息継ぎなどが起こったものとして、評定値の計算対象から除外することは極めて好適である。 Therefore, a miscellaneous phoneme (garbage phoneme) that does not belong to any phoneme is set, and a garbage HMM is stored in advance. If the rating value (γ _t (j)) for the garbage HMM is larger than a predetermined value in the score lowering section, it is preferable to determine the section as the garbage section. In particular, in the pronunciation evaluation, when the garbage section extends over two words, it is extremely preferable to exclude the evaluation value from the evaluation value as the occurrence of breathing or the like.

また、図４７は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の音声処理装置を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図４７は、このコンピュータシステム３４０の概観図であり、図４８は、コンピュータシステム３４０のブロック図である。 FIG. 47 shows the external appearance of a computer that executes the programs described in this specification to realize the above-described audio processing apparatuses according to various embodiments. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 47 is a general view of the computer system 340, and FIG. 48 is a block diagram of the computer system 340.

図４７において、コンピュータシステム３４０は、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）ドライブ、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）ドライブを含むコンピュータ３４１と、キーボード３４２と、マウス３４３と、モニタ３４４と、マイク３４５とを含む。 47, a computer system 340 includes a computer 341 including an FD (Flexible Disk) drive and a CD-ROM (Compact Disk Read Only Memory) drive, a keyboard 342, a mouse 343, a monitor 344, and a microphone 345. .

図４８において、コンピュータ３４１は、ＦＤドライブ３４１１、ＣＤ−ＲＯＭドライブ３４１２に加えて、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３４１３と、ＣＰＵ３４１３、ＣＤ−ＲＯＭドライブ３４１２及びＦＤドライブ３４１１に接続されたバス３４１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ（Ｒｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）３４１５と、ＣＰＵ３４１３に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）３４１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク３４１７とを含む。ここでは、図示しないが、コンピュータ３４１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 48, in addition to the FD drive 3411 and the CD-ROM drive 3412, the computer 341 includes a CPU (Central Processing Unit) 3413, a bus 3414 connected to the CPU 3413, the CD-ROM drive 3412, and the FD drive 3411, and a boot. A ROM (Read-Only Memory) 3415 for storing a program such as an up program, and a RAM (Random Access Memory) connected to the CPU 3413 for temporarily storing instructions of an application program and providing a temporary storage space 3416 and a hard disk 3417 for storing application programs, system programs, and data. Although not shown here, the computer 341 may further include a network card that provides connection to the LAN.

コンピュータシステム３４０に、上述した実施の形態の音声処理装置の機能を実行させるプログラムは、ＣＤ−ＲＯＭ３５０１、またはＦＤ３５０２に記憶されて、ＣＤ−ＲＯＭドライブ３４１２またはＦＤドライブ３４１１に挿入され、さらにハードディスク３４１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ３４１に送信され、ハードディスク３４１７に記憶されても良い。プログラムは実行の際にＲＡＭ３４１６にロードされる。プログラムは、ＣＤ−ＲＯＭ３５０１、ＦＤ３５０２またはネットワークから直接、ロードされても良い。 A program that causes the computer system 340 to execute the functions of the sound processing apparatus according to the above-described embodiment is stored in the CD-ROM 3501 or the FD 3502, inserted into the CD-ROM drive 3412 or the FD drive 3411, and further stored in the hard disk 3417. May be forwarded. Alternatively, the program may be transmitted to the computer 341 via a network (not shown) and stored in the hard disk 3417. The program is loaded into the RAM 3416 at the time of execution. The program may be loaded directly from the CD-ROM 3501, the FD 3502, or the network.

プログラムは、コンピュータ３４１に、上述した実施の形態の音声処理装置の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３４０がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS), a third-party program, or the like that causes the computer 341 to execute the functions of the voice processing apparatus according to the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 340 operates is well known and will not be described in detail.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.

また、上記のプログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Moreover, the computer which performs said program may be single, and plural may be sufficient as it. That is, centralized processing may be performed, or distributed processing may be performed.

本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる音声処理装置は、評価対象者の話者特性に応じた精度の高い音声処理ができるという効果を有し、発音評定装置やカラオケ評定装置や音声認識装置等として有用である。 As described above, the speech processing device according to the present invention has an effect of being able to perform speech processing with high accuracy according to the speaker characteristics of the evaluation target person, as a pronunciation rating device, a karaoke rating device, a speech recognition device, and the like. Useful.

実施の形態１における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 1 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同声道長正規化処理について説明するフローチャートFlowchart explaining the same vocal tract length normalization process 同ＨＭＭの仕様の例を示す図Diagram showing an example of the HMM specification 同Ｆ１、Ｆ２の計測結果を示す図The figure which shows the measurement result of F1 and F2 同音声分析条件を示す図Figure showing the same voice analysis conditions 同算出した評定値をグラフで表した例を示す図The figure which shows the example which expressed the calculated evaluation value with the graph 同算出した評定値をグラフで表した例を示す図The figure which shows the example which expressed the calculated evaluation value with the graph 同出力例を示す図Figure showing the same output example 同出力例を示す図Figure showing the same output example 実施の形態２における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 2 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同第二サンプリング周波数算出処理について説明するフローチャートFlowchart for explaining the second sampling frequency calculation process 同評定処理について説明するフローチャートFlow chart explaining the rating process 同評定結果（ｔ−ｐ−ＤＡＰスコア）を示す図The figure which shows the same evaluation result (tp-DAP score) 同出力例を示す図Figure showing the same output example 実施の形態３における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 3 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同評定処理について説明するフローチャートFlow chart explaining the rating process 同無音データの検知について説明する図The figure explaining the detection of the silence data 実施の形態４における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 4 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同評定処理について説明するフローチャートFlow chart explaining the rating process 同音素の挿入の検知について説明する図The figure explaining detection of insertion of the same phoneme 同出力例を示す図Figure showing the same output example 実施の形態５における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 5 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同評定処理について説明するフローチャートFlow chart explaining the rating process 同音素の置換の検知について説明する図The figure explaining the detection of substitution of the same phoneme 実施の形態６における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 6 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同評定処理について説明するフローチャートFlow chart explaining the rating process 同音素の欠落の検知について説明する図Diagram explaining detection of missing phonemes 実施の形態７における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 7 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 実施の形態８における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 8 同声道長正規化パラメータ算出処理について説明するフローチャートFlowchart explaining the same vocal tract length normalization parameter calculation processing 同音声処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the audio processing apparatus 同声道長正規化処理について説明するフローチャートFlowchart explaining the same vocal tract length normalization process 同声道長正規化パラメータ算出処理の概念について説明する図The figure explaining the concept of the same vocal tract length normalization parameter calculation process 実施の形態９における音声処理装置のブロック図Block diagram of a speech processing apparatus according to Embodiment 9 同声道長正規化パラメータ算出処理について説明するフローチャートFlowchart explaining the same vocal tract length normalization parameter calculation processing 同声道長正規化パラメータ算出処理の概念について説明する図The figure explaining the concept of the same vocal tract length normalization parameter calculation process 実施の形態１０における音声処理装置のブロック図Block diagram of the speech processing apparatus in the tenth embodiment 同声道長正規化パラメータ算出処理について説明するフローチャートFlowchart explaining the same vocal tract length normalization parameter calculation processing 同声道長正規化パラメータ算出処理の概念について説明する図The figure explaining the concept of the same vocal tract length normalization parameter calculation process 音声処理装置を構成するコンピュータシステムの概観図Overview of the computer system that composes the speech processing unit 音声処理装置を構成するコンピュータのブロック図Block diagram of a computer constituting an audio processing apparatus

Explanation of symbols

１０１入力受付部
１０２教師データ格納部
１０３音声受付部
１０４教師データフォルマント周波数格納部
１０５第一サンプリング周波数格納部
１０６サンプリング部
１０７評価対象者フォルマント周波数取得部
１０８評価対象者フォルマント周波数格納部
１０９、３６０９声道長正規化処理部
１１０、１１１０、１７１０、２１１０、２６１０、３０１０、３４１０音声処理部
１１０１、３４１０１フレーム区分手段
１１０２、３４１０２フレーム音声データ取得手段
１１０３、１１１０３、１７１０３、２１１０３、２６１０３、３０１０３評定手段
１１０４、２１１０４、３４１０２出力手段
１１０９発声催促部
３６０１、４１０１、４４０１声道長正規化パラメータ算出部
３６０２声道長正規化パラメータ格納部
１１０３１最適状態決定手段
１１０３２最適状態確率値取得手段
１１０３３、２１０２３、１１１０３３、１７１０３３評定値算出手段
１７１０１、２１１０１、２６１０１、３０１０１特殊音声検知手段
３４１０１音声認識手段
１１１０３２発音区間フレーム音韻確率値取得手段
１７１０１１無音データ格納手段
１７１０１２無音区間検出手段
３６０１１、４１０１１周波数範囲指定情報格納手段
３６０１２長時間ケプストラム平均ベクトル格納手段
３６０１３第二ケプストラムベクトル系列算出手段
３６０１４ケプストラム変換手段
３６０１５、４１０１５、４４０１５ケプストラム変換パラメータ算出手段
３６０１６最終ケプストラム変換パラメータ取得手段
３６０１７声道長正規化パラメータ算出手段
４１０１２、４４０１２学習音響データ格納手段
４１０１３占有度数算出手段
４４０１３最適音素系列取得手段
DESCRIPTION OF SYMBOLS 101 Input reception part 102 Teacher data storage part 103 Voice reception part 104 Teacher data formant frequency storage part 105 1st sampling frequency storage part 106 Sampling part 107 Evaluation object person formant frequency acquisition part 108 Evaluation object person formant frequency storage part 109, 3609 Voice Road length normalization processing unit 110, 1110, 1710, 2110, 2610, 3010, 3410 Audio processing unit 1101, 34101 Frame segmentation unit 1102, 34102 Frame audio data acquisition unit 1103, 11103, 17103, 21103, 26103, 30103 Rating unit 1104 , 21104, 34102 Output means 1109 Voice prompting unit 3601, 4101, 4401 Voice tract length normalization parameter calculation unit 3602 Voice tract length normalization parameter storage 11031 Optimum state determination means 11032 Optimal state probability value acquisition means 11033, 21023, 1111033, 171033 Rating value calculation means 17101, 21101, 26101, 30101 Special speech detection means 34101 Speech recognition means 111032 Sounding section frame phoneme probability value acquisition means 171011 Silence data Storage means 171012 Silent section detection means 36011, 41011 Frequency range designation information storage means 36012 Long-term cepstrum average vector storage means 36013 Second cepstrum vector series calculation means 36014 Cepstrum conversion means 36015, 41015, 44015 Cepstrum conversion parameter calculation means 36016 Final cepstrum conversion Parameter acquisition means 36017 Vocal tract length normalization parameter calculation means 410 2,44012 training acoustic data storage unit 41013 occupied frequency computing unit 44013 best phoneme sequence obtaining means

Claims

A voice reception unit for receiving voice;
A first sampling frequency storage unit storing a first sampling frequency;
A vocal tract length normalization parameter storage unit storing a vocal tract length normalization parameter that is information on the conversion rate of the sampling frequency;
A vocal tract that obtains second audio data by performing sampling processing on the audio received by the audio receiving unit at a second sampling frequency calculated using the vocal tract length normalization parameter and the first sampling frequency as parameters. A long normalization processing unit;
An audio processing unit for processing the second audio data ;
A teacher data storage unit that stores one or more teacher data that is data relating to the speech to be compared and that is data for one or more phonemes;
A vocal tract length normalization parameter calculation unit that calculates the vocal tract length normalization parameter based on the teacher data and the voice received by the voice reception unit;
The vocal tract length normalization parameter of the vocal tract length normalization parameter storage unit is a voice processing device that is a vocal tract length normalization parameter calculated by the vocal tract length normalization parameter calculation unit ,
The vocal tract length normalization parameter calculation unit
Learning acoustic data storage means for storing learning acoustic data which is a phoneme average cepstrum vector sequence in which phoneme average cepstrum vectors are arranged according to the specified utterance content;
A second cepstrum vector sequence calculating means for calculating a second cepstrum vector sequence by a short interval analysis from the voice received by the voice receiving unit;
Cepstrum conversion means for converting the second cepstrum vector sequence using a matrix having cepstrum conversion parameters as elements and calculating a frequency-warped third cepstrum vector sequence;
Optimal phoneme sequence acquisition means for acquiring a phoneme sequence that is a sequence of information for identifying a phoneme corresponding to each frame constituting the speech voice uttered by the user;
Cepstrum conversion parameter calculating means for calculating a cepstrum conversion parameter using the learning acoustic data and the third cepstrum vector sequence and the phoneme sequence;
Based on a predetermined rule, final cepstrum conversion parameter acquisition means for obtaining a cepstrum conversion parameter by repeating processing in the cepstrum conversion unit, processing in the optimal phoneme sequence acquisition unit, and processing in the cepstrum conversion parameter calculation unit When,
A speech processing apparatus comprising: a vocal tract length normalization parameter calculation unit that calculates the vocal tract length normalization parameter based on a cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition unit.

The vocal tract length normalization parameter calculation unit, instead of the configuration according to claim 1,
Learning acoustic data storage means for storing learning acoustic data which is a concatenated HMM in which phoneme HMMs are concatenated according to designated utterance content;
A second cepstrum vector sequence calculating means for calculating a second cepstrum vector sequence by a short interval analysis from the voice received by the voice receiving unit;
Cepstrum conversion means for converting the second cepstrum vector sequence using a matrix having cepstrum conversion parameters as elements and calculating a frequency-warped third cepstrum vector sequence;
An occupancy degree calculating means for calculating an occupancy degree which is a posterior probability of the voice received according to the designated utterance content;
Cepstrum conversion parameter calculation means for calculating a cepstrum conversion parameter using the learning acoustic data, the third cepstrum vector series, and the occupation frequency;
A final cepstrum conversion parameter obtaining unit that obtains a cepstrum conversion parameter by repeating the process in the cepstrum conversion unit, the process in the occupancy calculation unit, and the process in the cepstrum conversion parameter calculation unit based on a predetermined rule; ,
Said final cepstrum conversion parameter acquisition means based on the cepstrum transformation parameters obtained, the audio processing apparatus according to claim 1, further comprising a vocal tract length normalization parameter calculating means for calculating the vocal tract length normalization parameter.

The vocal tract length normalization parameter calculating means includes:
The speech processing apparatus according to claim 1 or 2 , wherein a linear frequency expansion / contraction ratio is calculated from the cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition means, and the vocal tract length normalization parameter is calculated from the linear frequency expansion / contraction ratio. .

The vocal tract length normalization parameter calculation unit
Further comprising frequency range designation information storage means for storing frequency range designation information which is information for designating a frequency range;
The cepstrum conversion parameter calculation means includes
4. The speech processing apparatus according to claim 2 , wherein a cepstrum conversion parameter is calculated also using the frequency range designation information.

The teacher data includes two or more state identifiers for identifying states and information on transition probabilities between states,
The voice processing unit
Frame dividing means for dividing the second audio data into frames;
Frame audio data acquisition means for obtaining one or more frame audio data which are audio data for each of the divided frames;
Rating means for rating the voice received by the voice receiving unit based on the teacher data and the one or more frame voice data;
Comprising output means for outputting a rating result in the rating means ,
The rating means is
Optimal state determination means for determining an optimal state in which the transition probability of at least one frame audio data of the one or more frame audio data is the highest ;
Optimal state probability value acquisition means for acquiring a probability value indicating the posterior probability of the optimal state determined by the optimal state determination means;
Claims having a rating value calculating means for calculating an evaluation value of the speech by using the probability value, wherein the optimal state probability value obtaining means has obtained, the ratio of the sum of the probability values in all the states of the frame corresponding to the probability value The speech processing apparatus according to any one of claims 1 to 4 .

The rating means is
Optimal state determination means for determining an optimal state in which the transition probability of the one or more frame audio data is the highest ;
A pronunciation interval frame phoneme probability value acquisition unit that acquires, for each pronunciation interval, one or more probability values in the overall state of the phoneme having the optimal state of each frame determined by the optimal state determination unit;
A rating value calculating means for calculating a time average value of two or more probability values for each of one or more sounding sections acquired by the sounding section frame phoneme probability value acquiring means, and calculating a speech rating value using the time average value. The speech processing apparatus according to claim 5 , comprising:

The voice processing device is a karaoke evaluation device,
The voice reception unit
Accepts singing voice of the person being evaluated,
The voice processing unit
The speech processing apparatus according to claim 1, wherein the singing voice is evaluated.

A first sampling frequency storage unit storing a first sampling frequency;
A vocal tract length normalization parameter storage unit storing a vocal tract length normalization parameter that is information on the conversion rate of the sampling frequency;
A vocal tract that obtains second audio data by performing sampling processing on the audio received by the audio receiving unit at a second sampling frequency calculated using the vocal tract length normalization parameter and the first sampling frequency as parameters. A long normalization processing unit ;
An audio processing unit for processing the second audio data;
A teacher data storage unit that stores one or more teacher data that is data relating to the speech to be compared and that is data for one or more phonemes;
A vocal tract length normalization parameter calculation unit that calculates the vocal tract length normalization parameter based on the teacher data and the voice received by the voice reception unit;
The vocal tract length normalization parameter of the vocal tract length normalization parameter storage unit is a vocal tract length normalization parameter calculated by the vocal tract length normalization parameter calculation unit,
The vocal tract length normalization parameter calculation unit
Learning acoustic data storage means for storing learning acoustic data which is a phoneme average cepstrum vector sequence in which phoneme average cepstrum vectors are arranged according to the specified utterance content;
A second cepstrum vector sequence calculating means for calculating a second cepstrum vector sequence by a short interval analysis from the voice received by the voice receiving unit;
Cepstrum conversion means for converting the second cepstrum vector sequence using a matrix having cepstrum conversion parameters as elements and calculating a frequency-warped third cepstrum vector sequence;
Optimal phoneme sequence acquisition means for acquiring a phoneme sequence that is a sequence of information for identifying a phoneme corresponding to each frame constituting the speech voice uttered by the user;
Cepstrum conversion parameter calculating means for calculating a cepstrum conversion parameter using the learning acoustic data and the third cepstrum vector sequence and the phoneme sequence;
Based on a predetermined rule, final cepstrum conversion parameter acquisition means for obtaining a cepstrum conversion parameter by repeating processing in the cepstrum conversion unit, processing in the optimal phoneme sequence acquisition unit, and processing in the cepstrum conversion parameter calculation unit When,
A digital signal processor comprising: a vocal tract length normalization parameter calculating unit that calculates the vocal tract length normalization parameter based on a cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition unit .

The vocal tract length normalization parameter calculation unit, instead of the configuration according to claim 8,
Learning acoustic data storage means for storing learning acoustic data which is a concatenated HMM in which phoneme HMMs are concatenated according to designated utterance content;
A second cepstrum vector sequence calculating means for calculating a second cepstrum vector sequence by a short interval analysis from the voice received by the voice receiving unit;
Cepstrum conversion means for converting the second cepstrum vector sequence using a matrix having cepstrum conversion parameters as elements and calculating a frequency-warped third cepstrum vector sequence;
An occupancy degree calculating means for calculating an occupancy degree which is a posterior probability of the voice received according to the designated utterance content;
Cepstrum conversion parameter calculation means for calculating a cepstrum conversion parameter using the learning acoustic data, the third cepstrum vector series, and the occupation frequency;
A final cepstrum conversion parameter obtaining unit that obtains a cepstrum conversion parameter by repeating the process in the cepstrum conversion unit, the process in the occupancy calculation unit, and the process in the cepstrum conversion parameter calculation unit based on a predetermined rule; ,
9. The digital signal processor according to claim 8, further comprising vocal tract length normalization parameter calculation means for calculating the vocal tract length normalization parameter based on the cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition means.

The computer,
A voice reception unit for receiving voice;
A first sampling frequency storage unit storing a first sampling frequency;
A vocal tract length normalization parameter storage unit storing a vocal tract length normalization parameter that is information on the conversion rate of the sampling frequency;
A vocal tract that obtains second audio data by performing sampling processing on the audio received by the audio receiving unit at a second sampling frequency calculated using the vocal tract length normalization parameter and the first sampling frequency as parameters. A long normalization processing unit;
An audio processing unit for processing the second audio data;
A teacher data storage unit that stores one or more teacher data that is data relating to the speech to be compared and that is data for one or more phonemes;
A program for functioning as comprising the vocal tract length normalization parameter calculation unit that calculates the vocal tract length normalization parameter based on the teacher data and the voice received by the voice reception unit,
The vocal tract length normalization parameter of the vocal tract length normalization parameter storage unit is a vocal tract length normalization parameter calculated by the vocal tract length normalization parameter calculation unit,
The vocal tract length normalization parameter calculation unit
Learning acoustic data storage means for storing learning acoustic data which is a phoneme average cepstrum vector sequence in which phoneme average cepstrum vectors are arranged according to the specified utterance content;
A second cepstrum vector sequence calculating means for calculating a second cepstrum vector sequence by a short interval analysis from the voice received by the voice receiving unit;
Cepstrum conversion means for converting the second cepstrum vector sequence using a matrix having cepstrum conversion parameters as elements and calculating a frequency-warped third cepstrum vector sequence;
Optimal phoneme sequence acquisition means for acquiring a phoneme sequence that is a sequence of information for identifying a phoneme corresponding to each frame constituting the speech voice uttered by the user;
Cepstrum conversion parameter calculating means for calculating a cepstrum conversion parameter using the learning acoustic data and the third cepstrum vector sequence and the phoneme sequence;
Based on a predetermined rule, final cepstrum conversion parameter acquisition means for obtaining a cepstrum conversion parameter by repeating processing in the cepstrum conversion unit, processing in the optimal phoneme sequence acquisition unit, and processing in the cepstrum conversion parameter calculation unit When,
A program for functioning as comprising a vocal tract length normalization parameter calculation unit that calculates the vocal tract length normalization parameter based on a cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition unit .

The vocal tract length normalization parameter calculation unit, instead of the configuration according to claim 10,
Learning acoustic data storage means for storing learning acoustic data which is a concatenated HMM in which phoneme HMMs are concatenated according to designated utterance content;
A second cepstrum vector sequence calculating means for calculating a second cepstrum vector sequence by a short interval analysis from the voice received by the voice receiving unit;
Cepstrum conversion means for converting the second cepstrum vector sequence using a matrix having cepstrum conversion parameters as elements and calculating a frequency-warped third cepstrum vector sequence;
An occupancy degree calculating means for calculating an occupancy degree which is a posterior probability of the voice received according to the designated utterance content;
Cepstrum conversion parameter calculation means for calculating a cepstrum conversion parameter using the learning acoustic data, the third cepstrum vector series, and the occupation frequency;
A final cepstrum conversion parameter obtaining unit that obtains a cepstrum conversion parameter by repeating the process in the cepstrum conversion unit, the process in the occupancy calculation unit, and the process in the cepstrum conversion parameter calculation unit based on a predetermined rule; ,
The vocal tract length normalization parameter calculation unit that calculates the vocal tract length normalization parameter based on the cepstrum conversion parameter obtained by the final cepstrum conversion parameter acquisition unit is provided to function as comprising the vocal tract length normalization parameter calculation unit. program.