JP2015068897A

JP2015068897A - Evaluation method and device for utterance and computer program for evaluating utterance

Info

Publication number: JP2015068897A
Application number: JP2013201242A
Authority: JP
Inventors: ヘフェルナンダニエル; Heffernan Daniel; 田中　久美子; Kumiko Tanaka; 久美子田中
Original assignee: University of Tokyo NUC
Current assignee: University of Tokyo NUC
Priority date: 2013-09-27
Filing date: 2013-09-27
Publication date: 2015-04-13

Abstract

PROBLEM TO BE SOLVED: To provide an utterance evaluation system having a useful extraction method for extracting clearness as a feature of evaluating utterance.SOLUTION: An utterance evaluation device includes: an input part to which a voice signal of the free utterance of an utterer is input; a feature extraction part for extracting a feature to be used for evaluation from the input voice signal; a feature evaluation part for comparing the feature extracted by the feature extraction part with a preliminarily stored reference feature; and an output part for outputting a comparison result. The feature to be used for evaluation includes at least the clearness of utterance, and the clearness of utterance is represented with a ratio of a resonance sound to an obstruction sound in the input voice signal. The feature extraction part acquires the ratio of the resonance sound to the obstruction sound by using means for dividing the input voice signal into a plurality of segments and means for classifying the obtained segments into the obstruction sound and the resonance sound.

Description

本発明は、発話（喋り方、話し方、スピーチ等）の評価方法及び装置、発話を評価するためのコンピュータプログラムに関するものである。 The present invention relates to an evaluation method and apparatus for utterances (how to speak, speak, speech, etc.) and a computer program for evaluating utterances.

人間は誰しもが喋り方において何らかの癖ないし特徴をもっている。例えば、ある者は喋りが速く、ある者は声が大きく、ある者は、滑舌が悪かったりする。また、個々人が有する癖は本人が認識していない場合も多い。 Every human being has a certain habit or characteristic in how to speak. For example, some people speak fast, some people have loud voices, and some have bad tongues. In addition, there are many cases where the individual does not recognize the wrinkles that the individual has.

話し方の矯正手法としては、第三者にスピーチを聴いてもらい、後からフィードバックをもらうことが一般的である。また、スピーチを録音しておき、録音されたスピーチを後から評価することも行われる。 As a method of correcting the way of speaking, it is common to have a third party listen to speech and receive feedback later. It is also possible to record speech and evaluate the recorded speech later.

自分のスピーチについて他人に評価されることはあまり気持ちの良いことではないことに加えて、評価までにそれなりの時間を要し、発話者に対して、スピーチ中に矯正ないし修正する機会を与えることが困難である。 In addition to not being very pleasant to be evaluated by others about your speech, it takes some time to evaluate and gives the speaker the opportunity to correct or correct it during the speech Is difficult.

そこで、人間ではなく機械によって発話を評価することが考えられる。ユーザの発話の特徴において、人間が認識できる品質に影響を与え得るものは幾つも知られている。例えば、吃音の程度、ろれつの悪さ、単語の発音の正確性、単語生成の速度、ピッチ変動、声の大きさは全て発話の品質の知覚に違いをもたらす。例えば、声の大きさやピッチは、人間でも機械でも抽出しやすい特徴であり、既存の手法が存在する。しかしながら、発話の明瞭さは、発話者自身では気づきにくく、また、これを認識できる有用な既存手法もない。 Therefore, it is conceivable to evaluate utterances by machines rather than humans. There are a number of known user utterance features that can affect the human perceivable quality. For example, the degree of stuttering, poorness, accuracy of word pronunciation, word generation speed, pitch variation, and loudness all make a difference in the perception of speech quality. For example, the loudness and pitch of a voice are features that can be easily extracted by both humans and machines, and there are existing methods. However, the clarity of the utterance is not easily noticed by the utterer himself, and there is no useful existing method for recognizing this.

音声処理については古くから様々な研究が行われている。音声研究の代表的なものとしては、音声合成、音声認識、話者認識等が挙げられるが、これらの目的は通常の自然な発話を評価して話者にフィードバックすることではない。例えば自動音声認識の目的は、発話→テキストへの変換であり、また、音声認識モジュールがスマートフォンに搭載されているものの、現状ではある程度明瞭な発音での音声入力を必要としており、自然な自由発話を処理対象とするものではない。 Various researches have been conducted on speech processing since ancient times. Representative speech research includes speech synthesis, speech recognition, speaker recognition, etc. These objectives are not to evaluate normal natural speech and feed it back to the speaker. For example, the purpose of automatic speech recognition is conversion from utterance to text, and even though a speech recognition module is installed in a smartphone, it currently requires speech input with a certain degree of clear pronunciation, and natural free speech Is not intended for processing.

語学学習の分野において音声処理を用いて発話評価が行われているが（非特許文献１、２）、具体的には、予め容易されたテキストについての話者のスピーチと、当該テキストについて予め録音されたネイティブスピーカーの音声を対比して評価するものであって、自由発話について評価するものではない。 Speech evaluation is performed using speech processing in the field of language learning (Non-Patent Documents 1 and 2). Specifically, the speaker's speech on the text facilitated in advance and the text recorded in advance. It is intended to evaluate the voice of a native speaker, and not to evaluate free speech.

音声処理では、目的に応じて様々な特徴量の抽出が行われるが、発話者へのフィードバックとして用いるためには、特徴量は人間がコントロールできるものである必要がある。典型的な例では、声の大きさという特徴量であれば、人間が容易にコントロールすることができる。これに対して、音声学の分野において、音声処理によって特徴を抽出して評価することが試みられているが（非特許文献３）、そこで抽出される特徴は人間がコントロールするには限界があるような特徴であり、発話者へのフィードバックとして用いることは困難である。 In voice processing, various feature amounts are extracted according to the purpose, but in order to use them as feedback to the speaker, the feature amounts must be controllable by humans. In a typical example, a human being can easily control a feature amount such as voice volume. On the other hand, in the field of phonetics, attempts have been made to extract and evaluate features by speech processing (Non-patent Document 3), but the features extracted there are limited to human control. Such features are difficult to use as feedback to the speaker.

また、本願の発明者等は、特にスマートフォンのようなモバイルデバイスを用いてどのように発話特徴を抽出して有用なフィードバックが提供できるかを考えた。発話評価を、特にスマートフォンのようなモバイルデバイス上で、実時間で実行するためには、特徴量の計算量が少ないことが重要である。
Franco, H., Abrash, V., Precoda, K., et al. The SRI EduSpeakTMsystem: Recognition and pronunciation scoring for language learning. Proc.InSTILL ’00 (2000). Lai, Y., Tsai, H., and Yu, P. A multimedia English learning systemusing HMMs to improve phonemic awareness for English learning. EducationalTechnology and Society (2009). Rusilo, L., and de Camargo, Z. The validity of some acousticmeasures to predict voice quality settings: trends between acoustic andperceptual correlates of voice quality. In ExLing ’11 (2011). In addition, the inventors of the present application have considered how to extract useful speech and provide useful feedback, particularly using a mobile device such as a smartphone. In order to perform utterance evaluation in real time, particularly on a mobile device such as a smartphone, it is important that the amount of feature calculation is small.
Franco, H., Abrash, V., Precoda, K., et al. The SRI EduSpeakTMsystem: Recognition and pronunciation scoring for language learning.Proc. InSTILL '00 (2000). Lai, Y., Tsai, H., and Yu, P. A multimedia English learning systemusing HMMs to improve phonemic awareness for English learning.Educational Technology and Society (2009). Rusilo, L., and de Camargo, Z. The validity of some acousticmeasures to predict voice quality settings: trends between acoustic and perceptual correlates of voice quality.In ExLing '11 (2011).

本発明は、発話を評価する特徴としての明瞭さを備えた発話評価システムを提供することを目的とする。
本発明の他の目的は、スマートフォンのようなモバイルデバイス上で、評価結果を実時間で発話者に提供する発話評価システムを提供することにある。 An object of this invention is to provide the speech evaluation system provided with the clarity as the characteristic which evaluates speech.
Another object of the present invention is to provide an utterance evaluation system that provides an evaluation result to a speaker in real time on a mobile device such as a smartphone.

本発明が採用した技術手段は、
発話者の自由発話の音声信号が入力される入力部と、
入力された音声信号から評価に用いられる特徴を抽出する特徴抽出部と、
特徴抽出部で抽出された特徴と予め格納されている参照特徴を比較する特徴評価部と、
比較結果を出力する出力部と、
を備え、
前記評価に用いられる特徴には、少なくとも発声の明瞭さが含まれ、
前記発声の明瞭さは、入力された音声信号における阻害音対共鳴音比によって表され、
前記特徴抽出部は、入力された音声信号を複数のセグメントに分割する手段と、得られたセグメントを阻害音と共鳴音に分類する手段と、を用いて前記阻害音対共鳴音比を取得する、
発話評価装置、である。 The technical means adopted by the present invention are:
An input unit for inputting a voice signal of a speaker's free utterance;
A feature extraction unit that extracts features used for evaluation from the input audio signal;
A feature evaluation unit that compares the feature extracted by the feature extraction unit with a reference feature stored in advance;
An output unit for outputting a comparison result;
With
Features used for the evaluation include at least clarity of utterance,
The intelligibility of the utterance is represented by an inhibition sound to resonance ratio in the input audio signal,
The feature extraction unit acquires the inhibition sound-to-resonance sound ratio using means for dividing the input audio signal into a plurality of segments and means for classifying the obtained segment into an inhibition sound and a resonance sound. ,
An utterance evaluation device.

１つの態様では、前記評価に用いられる特徴には、さらに、声の大きさが含まれ、前記特徴抽出部は、入力された音声信号のエネルギーから声の大きさを取得する。 In one aspect, the feature used for the evaluation further includes a voice volume, and the feature extraction unit obtains the voice volume from the energy of the input voice signal.

１つの態様では、前記評価に用いられる特徴には、さらに、ポーズ率が含まれ、
前記ポーズ率は、音声信号の長さにおいて発話されていない時間の割合であり、
前記特徴抽出部は、音声区間検出手段によって前記割合を取得する。
１つの態様では、前記音声区画検出手段は、入力された音声信号のエネルギーを用いて、音声区画を検出する。 In one aspect, the features used for the evaluation further include a pause rate,
The pause rate is the percentage of time that speech is not spoken in the length of the audio signal,
The feature extraction unit acquires the ratio by a voice section detection unit.
In one aspect, the voice segment detection means detects a voice segment using the energy of the input voice signal.

１つの態様では、前記評価に用いられる特徴には、さらに、ピッチが含まれ、
前記特徴抽出部は、ピッチ抽出手段によってピッチを取得する。
１つの態様では、前記ピッチ抽出手段は、自己相関法を用いた手段である。
後述する実施態様では、自己相関法としてESACF（enhanced summary autocorrelation function)が用いられる。 In one aspect, the features used for the evaluation further include pitch.
The feature extraction unit acquires a pitch by a pitch extraction unit.
In one aspect, the pitch extraction means is means using an autocorrelation method.
In an embodiment described later, ESACF (enhanced summary autocorrelation function) is used as the autocorrelation method.

１つの態様では、前記評価に用いられる特徴には、さらに、喋りの速さが含まれ、
前記喋りの速さは、所定時間内の発話イベント数によって表され、
前記特徴抽出部は、所定時間内の発話イベント数をカウントすることで、喋りの速さを取得する。
１つの態様では、前記発話イベントは音節である。 In one aspect, the features used for the evaluation further include the speed of beat,
The speed of the beat is represented by the number of utterance events within a predetermined time,
The feature extraction unit obtains the speed of speaking by counting the number of utterance events within a predetermined time.
In one aspect, the utterance event is a syllable.

１つの態様では、前記声の大きさを表す、縦軸がエネルギー、横軸が時間の関数をラウドネス関数と呼び、
前記特徴抽出部は、ラウドネス関数のエネルギー局所最小点を音節境界とする。
１つの態様では、前記特徴抽出部は、
入力された音声信号からラウドネス関数を取得する第１手段と、
前記ラウドネス関数から凸包関数を算出する第２手段と、
算出した凸包関数と対応するラウドネス関数との差が最大となる点を探索し、前記差が閾値以上であれば、当該点をエネルギー局所最小点として決定する第３手段と、
を備え、
前記第１手段、前記第２手段、前記第３手段が再帰的に繰り返される。 In one aspect, the function of the voice representing loudness, the vertical axis is energy, and the horizontal axis is time is called a loudness function.
The feature extraction unit sets a local energy minimum point of the loudness function as a syllable boundary.
In one aspect, the feature extraction unit includes:
A first means for obtaining a loudness function from the input audio signal;
A second means for calculating a convex hull function from the loudness function;
Searching for a point where the difference between the calculated convex hull function and the corresponding loudness function is maximum, and if the difference is greater than or equal to a threshold value, a third means for determining the point as an energy local minimum point;
With
The first means, the second means, and the third means are recursively repeated.

１つの態様では、前記入力された音声信号を複数のセグメントに分割する手段において、前記セグメントは音節であり、
前記特徴抽出部において、前記喋りの速さを取得する際の音節境界を用いて複数のセグメントを取得する。 In one aspect, in the means for dividing the input audio signal into a plurality of segments, the segments are syllables,
The feature extraction unit acquires a plurality of segments using syllable boundaries when acquiring the speed of the beat.

１つの態様では、前記セグメントを阻害音と共鳴音に分類する手段は、教師付き機械学習手段を用いるものである。
１つの態様では、前記教師付き機械学習手段は、特徴ベクトルとして、セグメントの時間長、周期性、メル周波数ケプストラム係数（MFCC）、を用いる。例えば、周期性はピッチ抽出において取得することができる。メル周波数ケプストラム係数の計算は当業者において周知である。
１つの態様では、前記教師付き機械学習手段は、サポートベクターマシン（ＳＶＭ）である。
ＳＶＭは学習が重い場合もあるものの分類が高速であり、軽い計算で実行されるため有利である。
なお、本発明に適用され得る教師付き機械学習手段はＳＶＭに限定されるものではなく、単純ベイズ分類器、Ｋ近傍法、ベイジアンネットワーク、HMM(隠れマルコフモデル)、GMM(混合ガウスモデル)、クラスタリング、判別分析（線形判別関数やマハラノビス距離）、ロジスティック回帰分析、決定木等の既存手段から選択され得る。 In one aspect, the means for classifying the segment into an inhibition sound and a resonance sound uses supervised machine learning means.
In one aspect, the supervised machine learning means uses a time length of a segment, periodicity, and a mel frequency cepstrum coefficient (MFCC) as a feature vector. For example, periodicity can be obtained in pitch extraction. The calculation of the mel frequency cepstrum coefficient is well known to those skilled in the art.
In one aspect, the supervised machine learning means is a support vector machine (SVM).
Although SVM may be heavy learning, SVM is advantageous because classification is fast and is performed with light computation.
The supervised machine learning means that can be applied to the present invention is not limited to the SVM, but a simple Bayes classifier, K-neighbor method, Bayesian network, HMM (Hidden Markov Model), GMM (Mixed Gaussian model), clustering , Discriminant analysis (linear discriminant function or Mahalanobis distance), logistic regression analysis, decision tree, and other existing means.

１つの態様では、前記阻害音対共鳴音比は、阻害音セグメント数対共鳴音セグメント数比である。
阻害音のエネルギーが高ければ高いほど明瞭さが高いと考えられ、所定時間の発話における阻害音セグメントの数が多ければ（対共鳴音比でみることができる）エネルギーが高いとみなすことができ、明瞭さが高いと判定できる。 In one aspect, the inhibitory sound to resonance sound ratio is a ratio of the number of inhibition sound segments to the number of resonance sound segments.
It can be considered that the higher the energy of the inhibitory sound, the higher the clarity, and the higher the number of inhibitory sound segments in the utterance for a given time (as seen by the resonance ratio), the higher the energy, It can be determined that the clarity is high.

１つの態様では、前記セグメントは単音ないし音素であり、前記阻害音対共鳴音比は、阻害音エネルギー対共鳴音エネルギー比である。
音声信号における単音あるいは音素の区切りを抽出し、各単音ないし音素を、阻害音、共鳴音のいずれかに分類し、阻害音のエネルギーを計り、阻害音のエネルギーが高いほど明瞭であるとする。
阻害音のエネルギーの高さは、所定時間の発話における阻害音エネルギーの高低によって判定することができ、所定時間内の阻害音エネルギーの高低は、阻害音のエネルギーと共鳴音のエネルギーの比率（これと実質的に等価な指標を含む）を用いて判断することができる。比率が高ければ（阻害音のエネルギーが高い）、発話の明瞭さが高いと判定する。
阻害音のエネルギーとしては、例えば、ＲＭＳを用いることができるが、他の音量の計り方を用いてもよい。 In one aspect, the segment is a phone or a phoneme, and the inhibition sound to resonance sound ratio is an inhibition sound energy to resonance sound energy ratio.
It is assumed that a single sound or a phoneme break in a speech signal is extracted, each single sound or phoneme is classified as either an inhibition sound or a resonance sound, the energy of the inhibition sound is measured, and the higher the energy of the inhibition sound, the clearer the sound.
The level of the energy of the inhibitory sound can be determined by the level of the inhibitory sound energy in the utterance for a predetermined time. And a substantially equivalent index). If the ratio is high (inhibition sound energy is high), it is determined that the clarity of the utterance is high.
As the energy of the inhibition sound, for example, RMS can be used, but another method of measuring the volume may be used.

１つの態様では、前記特徴評価部に格納された参照特徴は、各ユーザが目標とする発話を特徴付ける特徴である。 In one aspect, the reference feature stored in the feature evaluation unit is a feature that characterizes an utterance targeted by each user.

１つの態様では、前記出力部は、比較結果を視覚的に表示する表示部である。
１つの態様では、前記特徴抽出部、前記特徴評価部、前記出力部は、入力された音声信号についてリアルタイムで実行される。 In one aspect, the output unit is a display unit that visually displays a comparison result.
In one aspect, the feature extraction unit, the feature evaluation unit, and the output unit are executed in real time on the input audio signal.

１つの態様では、発話評価装置は、スマートフォン等のモバイルデバイスから構成される。 In one aspect, the speech evaluation apparatus is configured from a mobile device such as a smartphone.

１つの態様では、前記特徴は、設定された時間における特徴の平均値である。
１つの態様では、さらに、設定された時間における特徴の標準偏差が計算される。 In one aspect, the feature is an average value of the feature at a set time.
In one aspect, the standard deviation of the feature at a set time is further calculated.

本発明が採用した他の技術手段は、
発話者の自由発話の音声信号を入力するステップと、
入力された音声信号から評価に用いられる特徴を抽出する特徴抽出ステップと、
特徴抽出ステップで抽出された特徴と予め格納されている参照特徴を比較する特徴評価ステップと、
比較結果を出力するステップと、
を備え、
前記評価に用いられる特徴には、少なくとも発声の明瞭さが含まれ、
前記発声の明瞭さは、入力された音声信号における阻害音対共鳴音比によって表され、
前記特徴抽出ステップは、入力された音声信号を複数のセグメントに分割するステップと、得られたセグメントを阻害音と共鳴音に分類するステップと、阻害音対共鳴音比を取得するステップと、を含んでいる、
発話評価方法、である。 Other technical means adopted by the present invention are:
Inputting a speech signal of a speaker's free speech;
A feature extraction step of extracting features used for evaluation from the input speech signal;
A feature evaluation step for comparing the feature extracted in the feature extraction step with a pre-stored reference feature;
Outputting a comparison result; and
With
Features used for the evaluation include at least clarity of utterance,
The intelligibility of the utterance is represented by an inhibition sound to resonance ratio in the input audio signal,
The feature extraction step includes a step of dividing the input audio signal into a plurality of segments, a step of classifying the obtained segment into an inhibition sound and a resonance sound, and a step of acquiring an inhibition sound to resonance sound ratio. Contains,
This is an utterance evaluation method.

上記発話評価装置において記載された数々の態様についての記載は、発話評価方法を限定する態様として援用することができる。 The description about many aspects described in the said speech evaluation apparatus can be used as an aspect which limits a speech evaluation method.

本発明が採用した他の技術手段は、発話評価方法をコンピュータに実行させるためのコンピュータプログラムである。 Another technical means employed by the present invention is a computer program for causing a computer to execute an utterance evaluation method.

本発明によれば、発話者自身では気づきにくく、また、有用な認識手法が存在しなかった「発話の明瞭さ」を特徴の１つの含む発話評価手法を提供することができる。
本発明によれば、発話者がコントロールできる特徴をフィードバックすることで、自己の発話における問題を発見し修正することができる。
本発明によれば、目標ないし理想とする発話についての参照特徴を格納しておくことで、ユーザは自己の発話を目標ないし理想とするスピーチに近づけるような修正が可能となる。
本発明で用いる特徴は比較的計算量が軽く、スマートフォンのようなリソースが限られたモバイルデバイスにおいてリアルタイムフィードバックを可能とする。 According to the present invention, it is possible to provide an utterance evaluation method including one of the features of “clarity of utterance”, which is difficult to notice by the utterer himself and there is no useful recognition method.
According to the present invention, it is possible to find and correct a problem in one's own utterance by feeding back features that the speaker can control.
According to the present invention, by storing the reference features for the target or ideal utterance, the user can make corrections that bring his utterance closer to the target or ideal speech.
The features used in the present invention are relatively light in computation and enable real-time feedback in mobile devices with limited resources such as smartphones.

本発明に係る発話評価システムの全体概略図である。1 is an overall schematic diagram of an utterance evaluation system according to the present invention. 発話評価システムが搭載されたスマートフォンの正面図であり、左図は概要スクリーン、右図は詳細スクリーンを示す。It is a front view of the smart phone equipped with the speech evaluation system, the left figure shows an overview screen, and the right figure shows a detail screen. 発話評価システムのユーザインターフェースの状態遷移図である。It is a state transition diagram of the user interface of an utterance evaluation system. 発話評価システムのアーキテクチャの概要を示す図である。It is a figure which shows the outline | summary of the architecture of an utterance evaluation system. 発話評価システムが搭載されたスマートフォンにおける当該システムのＣＰＵ使用割合を示す図である。It is a figure which shows the CPU usage rate of the said system in the smart phone by which the speech evaluation system is mounted. 本実施形態に係る発話評価システムの特徴抽出部の詳細を示す図である。It is a figure which shows the detail of the feature extraction part of the speech evaluation system which concerns on this embodiment. 音声の２つの500msサンプルに適用した時のESACFアルゴリズムにより得られた自己相関値を示す。音声は16000 Hzでサンプリングされ、W=1024でESACFを適用した。上図は、高い周期性、低い周波数、下図は、低い周期性、雑音、高い周波数を示す。The autocorrelation value obtained by the ESACF algorithm when applied to two 500 ms samples of speech is shown. The audio was sampled at 16000 Hz and ESACF was applied at W = 1024. The upper figure shows high periodicity, low frequency, and the lower figure shows low periodicity, noise, and high frequency. 上図は、TIMITスピーチコーパスからのオリジナル波形及び音データ、下図はそこから得られたラウドネス関数である。発話は、発話者FAEM0、サンプルSA2: “[. . . ] carry an oily rag [. . . ].”である。The upper figure shows the original waveform and sound data from the TIMIT speech corpus, and the lower figure shows the loudness function obtained therefrom. The utterance is speaker FAEM0, sample SA2: “[...] Carry an oily rag [...].”. 上図は、図８図と同じであり、下図は、図８のラウドネス関数において、ラウドネス関数の凸包、セグメント境界を示す。The upper diagram is the same as FIG. 8, and the lower diagram shows the convex hull and segment boundary of the loudness function in the loudness function of FIG. 発話コーパスサンプルについての、本実施形態の速さ抽出アルゴリズムの結果を、実際の速さと比較したものを示す。The result of the speed extraction algorithm of this embodiment about the speech corpus sample is shown in comparison with the actual speed. 図９と類似する図であり、ラウドネス関数において音節セグメント境界を示す。It is a figure similar to FIG. 9, and shows a syllable segment boundary in a loudness function. 図１１における音節セグメントを阻害音（Ｏ）、共鳴音（Ｃ）に分類したものを示す。この例では、明瞭さのスコア＝阻害音数／全セグメント数＝４／（４＋３）＝０．５７となる。The syllable segments in FIG. 11 are classified into inhibition sounds (O) and resonance sounds (C). In this example, clarity score = number of inhibition sounds / number of all segments = 4 / (4 + 3) = 0.57. clearspeechjph S教材におけるクリアスピーチ（上図）、くだけたスピーチ（下図）のそれぞれについての明瞭さの値の分布を示す。Clearspeechjph Shows the distribution of clearness values for clear speech (upper figure) and plain speech (lower figure) in S teaching materials. (a)は１分間に亘る４人の発話者のそれぞれの特徴の平均値を示す。(b)は各発話者についてのピッチ、話す速さの標準偏差を示す。(a) shows the average value of each feature of the four speakers over 1 minute. (b) shows the standard deviation of the pitch and speaking speed for each speaker.

［Ａ］発話評価システムの概要
図１に示すように、本実施形態に係る発話評価システムは、発話者の自由発話の音声信号が入力される入力部（マイクロフォン）と、入力された音声信号から評価に用いられる特徴を抽出する特徴抽出部と、特徴抽出部で抽出された特徴と予め格納されている参照特徴を比較する特徴評価部と、比較結果を出力する出力部（スクリーン）と、を備え、当該システムは、コンピュータ（入力部、ディスプレイ、ＲＡＭ、ＲＯＭ等の記憶部、ＣＰＵを主体とする処理部等を備える）から構成することができる。 [A] Outline of Speech Evaluation System As shown in FIG. 1, the speech evaluation system according to the present embodiment includes an input unit (microphone) to which a speech signal of a speaker's free speech is input and an input speech signal. A feature extraction unit that extracts features used for evaluation, a feature evaluation unit that compares a feature extracted by the feature extraction unit with a reference feature stored in advance, and an output unit (screen) that outputs a comparison result The system can be configured from a computer (including a storage unit such as an input unit, a display, a RAM, and a ROM, and a processing unit mainly including a CPU).

特徴抽出部はコンピュータの処理部から構成され、マイクロフォンによって得られた音信号を処理することで、発話の特徴を抽出する。本実施形態では、特徴抽出部によって取得される特徴は、声の大きさ、ポーズ率、ピッチ、話す速さ、明瞭さの５つである。これらの特徴については、後に詳述する。 The feature extraction unit is composed of a processing unit of a computer, and extracts a feature of an utterance by processing a sound signal obtained by a microphone. In the present embodiment, there are five features acquired by the feature extraction unit: voice volume, pause rate, pitch, speaking speed, and clarity. These features will be described in detail later.

参照特徴格納部はコンピュータの記憶部から構成され、特徴抽出部で抽出された特徴に対応する参照特徴、具体的には、声の大きさ、ポーズ率、ピッチ、話す速さ、明瞭さの５つの参照特徴が格納されている。参照特徴は、学習モードにおいて、参照となる発話（マイクロフォンからの入力、あるいは、発話が録音されたメディアライブラリ）の特徴を特徴抽出部で抽出して、特徴学習部で処理することで、参照特徴格納部に格納される。参照特徴の実際の値は、ユーザ毎に異なり得るものである。 The reference feature storage unit is composed of a storage unit of a computer, and reference features corresponding to the features extracted by the feature extraction unit, specifically, voice volume, pause rate, pitch, speaking speed, and clarity 5 One reference feature is stored. The reference features are extracted by extracting the features of the reference utterance (input from the microphone or the media library in which the utterance is recorded) in the learning mode, and processing them in the feature learning unit. Stored in the storage unit. The actual value of the reference feature can vary from user to user.

特徴評価部はコンピュータの処理部から構成され、特徴抽出部で抽出された入力発話の特徴を、参照特徴格納部に格納されている参照特徴と比較し、比較結果をスクリーンに表示する。 The feature evaluation unit is composed of a processing unit of a computer, compares the feature of the input utterance extracted by the feature extraction unit with the reference feature stored in the reference feature storage unit, and displays the comparison result on the screen.

図２左図は、評価結果の概要を表示するスクリーンであって、Energy（声の大きさ）、Pause（ポーズ率）、Pitch（ピッチ）、Speed（話す速さ）、Clarity（明瞭さ）という５つの特徴について、参照特徴（学習データ）が正五角形の頂点に配置されている。ユーザの発話の特徴は、参照となる特徴に対する相対的なスケールで表示されている。所定時間（ユーザによって設定された時間）でのユーザ発話の各特徴の平均値が頂点となり、白い太線は頂点を結ぶ辺である。ユーザの発話の特徴の五角形の各頂点と正五角形とのズレは、参照特徴からのズレを表す。 The left figure in Fig. 2 is a screen that displays an overview of the evaluation results, called Energy (voice volume), Pause (pause rate), Pitch (pitch), Speed (speaking speed), and Clarity (clarity). For five features, reference features (learning data) are arranged at the vertices of a regular pentagon. The feature of the user's utterance is displayed on a relative scale with respect to the feature to be referred to. The average value of each feature of the user utterance at a predetermined time (time set by the user) is a vertex, and the white thick line is an edge connecting the vertices. A deviation between each vertex of the pentagon of the user's utterance feature and the regular pentagon represents a deviation from the reference feature.

スクリーンには、また、平均値±１標準偏差の領域がユーザの特徴を結ぶ五角形の辺に沿って帯状に示してある（図２左図参照）。特徴の標準偏差が大きくなると、帯の幅が大きくなり、帯も同幅でなくなる。標準偏差は、各特徴の所定時間に亘る変化を示している。特徴の時間的な変化を示すことで、２次的特徴を視覚的に表示することができる。例えば、所定時間に亘る声の大きさ及びピッチの大きな変化は、ダイナミックなスピーチ（反対に小さな変化は、モノトーンなスピーチ）を意味し、速さにおける小さな変化は、安定したリズムであることを意味する。 On the screen, an area with an average value of ± 1 standard deviation is shown in a band shape along a pentagonal side connecting user characteristics (see the left figure in FIG. 2). As the standard deviation of the feature increases, the width of the band increases, and the band does not have the same width. The standard deviation indicates the change of each feature over a predetermined time. By showing temporal changes in features, secondary features can be visually displayed. For example, large changes in loudness and pitch over a period of time mean dynamic speech (on the contrary, small changes are monotone speech), and small changes in speed mean a stable rhythm. To do.

概要スクリーンの下方には、特徴の平均及び標準偏差を計算する時間フレームを設定するためのスライダが設けてある。スライダは、水平状のトラックと、トラック上を移動可能な丸ボタンと、からなり、例えば親指でスライダをドラッグすることで、時間フレームを所望の時間（例えば、１秒から３０秒の間の任意の時間）に設定することができる。 Below the overview screen is a slider for setting the time frame for calculating the mean and standard deviation of the features. The slider is composed of a horizontal track and a circular button that can be moved on the track. For example, by dragging the slider with the thumb, the time frame can be set to a desired time (for example, any time between 1 to 30 seconds). Time).

概要スクリーンの右上には、学習ボタン（Train）、詳細表示ボタン(Details)が配置されている。詳細表示ボタンをタッチすると、画面が詳細表示に切り替わる（図２右図参照）。学習ボタンをタッチすると、学習モードに切り替わる。発話評価システムの学習は、デバイスのマイクロフォンからのスピーチのライブ録音、あるいは、デバイスの記憶部に格納したメディアライブラリの音声録音をロードすることによって行うことができる（図３参照）。システムの学習はオフラインで行うことができる。 A learning button (Train) and a detail display button (Details) are arranged on the upper right of the overview screen. When the detail display button is touched, the screen is switched to the detail display (see the right figure in FIG. 2). Touch the learning button to switch to learning mode. Learning of the speech evaluation system can be performed by loading live recording of speech from the microphone of the device or audio recording of the media library stored in the storage unit of the device (see FIG. 3). System learning can be done offline.

ユーザは、話すことによって発話評価システムとインタラクトし、視覚的にフィードバックを受け取り、ユーザは、自分の話し方を他人（自分の話し方のサンプルであってもよい）の話し方と比較することができる。発話評価システムを、自分が真似したいと思うような模範の話し方で学習させることで、ターゲットとなる話し方と自己の話か方との差を認識することがで、ターゲットとなる話し方に近づけるように矯正ないし修正を試みることができる。発話評価システムのアプリケーションはスマートフォンに搭載され、ユーザの音声が入力されると、声の大きさ、ポーズ率、ピッチ、話す速さ、明瞭さの５つの特徴を抽出し、ユーザに理解されやすい形式のグラフィカルフィードバックをユーザにリアルタイムで提供する。発話評価は、スピーチやプレゼンテーション中に実行することができ、また、通話中（例えば、ヘッドセットやハンズフリーユニット、スピーカフォーンを用いて画面を見ることができる状態にある場合）に実行することもできる。 The user interacts with the utterance rating system by speaking and receives visual feedback, and the user can compare his / her way of speaking with that of others (which may be a sample of his / her way of speaking). By learning the utterance evaluation system with a model of speaking that you want to imitate, you can recognize the difference between the target speaking method and your own speaking method so that it approaches the target speaking method You can try to correct or correct. The application of the speech evaluation system is installed in a smartphone, and when the user's voice is input, the five features of voice volume, pause rate, pitch, speaking speed, and clarity are extracted, and the format is easy for the user to understand. Provides graphical feedback to users in real time. Speech evaluation can be performed during speech or presentation, or during a call (for example, when the screen can be viewed using a headset, hands-free unit, or speakerphone). it can.

本実施形態では、発話評価システムは、スマートフォンに搭載される。スマートフォンは汎用コンピュータのハードウェア要素を備えており、スマートフォンから発話評価システムを構成することができる。ただし、表１に示すように、ＣＰＵや記憶部の能力の面で、パーソナルコンピュータに比べて制限されており、デジタル信号処理の計算やスクリーン上の表示処理をより少ない計算量で実行することが要求される。
In the present embodiment, the utterance evaluation system is mounted on a smartphone. A smartphone includes hardware elements of a general-purpose computer, and an utterance evaluation system can be configured from the smartphone. However, as shown in Table 1, the capacity of the CPU and the storage unit is limited compared to a personal computer, and the calculation of digital signal processing and the display processing on the screen can be executed with a smaller calculation amount. Required.

図４に、本実施形態に係る発話評価システムのアーキテクチャの概要を示す。本実施形態では、マルチスレッドモデルによって計算量が軽いアーキテクチャをサポートする。スレッドは、音声キャプチャスレッド、音声処理スレッド、レンダリングスレッドとして形成され、Grand Central Dispatchにより管理されるJob Queueによって互いにデータのやり取りを行う。音声キャプチャは、16 kHzオーディオの25 msノンオーバラッピングフレームのストリームを生成するiOS の Audio Unit APIによって実行され、40 Hz でフレームの処理を行うシステムが得られる。音声処理アルゴリズム(ベクトル、マトリックス、高速フーリエ変換(FFT:Fast Fourier Transform)計算)は、SIMD (single instruction, multiple data)CPU インストラクションの API を提供する vDSP によって実行される。グラフィックレンダリングは、Core Graphics (“Quartz”)とOpenGLの両方を用いる。Quartzは、その高度な描画機能が必要となる時に用いられ、Quartzイメージデータは、デバイスのGPU上のOpenGLテクスチャメモリにキャッシュされる。他の全ての描画についてはOpenGLが用いられ、必要な場合には、キャッシュされたQuartzイメージデータに直接アクセスすることができる。 FIG. 4 shows an outline of the architecture of the speech evaluation system according to this embodiment. In the present embodiment, an architecture with a low calculation amount is supported by a multi-thread model. The threads are formed as an audio capture thread, an audio processing thread, and a rendering thread, and exchange data with each other by a Job Queue managed by Grand Central Dispatch. Audio capture is performed by iOS's Audio Unit API, which generates a 25 ms non-overlapping frame stream of 16 kHz audio, resulting in a system that processes frames at 40 Hz. Speech processing algorithms (vector, matrix, Fast Fourier Transform (FFT) calculations) are performed by a vDSP that provides a single instruction, multiple data (SIMD) CPU instruction API. Graphic rendering uses both Core Graphics (“Quartz”) and OpenGL. Quartz is used when its advanced rendering capabilities are needed, and Quartz image data is cached in the OpenGL texture memory on the device's GPU. OpenGL is used for all other renderings, and cached Quartz image data can be accessed directly if needed.

スマートフォン（iPhone 4’s single-core A4 CPU, with iOS 5.1.）上で発話評価システムが実行される場合（１０秒間）のCPUの使用状況を図５に示す。発話評価システムは、CPU使用の31.9%を占めている（Ａ）。その内、音声処理が70.8%を占めている（Ｂ）。音声処理における各特徴が占める割合のブレークダウンを示す（Ｃ）。特徴が高度になるほどCPU timeが増えることがわかる。 FIG. 5 shows the CPU usage when the speech evaluation system is executed (10 seconds) on a smartphone (iPhone 4 ’s single-core A4 CPU, with iOS 5.1.). The speech evaluation system accounts for 31.9% of CPU usage (A). Among them, voice processing accounts for 70.8% (B). A breakdown of the proportion of each feature in speech processing is shown (C). It can be seen that the CPU time increases as the features become more advanced.

［Ｂ］特徴抽出
ユーザの発話の特徴において、人間が認識できる品質に影響を与え得るものは幾つも知られている。例えば、吃音の程度、ろれつの悪さ、単語の発音の正確性、単語生成の速度、ピッチ変動、声の大きさは全て発話の品質の知覚に違いをもたらす。本願の発明者等は、スマートフォンのようなモバイルデバイスを用いてどのように発話特徴を抽出して有用なフィードバックが提供できるかを考えた。本実施形態では、計算及びメモリの要求をなるべく抑えつつオンラインで実行できる手法を提案する。本実施形態では、抽出する特徴として５つの特徴（声の大きさ、ポーズ率、ピッチ、話す速さ、明瞭さ）を選択した。 [B] Feature Extraction There are many known features of user utterances that can affect the quality that humans can recognize. For example, the degree of stuttering, poorness, accuracy of word pronunciation, word generation speed, pitch variation, and loudness all make a difference in the perception of speech quality. The inventors of the present application have considered how a speech device can be extracted using a mobile device such as a smartphone to provide useful feedback. The present embodiment proposes a technique that can be executed online while minimizing calculation and memory requirements. In this embodiment, five features (voice volume, pause rate, pitch, speaking speed, and clarity) are selected as features to be extracted.

表２に、音声処理及び発話特徴のヒエラルキー、及び、音響／音声特徴、を示す。デジタル信号処理（ＤＳＰ）、自動音声認識（ＡＳＲ）、自然言語処理（ＮＬＰ）の順に高度な処理となっており、各処理によって得られる特徴も異なる。本実施形態では、ＤＳＰによって５つの特徴（声の大きさ、ポーズ率、ピッチ、話す速さ、明瞭さ）を抽出する。なお、本発明で用いられ得る特徴は、これらの５つに限定されるものではなく、さらに他の特徴（「吃音」、「語尾などの癖」等）が含まれていても良い。
Table 2 shows the hierarchy of speech processing and utterance features and the acoustic / speech features. The processing is advanced in the order of digital signal processing (DSP), automatic speech recognition (ASR), and natural language processing (NLP), and the characteristics obtained by each processing are also different. In the present embodiment, five features (voice volume, pause rate, pitch, speaking speed, and clarity) are extracted by the DSP. Note that the features that can be used in the present invention are not limited to these five features, and may include other features (such as “stuttering” and “scoring such as ending”).

本実施形態で用いる５つの特徴のうち、幾つかは、表３に示すように既存の手法を用いて取得することができる。より低いレベルの特徴は、より高い特徴をサポートする。例えば、発話の明瞭さは発話の速さとピッチに依存し、発話の速さは声の大きさに依存する。以下の説明では、よりシンプルな特徴（声の大きさ）からより複雑な特徴（明瞭さ）へ順次説明する。
Among the five features used in this embodiment, some can be obtained using existing techniques as shown in Table 3. Lower level features support higher features. For example, the clarity of an utterance depends on the speed and pitch of the utterance, and the speed of the utterance depends on the volume of the voice. In the following description, a simpler feature (voice volume) will be described in turn to a more complicated feature (clarity).

図６に示すように、マイクロフォンから入力された音声信号は、ＦＦＴ処理されてＲＭＳ振幅が取得され声の大きさ特徴が取得される。このＲＭＳ振幅を用いてＶＡＤアルゴリズムで音声区間を検出してポーズ率が取得される。音声信号に対して自己相関法（ＥＳＡＣＦ）を適用してピッチ抽出が行われる。ピッチ抽出における基本周波数計算において、音声区間検出のＶＡＤアルゴリズムを用いて無声区間を取り除く。音声信号について得られるラウドネス関数（縦軸がエネルギー、横軸が時間の関数）に対してMermelstein’s
syllabic-unit segmentation algorithm（以下、「Mermelsteinアルゴリズム」という）を適用してセグメントを取得し、単位時間当たりのセグメント数から話す速さを取得する。このセグメントを利用して、阻害音／共鳴音分類手段によって阻害音、共鳴音を分離して阻害音対共鳴音比を取得し、発話の明瞭さスコアとする。５つの特徴（声の大きさ、ポーズ率、ピッチ、話す速さ、明瞭さ）のスコアは所定時間の平均値である。また、これらの特徴について所定時間における標準偏差が計算される。例えば、ピッチについての標準偏差は発話の抑揚の指標（２次的特徴）として用いることができる。 As shown in FIG. 6, the voice signal input from the microphone is subjected to FFT processing, RMS amplitude is acquired, and voice loudness characteristics are acquired. Using this RMS amplitude, the voice interval is detected by the VAD algorithm, and the pause rate is acquired. Pitch extraction is performed on the audio signal by applying an autocorrelation method (ESACF). In the calculation of the fundamental frequency in pitch extraction, unvoiced intervals are removed using a VAD algorithm for detecting speech intervals. Mermelstein's for the loudness function obtained for the audio signal (vertical function is energy and horizontal function is time)
A segment is obtained by applying a syllabic-unit segmentation algorithm (hereinafter referred to as “Mermelstein algorithm”), and the speaking speed is obtained from the number of segments per unit time. Using this segment, the inhibitory sound / resonant sound is separated by the inhibitory sound / resonant sound classifying means to obtain the inhibitory sound to resonant sound ratio, and used as the speech clarity score. The scores of the five features (voice volume, pause rate, pitch, speaking speed, clarity) are average values for a predetermined time. In addition, a standard deviation at a predetermined time is calculated for these features. For example, the standard deviation about the pitch can be used as an index (secondary feature) of speech inflection.

［Ｂ−１］声の大きさ（Loudness）
本実施形態では、人間の知覚にマッチするような声の大きさを表現するために、よく知られている音の強さないしインテンシティ（intensity）を用いる。音の強さは、音響フレーム毎（例えば、２５ｍｓ毎）に計算される。音の強さは、音響エネルギーの対数量（ここでは、デシベルスケールを用いる）であり、音量に対する人間の知覚に近似する。音響エネルギーとしては、例えば、RMS振幅を用いることができる。 [B-1] Voice loudness
In the present embodiment, in order to express a loudness level that matches human perception, well-known sound intensity or intensity is used. The sound intensity is calculated for each acoustic frame (for example, every 25 ms). The intensity of sound is a logarithmic quantity of acoustic energy (here, using a decibel scale) and approximates human perception of sound volume. As the acoustic energy, for example, RMS amplitude can be used.

すなわち、本実施形態において、声の大きさは、以下の式によって表現される。
ここで、xは、Ｎサンプルの発話波形（振幅）値からなる音響フレームを表すベクトルである。
音の強さについては、以下の論文を参照することができる。Jurafsky, D., Martin,
J., Kehler, A., et al. Speech and Language Processing: An Introduction to
Natural Language Processing, Computational Linguistics, and Speech Recognition,
second ed. Prentice Hall New Jersey, 2008. That is, in the present embodiment, the loudness of the voice is expressed by the following expression.
Here, x is a vector representing an acoustic frame composed of N sample speech waveform (amplitude) values.
You can refer to the following papers for sound intensity. Jurafsky, D., Martin,
J., Kehler, A., et al. Speech and Language Processing: An Introduction to
Natural Language Processing, Computational Linguistics, and Speech Recognition,
second ed. Prentice Hall New Jersey, 2008.

音の強さは、計算量が軽く汎用的である。実験によると、音の強さを用いた手法は、使用者のマイクロフォンからの距離が大きく変わらなければ良好に機能することが確認された。 The strength of the sound is low in calculation amount and versatile. According to experiments, it was confirmed that the technique using sound intensity works well if the user's distance from the microphone does not change significantly.

［Ｂ−２］ポーズ率（Pause Ratio）
ポーズ率は、信号全体の長さに対する発話していない時間量として定義される。音響信号が人間の声を含む時を特定するために音声区間検出アルゴリズム（VAD：Voice Activity Detection)が用いられる。発話を雑音から識別することは、音声処理ドメインにおいて基本的なタスクであり、以下の論文にも示されるように、様々な手法が提案されている。Ramirez, J., G´orriz, J., and Segura, J. Voice activity detection.
fundamentals and speech recognition system robustness. Robust Speech
Recognition and Understanding (2007). [B-2] Pause Ratio
The pause rate is defined as the amount of non-speaking time for the total length of the signal. A voice activity detection (VAD) algorithm is used to identify when the acoustic signal contains a human voice. Discriminating speech from noise is a basic task in the speech processing domain, and various methods have been proposed as shown in the following papers. Ramirez, J., G´orriz, J., and Segura, J. Voice activity detection.
fundamentals and speech recognition system robustness.
Recognition and Understanding (2007).

本実施形態では、計算量の軽い特徴推定を保証するために、エネルギーをベースとした手法を用いる。最新のt秒のエネルギー値（RMS振幅）を環状バッファに保存し、最小エネルギー値を雑音として取得する。最小エネルギー値に対して所定の閾値を超えたエネルギーは発話コンテンツであると仮定する。本実施形態では、非限定的な数値として、t = 5秒、閾値は6 dBである。この手法は雑音に敏感であるが、比較的静かな環境であれば、有効かつ計算量が少なくて済む。上記手法は、声の大きさを取得するために取得したＲＭＳ振幅を用いるものであり、別途独立してVADを計算するものでないため計算量が少なくてすむ。なお、VADの計算方法としては様々なものが知られており、本発明に用いられ得るVADは上記手法に限定されない。 In the present embodiment, an energy-based method is used to guarantee feature estimation with a small amount of calculation. The latest t-second energy value (RMS amplitude) is stored in a circular buffer, and the minimum energy value is obtained as noise. It is assumed that energy exceeding a predetermined threshold with respect to the minimum energy value is utterance content. In the present embodiment, as a non-limiting numerical value, t = 5 seconds and the threshold is 6 dB. This technique is sensitive to noise, but is effective and requires less computation in a relatively quiet environment. The above method uses the RMS amplitude acquired to acquire the loudness of the voice, and does not separately calculate VAD, so that the calculation amount is small. Various VAD calculation methods are known, and the VAD that can be used in the present invention is not limited to the above method.

［Ｂ−３］ピッチ（Pitch）
発話信号の基本周波数(f₀)を自己相関アルゴリズムを用いて推定し、この推定をピッチの測定値として用いる。本実施形態では、基本周波数f₀は、各フレームに対してエンハンストサマリー自己相関関数（ESACF：enhanced summary autocorrelation function)を適用することで推定する。ESACFは時間領域でのピッチ推定法（全自己相関手法であるので）であり、試した幾つかのアルゴリズムの中では、正確性と軽い計算量のベストバランスを備えている。ESACFは公知であり、具体的な内容については、例えば以下の文献を参照することができる。Tolonen, T., and Karjalainen, M. A computationally efficient
multipitch analysis model. IEEE Trans. Speech Audio Process (2000). Mazzoni, D., and Dannenberg,
R. Melody matching directly from audio. In Proc. ISMIR ’01, Indiana University
(2001). [B-3] Pitch
The fundamental frequency (f ₀ ) of the speech signal is estimated using an autocorrelation algorithm, and this estimation is used as a measured value of the pitch. In the present embodiment, the fundamental frequency f ₀ is estimated by applying an enhanced summary autocorrelation function (ESACF) to each frame. ESACF is a time domain pitch estimation method (because it is a full autocorrelation method), and it has the best balance between accuracy and light computation among several algorithms we have tried. ESACF is publicly known, and for specific contents, for example, the following documents can be referred to. Tolonen, T., and Karjalainen, M. A computationally efficient
multipitch analysis model. IEEE Trans. Speech Audio Process (2000). Mazzoni, D., and Dannenberg,
R. Melody matching directly from audio.In Proc.ISMIR '01, Indiana University
(2001).

本実施形態では、周波数スペクトラムを用いる一般的な手法を用いずに、上記手法を選択した。最もシンプルなアプローチは、発話信号の周波数領域におけるスペクトログラムをユーザに提示することであるが、スペクトログラムを解釈することは簡単ではない。ピーク位置を分析することでスペクトラムから基本周波数f₀を推定することは可能であるが、現実的ではない。人間のスピーチは複雑であり、多くの異なる周波数や高調波を含むからである。 In the present embodiment, the above method is selected without using a general method using a frequency spectrum. The simplest approach is to present the spectrogram in the frequency domain of the speech signal to the user, but it is not easy to interpret the spectrogram. Although it is possible to estimate the fundamental frequency f ₀ from the spectrum by analyzing the peak position, it is not realistic. Human speech is complex and includes many different frequencies and harmonics.

ESACFを含む自己相関アルゴリズムは、数々の時間ずれ（タイムラグ）において自身に対する信号の相関関係を計算することによって実行される。Ｗサンプル長のオーバーラッピング窓を信号Ｓに対して移動させ、当該アルゴリズムは、ずれ（ラグ）１＜τ＜Ｗ／２について相関を計算する。各ラグτは、サンプルレート／τＨｚの周波数を表す。このラグにおける値は信号中のその周波数の強度を表す。したがって、以下の式で基本周波数f₀（すなわち、ピッチ）を推定することができる。
Autocorrelation algorithms, including ESACF, are performed by calculating the correlation of the signal to itself at a number of time lags. The overlapping window of W sample length is moved with respect to the signal S, and the algorithm calculates the correlation for the deviation (lag) 1 <τ <W / 2. Each lag τ represents a frequency of sample rate / τHz. The value in this lag represents the strength of that frequency in the signal. Therefore, the fundamental frequency f ₀ (that is, the pitch) can be estimated by the following equation.

音声信号を含まないフレームについて、ESACFは、小さなタイムラグにおいて高いf₀を示す大きな値を生成する傾向にある。図７には、この問題を無声サンプル（下方）における大きなピーク3200 Hzで示す。本実施形態では、これらのピークを、上記ＶＡＤアルゴリズムを用いてフィルターし、f₀推定計算から無声音を取り除き、１フレームを超えて周波数で２倍（１オクターブ）を超えるような突然の増加を無視して、無声であるが高いエネルギー発話をフィルターする。なお、ピッチ推定法には様々種類が知られており、本発明で用いられ得るピッチ推定法はESACFに限定されない。有声音の基本周波数やピッチ周期を求めるピッチ抽出法には、時間領域における手法（自己相関法等）、周波数領域における手法（スペクトル上の調波構造から基本周波数を検出する手法）が知られており、これらの手法を用いることもできる。 For frames that do not contain audio signals, ESACF tends to generate large values indicating high f ₀ at small time lags. FIG. 7 illustrates this problem with a large peak at 3200 Hz in the unvoiced sample (bottom). In this embodiment, these peaks are filtered using the above VAD algorithm, the unvoiced sound is removed from the f ₀ estimation calculation, and the sudden increase that exceeds twice the frequency (one octave) over one frame is ignored. And filter silent but high energy utterances. Various types of pitch estimation methods are known, and the pitch estimation method that can be used in the present invention is not limited to ESACF. Known pitch extraction methods for obtaining the fundamental frequency and pitch period of voiced sound include time-domain techniques (such as autocorrelation) and frequency-domain techniques (methods of detecting fundamental frequency from the harmonic structure on the spectrum). These methods can also be used.

［Ｂ−４］速度（Speed）
速度を測定するために、時間に亘って発話イベントの数をカウントする。これらのイベントは、発話信号においてエネルギーピークによって分離され、子音（副音）、音節核（中央母音）、あるいは、複合音節等を表す。 [B-4] Speed
To measure speed, the number of speech events is counted over time. These events are separated by energy peaks in the speech signal and represent consonants (subsounds), syllable nuclei (central vowels), or complex syllables.

本実施形態では、Mermelsteinアルゴリズムの変形に基づいてリアルタイムで速度を抽出する。Mermelsteinアルゴリズム自体は公知であり、例えば、以下の文献を参照することができる。
Mermelstein, P. Automatic segmentation of speech into syllabic
units. Journal of the Acoustical Society of America (JASA) (1975).
Villing, R., Ward, T., and Timoney, J. Performance limits for
envelope based automatic syllable segmentation. In Proc. ISSC ’06, IET (2006). In this embodiment, the speed is extracted in real time based on a modification of the Mermelstein algorithm. The Mermelstein algorithm itself is known, and for example, the following documents can be referred to.
Mermelstein, P. Automatic segmentation of speech into syllabic
units. Journal of the Acoustical Society of America (JASA) (1975).
Villing, R., Ward, T., and Timoney, J. Performance limits for
envelope based automatic syllable segmentation. In Proc. ISSC '06, IET (2006).

本発明者等は、ＡＳＲ（自動音声認識）の出力結果や他のアルゴリズムを用いた速度算出を比較した上で、Mermelsteinアルゴリズムを採用した。先ず、ＡＳＲ（自動音声認識）システムのテキスト出力において、単語、音節、あるいは単音をカウントすることが考えられる。Litman, D. J., Hirschberg, K. B., and Swerts, M. Predicting
automatic speech recognition performance using prosodic cues. In Proc. NAACL
’00, ACM (2000).しかしながら、ＡＳＲに内在され得る言語／方言依存性に加えて、現状のＡＳＲは、スマートフォンに対するリアルタイムかつ自然な大規模語彙からなる発話を精度良く認識できる程の技術レベルには未だ無い。 The present inventors adopted the Mermelstein algorithm after comparing the output results of ASR (automatic speech recognition) and speed calculation using other algorithms. First, in the text output of an ASR (automatic speech recognition) system, it is conceivable to count words, syllables, or single notes. Litman, DJ, Hirschberg, KB, and Swerts, M. Predicting
automatic speech recognition performance using prosodic cues.In Proc.NAACL
'00, ACM (2000). However, in addition to the language / dialogue dependency that can be inherent in ASR, the current ASR is a technology level that can accurately recognize utterances consisting of real-time and natural large vocabulary for smartphones. Is not yet.

一方、DSP手法では、声の大きさ特徴計算によって生成された強度値のストリームにおける最小を直接探索して、音がある部分をスピーチイベントとすることを考えたが、１フレームの長さでは低分解能となり、後に述べる明瞭さ特徴の性能に影響を与え得る。なお、本発明における速度計算のためのセグメンテーション手法は、Mermelsteinアルゴリズムを用いるものに限定されるものではなく、ＡＳＲ（自動音声認識）の出力結果や他のアルゴリズムを用いたものを排除するものではない。 On the other hand, in the DSP method, we considered directly searching for the minimum in the stream of intensity values generated by the loudness feature calculation and making the sound part a speech event. Resolution, which can affect the performance of the clarity features described below. Note that the segmentation method for speed calculation in the present invention is not limited to the one using the Mermelstein algorithm, and does not exclude the output result of ASR (automatic speech recognition) or the one using another algorithm. .

Mermelsteinアルゴリズムは波形に対して直接適用されるものであり、軽い計算量の要求に沿っており、この適用のダウンサンプリングによって、分解能の低減を殆ど生じさせずに、凸包生成における処理についてのサンプル数を低減できる。 The Mermelstein algorithm is applied directly to the waveform and is in line with light computational requirements, and the downsampling of this application provides a sample for processing in convex hull generation with little reduction in resolution. The number can be reduced.

Mermelsteinアルゴリズムは、音信号のラウドネス関数の凸包を計算して、音節に良好に対応する高いエネルギーのインターバルを検知する。この手順の例を図８、図９に示す。上側のグラフは原波形（振幅）を示し、下側のグラフはラウドネス関数を示す。凸包を斜線で示す。ラウドネス関数とその凸包との差の最大がラウドネス関数における最小を見つけるのに用いられる。閾値を超える差における最大が、音節ユニットセグメント境界（syllabic-unit segment boundary）として決定され、信号はそこで分割される。このアルゴリズムは、さらなる有意最小（significant minima）が見つからなくなるまで繰り返し適用される。 The Mermelstein algorithm calculates the convex hull of the loudness function of the sound signal and detects high energy intervals that correspond well to syllables. An example of this procedure is shown in FIGS. The upper graph shows the original waveform (amplitude), and the lower graph shows the loudness function. The convex hull is indicated by diagonal lines. The maximum difference between the loudness function and its convex hull is used to find the minimum in the loudness function. The maximum in the difference that exceeds the threshold is determined as the syllabic-unit segment boundary, where the signal is split. This algorithm is applied iteratively until no further significant minima is found.

本実施形態におけるMermelsteinアルゴリズムは、オンラインで実行される点においてオリジナルのアルゴリズムと異なる。最初に提案されたアルゴリズムでは、信号の全体の長さが決定されており、メモリに記憶されている（オフラインで実行される）。これは、凸包を生成するためにラウドネス関数の最大を見つけるのに必要である。信号は、ラウドネス関数の最小（無音）を伴って始まり、そして終わり、その間にラウドネス関数の最大があり、最大は、凸包が単調増加関数から単調減少関数へと変化する。本実施形態では、Mermelsteinアルゴリズムをオンラインで実行するために、凸包生成及び繰り返し最小探索を、最も直近のセグメント境界から始まり最も直近のフレーム終端で終了する音信号の可変窓に対して適用する。最も直近のセグメント境界は、ラウドネス関数の最小であり、したがって、この窓を用いたアプローチは、オリジナルのアプローチと同じ境界が得られる。このオンライン変形手法のパフォーマンスを向上させるために、音信号の波形、ラウドネス関数、凸包、ラウドネス関数と凸包との距離は、最も直近のセグメント境界からキャッシュに保存される。新しい音フレームが追加されるとキャッシュが更新され、新しいセグメントが発見されるとキャッシュが取り除かれる。例えば、最初のセグメント境界がαｓで見つかると、αｓ以前のデータを削除し、αｓから次のセグメント境界が見つかるまで窓（可変窓）を伸ばしていく。当該窓の中のβｓでセグメント境界が見つかると、βｓまでのデータを削除して当該窓を短くする。この作業を繰り返す。 The Mermelstein algorithm in this embodiment differs from the original algorithm in that it is executed online. In the first proposed algorithm, the total length of the signal is determined and stored in memory (runs offline). This is necessary to find the maximum of the loudness function to generate a convex hull. The signal begins with a minimum (silence) of the loudness function and ends, during which there is a maximum of the loudness function, where the maximum changes from a monotonically increasing function to a monotonically decreasing function. In this embodiment, in order to execute the Mermelstein algorithm online, convex hull generation and iterative minimum search are applied to the variable window of the sound signal that starts at the most recent segment boundary and ends at the most recent frame end. The most recent segment boundary is the minimum of the loudness function, so this windowed approach yields the same boundary as the original approach. In order to improve the performance of this online deformation technique, the waveform of the sound signal, the loudness function, the convex hull, and the distance between the loudness function and the convex hull are stored in the cache from the nearest segment boundary. The cache is updated when new sound frames are added, and the cache is removed when new segments are found. For example, when the first segment boundary is found at αs, the data before αs is deleted, and the window (variable window) is extended from αs until the next segment boundary is found. When a segment boundary is found at βs in the window, data up to βs is deleted to shorten the window. Repeat this process.

このアルゴリズムを評価するために、人間によって正確な速度かつ最小誤差でスピーチを生成するのではなく、テキスト−スピーチ変換エンジン(「Daniel」（男性、イギリス英語）設定の Apple’s Speech Synthesis Manager)を用いた。本アルゴリズムによる測定を実行するために、リニア間隔で120〜400 wpm(単語／分)の異なる速度で発話することで、スピーチサンプルの小さなコーパスを生成した。実験では、テキストとして、Project Gutenberg(http://www.gutenberg.org/ebooks/11)からダウンロードしたルイスキャロルの不思議な国のアリスの第１章から最初の１０段落を用いた。Mermelsteinアルゴリズムのオンライン変形をこのデータに対して適用した。結果を図１０に示す。アルゴリズムで認識されたセグメント数／秒は、速度(単語／分)と線形増加を示している。この線形の結果は、アルゴリズムがスピーチ速度の同定に成功したことを示す。 To evaluate this algorithm, we used a text-to-speech engine (Apple's Speech Synthesis Manager with "Daniel" (male, British English)) instead of generating speech with accurate speed and minimal error by humans. . To perform measurements with this algorithm, small corpus of speech samples was generated by speaking at different rates of 120-400 wpm (words / minute) at linear intervals. In the experiment, we used the first 10 paragraphs from chapter 1 of Alice in the magical country of Lewis Carroll, downloaded from Project Gutenberg (http://www.gutenberg.org/ebooks/11). An online variant of Mermelstein algorithm was applied to this data. The results are shown in FIG. The number of segments per second recognized by the algorithm shows a speed (words / minute) and a linear increase. This linear result indicates that the algorithm has successfully identified the speech rate.

［Ｂ−５］発声の明瞭性（Enunciation Clarity）
本発話評価システムにおいて、はっきりと発音する能力として、発声の明瞭さを考慮する。これは、ろれつ、その他了解度を落とし得る発音問題に密接に関連する。発生の明瞭さについては長年研究されてはいるが、明瞭さを推定するアルゴリズムについては有用な先行文献が殆ど存在しない。音声学における“明瞭な話し方”の概念は、人間が所定の状況下、例えばノンネイティブスピーカに話しかけたり、あるいは、子供に話しかけたり、あるいはコンピュータインターフェースに話変えるといった、特定の状況下で採用するような人為的な自然ではない明瞭な話し方である。 [B-5] Enunciation Clarity
In this utterance evaluation system, the clarity of utterance is considered as the ability to pronounce clearly. This is intimately related to pronunciation and other pronunciation problems that can reduce intelligibility. Although the clarity of occurrence has been studied for many years, there are few useful prior literature on algorithms for estimating clarity. The concept of “clear speech” in phonetics should be adopted under certain circumstances, such as when talking to a non-native speaker, talking to a child, or changing to a computer interface. It is a clear and unnatural way of speaking.

Pisoni文献（Pisoni, D. Clear Speech.
In The Handbook of Speech Perception. Blackwell Publishing, 2005, ch. 9, 218-227）では、明瞭な話し方の要素として考慮されるグローバル特徴として、音の強さ、話す速度、ポーズ、基本周波数、スペクトル傾斜、時間エンベロープ変調が挙げられている。本明細書では、最初の４つについて言及した。 Pisoni literature (Pisoni, D. Clear Speech.
In The Handbook of Speech Perception. Blackwell Publishing, 2005, ch. 9, 218-227), global features that are considered as elements of clear speech include sound intensity, speaking speed, pause, fundamental frequency, and spectral tilt. Time envelope modulation is mentioned. In this specification, the first four are mentioned.

音声学的特徴についても調査が行われており、consonant-to-vowel ratio(CVR)が提案されている。これは、母音のエネルギーに対する子音のエネルギーの比であり、子音のエネルギーが高いことが話し方の明瞭さを示すものであるとして提案されている。しかしながら、CVRを用いることで異なる様々な結果が得られている（Hazan, V., and Markham, D. Acoustic-phonetic correlates of talker
intelligibility for adults and children. Journal of the Acoustical Society of
America (JASA)(2004)）。これは、以下のような基本的な欠陥に起因すると考えられる。子音と母音は文字としてのテキスト上では合理的に識別可能であるが、子音の音には様々な種類があり、母音の音と類似する音響特徴を備えた子音も多い。また、明瞭さを推定する際に、全ての子音が同等に重要であるわけではなく、破裂音や強い摩擦音が比較的重要であり、鼻音はあまり重要でないという報告もある（Kennedy, E., Levitt, H., Neuman, A. C., and Weiss, M.
Consonant-vowel intensity ratios for maximizing consonant recognition by
hearing-impaired listeners. Journal of the Acoustical Society of America (JASA)
(1998)）。 Phonetic features have also been investigated and a consonant-to-vowel ratio (CVR) has been proposed. This is the ratio of the consonant energy to the vowel energy, and a high consonant energy has been proposed as an indication of clarity of speech. However, different results have been obtained using CVR (Hazan, V., and Markham, D. Acoustic-phonetic correlates of talker
intelligibility for adults and children.Journal of the Acoustical Society of
America (JASA) (2004)). This is considered due to the following basic defects. Although consonants and vowels can be reasonably distinguished on text as characters, there are various types of consonant sounds, and many consonants have acoustic features similar to vowel sounds. There are also reports that not all consonants are equally important in estimating clarity, but plosives and strong frictional sounds are relatively important, and nasal sounds are less important (Kennedy, E., Levitt, H., Neuman, AC, and Weiss, M.
Consonant-vowel intensity ratios for maximizing consonant recognition by
hearing-impaired listeners. Journal of the Acoustical Society of America (JASA)
(1998)).

そこで、本発明者等は、明瞭さを推定するために、阻害音と共鳴音比とを比較するという考え方を提案する。したがって、本発明では、発話の明瞭さの尺度として、子音対母音比ではなく、阻害音対共鳴音比（ＯＳＲ：obstruent-to-sonorant ratio）を用いる。ＯＳＲを取得すためには、先ず、音声信号において単音境界を特定する必要がある。これは、メル周波数ケプストラム係数（MFCCs）のようなフレーム特徴で訓練したガウシアン混合モデル（GMM）を用いるようなＡＳＲ様のシステムを用いることで可能であろう。 Therefore, the present inventors propose the idea of comparing the inhibition sound and the resonance sound ratio in order to estimate the clarity. Therefore, in the present invention, an obstructive-to-sonorant ratio (OSR) is used as a measure of articulation clarity instead of a consonant-to-vowel ratio. In order to acquire the OSR, first, it is necessary to specify a single tone boundary in the audio signal. This would be possible using an ASR-like system that uses a Gaussian mixture model (GMM) trained with frame features such as mel frequency cepstrum coefficients (MFCCs).

しかしながら、本実施形態では、全ての単音境界を見つけることを代替する計算量が軽い手段として、上述の速度推定アルゴリズムで生成された境界を再利用する。このアルゴリズムでは全ての単音境界が見つけられているわけではないが、得られている境界はすべて信号のラウドネス関数におけるエネルギー最小値を同定している。共鳴音と阻害音の間に位置する境界において、両方の音はこれらのエネルギーピーク間に有意最小（significant minimum）を形成するのに十分なエネルギーを備えているはずである。低いエネルギーの阻害音は、このミニマムを生成するような十分なエネルギーを持たないであろう。ここで、得られたセグメントを共鳴音、阻害音のいずれかに分類することによって、高いエネルギーの阻害音をカウントすることができる。阻害音の数が多いことはOSRが高くなることを意味し、結果として、明瞭な話し方を示す。本実施形態では、全セグメント数に対する阻害音セグメントの数を尺度とする。図１１はラウドネス関数における音節セグメント境界を示し、７つのセグメントが得られている。図１２に示すように、７つのセグメントが、阻害音４、共鳴音３に分離された。この時、明瞭さのスコア＝阻害音数／全セグメント数＝４／（４＋３）＝０．５７となる。 However, in the present embodiment, the boundary generated by the above-described velocity estimation algorithm is reused as a means that reduces the amount of calculation to substitute for finding all single-tone boundaries. Although not all phone boundaries are found in this algorithm, all obtained boundaries identify the energy minimum in the signal loudness function. At the boundary located between the resonance and the inhibition sound, both sounds should have sufficient energy to form a significant minimum between these energy peaks. A low energy inhibitor will not have enough energy to produce this minimum. Here, by classifying the obtained segments as either resonance sound or inhibition sound, high energy inhibition sound can be counted. A large number of obstruction sounds means a high OSR, resulting in a clear way of speaking. In the present embodiment, the number of inhibition sound segments with respect to the total number of segments is used as a scale. FIG. 11 shows the syllable segment boundary in the loudness function, and seven segments are obtained. As shown in FIG. 12, the seven segments were separated into the inhibition sound 4 and the resonance sound 3. At this time, the clarity score = the number of inhibition sounds / the total number of segments = 4 / (4 + 3) = 0.57.

共鳴音と阻害音の識別は、バイナリ分類問題である。この問題を解くため、本実施形態では、これらの２つのカテゴリーの音を分類するためにサポートベクターマシン(SVM)を学習させた。サポートベクターマシンは当業者によく知られており、例えば、Cortes, C., and Vapnik, V. Support-vector networks. Machine Learning(1995).を参照することができる。サポートベクターマシンは分類を迅速に実行することができ（学習には所要の時間を要するものの）、軽い計算量特性を維持するため有効な手段であることがわかった。モデルはオフラインで学習され、アプリケーション自体はバイナリ分類のみを実行するので、計算量が軽いという特性が得られる。 The distinction between resonant and inhibitory sounds is a binary classification problem. In order to solve this problem, the present embodiment trains a support vector machine (SVM) to classify these two categories of sounds. Support vector machines are well known to those skilled in the art, for example, see Cortes, C., and Vapnik, V. Support-vector networks. Machine Learning (1995). The support vector machine can quickly perform classification (although it takes a long time to learn), and has proved to be an effective means of maintaining light computational complexity. The model is learned off-line and the application itself performs only binary classification, resulting in a low computational complexity.

学習データ及びテストデータを生成するために、上記速度推定アルゴリズムを用いて、音がタグ付けされたコーパス（phone-tagged corpus）中の各文章を分割した。そして、セグメント中の音を見つける音注釈を用いて、各セグメントが共鳴音か否かを決定した。いかなる共鳴音を含むセグメントは共鳴音であると同定する。 In order to generate learning data and test data, each sentence in a corpus (phone-tagged corpus) tagged with sound was divided using the speed estimation algorithm. Then, using sound annotation to find the sound in the segment, it was determined whether each segment was a resonance sound. A segment containing any resonance is identified as a resonance.

各セグメントを記述する特徴ベクトルは、１４個の要素：時間長、周期性、及び、１２個のメル周波数ケプストラム係数（MFCCs）からなる。１３個のメル周波数ケプストラム係数（MFCCs）を生成したが、最初の係数は絶対エネルギーレベル（TIMIT corpusにおいては比較的不変であるが、実際の音響環境では変化する）への依存性を含むので除外した。周期性については、ピッチ抽出の ESACF の出力を用いることができる。メル周波数ケプストラム係数（MFCCs）の計算は、当業者において周知であるため説明は省略する。 The feature vector that describes each segment consists of 14 elements: time length, periodicity, and 12 mel frequency cepstrum coefficients (MFCCs). 13 mel frequency cepstrum coefficients (MFCCs) were generated, but the first coefficient is excluded because it includes a dependency on absolute energy levels (which are relatively unchanged in TIMIT corpus but change in the actual acoustic environment) did. For periodicity, the ESACF output for pitch extraction can be used. The calculation of the mel frequency cepstrum coefficients (MFCCs) is well known to those skilled in the art and will not be described.

スピーチコーパスには、TIMITを用いた。Garofolo, J., Lamel, L., Fisher, W., et al. TIMIT acoustic-phonetic continuous
speech corpus. Linguistic Data Consortium (1993).TIMITコーパスは、８つの異なる米国英語方言で喋られた６３００個の文章を含んでいる。コーパスを学習データ（4,620個の文、464人の話者)と、テストデータ(1,680個の文、168人の話者)に分けた。幾つかの文は同じものであっても、異なる話者によるものである。 TIMIT was used for the speech corpus. Garofolo, J., Lamel, L., Fisher, W., et al. TIMIT acoustic-phonetic continuous
speech corpus. Linguistic Data Consortium (1993). The TIMIT corpus contains 6300 sentences spoken in 8 different US English dialects. The corpus was divided into learning data (4,620 sentences, 464 speakers) and test data (1,680 sentences, 168 speakers). Some sentences are the same, but different speakers.

速度推定アルゴリズムを用いて、全ての文章を分割し、共鳴音を含む音か否かに基づいてセグメントにラベル付けを行った。母音、半母音、わたり音(glide)は共鳴音として扱った。有声音セグメント49,409 (70.2%)、無声音セグメント21,003 (29.8%)からなる訓練セット、及び、有声音セグメント18,109 (69.8%)、無声音セグメント7,848 (30.2%)からなるテストセット、を得た。 Using the velocity estimation algorithm, all sentences were divided and the segments were labeled based on whether or not they contained resonance sounds. Vowels, semi-vowels, and glide were treated as resonances. A training set consisting of voiced sound segments 49,409 (70.2%) and unvoiced sound segments 21,003 (29.8%) and a test set consisting of voiced sound segments 18,109 (69.8%) and unvoiced sound segments 7,848 (30.2%) were obtained.

ＳＶＭの実装にはSVM^lightを用いた。SVMカーネルには、 RBF(radial basis
function)を用いた。RBFカーネルの最適なＣ値及びγ値を見つけるために、訓練データから2,000個のランダムサンプルを採用し、LIBSVM’s [Chang, C., and Lin, C.
LIBSVM: a library for support vector machines. ACM Trans. Intelligent Systems
and Technology (2011).] grid parameter search tool [Hsu, C., Chang, C., Lin,
C., et al. A practical guide to support vector classification, 2003.]を用いた。セグメントの時間長、周期性、１２個のメル周波数ケプストラム係数からなる特徴を用いることで、セグメントを、精度93.27%、再現率94.85%で分類できた。なお、時間長及び周期性のみを用いた場合は、精度75.51%、再現率58.47% であった。 SVM ^light was used for SVM implementation. SVM kernel has RBF (radial basis
function). To find the optimal C and γ values for the RBF kernel, 2,000 random samples were taken from the training data and LIBSVM's [Chang, C., and Lin, C.
LIBSVM: a library for support vector machines.ACM Trans.Intelligent Systems
and Technology (2011).] grid parameter search tool [Hsu, C., Chang, C., Lin,
C., et al. A practical guide to support vector classification, 2003.]. By using the features of segment time length, periodicity, and twelve mel frequency cepstrum coefficients, segments could be classified with 93.27% accuracy and 94.85% recall. When only the time length and periodicity were used, the accuracy was 75.51% and the recall was 58.47%.

本アルゴリズムによる実際のスピーチの評価を検証するためのコーパスとして、題名clearspeechjphというコーパスの教材S (3 paragraphs,separated into 34 phrases)を用いた。クリアスピーチの定義は、本発明のものとは同じではないが、本発明に係るOSR尺度の正確性を予備的に検証するものとして採用した。発話者は、各サンプルを、クリアなわかりやすいスタイル、くだけたスタイルの両方で発話している。結果を図１３に示す。上図において、クリアスピーチサンプルが１近傍に分布しており、下図において、くだけたスピーチサンプルが０近傍に分布している。なお、くだけたスピーチサンプルは、必ずしも悪い発話ではないので、ある程度全体に亘って分布している。 The corpus material S (3 paragraphs, separated into 34 phrases) with the title “clearspeechjph” was used as a corpus to verify the evaluation of actual speech by this algorithm. The definition of clear speech is not the same as that of the present invention, but was adopted as a preliminary verification of the accuracy of the OSR measure according to the present invention. The speaker speaks each sample in both a clear and understandable style. The results are shown in FIG. In the upper diagram, clear speech samples are distributed in the vicinity of 1, and in the lower diagram, the speech samples are distributed in the vicinity of 0. A simple speech sample is not necessarily a bad utterance, and thus is distributed over a certain extent.

［Ｃ］実際の発話の評価
本実施形態に係る発話評価システムを用いてバラク・オバマ、ウディ・アレン、メリル・ストリープ、リサ・シンプソンの４名の発話について評価した。発話時間は１分程度である。発話は、実際の使用に近づけるべく、スピーカからiPhoneの内部マイクロフォンに向けて再生した。結果を図１４に示す。実際の発話評価システムの使用では、ユーザインターフェースには、ユーザの発話の特徴と選択的にモデルスピーカの発話特徴が表示されるが、図１４（ａ）では、４人の発話者の評価結果をまとめて示して比較した。正五角形は、４人の発話者の特徴の平均値を示す。実際のスクリーンには、図２左図に示すように、標準偏差、すなわちダイナミズムを帯状に示すが、図１４（ａ）では標準偏差は図示せず、図１４（ｂ）に別途示した。 [C] Evaluation of actual utterances Using the utterance evaluation system according to the present embodiment, four utterances of Barack Obama, Woody Allen, Meryl Streep, and Lisa Simpson were evaluated. The utterance time is about 1 minute. The utterances were played from the speakers to the iPhone's internal microphone to get closer to actual use. The results are shown in FIG. In actual use of the speech evaluation system, the user interface displays the features of the user's utterance and the utterance features of the model speaker selectively. In FIG. 14A, the evaluation results of four speakers are displayed. Shown together and compared. The regular pentagon indicates the average value of the characteristics of the four speakers. In the actual screen, as shown in the left diagram of FIG. 2, the standard deviation, that is, dynamism is shown in a band shape. In FIG. 14A, the standard deviation is not shown, and is shown separately in FIG. 14B.

全てのエネルギーレベルがほぼ同等であることは、同じ環境で同じようなボリュームレベルで録音がプレイバックされたことによるものである。ポーズ率が類似していることは、おそらく、全てのスピーチがモノローグスピーチ形式であったことによるものである。ピッチは、各スピーカのピッチ及びピッチの変動を示す。オバマは比較的低い声で話すのに対して、アレンは男性にしては高い声で話す。ストリープは女性にしては低い声で話すが、リサはコミカルな高いピッチで話す。 The fact that all energy levels are almost equal is due to the recording being played back at the same volume level in the same environment. The similar pose rate is probably due to the fact that all speech was in the form of monologue speech. The pitch indicates the pitch of each speaker and the variation of the pitch. Obama speaks in a relatively low voice, whereas Allen speaks in a high voice for men. Streep speaks low for women, but Lisa speaks on a comical high pitch.

発話の速さにおいて、ストリープは規則的な力強いモノローグであるのに対して、アレンは途切れ途切れで速いペースの神経症的なスタイルである。 In speaking speed, Streep is a regular, powerful monologue, whereas Allen is a choppy, fast-paced neurotic style.

明瞭さの結果は、アレンは速いペースにもかかわらず、単語をとても丁寧にはっきり発音することを示す。オバマの発話は最も低い明瞭さスコアとなっており、アレンのシャープな発音に比べて、オバマの低いエネルギーの子音はより丸みのある音を与える。図１４（ｂ）のダイナミズム値（ピッチおよび速さ）は、オバマはリズム／ペースにおいてモノトニックかつ安定しているのに対して、アレンのピッチは変動が多く、ペースもまた変動しており、よりダイナミックな感じを与える。 The clarity results show that Allen pronounces the word very carefully and clearly despite the fast pace. Obama's speech has the lowest clarity score, and Obama's low-energy consonants give a more rounded sound than Allen's sharp pronunciation. The dynamism values (pitch and speed) in FIG. 14 (b) show that Obama is monotonic and stable in rhythm / pace, whereas Allen's pitch is fluctuating and the pace is also fluctuating, Give a more dynamic feeling.

本発明は、発話矯正アプリケーションとして、スマートフォン等のモバイルデバイスに搭載することができる。

The present invention can be installed in a mobile device such as a smartphone as a speech correction application.

Claims

An input unit for inputting a voice signal of a speaker's free utterance;
A feature extraction unit that extracts features used for evaluation from the input audio signal;
A feature evaluation unit that compares the feature extracted by the feature extraction unit with a reference feature stored in advance;
An output unit for outputting a comparison result;
With
Features used for the evaluation include at least clarity of utterance,
The intelligibility of the utterance is represented by an inhibition sound to resonance ratio in the input audio signal,
The feature extraction unit acquires the inhibition sound-to-resonance sound ratio using means for dividing the input audio signal into a plurality of segments and means for classifying the obtained segment into an inhibition sound and a resonance sound. ,
Utterance evaluation device.

Features used for the evaluation further include loudness,
The feature extraction unit obtains a voice volume from energy of an input voice signal;
The utterance evaluation apparatus according to claim 1.

The features used for the evaluation further include a pause rate,
The pause rate is the percentage of time that speech is not spoken in the length of the audio signal,
The feature extraction unit acquires the ratio by a voice section detection unit.
The utterance evaluation apparatus according to claim 1.

The features used for the evaluation further include pitch,
The feature extraction unit obtains a pitch by a pitch extraction unit;
The utterance evaluation apparatus according to claim 1.

The characteristics used for the evaluation further include the speed of beat,
The speed of the beat is represented by the number of utterance events within a predetermined time,
The feature extraction unit obtains the speed of speaking by counting the number of utterance events within a predetermined time.
The utterance evaluation apparatus according to claim 1.

The utterance evaluation apparatus according to claim 5, wherein the utterance event is a syllable.

In the means for dividing the input audio signal into a plurality of segments, the segments are syllables,
In the feature extraction unit, a plurality of segments are acquired using a syllable boundary when acquiring the speed of the beat.
The utterance evaluation apparatus according to claim 6.

The means for classifying the segment into an inhibition sound and a resonance sound uses supervised machine learning means.
The utterance evaluation apparatus according to claim 1.

The supervised machine learning means uses a segment time length, periodicity, and Mel frequency cepstrum coefficient as a feature vector.
The speech evaluation apparatus according to claim 8.

The speech evaluation apparatus according to claim 1, wherein the inhibition sound to resonance sound ratio is a ratio of inhibition sound segment number to resonance sound segment number.

The speech evaluation apparatus according to claim 1, wherein the segment is a single sound or a phoneme, and the inhibition sound to resonance sound ratio is an inhibition sound energy to resonance sound energy ratio.

Inputting a speech signal of a speaker's free speech;
A feature extraction step of extracting features used for evaluation from the input speech signal;
A feature evaluation step for comparing the feature extracted in the feature extraction step with a pre-stored reference feature;
Outputting a comparison result; and
With
Features used for the evaluation include at least clarity of utterance,
The intelligibility of the utterance is represented by an inhibition sound to resonance ratio in the input audio signal,
The feature extraction step includes a step of dividing the input audio signal into a plurality of segments, a step of classifying the obtained segment into an inhibition sound and a resonance sound, and a step of acquiring an inhibition sound to resonance sound ratio. Contains,
Utterance evaluation method.

Features used for the evaluation further include loudness,
The feature extraction step includes obtaining a voice volume from energy of an input voice signal.
The speech evaluation method according to claim 12.

The features used for the evaluation further include a pause rate,
The pause rate is the percentage of time that speech is not spoken in the length of the audio signal,
The feature extraction step includes obtaining the ratio by a voice section detection unit;
The speech evaluation method according to any one of claims 12 and 13.

The features used for the evaluation further include pitch,
The feature extracting step includes obtaining a pitch by a pitch extracting means,
The speech evaluation method according to any one of claims 12 to 14.

The characteristics used for the evaluation further include the speed of beat,
The speed of the beat is represented by the number of utterance events within a predetermined time,
The feature extraction step includes obtaining the speed of the beat by counting the number of utterance events within a predetermined time,
The speech evaluation method according to any one of claims 12 to 15.

The speech evaluation method according to claim 16, wherein the speech event is a syllable.

In the step of dividing the input audio signal into a plurality of segments, the segments are syllables;
In the feature extraction step, a plurality of segments are acquired using a syllable boundary when acquiring the speed of the beat.
The speech evaluation method according to claim 17.

The step of classifying the segment into an inhibitory sound and a resonant sound uses supervised machine learning means.
The speech evaluation method according to any one of claims 12 to 18.

The supervised machine learning means uses a segment time length, periodicity, and Mel frequency cepstrum coefficient as a feature vector.
The utterance evaluation method according to claim 19.

The speech evaluation method according to any one of claims 12 to 20, wherein the inhibition sound-to-resonance sound ratio is a ratio of the number of inhibition sound segments to the number of resonance sound segments.

The speech evaluation method according to any one of claims 12 to 20, wherein the segment is a single sound or a phoneme, and the inhibition sound to resonance sound ratio is an inhibition sound energy to resonance sound energy ratio.

A computer program for causing a computer to execute the method according to any one of claims 12 to 22.